# Compute the expected Zipf count {zct[0..num-1]} for {num} words. # Look for a word {wd[i]} that has anomalously high probability. # {ct[i] <= zct[i]}, "freeze" it and proceed to the next one. If # {ct[i] > zct[i]}, # and repeat. Otherwise, the heuristic has a # partial failure; "freeze" the word and proceed to the next one. # # "Freezing" a word means declaring that it will not enter into # any new compounds, so its count will not change from then on # If fixating succeeds, the changes in {wd} and {ct} only affect # unfrozen words, and therefore only # the current word and succeeding ones. Since the new word has count # {N < ct[i]}, entries {0..i-1} will not be affected by the # pre-sort. So the scan resumes at the current word. Since {ct[i]} # decreases, it cannot loop forever. # # If there are no partial failures, the final count {ct[i]} will # be less than or equal to {zct[i]} for all {i}. This does not # ensure that they will be Zipfian, but should be close to. # # If there are several choices for the fixating of word # {wr[i]}, we pick the one which will bring {ct[i]} # closer to {zct[i]}, in some sense. # Compute ideal Zipf count for {num} distinct words: zntk = int(num * log(num) + 1); for (i = 0; i < num; i++) { zct[i] = int(nztk/(1+i) + 0.5); } # Prepare the list of word indices by decreasing count: split("", ord); for (i = 0; i < nwd; i++) { ord[i] = i; } index_sort_words(0); split("", frozen); # {frozen[w]} set means word {w} should not be compounded.