Zipf law plot (frequency as function of frequency rank) for various texts.
The languages, texts and the word frequency files are:
[[Vietnamese language|Vietnamese]]. The first five books (the ''Pentateuch'') from the [[Cadman Vietnamese Bible]] (1934). Probably translated from the English [[King James Bible]]. In the ASCII VIQR encoding, mapped to lowercase, without hyphens.
* All five books. Sample: ''ban dda^`u ddu+'c chu'a tro+`i du+.ng ne^n tro+`i dda^'t va? dda^'t la`'' [...] ''da.y la.i cho dde^? ca'c ngu+o+i la`m theo no' trong xu+' ma` ca'c''. File viet/ptt/tot.1/gud.wfr (original 169480 words, truncated/filtered to 35027 words, ''N'' = 1631 distinct).
Synthetic text imitating [[Vietnamese language|Vietnamese]]. Text created by a [[Markov chain]] of order 3, trained on the Cadman Vietnamese Pentateuch.
* Whole generated text. Sample: ''ddo' no+i cha(`ng ra(`ng mi`nh xo^'p da^~ng ddi dde^` ca'ch tro+`i'' [...] ''dda(.c''. File viep/mky/tot.1/gud.wfr (original 39293 words, truncated/filtered to 35027 words, ''N'' = 3341 distinct).
The word frequency files '*/*/*/gud.wfr' are available at the [https://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/Notes/tr-stats/dat/ UNICAMP website]. The original annotated full texts, before truncation/filtering, are in the companion files */*/org/main.src. The truncated/filtered texts -- one word per line, without punctuation -- are in */*/*/gud.tlw.