Zipf law plot (frequency as function of frequency rank) for various texts. The languages, texts and the word frequency files are: [[Chinese language|Chinese (Mandarin)]]. The classical Chinese novel ''[[Dream of the Red Chamber]]'' or ''Dream of the Red Mansion'' (''Hong2 Lou2 Meng4'') by Cao2 Xue3 Qin2 and Gao E (~1750); with some errors and omissions. Chinese characters were mapped 1:1 from GB (Guo Biao) to pinyn with tone marks and disambiguating suffixes, e.g. 'zuo4', 'zuo4.1', 'zuo4.2', so as to distinguish characters with the same pinyin. Each character is treated as a separate word. * Whole text. Sample: ''ci3 kai1 juan3 di4.2 yi1 hui2 ye3 zuo4.2 zhe3 zi4 yun2 yin1 ceng2 li4.4'' [...] ''dong1 bian1 wu1 nei4.1 guo4 lai2 dai4.1 le5 liu2.1''. File chin/red/tot.1/gud.wfr (original 706889 words, truncated/filtered to 35027 words, ''N'' = 2420 distinct). A Chinese translation of the first five books (''[[Pentateuch]]'') of the Old Testament Bible, from the [[Chinese Union Version|Union Version]] (1919). Chinese characters were mapped 1:1 from GB (Guo Biao) to pinyn with tone marks and disambiguating suffixes, e.g. 'zuo4', 'zuo4.1', 'zuo4.2', so as to distinguish characters with the same pinyin. Each character is treated as a separate word. * Whole text. Sample: ''qi3 chu1.1 shen2.1 chuang4 zao4 tian1 di4 di4 shi4 kong1 xu1.1 hun4'' [...] ''zhei4 li3 wo3 yao4 jiang1 yi1 qie4 jie4.8 ming4 lü4.1''. File chin/ptt/tot.1/gud.wfr (original 174364 words, truncated/filtered to 35027 words, ''N'' = 1392 distinct). A Chinese translation of the first five books (''[[Pentateuch]]'') of the Old Testament Bible, possibly the [[New Chinese Version]] (1992). Chinese characters were mapped 1:1 from GB (Guo Biao) to pinyn with tone marks and disambiguating suffixes, e.g, 'zuo4', 'zuo4.1', 'zuo4.2', so as to distinguish characters with the same pinyin. Each character is treated as a separate word. * Whole text. Sample: ''qi3 chu1.1 shen2.1 chuang4 zao4 tian1 di4 di4 shi4 kong1 xu1.1 hun4'' [...] ''he2 hua2''. File chin/ptn/tot.1/gud.wfr (original 193319 words, truncated/filtered to 35027 words, ''N'' = 1405 distinct). Transcripts of 92 selected Voice of America broadcasts in Chinese (1996-1998). Chinese characters were mapped 1:1 from GB (Guo Biao) to pinyn with tone marks and disambiguating suffixes, e.g. 'zuo4', 'zuo4.1', 'zuo4.2'. Each character is treated as a separate word. * Whole text. Sample: ''ge4.1 wei4.1 ting1 zhong4 mei3.1 guo2 zheng4.1 fu3 jue2.2 ding4 jin4 yi1'' [...] ''zheng4.1 zhi4.2 fan4.1 zheng4 shi4.13 kang4.1 yi4.2 hen3''. File chin/voa/tot.1/gud.wfr (original 58813 words, truncated/filtered to 35027 words, ''N'' = 1616 distinct). [[Chinese language|Chinese (Mandarin)]] in [[pinyin]]. Transcripts of 92 selected Voice of America broadcasts in Chinese (1996-1998). Chinese characters were converted to pinyin with tone marks, e.g. 'qing2', by Ocrat.com. Note that the same pinyin word may represent two or more different characters. * Whole text. Sample: ''ge4 wei4 ting1 zhong4 mei3 guo2 zheng4 fu3 jue2 ding4 jin4 yi1 bu4 dong4'' [...] ''guo2 da4 lu4 qi2''. File chip/voa/tot.1/gud.wfr (original 59476 words, truncated/filtered to 35027 words, ''N'' = 830 distinct). The word frequency files '*/*/*/gud.wfr' are available at the [https://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/Notes/tr-stats/dat/ UNICAMP website]. The original annotated full texts, before truncation/filtering, are in the companion files */*/org/main.src. The truncated/filtered texts -- one word per line, without punctuation -- are in */*/*/gud.tlw.