Zipf law plot (frequency as function of frequency rank) for various texts.

The languages, texts and the word frequency files are:

[[Arabic language|Arabic]]. The ''[[Quran]]'' (~650 CE). Based on the ''Unicode Quran'' document from the Sacred Texts site, maintained by John B. Hare, with several corrections. Arabic Unicode characters were mapped into [[ISO 8859-1|ISO latin-1]] characters in a vaguely phonetic way.  '''With''' vowel marks, hamza, madda but '''without''' sukuns.

* Whole text. Sample: ''<nowiki>bîsmî alllâhî alrrâµmânî alrrâµîymî alµâmdû lîllâhî râbbî al¿âlâmîynâ</nowiki>'' [...] ''<nowiki>tâttâ©î£ûwnâ mînhû sâkâräa wârîzqäa µâsânäa a¡înnâ fîy</nowiki>''. File arab/quv/tot.1/gud.wfr (original 77411 words, truncated/filtered to 35027 words, ''N'' = 10762 distinct).

The ''[[Quran]]'' (~650 CE). Based on the ''Unicode Quran'' document from the ''Sacred Texts'' site, maintained by John B. Hare, with several corrections. Arabic Unicode characters were mapped into [[ISO 8859-1|ISO latin-1]] characters in a vaguely phonetic way, with joined articles.  Version '''with''' short vowel marks, sukuns, hamza, and madda.

* Whole text. Sample: ''<nowiki>bîs°mî alllâhî alrrâµ°mânî alrrâµîymî al°µâm°dû lîllâhî râbbî</nowiki>'' [...] ''<nowiki>a!ân° attâ©î£îy mîn° al°jîbâalî bûyûwtäa</nowiki>''. File arab/quf/tot.1/gud.wfr (original 77394 words, truncated/filtered to 35027 words, ''N'' = 10935 distinct).

The ''[[Quran]]'' (~650 CE). Electronic phonetic transliteration from [[University of Southern California|USC]]'s Muslim Student Union, transliterated again into [[ISO 8859-1|ISO latin-1]] characters.  The result, particularly short vowel marks, is partly guessed. The subtler phonetic marks (sukuns, hamzas, maddas, alef-maksura, teh-marbuta etc.) were lost.

* Whole text. Sample: ''<nowiki>bîsmî allâhî alrrâµmânî alrrâµymî alµâmdû lîllâhî râbbî al¿âlâmynâ</nowiki>'' [...] ''<nowiki>wâmâ</nowiki>''. File arab/qph/tot.1/gud.wfr (original 77845 words, truncated/filtered to 35027 words, ''N'' = 9434 distinct).

The ''[[Quran]]'' (~650 CE). Electronic version of unknown provenance, without short vowel marks or sukuns, in a vaguely phonetic transliteration into [[ISO 8859-1|ISO latin-1]] characters.

* Whole text. Sample: ''<nowiki>bsm allh alrµmn alrµym alµmd llh rb al¿almyn alrµmn alrµym malk ywm</nowiki>'' [...] ''<nowiki>nfwra nµn a¿lm bma ystm¿wn bh a£ ystm¿wn</nowiki>''. File arab/qcs/tot.1/gud.wfr (original 74212 words, truncated/filtered to 35027 words, ''N'' = 9025 distinct).

The ''[[Quran]]'' (~650 CE). Based on the ''Unicode Quran'' document from the ''Sacred Texts'' site, maintained by John B. Hare, with several corrections. Arabic Unicode characters were mapped into [[ISO 8859-1|ISO latin-1]] characters in a vaguely phonetic way, with joined articles.  Version '''without''' short vowel marks, sukuns, hamza, madda.

* Whole text. Sample: ''<nowiki>bsm alllh alrrµmn alrrµym alµmd lllh rbb al¿lmyn alrrµmn alrrµym mlk ywm</nowiki>'' [...] ''<nowiki>walllh anzl mn alssma' ma' faµya bh alarð b¿d mwtha ann fy</nowiki>''. File arab/qud/tot.1/gud.wfr (original 77455 words, truncated/filtered to 35027 words, ''N'' = 8531 distinct).

The word frequency files '*/*/*/gud.wfr' are available at the [https://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/Notes/tr-stats/dat/ UNICAMP website].  The original annotated full texts, before truncation/filtering, are in the companion files */*/org/main.src.  The truncated/filtered texts -- one word per line, without punctuation -- are in */*/*/gud.tlw.