Some language samples for statistical analysis

Below please find some samples of prose text in several languages, cleaned and formatted for easy statistical analysis.

In each folder like "chin/ptt/", the file "main.src" describes the sample, the source file and the encoding used, and has the text is a somewhat human-readable format. The file "main.wds&dq has the words of the text, one per line, prefixed by a one-letter type code and a space. The important type codes are 'a' for text word and 'p' for punctuation. File lines with other codes can be ignored.

My university's WWW server and/or your browser may mangle some character codes if you try to open these files in the browser. To use them in any analysis, download them of fetch them with 'wget'

Natural language texts

Arabic in a vaguely-phonetic encoding (JSAR)

arab/qcs/ Quran, consonants-only.
arab/qph/ Quran, phonetic, sort of.
arab/quv/ Quran, Unicode, with vowel marks.

Mandarin, Chinese characters in two-byte GB2312 encoding,

chin/ptn/ Pentateuch, "New" translation.
chin/ptt/ Pentateuch, "Union" translation.
chin/red/ Novel Dream of the Red Mansion.
chin/voa/ Voice of America transcript (with pinyin by the side)

Mandarin, pinyin with numeric tone.

chip/voa/ Voice of America (with chars in GB2312 by the side)

English

engl/cul/ Culpeper The English Physitian (1652).
engl/gav/ Bboard postings by former bitcoin machinist G. Andresen.
engl/poi/ Crime novel from 1920.
engl/twp/ Street theater Towneley Mystery plays (~1460).
engl/wow/ H. G. Wells War of the Worlds (1898).

French

fran/tal/ J. Verne's De la Terre à la Lune (1865).

Ge'ez (old Ethiopian), SERA encoding

geez/eno/ Book of Enoch ("1 Enoch"), from the Coptic Church bible.
geez/gok/ Glory of the Kings, from the Coptic Church bible.

German

germ/sim/ Der Abenteuerliche Simplicissimus Teutsch (1668)

Greek, in Roman transliteration

grek/nwt/ New Testament

Hebrew, in Roman transliteration

hebr/tan/ The Torah (Pentateuch), with vowel points.

Italan

ital/psp/ G. Manzoni, I Promessi Sposi (1840).

Latin

latn/ben/ Rules of the Benedictine Monks (~550).
latn/nwt/ New Testament, Vulgate version (~400).
latn/ock/ W. Ockam, Dialogus on politics (~1340)
latn/ptt/ Pentateuch, Vulgate version (~400)

Omaha-Ponca, NetSiouan encoding (may still have some bugs)

omah/jod/ Corpus collected by J. O. Dorsey (~1890)

Portuguese

port/csm/ M. Assis, novel Dom Casmurro (~1899), modern spelling.
port/cso/ M. Assis, novel Dom Casmurro (~1899), original spelling.

Russian, Roman transliteration

russ/pic/ A sci-fi novel (~1972).
russ/ptt/ the Pentateuch, Synodal version (1876).

Spanish

span/qvi/ M. Cervantes, Don Quixote (1615).

Tibetan, ACIP encoding

tibe/ccv/ Ravigupta Commentary to Commentary on Valid Perception (~850).
tibe/pmi/ Kyabje T. R. A Play of Mistaken Illusion (~1970?).
tibe/vim/ Vimalakirti Sutra (~1300).

Vietnamese, VIQR encoding

viet/nwt/ New Testament, tarnslation by Nguyễn Thị Thuấn (1961).
viet/ptt/ Pentateuch, Cadman version (1934).

Artificially generated texts

The texts below below are not "e;natural"e; in that they were generated by various non-trivial algorithms. Some of them are random word saladas with no meaningful content other than the probability tables used by the generating algorithm.

Vietnamese, VIQR encoding

viep/grs/ Random Vietnamese by a variant of Gordon Rrugg's table and template method.
viep/mky/ Random Vietnamese by a character Markov chain or order 3.

Voynichese, EVA encoding

voyp/grm/ Random Voynichese generated manually by Gordon Rrugg.
voyp/grs/ Random Voynichese generated automatically by Gordon Rugg.

Last edited on 2025-07-07 09:42:05 by stolfi