Some language samples for statistical analysis
Below please find some samples of prose text in several languages, cleaned and formatted for easy statistical analysis.
In each folder like "chin/ptt/", the file "main.src" describes the sample, the source file and the encoding used, and has the text is a somewhat human-readable format. The file "main.wds&dq has the words of the text, one per line, prefixed by a one-letter type code and a space. The important type codes are 'a' for text word and 'p' for punctuation. File lines with other codes can be ignored.
My university's WWW server and/or your browser may mangle some character codes if you try to open these files in the browser. To use them in any analysis, download them of fetch them with 'wget'
Natural language texts
- Arabic in a vaguely-phonetic encoding (JSAR)
- Mandarin, Chinese characters in two-byte GB2312 encoding,
- chin/ptn/ Pentateuch, "New" translation.
- chin/ptt/ Pentateuch, "Union" translation.
- chin/red/ Novel Dream of the Red Mansion.
- chin/voa/ Voice of America transcript (with pinyin by the side)
- Mandarin, pinyin with numeric tone.
- chip/voa/ Voice of America (with chars in GB2312 by the side)
- English
- engl/cul/ Culpeper The English Physitian (1652).
- engl/gav/ Bboard postings by former bitcoin machinist G. Andresen.
- engl/poi/ Crime novel from 1920.
- engl/twp/ Street theater Towneley Mystery plays (~1460).
- engl/wow/ H. G. Wells War of the Worlds (1898).
- French
- fran/tal/ J. Verne's De la Terre à la Lune (1865).
- Ge'ez (old Ethiopian), SERA encoding
- geez/eno/ Book of Enoch ("1 Enoch"), from the Coptic Church bible.
- geez/gok/ Glory of the Kings, from the Coptic Church bible.
- German
- germ/sim/ Der Abenteuerliche Simplicissimus Teutsch (1668)
- Greek, in Roman transliteration
- Hebrew, in Roman transliteration
- hebr/tan/ The Torah (Pentateuch), with vowel points.
- Italan
- ital/psp/ G. Manzoni, I Promessi Sposi (1840).
- Latin
- latn/ben/ Rules of the Benedictine Monks (~550).
- latn/nwt/ New Testament, Vulgate version (~400).
- latn/ock/ W. Ockam, Dialogus on politics (~1340)
- latn/ptt/ Pentateuch, Vulgate version (~400)
- Omaha-Ponca, NetSiouan encoding (may still have some bugs)
- omah/jod/ Corpus collected by J. O. Dorsey (~1890)
- Portuguese
- port/csm/ M. Assis, novel Dom Casmurro (~1899), modern spelling.
- port/cso/ M. Assis, novel Dom Casmurro (~1899), original spelling.
- Russian, Roman transliteration
- Spanish
- Tibetan, ACIP encoding
- tibe/ccv/ Ravigupta Commentary to Commentary on Valid Perception (~850).
- tibe/pmi/ Kyabje T. R. A Play of Mistaken Illusion (~1970?).
- tibe/vim/ Vimalakirti Sutra (~1300).
- Vietnamese, VIQR encoding
- viet/nwt/ New Testament, tarnslation by Nguyễn Thị Thuấn (1961).
- viet/ptt/ Pentateuch, Cadman version (1934).
Artificially generated texts
The texts below below are not "e;natural"e; in that they were generated by various non-trivial algorithms. Some of them are random word saladas with no meaningful content other than the probability tables used by the generating algorithm.
- Vietnamese, VIQR encoding
- viep/grs/ Random Vietnamese by a variant of Gordon Rrugg's table and template method.
- viep/mky/ Random Vietnamese by a character Markov chain or order 3.
- Voynichese, EVA encoding
- voyp/grm/ Random Voynichese generated manually by Gordon Rrugg.
- voyp/grs/ Random Voynichese generated automatically by Gordon Rugg.
Last edited on 2025-07-07 09:42:05 by stolfi