Some language samples for statistical analysis

Below please find some samples of prose text in several languages, cleaned and formatted for easy statistical analysis.

In each folder like "chin/ptt/", the file "main.src" describes the sample, the source file and the encoding used, and has the text is a somewhat human-readable format. The file "main.wds&dq has the words of the text, one per line, prefixed by a one-letter type code and a space. The important type codes are 'a' for text word and 'p' for punctuation. File lines with other codes can be ignored.

My university's WWW server and/or your browser may mangle some character codes if you try to open these files in the browser. To use them in any analysis, download them of fetch them with 'wget'

Natural language texts

Artificially generated texts

The texts below below are not "e;natural"e; in that they were generated by various non-trivial algorithms. Some of them are random word saladas with no meaningful content other than the probability tables used by the generating algorithm.


Last edited on 2025-07-07 09:42:05 by stolfi