New Portuguese and Spanish samples for @Quimqu: https://www.ic.unicamp.br/~stolfi/EXPORT/voynich/work/Notes/085/out/ All texts are derived from the classical 1899 novel "Dom Casmurro" by Brazilian author Machado de Assis. The "text" folder has the full texts for reference -- more or less as published, will capitals, punctuation, etc. The "full", "head", "tail" have the same texts, but converted to lowercase, with all punctuation (including apostrophe and hyphen) converted to blanks, and with numbers, symbols, and foreign language words replaced by " * "words. Paragraphs are separated by blank lines. Words in each paragraph are separated by blanks and newlines. The "full" files have all the words of the novel, which are ~65'500 for Portuguese and ~69'000 for Spanish. The "head" and "tail" files are the same, trimmed to the first ~38'000 words and the last ~38'000 tokens, respectively. This token count should roughly match that of the VMS. Thus there is some overlap between each head and the corresponding tail, which is ~10'500 words for Portuguese and ~7'000 for Spanish. The last parag of "head" and the first parag of "tail" may be truncated. The files are in the Unicode UTF-8 encoding. Besides blank and newline, the characters in the "full", "head", and "tail" files should be Spanish: abcdefghijklmnopqrstuvwxyz áéíóúñüªº ãç Portuguese: abcdefghijklmnopqrstuvwxyz áéíóúâêôàãõçü In each folder there are four files: "port-orig.txt" uses the original 1899 spelling (apart from capitalization), "port-curr.txt" uses the "modern" spelling (as of ~1999), "port-phon.txt" uses an ersatz phonetic spelling. The files "span-curr.txt" have the Spanish translation of the novel, with (I suppose) current spelling. The first three files in each folder should have almost 1:1 mapping of the tokens which matches the lexemes. The differences are mostly hyphenated or separate words in one file that are joined in the other. The phonetic files, in particular, have the adverbial suffix "mente" as a separate word. There is no such simple mapping between tokens of Portuguese and Spanish files, even though most lexemes are cognates. The word order and choice of synonyms is often different, etc. I suppose that the Spanish file has more words because the Portuguese contractions are split: "da" <--> "de la" etc. In any structural word analysis that is independent of a lexical mapping, the three Portuguese samples in each folder should be almost coincident. And each Portuguese sample should plot close to the corresponding Spanish sample, because, even though they are different languages by different authors (Machado and the translator), the underlying text is the same. On the other hand, each "head" should map close to the corresponding "tail", because they ae different texts but have the same author the same language, the same topic, and the same literary genre and style. They can therefore be used to test how sensitive the analysis is to sampling noise. The most interesting comparison will be between a Portuguese "head" and a Spanish "tail" (or vice-versa) because what they have in common is only the topic and genre, and partly the style.