[quote="quimqu" pid='72165' dateline='1761345334']I’ve been working on the Voynich Manuscript using machine learning and data science.[/quote] Ugh! But you did get some interesting results in spite of these handicaps... :D [quote]each word (token) becomes a node, and we draw edges between them whenever they occur near each other in the text. using a sliding window of 5 tokens. Once you have the graph, you can study things like community structure, modularity, and assortativity[/quote] I suppose you consider how often the pair appears, not just yes/no? The VMS has ~39'000 tokens (word occurrences) and ~7'000 lexemes ("word types", distinct words). With a 5-token window you will collect about 4 undirected pairs per token (or 8 directed ones). That is ~156'000 undirected pairs, giving an average node degree of ~45 -- much less than the ~24 million edges of the complete graph, where each node has degree ~7000. Therefore much of the structure of the graph will be an artifact of random sampling. That is, even if chedal could occur next to 1000 different lexemes with equal frequency, the graph could show only a dozen of them, at random. The structure of truly random graphs has been studied a lot, but since the tokens have very different frequencies, it is not clear how to apply those results to this situation. For that reason, I would consider only pairs u,v where the frequency of u-near-v is significantly higher than what one would expect if the tokens were chosen independently with the observed lexeme frequencies (that is, by a word markov of order 0). With that cutoff criterion, this simple gibberish should give an almost empty graph. No? And also there is the problem that Voynichese seems to have a smaller vocabulary than other "control" texts of the same size. Thus the graph of Voynichese will have fewer nodes, hence perhaps a higher average degree. This difference alone could create differences in the community structure etc. [quote]The same pattern appears when comparing character entropy (morphological freedom) [...] It might reflect a controlled or encoded version of natural language, or simply a writing system with its own conventions. It’s also interesting that the same language can produce very different results depending on the type of text[/quote] Indeed. Character-based statistics are extremely sensitive to the dialect and spelling/encoding system used. If one were to write German with "x" instead of "sch", and with "q" or "j" for the two sounds of "ch", the character statistics would be different from those of official German. Moreover, the frequency of characters and character tuples is dominated by their occurence in the most common words. The popularity of "t", "h", and "e" in English is due in great part to "the" being the most common in English prose. But "the" may be nearly absent in some "technical" books like herbals or pharmaceutical manuals. And if one were to spell that word (only) as "ye" or "d'", the frequencies of "t", "e", "th" etc would measurably drop. [quote]English (Culpepper)[/quote] The name is "peper" not "pepper". I know because I made the same mistake... All the best, --stolfi