The tables below show the number of times each pair of words occurs in consecutive positions in the Biological section of the Voynich manuscript.
As usual, the data provides lots of tantalizing patterns, but no definite conclusions. At least, the patterns strongly suggest that the text is actually natural language, and not random garbage.
One obvious pattern in the table is that words with similar structure seem to have similar neighbor distributions.
Also, there are some words that are unexpectedly common at the end of lines (just before the "//") or at the beginning (after "//"). Other words seem to avoid those positions. I suspect this effect is due to the fact that many end-of-lines are also end-of-paragraph, and hence end-of-sentence. Unfortunately, many paragraph breaks seem to have been omitted in the Currier and FSG transcripts, so this data is particularly noisy.
Guesses, anyone?
The counts were obtained from the entire Biological section of the VMs (f75r--f84v), which is in Currier's "Language B". The version used was a mechanical stroke-level "consensus" of the Currier and FSG transcriptions.
The text had 7054 words, including end-of-line marks "//" and end-of-paragraph marks "=". Words with invalid characters, and words that were transcribed differently by Currier and Friedman, were mapped to the special word "???".
The text was encoded with an ad-hoc stroke-level encoding (yes, ANOTHER one), with identification of some easily confused letters and ignoring differences which (I beleive) are just calligraphic variations.
The encoding is basically the Frogguy alphabet, with the following changes:
Frogguy 9 2 4 x a s e e' t iiiv iiv iv ----------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- This table a r q e a z c z c m m n Frogguy ig ir qp lp dj fj eQPt eLPt eDJt eFJt cg & ----------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- This table k w H H P P cHc cHc cPc cPc cj ig
Here is the rough correspondence between my encoding and the original FSG encoding:
Table FSG ----- --- 8 8 H H, D P P, F a A, G, CI am AM, AIN, CIIIL an AN, CM ar AR, CIR aw AIR, CIIR c C cHc HZ, DZ cPc PZ, FZ ca CA, CG, CCI, TI cc T, CC cj 6 e E i I ig 7 iu L k K m M n N o O q 4 r R w IR z 2 za 2A, 2G, 2CI, SI zc S, 2C
Since the correspondence is (on purpose) ambiguous, I cannot easily map the table back to the FSG encoding.
The whole word-pair frequency table would be huge (about 850 by 850), but quite sparse (less than 7000 non-zero entries). To keep the output small but readable, I partitioned the vocabulary into a small set of frequently occuring key words, and a large set of non-keys.
The word-pair frequency table was then split into four sections: key × key, key × non-key, non-key × key, and non-key × non-key. Only the first three sections were computed and printed; the first has about 25 × 25 entries, and the other two have about 25 × 830 entries.
Here is the list of key words used for this run:// qoe zcccHca oHam zcc8a eccc8a zccca oHar ccc8a oHcc8a cccca qoHan oe zccc8a zam qoHam oHc8a ccca 8am qoHar qoHcc8a zcca 8ar qoHcca qoHc8a cccHca 8ae oHcca qoHa zccHca oHae or qoHae ccccHca
To build the tables, I used the following Unix commands:
cat infile.wds \ | gawk \ ' BEGIN { want = "="; } \ /./ { print want, $0; want = $0; next; } \ ' \ | count-diword-freqs \ -v rows=nonkeys.dic \ -v cols=keys.dic \
The file infile.wds contains the input text, one word per line.
The command count-diword-freqs is another gawk script that counts the occurrences of each pair, and prints the formatted table to stdout.
The auxiliary files keys.dic and non-keys.dic contain the two word sets, one word per line.
97-08-08: Computed the tables and created these HTML pages.
97-12-10: Reorganized the HTML text, moving each table to a separate file. No substantial changes.