Hacking at the Voynich manuscript - Side notes
030 Combining the p-m-s and OKOKOKO paradigms
Last edited on 1998-09-21 01:50:21 by stolfi
This note attempts to rederive the p-m-s paradigm
in light of the OKOKOKO decompostion.
First, let's get the word lists for each section:
set sections = ( `cat ../023/text-sections/all.names` )
set seccom = `echo ${sections} | tr ' ' ','`
mkdir data
foreach sec ( $sections )
echo "${sec}"
mkdir data/${sec}
cat ../023/text-sections/${sec}.evt \
| words-from-evt \
| tr '*' '?' \
> data/${sec}/words.wds
end
mkdir data/all
cat data/{${seccom}}/words.wds > data/all/words.wds
mkdir data/ren
cat Rene-words.frq \
| gawk '/./{n=$1;for(i=0;i<n;i++){print $2;}}' \
> data/ren/words.wds
dicio-wc data/{${seccom},ren,all}/words.wds
lines words bytes file
------- ------- --------- ------------
2210 2210 13427 data/unk/words.wds
2211 2210 13467 data/pha/words.wds
10358 10356 66804 data/str/words.wds
7584 7584 44680 data/hea/words.wds
3338 3336 20045 data/heb/words.wds
6539 6539 37738 data/bio/words.wds
324 324 2089 data/ast/words.wds
389 387 2264 data/cos/words.wds
169 169 1018 data/zod/words.wds
28939 28939 172850 data/ren/words.wds
33122 33115 201532 data/all/words.wds
Factor them into elements and separate the unreadable words
and parsing bugs:
foreach sec ( $sections ren all )
echo ${sec}
cat data/${sec}/words.wds \
| factor-OK \
> data/${sec}/words.fac
cat data/${sec}/words.fac \
| egrep -e '^{[{}a-z?]*}$' \
> data/${sec}/words-gut.fac
cat data/${sec}/words.fac \
| egrep -v -e '^{[{}a-z?]*}$' \
> data/${sec}/words-bad.fac
egrep -e '^[^{]' data/${sec}/words-gut.fac | head -5
egrep -e '[^}]$' data/${sec}/words-gut.fac | head -5
egrep -e '[}][^{]' data/${sec}/words-gut.fac | head -5
egrep -e '[^}][{]' data/${sec}/words-gut.fac | head -5
end
dicio-wc data/{${seccom},ren,all}/words-{gut,bad}.fac
lines words bytes file
------- ------- --------- ------------
2190 2190 30362 data/unk/words-gut.fac
20 20 279 data/unk/words-bad.fac
2112 2112 28599 data/pha/words-gut.fac
99 98 1439 data/pha/words-bad.fac
10311 10311 149736 data/str/words-gut.fac
47 45 678 data/str/words-bad.fac
7532 7532 98697 data/hea/words-gut.fac
52 52 721 data/hea/words-bad.fac
3318 3318 45009 data/heb/words-gut.fac
20 18 270 data/heb/words-bad.fac
6530 6530 85290 data/bio/words-gut.fac
9 9 114 data/bio/words-bad.fac
312 312 4495 data/ast/words-gut.fac
12 12 201 data/ast/words-bad.fac
376 376 4954 data/cos/words-gut.fac
13 11 174 data/cos/words-bad.fac
162 162 2240 data/zod/words-gut.fac
7 7 104 data/zod/words-bad.fac
28939 28939 389191 data/ren/words-gut.fac
0 0 0 data/ren/words-bad.fac
32843 32843 449382 data/all/words-gut.fac
279 272 3980 data/all/words-bad.fac
Count the elements:
foreach sec ( $sections ren all )
echo "${sec}"
cat data/${sec}/words-gut.fac \
| tr '{}' '\012\012' \
| egrep '.' \
| sort | uniq -c | expand \
| sort -b +0 -1nr \
> data/${sec}/elem.cts
end
multicol \
-v titles="all ren $sections" \
data/{all,ren,$seccom}/elem.cts \
> elem.multi
compare-counts \
-titles "all ren $sections element" \
-remFreqs \
-sort 1 \
data/{all,ren,$seccom}/elem.cts \
> elem.cmp
all ren unk pha str hea heb bio ast cos zod element
--- --- --- --- --- --- --- --- --- --- --- -------
834 826 846 773 848 800 864 847 828 852 798 o
712 699 728 676 736 689 719 700 700 725 711 y
615 602 600 600 622 611 609 622 608 603 572 a
524 510 512 524 536 532 488 516 523 527 516 d
452 440 437 435 467 470 437 424 482 435 431 l
394 380 382 389 402 425 373 362 436 388 394 k
346 336 336 349 363 333 327 342 379 343 356 ch
300 291 276 298 319 280 278 308 326 297 296 r
259 248 246 262 274 253 251 242 311 271 294 q
225 215 209 228 235 215 217 226 286 234 269 iin
192 181 168 215 201 171 186 200 253 183 193 t
161 151 145 184 163 154 155 165 223 159 177 che
131 119 128 149 119 142 134 131 171 135 123 ee
113 104 105 136 107 112 119 117 159 106 110 sh
97 89 92 113 96 89 104 102 123 89 95 s
82 74 77 101 82 83 90 74 110 74 80 she
70 61 69 77 71 78 73 58 90 65 72 ke
60 54 56 71 58 70 62 51 81 57 63 p
51 41 53 67 47 63 56 33 78 54 55 in
44 35 43 63 40 54 47 30 72 42 49 m
37 28 36 57 33 51 37 22 60 34 20 te
31 23 31 53 31 35 34 18 56 29 19 cth
26 18 28 44 27 29 27 12 51 26 19 ckh
22 14 21 41 20 27 23 11 43 21 19 ir
19 12 19 39 16 25 21 10 39 18 12 eee
17 11 15 36 14 23 15 9 38 18 12 f
15 10 13 34 12 20 14 9 31 17 9 oa
13 9 11 30 10 18 11 7 21 11 6 e?
11 7 10 23 9 17 10 5 20 10 6 ckhe
10 6 9 20 8 14 8 5 16 8 4 cthe
8 5 7 19 7 12 7 4 14 8 4 cph
7 5 6 16 7 10 6 4 11 7 4 oy
6 4 6 16 6 7 6 4 11 7 4 n
6 3 5 15 5 6 5 4 8 6 3 iir
5 2 5 14 5 5 4 3 8 6 3 iiin
4 2 5 13 4 5 4 2 6 6 3 i?
4 2 3 12 3 4 3 2 6 6 3 cphe
3 2 3 10 3 4 3 2 6 5 3 oo
3 1 1 10 3 3 2 2 6 5 3 cfh
3 1 1 10 2 3 2 2 5 5 3 im
2 1 1 9 2 2 1 2 5 3 3 yo
2 1 1 8 1 2 1 2 5 3 3 de
2 1 1 5 1 2 1 2 4 3 3 iiir
1 1 1 5 1 2 1 1 3 3 3 il
1 1 1 2 1 2 1 1 0 3 3 j
1 1 0 2 1 2 0 1 0 3 3 x
1 1 0 2 0 2 0 1 0 3 3 is
1 1 0 2 0 2 0 1 0 2 3 ya
1 1 0 1 0 1 0 1 0 2 3 cfhe
1 1 0 1 0 1 0 1 0 1 . ay
0 1 0 0 0 1 0 1 0 1 . ao
0 1 0 0 0 1 0 1 0 1 . iim
0 0 0 0 0 1 0 1 0 1 . g
0 0 0 0 0 1 0 1 0 1 . ck
0 0 0 0 0 0 0 1 0 1 . iil
0 0 0 0 0 0 0 1 0 1 . ct
0 0 0 0 0 0 . 1 0 1 . id
0 0 0 0 0 0 . 1 0 1 . iid
0 0 0 0 0 0 . 0 0 1 . cthh
0 0 0 0 0 0 . 0 0 1 . b
0 0 0 0 0 0 . 0 0 1 . cphh
0 0 0 0 0 0 . 0 0 1 . ikh
0 0 0 0 0 0 . 0 0 1 . aa
0 0 0 0 0 0 . 0 0 1 . c?
0 0 0 0 0 0 . 0 0 1 . iis
0 0 0 0 0 0 . 0 0 1 . yoa
0 0 0 0 0 0 . 0 0 1 . iiil
0 0 0 0 0 0 . 0 0 1 . ikhe
0 0 0 0 0 0 . 0 0 1 . aoy
0 0 0 0 0 0 . 0 0 1 . cf
0 0 0 0 0 0 . 0 0 1 . chh
0 0 0 0 0 0 . 0 0 1 . cp
0 0 0 0 0 0 . 0 0 1 . h?
0 0 0 0 . 0 . 0 0 1 . iiid
0 0 0 0 . 0 . 0 0 0 . ij
0 0 0 0 . 0 . 0 0 0 . iph
0 0 0 0 . 0 . 0 0 0 . ith
0 0 0 0 . 0 . 0 0 0 . ithe
0 0 0 0 . 0 . . 0 0 . ithh
0 0 0 0 . 0 . . 0 0 . oao
0 0 0 0 . 0 . . 0 0 . ooa
0 0 0 0 . 0 . . . 0 . ooooooooo
0 0 0 0 . 0 . . . 0 . oya
0 0 0 . . 0 . . . 0 . pe
0 0 0 . . . . . . 0 . u
0 0 . . . . . . . 0 . yay
. 0 . . . . . . . . . yy
. 0 . . . . . . . . . cfhh
. 0 . . . . . . . . . ckhh
. 0 . . . . . . . . . kh
. . . . . . . . . . . v
Check whether our list of elements is complete:
cat data/{all,ren}/elem.cts | gawk '/./{print $2}' | sort | uniq > .bar
cat elem.dic | sort > .foo
bool 1-2 .bar .foo
cat elem-to-class.tbl | gawk '/./{print $1}' | sort > .baz
bool 1-2 .baz .foo
Now let's enumerate all pairs of non-empty elements,
consecutive and non-consecutive, in each word:
foreach ptpn ( sep.0 con.1 )
set ptp = "${ptpn:r}"; set pfl = "${ptpn:e}"
foreach sec ( $sections ren all )
echo "Enumerating ${ptp} element pairs for ${sec}..."
cat data/${sec}/words-gut.fac \
| nice enum-elem-pairs -v consecutive=${pfl} \
| tr -d '{}' \
| sort | uniq -c | expand \
| gawk '/./{printf "%7d %s:%s\n", $1,$2,$3;}' \
| sort +0 -1nr +1 -2 \
> data/${sec}/elem-${ptp}-pair.cts
end
multicol \
-v titles="all ren ${sections}" \
data/{all,ren,$seccom}/elem-${ptp}-pair.cts \
> elem-${ptp}-pair.multi
compare-counts \
-titles "all ren $sections pair" \
-freqs \
-sort 1 \
data/{all,ren,$seccom}/elem-${ptp}-pair.cts \
> elem-${ptp}-pair.cmp
end
Tabulate element pairs, collapsing elements into classes
foreach ptp ( sep con )
foreach sec ( ${sections} ren all )
echo "=== ${ptp} pairs for ${sec} ========================"
cat data/${sec}/elem-${ptp}-pair.cts \
| tr ':' ' ' \
| map-fields \
-v table=elem-to-class.tbl \
-v fields="2,3" \
| gawk '/./{printf "%7d %s:%s\n", $1,$2,$3;}' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-${ptp}-pair.cts
foreach ttpn ( freqs.3 counts.5 )
set ttp = "${ttpn:r}"; set dig = "${ttpn:e}"
cat data/${sec}/class-${ptp}-pair.cts \
| tr ':' ' ' | gawk '/./{print $1,"*",$2,$3;}' \
| tabulate-triple-counts \
-v rows=elem-classes.dic \
-v cols=elem-classes.dic \
-v ${ttp}=1 -v digits=${dig} \
> data/${sec}/class-${ptp}-pair.${ttp}
end
end
end
Here is a typical "sep" table, for the "bio" section:
Pairs with key = *
Pair probabilities (×999):
--- --- --- --- --- --- --- --- --- --- --- -----
E T
T O
Q O S D X H N I W C T
--- --- --- --- --- --- --- --- --- --- --- -----
Q . 79 13 14 11 36 . 7 . . 164
O . 76 84 27 25 61 2 36 1 . 317
S . 35 11 11 14 8 . 3 . . 86
D . 68 9 2 2 . . 4 . . 88
X . 83 12 42 8 12 . . . . 161
H . 86 20 29 24 . . 14 . . 177
N . . . . . . . . . . 0
I . . . . . . . . . . 0
W . 1 . . . . . . . . 3
ETC . . . . . . . . . . 0
--- --- --- --- --- --- --- --- --- --- --- -----
TOT . 432 151 128 88 121 5 67 4 . 999
Note that the classes H and X are rarely preceded by D
but often followed by it. I suppose most of these
cases are final "dy"s.
Now let's extract all subsequences of three non-empty
elements from each word:
foreach ptpn ( sep.0 )
set ptp = "${ptpn:r}"; set pfl = "${ptpn:e}"
foreach sec ( $sections ren all )
echo "Enumerating ${ptp} element triples for ${sec}..."
cat data/${sec}/words-gut.fac \
| nice enum-elem-triples -v consecutive=${pfl} \
| tr -d '{}' \
| sort | uniq -c | expand \
| gawk '/./{printf "%7d %s:%s:%s\n", $1,$2,$3,$4;}' \
| sort +0 -1nr +1 -2 \
> data/${sec}/elem-${ptp}-triple.cts
end
multicol -v titles="all ren ${sections}" data/{all,ren,$seccom}/elem-${ptp}-triple.cts \
> elem-${ptp}-triple.multi
compare-counts \
-titles "all ren $sections triple" \
-freqs \
-sort 1 \
data/{all,ren,$seccom}/elem-${ptp}-triple.cts \
> elem-${ptp}-triple.cmp
end
Tabulate triples sliced by middle element, first collapsing similar letters:
foreach ptp ( sep )
foreach sec ( ${sections} ren all )
echo "=== ${ptp} triples for ${sec} ========================"
cat data/${sec}/elem-${ptp}-triple.cts \
| tr ':' ' ' \
| map-fields \
-v table=elem-to-class.tbl \
-v fields="2,3,4" \
| gawk '/./{printf "%7d %s:%s:%s\n", $1, $2,$3,$4;}' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-${ptp}-triple.cts
foreach ttpn ( freqs.3 counts.5 )
set ttp = "${ttpn:r}"; set dig = "${ttpn:e}"
cat data/${sec}/class-${ptp}-triple.cts \
| tr ':' ' ' | gawk '/./{print $1,$3,$2,$4;}' \
| sort -b +1 -2 +0 -1nr \
| tabulate-triple-counts \
-v rows=elem-classes.dic \
-v cols=elem-classes.dic \
-v ${ttp}=1 -v digits=${dig} \
> data/${sec}/class-${ptp}-triple.${ttp}
end
end
end
It seems that, if we ignore the O's and Q's, most words have a "midfix"
consisting of D, X, and H elements, with a prefix
of S letters, and a suffix of S and D elements.
Let's add to the factored word tables a second column with the
element classes:
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/words-gut.fac \
| gawk \
' /./{ \
s = $1; \
gsub(/[{}]/, " ", s); gsub(/ */, " ", s); \
gsub(/^ */, "", s); gsub(/ *$/, "", s); \
printf "%s %s\n", $0, s; \
} ' \
| map-fields \
-v table=elem-to-class.tbl \
-v forgiving=1 \
| gawk '/./{ \
e=$1; $1=""; c=$0; \
gsub(/^ */,":",c); gsub(/ *$/,":",c); \
gsub(/ */, ":", c); \
printf "%s %s\n", c,e; \
} ' \
| sort | uniq -c | expand \
| sort -b +0 -1nr +1 -2 \
> data/${sec}/class-wds.cts
end
Looking at Rene's words, it looks like most words have at most one H,
and all X's are consecutive and adjacent to it (except for the intrusion
of "O"s in some languages).
Let's tabulate the patterns of H and X elements,
after removing the O elements, and any prefix or suffix
consisting of elements oter than H and X::
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| sed \
-e 's/[O]://g' \
-e 's/ :[A-GI-WY-Z:]*/ :/' \
-e 's/:[A-GI-WY-Z:]*$/:/' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-midf.cts
end
multicol \
-v titles="all ren ${sections}" \
data/{all,ren,${seccom}}/class-midf.cts \
> class-midf.multi
compare-counts \
-titles "all ren $sections midfix" \
-remFreqs \
-sort 1 \
data/{all,ren,$seccom}/class-midf.cts \
> class-midf.cmp
Now let's tabulate the prefix, suffix, and unifix patterns,
omitting the Q's and O's:
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| egrep -v -e '[HX]:' \
| sed \
-e 's/[O]://g' \
-e 's/ :[Q:]*/ :/' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-unif.cts
end
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| egrep -e '[HX]:' \
| sed \
-e 's/[O]://g' \
-e 's/ :[Q:]*/ :/' \
-e 's/[HX]:.*//' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-pref.cts
end
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| egrep -e '[HX]:' \
| sed \
-e 's/[O]://g' \
-e 's/ :[Q:]*/ :/' \
-e 's/ :.*[HX]:/ :/' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-suff.cts
end
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| egrep -e '[HX]:' \
| sed \
-e 's/[O]://g' \
-e 's/:[A-PR-Z:]*$/:/' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-qhaf.cts
end
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| egrep -v -e '[HX]:' \
| sed \
-e 's/[O]://g' \
-e 's/:[A-PR-Z:]*$/:/' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> data/${sec}/class-qsof.cts
end
Let's make comparative tables:
foreach fix ( midf unif pref suff qhaf qsof )
multicol \
-v titles="all ren ${sections}" \
data/{all,ren,${seccom}}/class-${fix}.cts \
> class-${fix}.multi
compare-counts \
-titles "all ren ${sections} ${fix}ix-pattern" \
-freqs \
-sort 1 \
data/{all,ren,${seccom}}/class-${fix}.cts \
> class-${fix}.cmp
end
THE OKOKOKO AND PMS PARADIGMS COMBINED
To a first approximation, the Voynichese words can be decomposed into
the following "elements":
Q = { q }
O = { a o y } ("circles")
H = { t te cth cthe k ke ckh ckhe
p pe cph cphe f fe cfh cfhe } ("gallows")
X = { ch che sh she ee eee } ("tables")
R = { d l r s } ("dealers")
F = { n m g j in iin ir iir } ("finals")
W = { e i cthh ith kh ct iiim iir is ETC. } ("weirdos")
The "p" and "f" elements are almost certainly calligraphic variants
of the corresponding "t" and "k" elements.
There are two classes of words: the "hard" ones, that contain Hs
and/or Xs, and the 'soft" ones that don't.
Let's ignore the O's for the moment. The "hard" words have the form
Q^a R^b X^c H^d X^e R^f F^g
where
a = 0 (86%) or 1 (14%)
b = 0 (90%) or 1 ( 9%)
d = 0 (49%) or 1 (49%)
c+e = 0 (52%) or 1 (43%) or 2 (4%)
f = 0 (42%) or 1 (53%) or 2 (4%)
g = 0 (85%) or 1 (14%)
The "soft" words have the form
Q^w R^x F^y
where
w = 0 (95%) or 1 ( 5%)
x = 0 (12%) or 1 (58%) or 2 (22%) or 3( 2%)
y = 0 (55%) or 1 (40%)
The "soft" schema above can be interpreted as a special case of the
"hard" schema with no X or Hs (i.e. c+d+e = 0), although the probabilities
are somewhat different.
Said another way, the typical Voynichese word has a "midfix"
(kernel, root), possibly empty, consisting of at most one gallows
surrounded by at most two tables. To the midfix is attached
a prefix having at most one "q" and at most one dealer;
and a suffix with at most two dealers and at most one final.
Let's now check how many words fit this paradigm:
foreach sec ( ${sections} ren all )
echo "$sec"
cat data/${sec}/class-wds.cts \
| gawk '/./{print $1,$2}' \
| sed -e 's/[O]://g' \
> /tmp/.foo
cat /tmp/.foo \
| egrep -e ' :(Q:|)([SD]:|)(X:X:H:|H:X:X:|(X:|)(H:|)(X:|))([SD]:|)([SD]:|)([NI]:|)$' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> /tmp/.foogut
cat /tmp/.foo \
| egrep -v -e ' :(Q:|)([SD]:|)(X:X:H:|H:X:X:|(X:|)(H:|)(X:|))([SD]:|)([SD]:|)([NI]:|)$' \
| combine-counts | sort -b +0 -1nr +1 -2 \
> /tmp/.foobad
/bin/rm -f data/${sec}/class-fit.cts
cat /tmp/.foogut | sed -e 's/ :/ +:/' >> data/${sec}/class-fit.cts
cat /tmp/.foobad | sed -e 's/ :/ -:/' >> data/${sec}/class-fit.cts
end
This paradigm fits 97% of all words in Rene's list (with multiplicities)
and 94% of all words in the interlinear file. The remaining 6%
includes the words containing "wild" (W) elements (3% of all words)
and long words that look like two words joined together.
The most common patterns in the interlinear that do
not fit the paradigm and do not contain a wild element
are
46 :H:H:
44 :H:X:H:
38 :H:S:X:
32 :H:S:X:D:
26 :H:S:X:S:
20 :D:I:S:
19 :D:S:X:D:
19 :H:H:S:
19 :X:D:X:
18 :S:I:S:
18 :X:X:H:X:
15 :I:S:
15 :S:S:X:
15 :X:S:H:
14 :H:I:S:
13 :D:S:X:
12 :D:I:D:
12 :S:S:X:D:
where S = { s l r }, D = { d }, I = { in iin ir iir }, N = { n m g j }.