Hacking at the Voynich manuscript - Side notes 023 Characterizing Voynichese sub-languages by QOKOKOKO element freqs Last edited on 1998-07-04 10:25:19 by stolfi 1998-06-20 stolfi ================= [ First version done on 1998-05-04, now redone with fresher data. ] In Note 021, I tried to classify pages according to the frequencies of certain keywords. John Grove pointed out that the transcription which I used (Friedman's) has inconsistencies which may masquerade as language differences, e.g. "dain" in place of "daiin" or vice-versa. Also, it seems that spacing (word division) is quite inconsistent. So, in an attempt to avoid those problems, I thought of using, instead of words, the "elements" of the QOKOKOKO paradigm. See Notes 017 and 018. Since I am not still clear on how to group the O's with K's (with the following K, with the preceding K, with both, with neither), I will leave them as separate elements. Also, for simplicity (without any conviction at all) I will split every double-letter O into two elements. Also, given Grove's observations on anomalous "p" and "t" distributions at beginning-of-line, and the well-known attraction of certain elements for end-of-line, it seems advisable to discard the first few and last few elements of every line. I. EXTRACTING AND COUNTING ELEMENTS We will prepare two sets of statistics, one using raw words ("-r") and one using word equivalence classes ("-c"). elem-to-class -describe element equivalence: map_ee_to_ch ignore_gallows_eyes join_ei equate_aoy collapse_ii equate_eights equate_pt erase_word_spaces append_tilde This mapping will hopefully reduce transcription and sampling noise. Factoring the text into elements: mkdir -p RAW EQV /bin/rm -rf {RAW,EQV}/efreqs mkdir -p {RAW,EQV}/efreqs foreach utype ( pages sections ) foreach f ( `cat text-${utype}/all.names` ) cat text-${utype}/${f}.evt \ | lines-from-evt | egrep '.' \ | factor-OK | egrep '.' \ > /tmp/${utype}-${f}.els end end Counting elements and computing relative frequencies: foreach ep ( cat.RAW elem-to-class.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} /bin/rm -rf ${etag}/efreqs mkdir -p ${etag}/efreqs foreach utype ( pages sections ) set frdir = "${etag}/efreqs/${utype}" mkdir -p ${frdir} cp -p text-${utype}/all.names ${frdir}/ foreach f ( `cat text-${utype}/all.names` ) echo ${frdir}/$f.frq cat /tmp/${utype}-${f}.els \ | trim-three-from-ends \ | tr '{}' '\012\012' \ | ${ecmd} | egrep '.' \ | sort | uniq -c | expand \ | sort -b +0 -1nr \ | compute-freqs \ > ${frdir}/${f}.frq end end end Computing total frequencies: foreach ep ( cat.RAW elem-to-class.EQV ) set etag = ${ep:e}; set ecmd = ${ep:r} foreach utype ( pages sections ) set fmt = "${etag}/efreqs/${utype}/%s.frq" set frfiles = ( \ `cat text-${utype}/all.names | gawk '/./{printf "'"${fmt}"'\n",$0;}'` \ ) echo ${frfiles} cat ${frfiles} \ | gawk '/./{print $1, $3;}' \ | combine-counts \ | sort -b +0 -1nr \ | compute-freqs \ > ${etag}/efreqs/${utype}/tot.frq end end II. TABULATING ELEMENT FREQUENCIES PER SECTION set sectags = ( `cat text-sections/all.names` ) echo $sectags foreach etag ( RAW EQV ) tabulate-frequencies \ -dir ${etag}/efreqs/sections \ -title "elem" \ tot ${sectags} end Elements sorted by frequency (× 99), per section: tot unk pha str hea heb bio ast cos zod ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- 17 o 16 o 23 o 15 o 20 o 14 y 15 o 18 o 16 o 20 o 12 y 12 a 9 y 11 y 11 y 14 o 15 y 13 y 13 y 13 a 9 a 11 y 8 l 11 a 9 ch 11 d 10 d 9 a 11 a 8 y 8 d 8 d 6 a 8 d 7 a 10 a 8 l 7 d 9 l 8 l 7 l 7 l 6 d 6 l 7 d 6 k 7 a 5 ch 6 d 7 t 5 k 5 r 4 r 6 k 6 l 5 l 6 q 5 ee 4 ch 5 r 4 ch 5 k 4 k 4 q 5 r 4 ch 6 k 5 r 4 k 5 d 4 r 4 ch 4 ch 4 ee 4 k 4 r 3 ee 4 k 4 r 5 ee 4 q 3 t 3 q 4 r 3 t 3 iin 3 che 3 s 4 t 3 ch 3 ee 3 iin 3 ee 3 iin 3 iin 3 che 3 r 3 l 3 iin 3 k 3 iin 3 q 3 che 3 ch 2 sh 2 q 2 she 3 t 2 sh 2 te 3 t 2 che 3 iin 3 che 2 q 2 t 2 t 2 che 2 q 2 iin 3 che 2 sh 2 ke 3 t 2 s 2 ee 2 ch 2 iin 2 ee 1 s 1 sh 1 ee 1 s 1 she 1 che 1 ke 1 in 2 ke 2 che 1 che 1 she 1 she 1 ? 1 sh 1 cth 1 sh 1 ke 1 q 1 s 1 sh 1 ke 1 p 1 sh 1 ke 1 ee 1 she 1 iin 1 ? 1 she 1 she 1 s 1 s 1 she 1 in 0 she 1 s 1 sh 1 e? 0 ke 0 p 0 in 0 ke 1 t 0 p 0 p 0 te 1 s 1 she 0 e? 0 ? 0 p 0 te 0 ckh 0 s 0 ckh 0 p 0 te 1 te 0 p 0 in 0 te 0 ir 0 ckhe 0 te 0 in 0 ckh 0 ckh 1 sh 0 te 0 ke 0 cth 0 cth 0 te 0 ir 0 m 0 f 0 p 0 p 0 ? 0 eee 0 ckh 0 f 0 p 0 eee 0 ke 0 in 0 cth 0 ir 0 cth 0 e? 0 ir 0 in 0 e? 0 ckh 0 cph 0 ir 0 ckhe 0 cth 0 in 0 m 0 ? 0 m 0 iiir 0 cth 0 te 0 m 0 f 0 ckh 0 m 0 cth 0 eee 0 ckh 0 in 0 f 0 cthe 0 cth 0 eee 0 eee 0 ckh 0 cthe 0 f 0 cph 0 cth 0 e? 0 f 0 e? 0 e? 0 cthe 0 eee 0 iir 0 m 0 eee 0 cthe 0 ckhe 0 ? 0 cthe 0 ir 0 in 0 cthe 0 q 0 e? 0 e? 0 ir 0 ? 0 ckhe 0 ? 0 cthe 0 iir 0 ir 0 ckhe 0 ? 0 f 0 m 0 e? 0 ckhe 0 iiin 0 i? 0 iir 0 cthe 0 cfh 0 m 0 iir 0 eee 0 eee 0 h? 0 il 0 cfh 0 cph 0 cthe 0 eee 0 i? 0 ir 0 iir 0 cph 0 j 0 ckhe 0 iir 0 cphe 0 j 0 cthe 0 n 0 iiin 0 cphe 0 ckhe 0 ij 0 iiin 0 ckhe 0 cphe 0 iiin 0 cfh 0 cphe 0 ? 0 cph 0 n 0 iir 0 cph 0 cph 0 cphe 0 cph 0 ck 0 f 0 i? 0 im 0 iiin 0 il 0 i? 0 cfhe 0 ikh 0 im 0 cphe 0 n 0 iir 0 n 0 iiin 0 i? 0 il 0 m 0 cfh 0 i? 0 i? 0 cphe 0 iir 0 im 0 m 0 iiir 0 x 0 cfhe 0 de 0 ct 0 cfh 0 n 0 de 0 iil 0 de 0 x 0 de 0 n 0 cfh 0 il 0 il 0 is 0 is 0 im 0 x 0 de 0 im 0 n 0 im 0 cfhe 0 de 0 i? 0 cfhe 0 id 0 cfhe 0 b 0 id 0 is 0 x 0 pe 0 cfh 0 ck 0 ith 0 is 0 iil 0 id 0 c? 0 j 0 iid 0 iil 0 iir 0 h? 0 id 0 cf 0 b 0 ck 0 iiid 0 g 0 cp 0 ct 0 iiil 0 h? 0 ct 0 iil 0 iim 0 iiil 0 g 0 id 0 iis 0 iiir 0 iil I have compared these counts with those obtained by removing two, one, or zero elements from each line end. The conclusion is that the ordering of the first six entries in each column is quite stable; it is probably not an artifact. Some quick observations: there seem to be three "extremal" samples: hea ("ch" abundant), bio ("q" important), and zod ("t" important). There are too many "e?" elements; I must check where they come from and perhaps modify the set of elements to account for them. [ It seems that many came from groups of the form "e[ktpf]e", "e[ktpf]ee", which could be "c[ktpf]h" and "c[ktpf]he" without ligatures. Most of the remaining come from Friedman's transcription; there are practically none in the more careful transcriptions. ] All valid elements that occur at least 10 times in the text: o y a q n in iin iiin r ir iir iiir d s is l il m im j de k t ke te p f cth ckh cthe ckhe cph cfh cphe cfhe ch che sh she ee eee x Valid elements that occur less than 10 times in the whole text: iil ij pe ct ck id Created a file "RAW/plots/vald/keys.dic" with all the valid elements. Equiv-reduced elements sorted by frequency (× 99), per section: tot unk pha str hea heb bio ast cos zod -------- -------- -------- -------- -------- -------- -------- -------- -------- -------- 38 o~ 40 o~ 40 o~ 38 o~ 39 o~ 40 o~ 37 o~ 40 o~ 41 o~ 42 o~ 10 t~ 10 t~ 8 l~ 11 t~ 11 ch~ 12 d~ 10 d~ 10 ch~ 9 t~ 11 t~ 8 d~ 8 d~ 7 ch~ 8 d~ 9 t~ 10 t~ 9 t~ 8 t~ 9 l~ 8 ch~ 8 ch~ 7 l~ 7 d~ 8 ch~ 7 d~ 6 ch~ 8 l~ 7 d~ 7 ch~ 8 l~ 7 l~ 6 ch~ 6 t~ 6 l~ 6 l~ 5 l~ 6 q~ 5 r~ 6 d~ 5 d~ 4 r~ 5 r~ 4 r~ 4 q~ 5 r~ 4 r~ 5 ch~ 3 s~ 4 r~ 5 r~ 4 q~ 3 in~ 3 q~ 4 in~ 4 in~ 3 in~ 3 che~ 3 l~ 3 in~ 3 te~ 4 in~ 3 q~ 3 in~ 4 r~ 2 sh~ 3 che~ 3 in~ 3 te~ 2 sh~ 3 in~ 3 che~ 2 che~ 3 che~ 4 che~ 2 cth~ 2 q~ 3 r~ 3 che~ 2 q~ 2 che~ 1 te~ 2 sh~ 2 te~ 1 te~ 2 q~ 2 te~ 2 she~ 2 in~ 2 che~ 1 s~ 1 sh~ 1 te~ 1 s~ 1 she~ 2 s~ 1 sh~ 2 te~ 1 q~ 1 te~ 1 she~ 1 she~ 1 she~ 1 ?~ 1 sh~ 1 che~ 1 she~ 1 sh~ 1 ?~ 1 s~ 1 sh~ 1 cth~ 1 cth~ 1 cth~ 0 s~ 0 te~ 1 s~ 1 cth~ 1 e?~ 1 she~ 0 ?~ 1 s~ 1 s~ 1 sh~ 0 cth~ 0 she~ 1 cth~ 1 s~ 1 she~ 0 cth~ 0 e?~ 0 ir~ 0 ir~ 1 she~ 0 ir~ 0 cthe~ 0 ir~ 0 cthe~ 1 cth~ 0 e?~ 0 cthe~ 0 cthe~ 0 cthe~ 1 cthe~ 0 cthe~ 0 ir~ 0 cthe~ 0 e?~ 1 sh~ 0 ?~ 0 cth~ 0 ?~ 0 e?~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ir~ 0 ir~ 0 ir~ 0 ir~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 e?~ 0 ?~ 0 h?~ 0 cthe~ 0 cthe~ 0 q~ 0 n~ 0 id~ 0 i?~ 0 i?~ 0 n~ 0 id~ 0 ith~ 0 i?~ 0 id~ 0 i?~ 0 n~ 0 de~ 0 il~ 0 i?~ 0 i?~ 0 ct~ 0 il~ 0 il~ 0 i?~ 0 is~ 0 n~ 0 ct~ 0 n~ 0 ?~ 0 id~ 0 id~ 0 il~ 0 n~ 0 de~ 0 id~ 0 x~ 0 il~ 0 de~ 0 x~ 0 id~ 0 id~ 0 de~ 0 de~ 0 n~ 0 ct~ 0 x~ 0 il~ 0 de~ 0 is~ 0 is~ 0 b~ 0 i?~ 0 x~ 0 is~ 0 is~ 0 h?~ 0 h?~ 0 c?~ 0 ith~ 0 b~ 0 b~ 0 c?~ There are 23 valid elements with frequency > 20 under the equivalence: o t te cth cthe ch che sh she d de id l r q s m n in ir im il Valid elements with frequency below 20: ct is g b x Created a file "EQV/plots/vald/keys.dic" with all the valid elements, collapsed by the above equivalence. III. PAGE SCATTER-PLOTS See Notes/021 for explanation of these plots. Let's now compute the frequencies of these keywords in each page and section: foreach dic ( vald ) foreach etag ( RAW EQV ) foreach utype ( pages sections ) set frdir = "${etag}/efreqs/${utype}" set ptdir = "${etag}/plots/${dic}/${utype}" echo "${frdir}" "${ptdir}" /bin/rm -rf ${ptdir} mkdir -p ${ptdir} cp -p ${frdir}/all.names ${ptdir} foreach fnum ( tot `cat ${frdir}/all.names` ) printf "%30s/%-7s " "${ptdir}" "${fnum}:" cat ${frdir}/${fnum}.frq \ | gawk '/./{print $1, $3;}' \ | est-dic-probs -v dic=${etag}/plots/${dic}/keys.dic \ > ${ptdir}/${fnum}.pos end end end end III. SCATTER PLOTS Let's plot them: set sys = "tot-hea" foreach dic ( vald ) foreach etag ( RAW EQV ) set ptdir = "${etag}/plots/${dic}/pages" set scdir = "${etag}/plots/${dic}/sections" set fgdir = "${etag}/plots/${dic}/${sys}" /bin/rm -rf ${fgdir} mkdir -p ${fgdir} cp -p ${ptdir}/all.names ${fgdir}/all.names make-3d-scatter-plots \ ${ptdir} \ ${fgdir} \ ${scdir}/{tot,hea,heb,bio}.pos end end Again, trying to separate Herbal-A from Pharma: set sys = "hea-pha" foreach dic ( vald ) foreach etag ( RAW EQV ) set ptdir = "${etag}/plots/${dic}/pages" set scdir = "${etag}/plots/${dic}/sections" set fgdir = "${etag}/plots/${dic}/${sys}" /bin/rm -rf ${fgdir} mkdir -p ${fgdir} cp -p ${ptdir}/all.names ${fgdir}/all.names make-3d-scatter-plots \ ${ptdir} \ ${fgdir} \ ${scdir}/{hea,pha,heb,bio}.pos end end The scatter plots made with colapsed letters still show the main sections as separate clusters, but touching each other.