Hacking at the Voynich manuscript - Side notes 504 Attempt at factoring the 'edy' and 'air' word classes Last edited on 1999-01-31 06:20:15 by stolfi OBSOLETE This is partly a remake of work from Notebook-1.txt, originally done around 97-07-05. Summary of previous relevant tasks: I obtained Landini's interlinear transcription of the VMs, version 1.6 (landini-interln16.evt) from http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt] Around 97-11-01 I split landini-interln16.evt into many files, with one text unit per page. [Notebook-12.txt] On 97-11-05 I mapped those files from FSG and other ad-hoc alphabets to EVA. [Notebook-12.txt] The files are L16-eva/fNNxx.YY, and a machine-readable description of their contents and logical order is in L16-eva/INDEX. Then I started going back to redoing some of the previous tasks using the new encoding. I extracted the Currier (;C>) and Friedman-I (;F>) versions of the "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also built the associated text files and word lists bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt] I mapped the word files bio-{c,f}-eva-gut.{wds,dic} to a reduced alphabet (ERA) obtaining files bio-{c,f}-era-gut.{wds,dic} I built finite-state automata for these two files, and looked for producetive states. The Friedman version seems cleaner. [Note-003.txt] 97-11-10 stolfi =============== Poring over the productive states of the automata and playng around with counting programs, I tentatively identified two major classes of morphologically similar words: one inflecting as -e*d?y or -e*d (the 'edy' class) and one inflecting as -e*d?[ao]i*[rlnm] (the 'air' class). Let's collect their "roots". To keep the tables manageable, and reduce noise, let's work with ERA-mapped files. Recall that "a" and "y" become "o" in ERA. cat bio-f-era-gut.wds \ | egrep -e '[orlmnd]$' \ | sed -e 's:e*d*o$:- -EDY:' \ | sed -e 's:e*d$:- -EDY:' \ | sed -e 's:e*d*oi*[rlmn]$:- -AIR:' \ | egrep -e '- -' \ > Note-004/.factored cat Note-004/.factored \ | gawk '/./ {print $1}' \ | revbytes | sort | uniq | revbytes \ > Note-004/.prefs-all.dic dicio-wc Note-004/.prefs-all.dic lines words bytes file ------ ------- --------- ------------ 266 266 1704 Note-004/.prefs-all.dic I built a file "Note-004/suffs.cls" containing only the lines "-EDY" and "-AIR". cat .factored \ | count-diword-freqs \ -v rows=Note-004/.prefs-all.dic \ -v cols=Note-004/suffs.cls \ -v digits=4 \ > Note-004/.tbl-1 The raw count table was then extracted by hand from Note-004/.tbl-1 and saved to Note-004/.tbl-2. These are the most popular prefixes from that table (10 or more occurrences), sorted by reverse prefix: prefix TOT EDY AIR ----------- ------ ---- ---- - 846 81 765 ch- 971 771 200 dch- 58 47 11 okech- 12 11 1 kch- 54 44 10 okch- 104 101 3 lch- 139 125 14 olch- 111 105 6 polch- 11 10 1 rolch- 13 12 1 och- 46 39 7 pch- 43 26 17 opch- 39 34 5 rch- 47 43 4 k- 208 97 111 ek- 34 24 10 cheek- 110 108 2 chek- 189 172 17 oek- 17 14 3 chk- 29 12 17 lk- 32 20 12 olk- 148 86 62 rolk- 17 13 4 ok- 1652 873 779 dok- 12 9 3 l- 70 24 46 ol- 109 58 51 dol- 19 15 4 okol- 31 22 9 o- 32 18 14 p- 19 1 18 op- 10 4 6 r- 253 18 235 or- 48 15 33 dor- 10 7 3 ... ... ... ... ----------- ------ ---- ---- TOT 5992 3395 2597 Let's do a scatter plot of the prefixes, using the EDY/AIR counts: gnuplot <<EOF set term x11 # set term pbm color small # set output Note-004/pref-suff-plot.ppm set title "EDY vs AIR counts" set xlabel "-EDY count" set ylabel "-AIR count" plot \ '< radial-squeeze -v x=3 -v y=4 -v noise=0.3 .tbl-2' \ using 1:2 title "collapsed prefixes" with points pause 120 EOF The scatter plot did not show any obviously distinctive prefix classes. Let's try to classify anyway the prefixes based on their EDY versus AIR preferences. cat Note-004/.tbl-2 \ | classify-by-ratio \ -v min=3 \ -v xf=3 -v xn=3395 \ -v yf=4 -v yn=2597 \ | revbytes | sort -b +4 -5 | revbytes \ > Note-004/pref-suff-class.txt Here are the significant entries (type > 0), sorted by reverse prefix. prefix TOT EDY AIR R ----------- ------ ---- ---- - - 846 81 765 9 ----------- ------ ---- ---- - d- 6 6 . 1 ----------- ------ ---- ---- - ch- 971 771 200 2 dch- 58 47 11 2 olkech- 3 3 . 1 okech- 12 11 1 1 fch- 4 3 1 3 ofch- 9 8 1 1 kch- 54 44 10 2 lkch- 3 3 . 1 olkch- 7 7 . 1 okch- 104 101 3 1 lch- 139 125 14 1 olch- 111 105 6 1 dolch- 6 6 . 1 kolch- 3 3 . 1 okolch- 4 4 . 1 polch- 11 10 1 1 rolch- 13 12 1 1 och- 46 39 7 2 pch- 43 26 17 5 chepch- 5 5 . 1 lpch- 3 2 1 4 opch- 39 34 5 2 rch- 47 43 4 1 orch- 6 6 . 1 ----------- ------ ---- ---- - k- 208 97 111 6 dk- 4 3 1 3 ek- 34 24 10 3 cheek- 110 108 2 1 lcheek- 9 8 1 1 chek- 189 172 17 1 lchek- 4 4 . 1 rchek- 6 6 . 1 oek- 17 14 3 2 chk- 29 12 17 7 lk- 32 20 12 4 olk- 148 86 62 5 dolk- 3 3 . 1 opolk- 2 . 2 9 rolk- 17 13 4 3 ok- 1652 873 779 5 dok- 12 9 3 3 cheok- 4 4 . 1 chok- 8 4 4 6 kok- 2 . 2 9 olok- 5 4 1 2 rok- 6 3 3 6 rk- 5 1 4 8 ----------- ------ ---- ---- - l- 70 24 46 7 chl- 3 1 2 7 okl- 3 . 3 9 ol- 109 58 51 5 dol- 19 15 4 2 cheol- 8 5 3 4 chol- 4 4 . 1 kol- 6 5 1 2 okol- 31 22 9 3 lol- 8 6 2 3 opol- 3 3 . 1 rol- 7 6 1 2 orol- 4 4 . 1 ----------- ------ ---- ---- - o- 32 18 14 5 cho- 3 3 . 1 oko- 7 6 1 2 ----------- ------ ---- ---- - p- 19 1 18 9 ep- 9 5 4 5 cheep- 4 4 . 1 chep- 4 3 1 3 op- 10 4 6 7 ----------- ------ ---- ---- - r- 253 18 235 9 lr- 2 . 2 9 or- 48 15 33 8 dor- 10 7 3 3 kor- 8 4 4 6 okor- 5 4 1 2 lor- 3 . 3 9 olor- 5 2 3 7 por- 2 . 2 9 ror- 6 1 5 9 doror- 3 2 1 4 ----------- ------ ---- ---- - Note that the EDY/AIR ratio depends rather strongly on the last letter of the suffix. These were the counts for prefixes containing 'c[ktpf]h' before we decided to map those compounds to 'e[ktpf]e': Counts for 'c[ktpf]h' before: ----------- ------ ---- ---- - cph- 9 5 4 5 checph- 4 4 . 1 chcph- 3 3 . 1 ----------- ------ ---- ---- - ckh- 26 19 7 3 checkh- 82 80 2 1 lcheckh- 8 8 . 1 chckh- 133 128 5 1 ockh- 16 13 3 2 ----------- ------ ---- ---- - Counts for 'e[ktpf]-' before ----------- ------ ---- ---- - ep- . . . 0 cheep- . . . 0 chep- . . . 0 ----------- ------ ---- ---- - ek- 8 5 3 4 cheek- 28 28 . 1 lcheek- . . . 0 chek- 56 44 12 2 oek- . . . 0 ----------- ------ ---- ---- - Counts for 'e[ktpf]-' after: ----------- ------ ---- ---- - ep- 9 5 4 5 cheep- 4 4 . 1 chep- 4 3 1 3 ----------- ------ ---- ---- - ek- 34 24 10 3 cheek- 110 108 2 1 lcheek- 9 8 1 1 chek- 189 172 17 1 oek- 17 14 3 2 ----------- ------ ---- ---- -