Hacking at the Voynich manuscript - Side notes
503 Automaton-based analysis with reduced alphabet

Last edited on 1999-01-31 06:18:58 by stolfi

OBSOLETE

  This is partly a remake of work from Notebook-1.txt, originally done
  around 97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]
    
    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]
    
    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    Then I started going back to redoing some of the previous tasks
    using the new encoding.
    
    I extracted the Currier (;C>) and Friedman-I (;F>)
    versions of the "bio" section, in EVA alphabet, as files
    bio-{c,f}-eva.evt. I also built the associated text files and word lists
    bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt]

    I built finite automata for the two versions, and looked for
    strange words and inflection classes.  Friedman's version looked a
    bit more regular than Currier's.  [Note-002.txt]
    
97-11-10 stolfi
===============
  
  After a couple of days of playing around with prefixes and suffixes,
  I concluded that, at this stage in the analysis, it was necessary to 
  collapse similar letters: not only to reduce transcription and sampling
  noise, but also to reduce the output to a manageable size.
  
  Therefore, I wrote a filter "eva2era" that maps strings of EVA letters
  to a reduced alphabet ERA, as follows:
  
    function eva_to_era(txt)
    {
      # Converts a chunk of comment-free EVA to ERA

      gsub(/sh/,   "ch",  txt);
      gsub(/s/,    "r",   txt);
      gsub(/t/,    "k",   txt);
      gsub(/ckh/,  "eke", txt);
      gsub(/cph/,  "epe", txt);
      gsub(/cfh/,  "efe", txt);
      gsub(/ei/,   "o",   txt);
      gsub(/a/,    "o",   txt);
      gsub(/y/,    "o",   txt);
      gsub(/iii*/, "i",   txt);
      gsub(/q/,    "",    txt);

      return txt
    }

  I created files of good words from the two transcriptions: 
  
    foreach f ( f c )

      cat bio-${f}-eva-gut.wds \
        | egrep -v '.q' \
        | eva2era \
        > bio-${f}-era-gut.wds

      cat bio-${f}-era-gut.wds \
        | sort | uniq \
        > bio-${f}-era-gut.dic
        
      dicio-wc bio-${f}-era-gut.{wds,dic}

    end

     lines   words     bytes file        
    ------ ------- --------- ------------
      6166    6166     34893 bio-f-era-gut.wds
       763     763      5164 bio-f-era-gut.dic

      5864    5861     34612 bio-c-era-gut.wds
       940     939      6720 bio-c-era-gut.dic

  Note that I removed words with embedded "q"s.  The reason is that 
  eva2era deletes "q"s, so those words might have ended up in
  confusing places. There were 16 such words in the Friedman version:
  
    lshdyqo qoqokeey oqoky qokeedyqokar oqokaiin cheyqy yqol olqo
    tyqoky oqol oqo qoqokal yqokaiin oqofchedy oqol yqor

  and 20 in the Currier version:
  
    oqokain lshdyqo qoqokeey oqoky sheq oqokain dyqokol dqokedy qolqol
    ysheeyqo cheyqytaiin olqo tyqoky oqol oqo chedyqoty teyqokedy
    yqokain oqofchedy yqolyqor
  
  The ones present in both are 
  
    lshdyqo qoqokeey oqoky oqokaiin olqo tyqoky oqol oqo yqokain
    
  but I haven't checked whether they are really the same words of the VMs.
  
  The two ".dic" files differ in 459 words (even after eva2era collapse!). 
  To be precise, Friedman\Currier is 134 words, Currier\Friedman is 313
  words.
  
  I built automata for the two files:

    foreach f ( f c )

      cat bio-${f}-era-gut.dic \
        | nice MaintainAutomaton \
            -add - \
            -dump Note-003/${f}-era.dmp
  
      nice AutoAnalysis \
        -load Note-003/${f}-era.dmp \
        -unprod Note-003/${f}-era-1-unp.sts \
          -maxUnprod 1 \
          -unprodSugg Note-003/${f}-era-1-unp.sugg

      nice AutoAnalysis \
        -load Note-003/${f}-era.dmp \
        -prod Note-003/${f}-era-1-prd.pst \
        -minProductivity 1 \
        -maxPrefSize 30 \
        -maxSuffSize 30 

    end

  Automata:

     strings  letters    states   finals     arcs  sub-sts lets/arc
    -------- --------  -------- -------- -------- -------- --------
         763     4401       360      122      857      710    5.135 Friedman
         940     5780       486      162     1119      940    5.165 Currier

  Strange words:
  
    Friedman: 40 unproductive states, 26 strange words
   
      chlchpcheeo cholkeeeo efedorol epoloir kchdolkdo keoeoloin keokem
      koekeeo korolchrdo lkeoldo lroiror ochepolr oddcheo okino
      olkeeocheol olkeeoolor oloefeo olpoekeo pdolchor poldoko polkechol
      prchedol rchkchdo reokoin rokcheed rpcho
   
    Currier: 59 unproductive states, 32 strange words:
    
      chedkeolo chkolcheedo chlchpcheeo doleerolcheo dolkeeeoro ekokedor
      epoloir erer kchdolkdo kchololkee kedopocho keokeg koekeeo
      korolchrdo lolkedokoinol nlkeedo ochepolr ochokeed ofolcheko
      okchorolko okedoepeo okorlche pchchedol pcheokeeor pcholpchefedo
      pdolchor poldoroirol polkechol porolcho rcholcheekeo rofchom
      rolchkedokor

    The 8 words that are strange in both automata:
    
      chlchpcheeo epoloir kchdolkdo koekeeo korolchrdo ochepolr pdolchor
      polkechol

  Productive states in Friedman's automaton:

      state  nprefs  nsuffs  nwords  prodty  prefs/suffs
    ------- ------- ------- ------- -------  -----------------
         11      22       2      44      21  { n, rolcheo, rcho, opcheo, ... }:{ (), l }
         60      22       2      44      21  { rolched, roin, oror, orol, ... }:{ (), o }
         33      14       2      28      13  { rolo, oloko, olchedo, okecho, ... }:{ (), r }
          8       4       4      16       9  { ekeo, cheeo, roro, chdo }:{ (), in, l, r }
         86      10       2      20       9  { olkche, okold, okeee, epee, ... }:{ do, o }
         34       5       3      15       8  { opchol, okchd, oched, chor, ... }:{ (), o, or }
         26       7       2      14       6  { olkol, okeedol, pcheol, ocheol, ... }:{ (), do }
         37       7       2      14       6  { rchek, orok, lked, lchek, ... }:{ eo, o }
         80       4       3      12       6  { pchedo, oekeo, kcho, cheolo }:{ (), l, r }
         83       4       3      12       6  { polchd, opched, olkeed, cheor }:{ (), o, ol }
        306       4       3      12       6  { polche, okche, lkee, kolch }:{ d, do, o }
        107       2       6      12       5  { okedo, chko }:{ (), in, ir, l, m, r }
          7       3       3       9       4  { oloro, eko, choko }:{ (), in, l }
          9       2       5      10       4  { ror, chd }:{ (), o, oin, ol, or }
         12       5       2      10       4  { opchd, okd, kold, dcheed, ... }:{ o, ol }
        103       5       2      10       4  { oroi, olkoi, koi, okedoi, ... }:{ n, r }
        168       3       3       9       4  { okolch, dolche, dee }:{ do, edo, o }
        188       2       4       8       3  { rolke, doke }:{ do, edo, eo, o }
        217       4       2       8       3  { okorolo, loin, keedo, dororo }:{ (), m }
        335       2       4       8       3  { okolo, lcheo }:{ (), l, m, r }
          6       3       2       6       2  { doko, okolko, chldo }:{ (), in }
         38       3       2       6       2  { ocheek, cheeek, roek }:{ eeo, eo }
         55       3       2       6       2  { orche, epch, cheepe }:{ edo, o }
         88       2       3       6       2  { lcheeke, chepch }:{ edo, eo, o }
        167       3       2       6       2  { oloke, lchepe, kech }:{ do, edo }
        218       2       3       6       2  { keed, doror }:{ (), o, om }
        528       2       3       6       2  { rched, old }:{ (), o, oir }

  Productive states in Currier's automaton:

      state  nprefs  nsuffs  nwords  prodty  prefs/suffs
    ------- ------- ------- ------- -------  -----------------
         80      23       2      46      22  { rolched, oror, oloin, olkor, ... }:{ (), o }
         17      20       2      40      19  { rolkeo, rolcheo, opcheo, olkeo, ... }:{ (), l }
         35      17       2      34      16  { opolo, oloko, okecho, lcheeo, ... }:{ (), r }
         66      17       2      34      16  { rolkee, olkche, okeee, loee, ... }:{ do, o }
         21      12       2      24      11  { rchek, olkchd, lked, lchek, ... }:{ eo, o }
        186       5       3      15       8  { opchol, okchd, oched, kolchd, ... }:{ (), o, or }
        406       5       3      15       8  { pchedo, okolo, oekeo, lcho, ... }:{ (), l, r }
         14       3       4      12       6  { cheoko, cheeo, chdo }:{ (), in, l, r }
         12       5       2      10       4  { rolko, ololo, okolko, ooko, ... }:{ (), in }
         18       5       2      10       4  { olold, ofchd, kold, dcheed, ... }:{ o, ol }
        291       5       2      10       4  { roror, rolor, roir, pchor, ... }:{ (), ol }
        473       3       3       9       4  { polche, odche, lkee }:{ d, do, o }
        696       3       3       9       4  { polchd, olor, olched }:{ (), o, ol }
         49       4       2       8       3  { rek, roek, ocheek, cholke }:{ eeo, eo }
         67       4       2       8       3  { ode, rorch, lkch, cheepch }:{ edo, eo }
        145       4       2       8       3  { olkoi, okedoi, ki, chkoi }:{ n, r }
        238       2       4       8       3  { orche, doke }:{ do, edo, eo, o }
        308       4       2       8       3  { rcheed, okechd, kched, eked }:{ o, or }
        382       4       2       8       3  { okolr, lolr, lcheer, keeol }:{ (), oin }
        461       4       2       8       3  { okedol, oind, lkol, ldol }:{ (), or }
        508       2       4       8       3  { olcheo, lolo }:{ (), l, m, r }
         38       3       2       6       2  { oker, chr, chedol }:{ (), do }
         68       3       2       6       2  { rorc, lkc, cheepc }:{ hedo, heo }
        108       3       2       6       2  { okede, dcheoke, cheolch }:{ do, eo }
        112       3       2       6       2  { oroiro, olchol, cheolkoin }:{ (), ro }
        140       2       3       6       2  { ore, kech }:{ do, edo, eo }
        167       3       2       6       2  { rolkch, chokch, lpch }:{ do, edo }
        168       3       2       6       2  { rolkc, chokc, lpc }:{ hdo, hedo }
        514       2       3       6       2  { olr, lor }:{ (), oin, olo }