Hacking at the Voynich manuscript - Side notes
504 Attempt at factoring the 'edy' and 'air' word classes

Last edited on 1999-01-31 06:20:15 by stolfi

OBSOLETE

  This is partly a remake of work from Notebook-1.txt, originally done
  around 97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]
    
    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]
    
    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    Then I started going back to redoing some of the previous tasks
    using the new encoding.
    
    I extracted the Currier (;C>) and Friedman-I (;F>) versions of the
    "bio" section, in EVA alphabet, as files bio-{c,f}-eva.evt. I also
    built the associated text files and word lists
    bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt]

    I mapped the word files bio-{c,f}-eva-gut.{wds,dic} to a reduced
    alphabet (ERA) obtaining files bio-{c,f}-era-gut.{wds,dic} I built
    finite-state automata for these two files, and looked for
    producetive states. The Friedman version seems cleaner.
    [Note-003.txt]
    
97-11-10 stolfi
===============
  
  Poring over the productive states of the automata and playng around
  with counting programs, I tentatively identified two major classes
  of morphologically similar words: 
  
      one inflecting as -e*d?y or -e*d (the 'edy' class) and
      
      one inflecting as -e*d?[ao]i*[rlnm] (the 'air' class).
    
  Let's collect their "roots".  To keep the tables manageable,
  and reduce noise, let's work with ERA-mapped files. 
  Recall that "a" and "y" become "o" in ERA.
  
    cat bio-f-era-gut.wds \
      | egrep -e '[orlmnd]$' \
      | sed -e 's:e*d*o$:- -EDY:' \
      | sed -e 's:e*d$:- -EDY:' \
      | sed -e 's:e*d*oi*[rlmn]$:- -AIR:' \
      | egrep -e '- -' \
      > Note-004/.factored

    cat Note-004/.factored \
      | gawk '/./ {print $1}' \
      | revbytes | sort | uniq | revbytes \
      > Note-004/.prefs-all.dic

    dicio-wc Note-004/.prefs-all.dic

     lines   words     bytes file        
    ------ ------- --------- ------------
       266     266      1704 Note-004/.prefs-all.dic

  I built a file "Note-004/suffs.cls" containing only the lines 
  "-EDY" and "-AIR".

    cat .factored \
      | count-diword-freqs \
          -v rows=Note-004/.prefs-all.dic \
          -v cols=Note-004/suffs.cls \
          -v digits=4 \
      > Note-004/.tbl-1

  The raw count table was then extracted by hand from Note-004/.tbl-1 and
  saved to Note-004/.tbl-2.

  These are the most popular prefixes from that table (10 or more
  occurrences), sorted by reverse prefix:
  
    prefix         TOT  EDY  AIR
    ----------- ------ ---- ----
    -              846   81  765
    ch-            971  771  200
    dch-            58   47   11
    okech-          12   11    1
    kch-            54   44   10
    okch-          104  101    3
    lch-           139  125   14
    olch-          111  105    6
    polch-          11   10    1
    rolch-          13   12    1
    och-            46   39    7
    pch-            43   26   17
    opch-           39   34    5
    rch-            47   43    4
    k-             208   97  111
    ek-             34   24   10
    cheek-         110  108    2
    chek-          189  172   17
    oek-            17   14    3
    chk-            29   12   17
    lk-             32   20   12
    olk-           148   86   62
    rolk-           17   13    4
    ok-           1652  873  779
    dok-            12    9    3
    l-              70   24   46
    ol-            109   58   51
    dol-            19   15    4
    okol-           31   22    9
    o-              32   18   14
    p-              19    1   18
    op-             10    4    6
    r-             253   18  235
    or-             48   15   33
    dor-            10    7    3
    ...            ...  ...  ...
    ----------- ------ ---- ----
    TOT           5992 3395 2597
  
  Let's do a scatter plot of the prefixes, using the EDY/AIR counts:
    
    gnuplot <<EOF
    set term x11
    # set term pbm color small
    # set output Note-004/pref-suff-plot.ppm
    set title "EDY vs AIR counts"
    set xlabel "-EDY count"
    set ylabel "-AIR count"
    plot \
      '< radial-squeeze -v x=3 -v y=4 -v noise=0.3 .tbl-2' \
        using 1:2 title "collapsed prefixes" with points
    pause 120
    EOF
      
  The scatter plot did not show any obviously distinctive prefix classes.
  
  Let's try to classify anyway the prefixes based on their EDY versus AIR 
  preferences.

    cat Note-004/.tbl-2 \
      | classify-by-ratio \
          -v min=3 \
          -v xf=3 -v xn=3395 \
          -v yf=4 -v yn=2597 \
      | revbytes | sort -b +4 -5 | revbytes \
      > Note-004/pref-suff-class.txt

  Here are the significant entries (type > 0), sorted by reverse
  prefix.
  
    prefix         TOT  EDY  AIR R
    ----------- ------ ---- ---- -
    -              846   81  765 9
    ----------- ------ ---- ---- -
    d-               6    6    . 1
    ----------- ------ ---- ---- -
    ch-            971  771  200 2
    dch-            58   47   11 2
    olkech-          3    3    . 1
    okech-          12   11    1 1
    fch-             4    3    1 3
    ofch-            9    8    1 1
    kch-            54   44   10 2
    lkch-            3    3    . 1
    olkch-           7    7    . 1
    okch-          104  101    3 1
    lch-           139  125   14 1
    olch-          111  105    6 1
    dolch-           6    6    . 1
    kolch-           3    3    . 1
    okolch-          4    4    . 1
    polch-          11   10    1 1
    rolch-          13   12    1 1
    och-            46   39    7 2
    pch-            43   26   17 5
    chepch-          5    5    . 1
    lpch-            3    2    1 4
    opch-           39   34    5 2
    rch-            47   43    4 1
    orch-            6    6    . 1
    ----------- ------ ---- ---- -
    k-             208   97  111 6
    dk-              4    3    1 3
    ek-             34   24   10 3
    cheek-         110  108    2 1
    lcheek-          9    8    1 1
    chek-          189  172   17 1
    lchek-           4    4    . 1
    rchek-           6    6    . 1
    oek-            17   14    3 2
    chk-            29   12   17 7
    lk-             32   20   12 4
    olk-           148   86   62 5
    dolk-            3    3    . 1
    opolk-           2    .    2 9
    rolk-           17   13    4 3
    ok-           1652  873  779 5
    dok-            12    9    3 3
    cheok-           4    4    . 1
    chok-            8    4    4 6
    kok-             2    .    2 9
    olok-            5    4    1 2
    rok-             6    3    3 6
    rk-              5    1    4 8
    ----------- ------ ---- ---- -
    l-              70   24   46 7
    chl-             3    1    2 7
    okl-             3    .    3 9
    ol-            109   58   51 5
    dol-            19   15    4 2
    cheol-           8    5    3 4
    chol-            4    4    . 1
    kol-             6    5    1 2
    okol-           31   22    9 3
    lol-             8    6    2 3
    opol-            3    3    . 1
    rol-             7    6    1 2
    orol-            4    4    . 1
    ----------- ------ ---- ---- -
    o-              32   18   14 5
    cho-             3    3    . 1
    oko-             7    6    1 2
    ----------- ------ ---- ---- -
    p-              19    1   18 9
    ep-              9    5    4 5
    cheep-           4    4    . 1
    chep-            4    3    1 3
    op-             10    4    6 7
    ----------- ------ ---- ---- -
    r-             253   18  235 9
    lr-              2    .    2 9
    or-             48   15   33 8
    dor-            10    7    3 3
    kor-             8    4    4 6
    okor-            5    4    1 2
    lor-             3    .    3 9
    olor-            5    2    3 7
    por-             2    .    2 9
    ror-             6    1    5 9
    doror-           3    2    1 4
    ----------- ------ ---- ---- -
  
  Note that the EDY/AIR ratio depends rather strongly on the
  last letter of the suffix.

  These were the counts for prefixes containing 'c[ktpf]h'
  before we decided to map those compounds to 'e[ktpf]e':
  
    Counts for 'c[ktpf]h' before:
    ----------- ------ ---- ---- -
    cph-             9    5    4 5
    checph-          4    4    . 1
    chcph-           3    3    . 1
    ----------- ------ ---- ---- -
    ckh-            26   19    7 3
    checkh-         82   80    2 1
    lcheckh-         8    8    . 1
    chckh-         133  128    5 1
    ockh-           16   13    3 2
    ----------- ------ ---- ---- -
    
    Counts for 'e[ktpf]-' before
    ----------- ------ ---- ---- -
    ep-              .    .    . 0
    cheep-           .    .    . 0
    chep-            .    .    . 0
    ----------- ------ ---- ---- -
    ek-              8    5    3 4
    cheek-          28   28    . 1
    lcheek-          .    .    . 0
    chek-           56   44   12 2
    oek-             .    .    . 0
    ----------- ------ ---- ---- -

    Counts for 'e[ktpf]-' after:
    ----------- ------ ---- ---- -
    ep-              9    5    4 5
    cheep-           4    4    . 1
    chep-            4    3    1 3
    ----------- ------ ---- ---- -
    ek-             34   24   10 3
    cheek-         110  108    2 1
    lcheek-          9    8    1 1
    chek-          189  172   17 1
    oek-            17   14    3 2
    ----------- ------ ---- ---- -