Hacking at the Voynich manuscript - Side notes
008 A prefix-midfix-suffix factorization of the bio section in EVA encoding

Last edited on 2025-05-01 18:20:53 by stolfi

  This is partly a remake of work from Notebook-1.txt, originally done around
  97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]
    
    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]
    
    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    Then I started going back to redoing some of the previous tasks
    using the new encoding.
    
    I extracted the Currier (;C>) and Friedman-I (;F>)
    versions of the "bio" section, in EVA alphabet, as files
    bio-{c,f}-eva.evt. I also built the associated text files and word lists
    bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt]
    
    Afer mapping the data to a reduced alphabet (ERA) [Note-003.txt],
    I identified a paradigm which consists of 267 prefixes combined
    with 147 suffixes; the latter are maximal strings of the form
    [doirlmn]*.  If we use just the 91 most common prefixes
    (>= 3 occurrences) and the most common 24 suffixes(>= 20 occurrences),
    we can reproduce 5867 of the 6166 original words (95.1%)
    counting repetitions. If we look only at distinct words, 
    we get 534 out of 763 with 98 prefixes and 22 suffixes. [Note-006.txt]
    
    Later I did a similar factorization of the words in the full
    EVA alphabet, defining suffixes as [daoirsl]*[rlmnyd]. 
    This factorization had 526 prefixes and 206 suffixes,
    and left out 88 words (55 distinct). The top 160 prefixes
    and 30 suffixes generate 5176 out of 6166 words (84%).
    Ignoring word frequencies, with 175 prefixes and 33
    suffixes we can generate 712 out 1342 words (53%). [Note-007.txt]
    
97-11-11 stolfi
===============

  The previous factorizations seemed insatisfactory, I thought perhaps 
  I was beeing too greedy and including in the suffix things that 
  belonged to the prefix.
  
  Looking at the suffix list of Note-007.txt, I concluded that 
  a suffix is made up of these components:
  
    [aoydsm] i*n i*r i*l
    
  So this will be the basis of my next factorization attempt.

  There is strong indication that these letters are to be 
  parsed in pairs { al ar aiin dy ol ... }  But let's 
  leave that for later.
  
  Also each word seems to have a (generally short) prefix
  made with letters { qo o a y l d r s } (but beware not 
  to eat the "s" in "sh").   So I will try a tripartition
  of the words:
    
    cat bio-f-eva-gut.wds \
      | sed \
          -e 's/sh/X/g' \
          -e 's/$/}/' \
          -e 's/^/{/' \
          -e 's/{\([qoaydirslmn][qoaydirslmn]*\)/\1{/' \
          -e 's/\([qoaydirslmn][qoaydirslmn]*\)}/}\1/' \
          -e 's/X/sh/g' \
          -e 's/{}/\./' \
          -e 's/\.//g' \
          -e 's/{/- -/' \
          -e 's/}/- -/' \
      > factored
      
  [ I redid all this processing on 97-12-10, after fixing a bug in the
  prefix-parsing "sed" script.  In its previous form, the script had
  left some "q"s attached to the midfix. ]

  Each line in the file factored is either a single word
  consisting entirely of prefix/suffix letters (a "unifix"),
  or a Voynich word broken into three Unix words
  ("prefix", "midfix", suffix").

    cat factored \
      | grep -v -e '- -' \
      > unifs-all.wds

    cat factored \
      | grep -e '- -' \
      | gawk '/./ {print $1}' \
      > prefs-all.wds

    cat factored \
      | grep -e '- -' \
      | gawk '/./ {print $2}' \
      > midfs-all.wds

    cat factored \
      | grep -e '- -' \
      | gawk '/./ {print $3}' \
      > suffs-all.wds
      
    dicio-wc {prefs,midfs,suffs,unifs}-all.wds

     lines   words     bytes file        
    ------ ------- --------- ------------
      4666    4666     13963 prefs-all.wds
      4666    4666     26318 midfs-all.wds
      4666    4666     18576 suffs-all.wds
      1516    1516      6418 unifs-all.wds

    foreach f ( prefs midfs suffs unifs )
      cat ${f}-all.wds \
        | sort | uniq -c | expand | sort -k1nr \
        > ${f}-all.frq
    end
      
    dicio-wc {prefs,midfs,suffs,unifs}-all.frq

     lines   words     bytes file        
    ------ ------- --------- ------------
        53     106       657 prefs-all.frq
       209     418      3295 midfs-all.frq
       113     226      1513 suffs-all.frq
       230     460      3013 unifs-all.frq

    pr -m -w 100 -e -t \
        {prefs,midfs,suffs,unifs}-all.frq \
      | expand \
      > joint-all.frq

    freq prefix     freq midfix     freq suffix     freq unifix
    ---- --------   ---- --------   ---- --------   ---- --------
     1859 -          824 -k-        1728 -dy         186 ol
     1296 qo-        588 -che-      1239 -y          126 qol
      607 o-         514 -she-       422 -aiin       106 daiin
      255 ol-        387 -kee-       254 -al          71 dal
      209 l-         354 -t-         245 -ol          64 dar
      108 y-         347 -ke-        157 -ar          56 saiin
       75 d-         179 -te-         86 -ain         55 or
       45 r-         121 -ch-         66 -or          50 sol
       36 qol-       113 -tee-        51 -d           48 dy
       29 s-         105 -shee-       36 -s           36 aiin
       23 q-          95 -chee-       28 -dar         28 dol
       21 sol-        83 -sh-         25 -dal         25 oly
       12 dy-         58 -pche-       21 -am          21 lol
        8 sal-        49 -chckh-      20 -            21 sal
        7 so-         38 -kch-        20 -aly         18 ar
        6 dal-        38 -p-          16 -a           18 iin
        6 olo-        33 -tche-       16 -l           18 raiin
        5 a-          31 -sheckh-     13 -oldy        17 sor
        5 dol-        28 -tch-        12 -daiin       15 al
        4 al-         27 -kche-       10 -air         15 sar
        4 lo-         26 -chcth-      10 -ary         14 s
        4 or-         25 -checkh-     10 -r           13 olor
        3 oqo-        25 -shckh-       9 -aldy        12 olol
        3 qod-        24 -shek-        7 -as          12 rol
        2 dl-         22 -kshe-        6 -ady         11 m
        2 do-         20 -ee-          6 -alor        11 ral
        2 lol-        17 -checth-      6 -dol         11 y
        2 olol-       17 -chek-        6 -dor         10 lor
        2 qoqo-       17 -tshe-        6 -oiin        10 oldy
        2 qor-        16 -pch-         6 -sdy         10 r
        2 rol-        15 -cth-         6 -sy           9 dain
        1 alo-        14 -cthe-        5 -o            9 olaiin
        1 aro-        14 -fche-        5 -oly          8 dam
        1 dar-        12 -chckhe-      4 -alol         8 ldy
        1 dor-        12 -shcth-       4 -an           8 ly
        1 ld-         11 -ckhe-        4 -dam          8 ory
        1 od-         11 -keee-        4 -m            8 qor
        1 odd-        10 -shckhe-      4 -ody          7 l
        1 oll-        10 -shecth-      3 -ay           7 orol
        1 oro-         9 -cheek-       3 -ydy          7 qoly
     ... ...         ... ...         ... ....        ... ...
    ---- --------   ---- --------   ---- --------   ---- --------
    4666 TOTAL      4666 TOTAL      4666 TOTAL      1516 TOTAL

  Let's count the words with empty prefix and suffix:
  
    cat factored | egrep '^-.*[^-]$'       | wc 
    cat factored | egrep '[^-]-.*-$'       | wc
    cat factored | egrep '^-.*-$'          | wc
    cat factored | egrep '^[^-].*-.*[^-]$' | wc

    empty prefix     = 1849
    empty suffix     = 20  
    both empty       = 10  
    both non-empty   = 2797
    
  Listing the non-hard midfixes:
  
    cat midfs-all.frq \
      | sed -e 's/sh/X/g' \
      | egrep '[^- 0-9Xchtkpfe]' \
      | sed -e 's/X/sh/g' \
      > midfs-anomalous.frq