Hacking at the Voynich manuscript - Side notes
505 A complete factorization of words in ERA alphabet

Last edited on 2025-05-01 18:38:24 by stolfi

OBSOLETE

  This is partly a remake of work from Notebook-1.txt, originally done around
  97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]
    
    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]
    
    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    Then I started going back to redoing some of the previous tasks
    using the new encoding.
    
    I extracted the Currier (;C>) and Friedman-I (;F>)
    versions of the "bio" section, in EVA alphabet, as files
    bio-{c,f}-eva.evt. I also built the associated text files and word lists
    bio-{c,f}-eva{-{gut,fun,bad},{wds,dic,frq},.txt}. [Note-001.txt]
    
    Eventually I diecided that it was necessary to map the data 
    to a reduced alphabet (ERA), identifying similar letters:
    both to reduce transcription and sampling noise, and to 
    make the results more manageable. Accordingly,
    I created files bio-{c,f}-era-gut.{wds,dic} [Note-003.txt]

    After some ad-hoc hacking, I tentatively identified a paradigm
    which consists of a couple hundred prefixes combined with suffixes
    of the form -e*d and -e*d?y (the EDY class) and -e*d?[ao]i*[rlnm] (the
    AIR class). The prefixes and their statistics were saved to file 
    Note-004/pref-suff-class.txt [Note-004.txt]
    
97-11-10 stolfi
===============
  
  Let's see if there are any other interesting suffixes besides 
  the EDY and AIR classes:
  
    cat bio-f-era-gut.wds \
      | egrep -v '[do]$' \
      | egrep -v 'oi*[rlmn]$' \
      > Note-005/.fun-suffs.wds
      
    dicio-wc Note-005/.fun-suffs.wds
      
     lines   words     bytes file        
    ------ ------- --------- ------------
       174     174       717 Note-005/.fun-suffs.wds

  It seems we have pretty much got them all. Let's look at this 
  remainder:

    cat Note-005/.fun-suffs.wds \
      | sort | uniq -c | expand \
      | revbytes | sort | revbytes \
      > Note-005/.fun-suffs.frq
      
  The only significant suffixes in this list are { -l -e*r }:
  
    l(7) chl(6) lchl(2) dolchl(1) rolchl(1) kl(1) lkl(1) okl(4)

    r(24) cheer(6) dcheer(1) lcheer(1) olcheer(2) ocheer(2) rcheer(1)
    ekeer(1) oekeer(1) okeer(5) cher(11) lcher(1) rcher(2) cheker(1)
    oker(2) chr(4) lchr(1) rchr(3) chekr(1)
    
  There are also 11 isolated "m"s, and 18 isolated "in"s,
  and some funny suffixes:
  
    -ede(1) -ee(3) -e(2) -dl(5) -odl(1) 
    -ll(1) -nl(1) -oinl(1) -eerl(1) -em(4)
    -oinm(1) -een (2)
    -edr(1) -lr(6) -olr(6) -rlr(1)
  
  Looking at the prefixes, it seems that many of those that end
  in 'ol-' are followed by a very restricted class of suffixes.
  Let's see:
  
    cat bio-f-era-gut.wds \
      | egrep 'ol[edoirlmn][edoirlmn]*$' \
      | sed \
          -e 's/^/ /g' \
          -e 's/^.*[^edoirlmn]\([edoirlmn]*\)$/\1/' \
      | sort | uniq -c | expand \
      | sort -k1nr \
      > Note-005/.olsuffs.frq
  
  The significant suffixes found are
  
    olo(63) oldo(29) olor(23) olol(21) oloin(14) dolo(10)
    
  And the next, not so significant, are

    eoldo(7) doldo(5) eolo(5) lolo(5) olr(5) orolo(5) rolo(5) olom(4)

  OK, so let's collect all the individual suffixes 
  (with EVA -> ERA collapse), including those special ones.
  We now have a simple definition of "suffix": a maximal
  terminal string consisting only of /[edoirlmn]/ characters.
  
    cat bio-f-era-gut.wds \
      | sed -e 's:\([edoirlmn]*\)$:- -\1:' \
      > Note-005/.factored
      
    dicio-wc bio-f-era-gut.wds .factored
  
     lines   words     bytes file        
    ------ ------- --------- ------------
      6166    6166     34893 bio-f-era-gut.wds
      6166   12332     53391 Note-005/.factored

  Now, let's collect the prefixes and suffixes:
  
    cat Note-005/.factored \
      | gawk '/./ {print $1}' \
      | revbytes | sort | uniq | revbytes \
      > Note-005/.prefs-all.dic

    cat Note-005/.factored \
      | gawk '/./ {print $2}' \
      | sort | uniq \
      > Note-005/.suffs-all.dic

    dicio-wc prefs-all.dic suffs-all.dic
  
     lines   words     bytes file        
    ------ ------- --------- ------------
       156     156      1023 Note-005/.prefs-all.dic
       219     219      1369 Note-005/.suffs-all.dic
  
  Great. Now let's count their occurrences and list the most important:

    cat Note-005/.factored \
      | gawk '/./ {print $1}' \
      | revbytes | sort | revbytes | uniq -c | expand \
      | sort -k1nr \
      > Note-005/.prefs-all.frq

  The 53 most important prefixes (at least 3 occurrences), 
  accounting for 6040 words

    ok-(1730) -(1544) ch-(1036) k-(229) chek-(199) olk-(156) lch-(148)
    olch-(115) cheek-(110) okch-(106) dch-(61) rch-(57) kch-(54)
    och-(49) pch-(46) opch-(41) lk-(38) ek-(35) chk-(29) p-(25)
    oek-(18) rolk-(17) op-(14) rolch-(14) dok-(12) okech-(12)
    polch-(11) ep-(10) lcheek-(9) ofch-(9) chok-(8) olkch-(8)
    dolch-(7) orch-(6) rchek-(6) rok-(6) chep-(5) chepch-(5) olok-(5)
    rk-(5) cheep-(4) cheok-(4) dk-(4) fch-(4) lchek-(4) okolch-(4)
    chedch-(3) dolk-(3) kok-(3) kolch-(3) lkch-(3) lpch-(3) olkech-(3)
    
  The other 103 prefixes (less than 3 occurrences), accounting for 
  126 words total:
    
    chech-(2) cheeek-(2) chf-(2) dlch-(2) dorch-(2) ef-(2) epch-(2)
    kech-(2) lchep-(2) lok-(2) lolk-(2) ocheek-(2) odch-(2) of-(2)
    okolk-(2) olchek-(2) olfch-(2) olpch-(2) ook-(2) opolk-(2)
    orok-(2) pok-(2) roek-(2) chch-(1) chedok-(1) cheekch-(1)
    cheekeedch-(1) cheolch-(1) chkch-(1) chlchpch-(1) choek-(1)
    choep-(1) chokch-(1) cholch-(1) cholk-(1) chop-(1) chpch-(1)
    dcheek-(1) dchek-(1) dchok-(1) dkch-(1) dokch-(1) dolfch-(1)
    efch-(1) ekch-(1) kchdolk-(1) kchek-(1) kcheok-(1) keok-(1)
    koek-(1) kolk-(1) korch-(1) korolch-(1) ldch-(1) lf-(1) loch-(1)
    ochch-(1) ochek-(1) ochep-(1) oddch-(1) odok-(1) odorch-(1)
    oeek-(1) oep-(1) okchok-(1) okechek-(1) okeech-(1) okeeolch-(1)
    okeolch-(1) okoch-(1) okok-(1) okook-(1) okop-(1) olcheek-(1)
    olchk-(1) olek-(1) olkeeoch-(1) ollch-(1) oloef-(1) olokch-(1)
    ololch-(1) ololk-(1) olpoek-(1) ooch-(1) opolch-(1) ork-(1)
    pchef-(1) pdolch-(1) poldch-(1) poldok-(1) polk-(1) polkech-(1)
    porch-(1) prch-(1) rcheek-(1) rchkch-(1) rek-(1) reok-(1) rkch-(1)
    rokch-(1) rolchk-(1) rolkch-(1) rpch-(1)

  Now for the suffixes:

    cat Note-005/.factored \
      | gawk '/./ {print $2}' \
      | sort | uniq -c | expand \
      | sort -k1nr \
      > Note-005/.suffs-all.frq

  The 20 most significant suffixes (at least 30 occurrences),
  accounting for 5344 words:
  
    -edo(1161) -ol(690) -eo(608) -oin(561) -eedo(427) -o(348)
    -eeo(299) -or(267) -do(167) -eol(123) -doin(117) -dol(106)
    -rol(94) -roin(86) -dor(74) -olo(63) -ror(47) -eor(43) 
    -r(33) -ed(30)
    
  The 62 intermediate-frequency ones (less than 30,
  at least 3 occurrences), accounting for another 659 words:
    
    -oldo(29) -edor(27) -lol(26) -oro(26) -l(23) -olor(23) -om(23)
    -edol(21) -olol(21) -eer(20) -orol(19) -ro(19) -in(18) -er(17)
    -odo(16) -eeol(15) -lo(15) -(14) -eed(14) -lor(14) -oloin(14)
    -oir(13) -edoin(12) -eeeo(11) -m(11) -oroin(11) -dolo(10) -d(9)
    -dom(9) -ldo(9) -oror(9) -eeedo(7) -eoldo(7) -oeedo(7) -doir(6)
    -doro(6) -lr(6) -dl(5) -doldo(5) -eeor(5) -eolo(5) -lolo(5)
    -odoin(5) -olr(5) -orolo(5) -rdo(5) -rolo(5) -rom(5) -eedol(4)
    -em(4) -loin(4) -olom(4) -on(4) -edom(3) -ee(3) -eedor(3) -ero(3)
    -oo(3) -ool(3) -oor(3) -roir(3) -rorol(3)
  
  The 137 least significant ones (less than 3 occurrences), accounting for 
  only 163 words:
  
    -dedo(2) -deedo(2) -dolol(2) -dolor(2) -dorodo(2) -dororo(2) -e(2)
    -edoir(2) -een(2) -eero(2) -eodo(2) -eoin(2) -eolol(2) -eom(2)
    -eoo(2) -ldoin(2) -ldol(2) -lom(2) -loroin(2) -odol(2) -odor(2)
    -oeeedo(2) -oil(2) -ololo(2) -olorol(2) -orom(2) -deeedo(1)
    -deeo(1) -doedo(1) -doil(1) -doindo(1) -doinl(1) -doirodo(1)
    -doirol(1) -doiroldo(1) -dolord(1) -doloro(1) -dool(1) -dordo(1)
    -doroin(1) -dorom(1) -doror(1) -dororom(1) -drol(1) -ede(1)
    -ededo(1) -edeeo(1) -edeo(1) -edoldo(1) -edolo(1) -edool(1)
    -edoor(1) -edoro(1) -edorol(1) -edr(1) -eedeeo(1) -eedoldo(1)
    -eedom(1) -eeerol(1) -eeodo(1) -eeoin(1) -eeolo(1) -eeoolor(1)
    -eerl(1) -eerol(1) -eod(1) -eodoin(1) -eoeoloin(1) -eold(1)
    -eolor(1) -eoro(1) -eorol(1) -erdo(1) -ino(1) -ld(1) -lddo(1)
    -ldolor(1) -ldor(1) -ll(1) -lldor(1) -lod(1) -loeeo(1) -loinm(1)
    -loldo(1) -lolom(1) -lolor(1) -lorol(1) -lroiror(1) -lron(1)
    -lror(1) -n(1) -nl(1) -odeedo(1) -odeeo(1) -odoirol(1) -odorol(1)
    -oeeo(1) -oeo(1) -oinolo(1) -old(1) -olddo(1) -oldoir(1) -oldol(1)
    -oleedo(1) -oleeedo(1) -oleereo(1) -ollom(1) -olod(1) -oloino(1)
    -oloir(1) -ololdo(1) -olordo(1) -oloro(1) -oloroin(1) -oloror(1)
    -olro(1) -olrolo(1) -ooin(1) -ooo(1) -ooon(1) -ordo(1) -orodl(1)
    -orodo(1) -oroir(1) -orolom(1) -orolr(1) -ororo(1) -ororor(1)
    -rlr(1) -rodor(1) -roino(1) -roirol(1) -roldo(1) -rolor(1)
    -roro(1) -roroin(1) -roror(1)

  Perhaps we should have left the "e*" part in the prefix...

  Here are all the 219 suffixes sorted by reverse suffix, 
  and grouped by last letters:

     14 -

      9 -d
     30 -ed
     14 -eed

      1 -ld
      1 -old
      1 -eold

      1 -eod
      1 -lod
      1 -olod

      1 -dolord

      2 -e
      1 -ede
      3 -ee

     23 -l

      5 -dl
      1 -orodl

      2 -oil
      1 -doil

      1 -ll

      1 -nl

      1 -doinl

    690 -ol
    106 -dol
     21 -edol
      4 -eedol
      2 -ldol
      1 -oldol
      2 -odol
    123 -eol
     15 -eeol
     26 -lol
     21 -olol
      2 -dolol
      2 -eolol
      3 -ool
      1 -dool
      1 -edool
     94 -rol
      1 -drol
      1 -eerol
      1 -eeerol
      1 -doirol
      1 -odoirol
      1 -roirol
     19 -orol
      1 -edorol
      1 -odorol
      1 -eorol
      1 -lorol
      2 -olorol
      3 -rorol

      1 -eerl

     11 -m
      4 -em

      1 -loinm

     23 -om
      9 -dom
      3 -edom
      1 -eedom
      2 -eom
      2 -lom
      1 -ollom
      4 -olom
      1 -lolom
      1 -orolom
      5 -rom
      2 -orom
      1 -dorom
      1 -dororom

      1 -n
      2 -een

     18 -in
    561 -oin
    117 -doin
     12 -edoin
      2 -ldoin
      5 -odoin
      1 -eodoin
      2 -eoin
      1 -eeoin
      4 -loin
     14 -oloin
      1 -eoeoloin
      1 -ooin
     86 -roin
     11 -oroin
      1 -doroin
      2 -loroin
      1 -oloroin
      1 -roroin

      4 -on
      1 -ooon
      1 -lron

    348 -o
    167 -do
      1 -lddo
      1 -olddo
   1161 -edo
      2 -dedo
      1 -ededo
    427 -eedo
      2 -deedo
      1 -odeedo
      7 -eeedo
      1 -deeedo
      1 -oleeedo
      2 -oeeedo
      1 -oleedo
      7 -oeedo
      1 -doedo
      9 -ldo
     29 -oldo
      5 -doldo
      1 -edoldo
      1 -eedoldo
      7 -eoldo
      1 -loldo
      1 -ololdo
      1 -roldo
      1 -doiroldo
      1 -doindo
     16 -odo
      2 -eodo
      1 -eeodo
      1 -doirodo
      1 -orodo
      2 -dorodo
      5 -rdo
      1 -erdo
      1 -ordo
      1 -dordo
      1 -olordo

    608 -eo
      1 -edeo
    299 -eeo
      1 -deeo
      1 -edeeo
      1 -eedeeo
      1 -odeeo
     11 -eeeo
      1 -oeeo
      1 -loeeo
      1 -oeo
      1 -oleereo

     15 -lo
     63 -olo
     10 -dolo
      1 -edolo
      5 -eolo
      1 -eeolo
      5 -lolo
      2 -ololo
      1 -oinolo
      5 -rolo
      1 -olrolo
      5 -orolo

      1 -ino
      1 -oloino
      1 -roino

      3 -oo
      2 -eoo
      1 -ooo

     19 -ro
      3 -ero
      2 -eero
      1 -olro
     26 -oro
      6 -doro
      1 -edoro
      1 -eoro
      1 -oloro
      1 -doloro
      1 -roro
      1 -ororo
      2 -dororo

     33 -r

      1 -edr

     17 -er
     20 -eer

     13 -oir
      6 -doir
      2 -edoir
      1 -oldoir
      1 -oloir
      3 -roir
      1 -oroir

      6 -lr
      5 -olr
      1 -orolr

      1 -rlr

    267 -or
     74 -dor
     27 -edor
      3 -eedor
      1 -ldor
      1 -lldor
      2 -odor
      1 -rodor
     43 -eor
      5 -eeor
     14 -lor
     23 -olor
      2 -dolor
      1 -ldolor
      1 -eolor
      1 -lolor
      1 -eeoolor
      1 -rolor
      3 -oor
      1 -edoor
     47 -ror
      1 -lroiror
      1 -lror
      9 -oror
      1 -doror
      1 -oloror
      1 -roror
      1 -ororor