Hacking at the Voynich manuscript - Side notes
104 Listing repeated words in Voynichese and other languages

Last edited on 2025-05-04 16:51:55 by stolfi

INTRODUCTION

  In this note we list and tabulate repeats -- lexemes that occur twice
  or more in in consecutive positions in the text -- for Voynichese and
  other languages.
  
SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../tr-stats/dat
    ln -s ../tr-stats/tex
    ln -s /home/staff/stolfi/voynich/work 

    ln -s work/compute-freqs
    ln -s work/list-duplicate-words
    ln -s work/update-paper-include
    ln -s work/fix-words
    
    ln -s ../100/subsections.tags

LISTING REPEATED LEXEMES

  Language samples:
  
    set samples = ( \
      voyn/tak \
      \
      engl/wow \
      engl/cul \
      engl/twp \
      latn/ptt \
      latn/nwt \
      latn/ock \
      grek/nwt \
      span/qvi \
      ital/psp \
      fran/tal \
      port/csm \
      germ/sim \
      russ/pic \
      russ/ptt \
      arab/quf \
      arab/quv \
      arab/qud \
      arab/qph \
      arab/qcs \
      hebr/tav \
      hebr/tad \
      geez/gok \
      viet/ptt \
      viet/nwt \
      chin/ptt \
      chin/ptn \
      chin/red \
      chin/voa \
      chip/voa \
      tibe/vim \
      tibe/ccv \
      tibe/pmi \
      \
      enrc/wow \
      chrc/red \
      envt/wow \
      envg/wow \
      \
      voyp/grs \
      voyp/grm \
      viep/grs \
      viep/mky \
      \
      engl/wnm \
      engl/cpn \
    )    

  Generate the dup-lexeme listings

    foreach sample ( $samples )
      set lang = ${sample:h}
      set book = ${sample:t}
      make LANG=${lang} BOOK=${book} -f dup-word-lists.make all
    end

  Printing summary:
  
    set sfile = ".summary"
    rm -rf ${sfile}
    foreach sample ( $samples )
      set tfile = "tex/${sample}/tot.1/raw-dup-summary.tex"
      echo -n "${sample}" >> ${sfile}
      grep -h DupTotCt ${tfile} \
        | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %5d", s; }' >> ${sfile}
      grep -h DupTotFr ${tfile} \
        | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %.5f", s; }' >> ${sfile}
      grep -h DupMaxCt ${tfile} \
        | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %5d", s; }' >> ${sfile}
      grep -h DupMaxFr ${tfile} \
        | gawk '//{ s = $0; gsub(/[\\A-Za-z{}]/, "", s); printf " %.5f", s; }' >> ${sfile}
      grep -h DupMaxWd ${tfile} \
        | gawk '//{ s = $0; gsub(/^[\\]def[\\][A-Za-z]*/, "", s); printf " %s\n", s; }' >> ${sfile}
    end
    cat ${sfile} | sort -b -k3gr

  Output (manually resorted)

      voyn/tak   317 0.00802    22 0.00056 {Chol}

      chin/red   351 0.00995    44 0.00125 {lao\tn{3}{1}}
      
      chip/voa   151 0.00427    35 0.00099 {guo\tn{2}{}}
      
      viet/nwt   106 0.00294    18 0.00050 {nga`i}
      ital/psp    92 0.00258     8 0.00022 {no}
      tibe/ccv    90 0.00257    54 0.00154 {MA}
      chin/voa    91 0.00255    35 0.00098 {guo\tn{2}{}}
      chin/ptn    85 0.00238    12 0.00034 {ji\tn{4}{4}}
      russ/pic    80 0.00221     4 0.00011 {davaj}
      chin/ptt    67 0.00186     7 0.00019 {ji\tn{4}{4}}
      port/csm    60 0.00171     8 0.00023 {não}
      grek/nwt    63 0.00170    16 0.00043 {\alpha\mu\eta\nu}
      tibe/pmi    54 0.00154    10 0.00029 {KHANG}
      latn/nwt    57 0.00153    16 0.00043 {amen}
      engl/twp    54 0.00130     6 0.00014 {alas}
      viet/ptt    43 0.00119    10 0.00028 {ddo+`i}
      tibe/vim    27 0.00077    12 0.00034 {DE}
      hebr/tad    25 0.00066     3 0.00008 {b¤b¤qr}
      fran/tal    22 0.00061     3 0.00008 {de}
      geez/gok    17 0.00049     8 0.00023 {'alElene}
      engl/wow    16 0.00045     3 0.00008 {had}
      hebr/tav    16 0.00042     3 0.00008 {b¤äb¤öqêr}
      latn/ock    11 0.00031     4 0.00011 {et}
      russ/ptt    11 0.00031     3 0.00009 {÷åóøíá}
      engl/cul    11 0.00030     4 0.00011 {it}
      span/qvi    20 0.00028     4 0.00006 {el}
      latn/ptt     9 0.00024     2 0.00005 {septena}
      germ/sim     8 0.00023     2 0.00006 {halt}
      arab/qud     6 0.00016     1 0.00003 {alllh}
      arab/qcs     4 0.00011     1 0.00003 {allh}
      arab/quf     2 0.00005     1 0.00003 {allhî}
      arab/quv     2 0.00005     1 0.00003 {allhî}
      arab/qph     1 0.00003     1 0.00003 {fyhî}

      voyp/grm     4 0.00551     1 0.00138 {dal}
      voyp/grs     5 0.00256     1 0.00051 {Chedy}

      viep/mky    80 0.00222     7 0.00019 {cho}
      viep/grs    16 0.00051     4 0.00013 {nga}

      envg/wow     0 0.00000     0 0.00000 {-}
      envt/wow    39 0.00110    11 0.00031 {ngu+o+i}
      enrc/wow    16 0.00045     3 0.00008 {dlxxvi}

      chrc/red   351 0.00995    44 0.00125 {slkiy}

NOTES
  
  The VMS has 8.0 dups per 1000 tokens; that is only the second
  highest rate, slightly less than the Dream of Red Chamber (10 dups
  per 1000).
  
  The next significant entries are the Voice of America broadcasts
  in Pinyin, with about half as many dups (4.2/1000).  Then comes
  a bunch of languages with very similar values, starting with 
  Vietnamese (2.9/1000) and ending with Arabic (virtually zero).
  
  Rugg's method hand-applied to Voynichese produces about 70% of the
  duplication seen in the real VMS (about 5.5/1000); the software version
  produces much less (about 2.6/1000). 
  
  When applied to Vietnamese, Rugg's method produces little duplicates
  (0.5/1000).  A second-order Markov monkey gives 2.2/1000, about the same
  as the true text (2.9/1000).
  
  Roman-coded English and Chinese, as expected, have the same dups as
  the plain versions. Viet-coded English has more dups than english,
  presumably because of the use of combinations of the form "X-Y" and
  "Y" for distinct English words; the sequence "X-Y Y" will be counted
  as a dup of Y.

  Well's "War of the Worlds" has 16 dups (0.45 per 1000 tokens) 
  but they are all destroyed by Vigenère encoding (the only file 
  with zero dups).