Hacking at the Voynich manuscript - Side notes
100 Preparing a clean Voynichese sample for analysis 

Last edited on 2002-01-17 21:12:13 by stolfi

SUMMARY

  We prepare clean Voynichese samples of prose and labels
  (without weirdos, unreadable characters, or contentious readings)
  for the statistical analyses that will go into the "word structure"
  technical report.
  
SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../../basify-weirdos
    ln -s ../../compute-cum-cum-freqs
    ln -s ../../compute-cum-freqs
    ln -s ../../compute-freqs
    ln -s ../../combine-counts
    ln -s ../../totalize-fields
    ln -s ../../select-units
    ln -s ../../words-from-evt
    ln -s ../../format-counts-packed
    ln -s ../../update-paper-include

  Paper directories:

    set tbldir = "/home/staff/stolfi/papers/voynich-stats/techrep/tables/auto"
    set figdir = "/home/staff/stolfi/papers/voynich-stats/techrep/figures/auto"

REFERENCE DATA

  The source data will be majority edition derived from interlinear
  release 1.6e6, already chopped into pages and sections.
  We will eventually exclude words with weirdos and contentious characters,

SECTION NAMES

  Get (sub)section names:

    ln -s ../045/subsecs-m text-subsecs
    ln -s ../045/only-m.evt text-all.evt

    set secs = ( `cat subsections.tags` )
    set secscm = `echo ${secs} | tr ' ' ','`
    echo ${secs}; echo ${secscm}
    
  Checking whether we missed anything:
 
    echo "${secs}" | tr ' ' '\012' | sort > .foo
    diff .foo text-subsecs/all.names

  Extracting the good subsections:
  
    cat subsections.tags \
      | egrep -v 'unk|xxx' \
      > subsections-ok.tags
    echo `cat subsections-ok.tags`

DIRECTORY STRUCTURE

  The data files (text, word counts, tables, etc.) for each sample
  text will live in the subdirectory sample/LANG/BUK/SEC.K/,where
  
     LANG  the sample's language. Two samples should have the same LANG
             only if they use the same spelling for shared words.
             Thus, English and Italian are different LANGs. Different
             encodings of Chinese (pinyin, GR, RomanNum) are different
             LANGs. Medieval French and Modern French are different
             LANGs. The Bible (with modern spelling) and War of the
             World are the same LANG. After much analysis, it seems
             that we can assign a single LANG ("voyn") to all parts of
             the VMS. 
     
     BUK   the book. Two samples with the same LANG and BUK should
             be by the same author and part of the same book. For 
             Voynichese, BUK is 
             
               "maj" - the whole text, majority version.
               "tak" - the whole text, Takeshi's version.
               "pro" - running prose (parags, lines, etc.) from "maj".
               "lab" - labels, titles, word lists, etc. from "maj".
               "ini" - first word after each breaj, from "pro".
               "fin" - last word before each break, from "pro".
               "mid" - all of "pro" except "ini" and "fin".
               
             For the other languages, BUK is a book tag (e.g. "wow"
             for War of the Worlds, "ptt" for the Pentateuch).
     
     SEC   the major division within the book. The sections must
             be disjoint. One should divide a book into sections only 
             if their usage of common words is expected to be different 
             (due to differences in subject matter and/or style), 
             and those differences are considered relevant for the
             analysis. In any case, for each book there should be a
             section called "tot" which is the union of all other
             sections.
             
             For the VMS, each classical section is a separate SEC,
             except that we split the herbal section into "hea" and
             "heb". In the Culpeper herbal, the preamble, plant
             descriptions, and recipes could be in three separate
             SECs. In the Pentateuch, we could let each of the five
             books be a separate SEC.
             
     K     the subsection.  This subdivision is used only in 
             some studies; otherwise, all subsections of the same 
             section are handled together.  For Voynichese, a subsection 
             is a maximal string of *consecutive* pages that belong to 
             the same SEC.  For other languages, we usually don't need to 
             have more than one subsection.  In any case, the "tot" section
             has a single subsection, with K = "1".
  
VOYNICHESE LANGUAGES

  Voynichese has six LANGs:
  
      voyn = all word-like text (prose aNd labels, but not letters).
      voyp = prose text only.
      voyl = labels only.
      voyi = line-initial prose tokens.
      voyf = line-final prose tokens.
      voym = prose text minus line-initial and line-final tokens.

    set langs = ( voy{n,p,l,i,f,m} )
    set langscm = "`echo ${langs} | tr ' ' ','`"
    echo ${langscm}

PREPARING THE VOYNICHESE SAMPLES

  Per-section data (only for voynichese) will live in sudirectories
  sample/LANG/vms/SEC.K where SEC.K is the section and subsection tag.
  Let's create the respective book directories:

    mkdir sample 
    foreach pd ( sample ${tbldir} ${figdir} )
      mkdir ${pd}/{${langscm}}
      mkdir ${pd}/{${langscm}}/vms
      mkdir ${pd}/{${langscm}}/vms/{${secscm},tot.t}
    end
    
  Copy the list of section-subsection tags (except "tot.t") to handy places:

    foreach lang ( ${langs} )
      (cd sample/${lang}/vms/ && ln -s ../../../subsections.tags )
      (cd sample/${lang}/vms/ && ln -s ../../../subsections-ok.tags )
    end

EXTRACTING THE RAW EVT-FORMATTED TEXT

  Copy the raw EVT-formatted text files to the appropriate 
  sub-directories of "voyn", converting all weirdos to basic 
  EVA chars, or to "*" if impossible:
  
    set utypes = \
      'parags,starred-parags,circular-lines,circular-text,radial-lines,titles,labels,words'
    foreach sec ( ${secs} "tot.t" )
      set ifile = "text-subsecs/${sec}.evt"
      if ( "$sec" == "tot.t" ) set ifile = "text-all.evt"
      set ofile = "sample/voyn/vms/${sec}/raw.evt"
      echo "${ifile} -> ${ofile}"
      cat ${ifile} \
        | sed -e 's/[&][*\!][*\!][*\!][*\!;]/*\!\!\!\!/g' \
        | basify-weirdos \
        | select-units \
            -v types="${utypes}" \
            -v table=unit-to-type.tbl \
        > ${ofile}
    end
    dicio-wc sample/voyn/vms/{${secscm},tot.t}/raw.evt

  Now separate the EVT-formatted files for running prose ("voyp"),
  labels ("voyl"), for each subsection SEC.K, including "tot.t".

    ln -s ../019/unit-to-type.tbl

    foreach lang ( voyp voyl )
      if ( ${lang} == voyp ) then
        set utypes = \
          'parags,starred-parags,circular-lines,circular-text,radial-lines,titles'
      else
        set utypes = \
          'labels,words'
      endif
      echo "utypes = ${utypes}"
      foreach sec ( ${secs} "tot.t" )
        set ifile = "sample/voyn/vms/${sec}/raw.evt"
        set ofile = "sample/${lang}/vms/${sec}/raw.evt"
        echo "${ifile} -> ${ofile}"
        cat ${ifile} \
          | select-units \
              -v types="${utypes}" \
              -v table=unit-to-type.tbl \
          > ${ofile}
      end
      dicio-wc sample/${lang}/vms/{${secscm},tot.t}/raw.evt
    end

EXTRACTING THE RAW TOKEN LISTS

  Now we extract the raw token lists for . We treat line breaks as
  spaces, but perserve paragraph breaks as dummy "=" words.
  
    foreach lang ( voyn voyp voyl )
      foreach sec ( ${secs} "tot.t" )
        set ifile = "sample/${lang}/vms/${sec}/raw.evt"
        set ofile = "sample/${lang}/vms/${sec}/raw.tks"
        echo "${ifile} -> ${ofile}"
        cat ${ifile} \
          | words-from-evt -v showParags=1 \
          | sed -e 's/^ *$/=/' \
          > ${ofile}
      end
      dicio-wc sample/${lang}/vms/{${secscm},tot.t}/raw.tks
    end

      lines   words     bytes file        
    ------- ------- --------- ------------
       7047    7047     40120 sample/voyn/vms/hea.1/raw.tks
        882     882      5335 sample/voyn/vms/hea.2/raw.tks
       2959    2959     17252 sample/voyn/vms/heb.1/raw.tks
        570     570      3318 sample/voyn/vms/heb.2/raw.tks
        205     205       688 sample/voyn/vms/cos.1/raw.tks
       2032    2032     11113 sample/voyn/vms/cos.2/raw.tks
       1123    1123      6210 sample/voyn/vms/cos.3/raw.tks
       7171    7171     42085 sample/voyn/vms/bio.1/raw.tks
       1674    1674      8987 sample/voyn/vms/zod.1/raw.tks
       1123    1123      6304 sample/voyn/vms/pha.1/raw.tks
       1763    1763      9951 sample/voyn/vms/pha.2/raw.tks
        763     763      4508 sample/voyn/vms/str.1/raw.tks
      11056   11056     68652 sample/voyn/vms/str.2/raw.tks
        220     220      1223 sample/voyn/vms/unk.1/raw.tks
        142     142       901 sample/voyn/vms/unk.2/raw.tks
         49      49       279 sample/voyn/vms/unk.3/raw.tks
        337     337      2007 sample/voyn/vms/unk.4/raw.tks
        351     351      2162 sample/voyn/vms/unk.5/raw.tks
        492     492      2943 sample/voyn/vms/unk.6/raw.tks
        392     392      2231 sample/voyn/vms/unk.7/raw.tks
          2       2        11 sample/voyn/vms/unk.8/raw.tks
      40372   40372    236314 sample/voyn/vms/tot.t/raw.tks

      lines   words     bytes file        
    ------- ------- --------- ------------
       7045    7045     40111 sample/voyp/vms/hea.1/raw.tks
        882     882      5335 sample/voyp/vms/hea.2/raw.tks
       2959    2959     17252 sample/voyp/vms/heb.1/raw.tks
        570     570      3318 sample/voyp/vms/heb.2/raw.tks
        186     186       588 sample/voyp/vms/cos.1/raw.tks
       1606    1606      9155 sample/voyp/vms/cos.2/raw.tks
        904     904      5222 sample/voyp/vms/cos.3/raw.tks
       6915    6915     40962 sample/voyp/vms/bio.1/raw.tks
       1015    1015      6051 sample/voyp/vms/zod.1/raw.tks
        942     942      5466 sample/voyp/vms/pha.1/raw.tks
       1455    1455      8612 sample/voyp/vms/pha.2/raw.tks
        763     763      4508 sample/voyp/vms/str.1/raw.tks
      11056   11056     68652 sample/voyp/vms/str.2/raw.tks
        220     220      1223 sample/voyp/vms/unk.1/raw.tks
        142     142       901 sample/voyp/vms/unk.2/raw.tks
         49      49       279 sample/voyp/vms/unk.3/raw.tks
        307     307      1895 sample/voyp/vms/unk.4/raw.tks
        351     351      2162 sample/voyp/vms/unk.5/raw.tks
        492     492      2943 sample/voyp/vms/unk.6/raw.tks
        392     392      2231 sample/voyp/vms/unk.7/raw.tks
          0       0         0 sample/voyp/vms/unk.8/raw.tks
      38269   38269    226898 sample/voyp/vms/tot.t/raw.tks

      lines   words     bytes file        
    ------- ------- --------- ------------
          1       1         7 sample/voyl/vms/hea.1/raw.tks
          0       0         0 sample/voyl/vms/hea.2/raw.tks
          0       0         0 sample/voyl/vms/heb.1/raw.tks
          0       0         0 sample/voyl/vms/heb.2/raw.tks
         18      18        98 sample/voyl/vms/cos.1/raw.tks
        425     425      1956 sample/voyl/vms/cos.2/raw.tks
        218     218       986 sample/voyl/vms/cos.3/raw.tks
        255     255      1121 sample/voyl/vms/bio.1/raw.tks
        658     658      2934 sample/voyl/vms/zod.1/raw.tks
        180     180       836 sample/voyl/vms/pha.1/raw.tks
        307     307      1337 sample/voyl/vms/pha.2/raw.tks
          0       0         0 sample/voyl/vms/str.1/raw.tks
          0       0         0 sample/voyl/vms/str.2/raw.tks
          0       0         0 sample/voyl/vms/unk.1/raw.tks
          0       0         0 sample/voyl/vms/unk.2/raw.tks
          0       0         0 sample/voyl/vms/unk.3/raw.tks
         29      29       110 sample/voyl/vms/unk.4/raw.tks
          0       0         0 sample/voyl/vms/unk.5/raw.tks
          0       0         0 sample/voyl/vms/unk.6/raw.tks
          0       0         0 sample/voyl/vms/unk.7/raw.tks
          2       2        11 sample/voyl/vms/unk.8/raw.tks
       2102    2102      9414 sample/voyl/vms/tot.t/raw.tks

  Now we do the same for the line-initial, -media, and -final 
  sublanguages of "voyp":

    foreach lang ( voyi voym voyf )
      set omi = 1; set omm = 1; set omf = 1
      if ( "${lang}" == "voyi" ) set omi = 0
      if ( "${lang}" == "voym" ) set omm = 0
      if ( "${lang}" == "voyf" ) set omf = 0
      foreach sec ( ${secs} "tot.t" )
        set ifile = "sample/voyp/vms/${sec}/raw.evt"
        set ofile = "sample/${lang}/vms/${sec}/raw.tks"
        echo "${ifile} -> ${ofile}"
        cat ${ifile} \
          | words-from-evt \
              -v showParags=1 \
              -v omitInitial=${omi} \
              -v omitMedial=${omm} \
              -v omitFinal=${omf} \
          | sed -e 's/^ *$/=/' \
          > ${ofile}
      end
      dicio-wc sample/${lang}/vms/{${secscm},tot.t}/raw.tks
    end

      lines   words     bytes file        
    ------- ------- --------- ------------
       1516    1516      8719 sample/voyi/vms/hea.1/raw.tks
        199     199      1189 sample/voyi/vms/hea.2/raw.tks
        497     497      2975 sample/voyi/vms/heb.1/raw.tks
         90      90       510 sample/voyi/vms/heb.2/raw.tks
          4       4        12 sample/voyi/vms/cos.1/raw.tks
        316     316      1594 sample/voyi/vms/cos.2/raw.tks
        110     110       619 sample/voyi/vms/cos.3/raw.tks
        910     910      5381 sample/voyi/vms/bio.1/raw.tks
         35      35       190 sample/voyi/vms/zod.1/raw.tks
        128     128       711 sample/voyi/vms/pha.1/raw.tks
        190     190      1099 sample/voyi/vms/pha.2/raw.tks
         88      88       541 sample/voyi/vms/str.1/raw.tks
       1371    1371      7893 sample/voyi/vms/str.2/raw.tks
         31      31       145 sample/voyi/vms/unk.1/raw.tks
         34      34       234 sample/voyi/vms/unk.2/raw.tks
         15      15        82 sample/voyi/vms/unk.3/raw.tks
         38      38       227 sample/voyi/vms/unk.4/raw.tks
         44      44       259 sample/voyi/vms/unk.5/raw.tks
         48      48       294 sample/voyi/vms/unk.6/raw.tks
         44      44       261 sample/voyi/vms/unk.7/raw.tks
          0       0         0 sample/voyi/vms/unk.8/raw.tks
       5726    5726     32967 sample/voyi/vms/tot.t/raw.tks

      lines   words     bytes file        
    ------- ------- --------- ------------
       4223    4223     23632 sample/voym/vms/hea.1/raw.tks
        482     482      2912 sample/voym/vms/hea.2/raw.tks
       2058    2058     11874 sample/voym/vms/heb.1/raw.tks
        415     415      2426 sample/voym/vms/heb.2/raw.tks
        113     113       438 sample/voym/vms/cos.1/raw.tks
       1167    1167      6506 sample/voym/vms/cos.2/raw.tks
        715     715      4076 sample/voym/vms/cos.3/raw.tks
       5269    5269     31308 sample/voym/vms/bio.1/raw.tks
        954     954      5688 sample/voym/vms/zod.1/raw.tks
        715     715      4112 sample/voym/vms/pha.1/raw.tks
       1125    1125      6572 sample/voym/vms/pha.2/raw.tks
        603     603      3528 sample/voym/vms/str.1/raw.tks
       8886    8886     54891 sample/voym/vms/str.2/raw.tks
        163     163       911 sample/voym/vms/unk.1/raw.tks
         73      73       453 sample/voym/vms/unk.2/raw.tks
         23      23       128 sample/voym/vms/unk.3/raw.tks
        241     241      1487 sample/voym/vms/unk.4/raw.tks
        281     281      1709 sample/voym/vms/unk.5/raw.tks
        402     402      2418 sample/voym/vms/unk.6/raw.tks
        314     314      1734 sample/voym/vms/unk.7/raw.tks
          0       0         0 sample/voym/vms/unk.8/raw.tks
      28240   28240    166839 sample/voym/vms/tot.t/raw.tks

      lines   words     bytes file        
    ------- ------- --------- ------------
       1516    1516      7624 sample/voyf/vms/hea.1/raw.tks
        199     199      1132 sample/voyf/vms/hea.2/raw.tks
        497     497      2522 sample/voyf/vms/heb.1/raw.tks
         90      90       428 sample/voyf/vms/heb.2/raw.tks
          4       4         8 sample/voyf/vms/cos.1/raw.tks
        316     316      1396 sample/voyf/vms/cos.2/raw.tks
        110     110       550 sample/voyf/vms/cos.3/raw.tks
        910     910      4621 sample/voyf/vms/bio.1/raw.tks
         35      35       191 sample/voyf/vms/zod.1/raw.tks
        128     128       685 sample/voyf/vms/pha.1/raw.tks
        190     190      1003 sample/voyf/vms/pha.2/raw.tks
         88      88       471 sample/voyf/vms/str.1/raw.tks
       1371    1371      7001 sample/voyf/vms/str.2/raw.tks
         31      31       163 sample/voyf/vms/unk.1/raw.tks
         34      34       179 sample/voyf/vms/unk.2/raw.tks
         15      15        77 sample/voyf/vms/unk.3/raw.tks
         38      38       201 sample/voyf/vms/unk.4/raw.tks
         44      44       230 sample/voyf/vms/unk.5/raw.tks
         48      48       243 sample/voyf/vms/unk.6/raw.tks
         44      44       256 sample/voyf/vms/unk.7/raw.tks
          0       0         0 sample/voyf/vms/unk.8/raw.tks
       5726    5726     29017 sample/voyf/vms/tot.t/raw.tks
    
COMPUTING WORD OCCURRENCE COUNTS

  Counting word occurrences by subset and section:

    foreach lang ( ${langs} )
      foreach sec ( ${secs} "tot.t" )
        set ifile = "sample/${lang}/vms/${sec}/raw.tks"
        set ofile = "sample/${lang}/vms/${sec}/raw.wfr"
        echo "${ifile} -> ${ofile}"
        cat ${ifile} \
          | egrep -v '=' \
          | sort | uniq -c | expand \
          | sort -b +0 -1nr +1 -2 \
          | compute-freqs \
          > ${ofile}
      end
      dicio-wc sample/${lang}/vms/{${secscm},tot.t}/raw.wfr \
        | gawk '/./{ printf "    %8s %s\n", $1,$4;}' 
    end
    
       lines file
     ------- ------------
        2132 sample/voyn/vms/hea.1/raw.wfr
         554 sample/voyn/vms/hea.2/raw.wfr
        1189 sample/voyn/vms/heb.1/raw.wfr
         331 sample/voyn/vms/heb.2/raw.wfr
          83 sample/voyn/vms/cos.1/raw.wfr
        1019 sample/voyn/vms/cos.2/raw.wfr
         620 sample/voyn/vms/cos.3/raw.wfr
        1597 sample/voyn/vms/bio.1/raw.wfr
         884 sample/voyn/vms/zod.1/raw.wfr
         561 sample/voyn/vms/pha.1/raw.wfr
         808 sample/voyn/vms/pha.2/raw.wfr
         483 sample/voyn/vms/str.1/raw.wfr
        3225 sample/voyn/vms/str.2/raw.wfr
         162 sample/voyn/vms/unk.1/raw.wfr
         103 sample/voyn/vms/unk.2/raw.wfr
          46 sample/voyn/vms/unk.3/raw.wfr
         239 sample/voyn/vms/unk.4/raw.wfr
         246 sample/voyn/vms/unk.5/raw.wfr
         297 sample/voyn/vms/unk.6/raw.wfr
         235 sample/voyn/vms/unk.7/raw.wfr
           2 sample/voyn/vms/unk.8/raw.wfr
        8591 sample/voyn/vms/tot.t/raw.wfr

       lines file
     ------- ------------
        2131 sample/voyp/vms/hea.1/raw.wfr
         554 sample/voyp/vms/hea.2/raw.wfr
        1189 sample/voyp/vms/heb.1/raw.wfr
         331 sample/voyp/vms/heb.2/raw.wfr
          73 sample/voyp/vms/cos.1/raw.wfr
         868 sample/voyp/vms/cos.2/raw.wfr
         533 sample/voyp/vms/cos.3/raw.wfr
        1536 sample/voyp/vms/bio.1/raw.wfr
         641 sample/voyp/vms/zod.1/raw.wfr
         485 sample/voyp/vms/pha.1/raw.wfr
         684 sample/voyp/vms/pha.2/raw.wfr
         483 sample/voyp/vms/str.1/raw.wfr
        3225 sample/voyp/vms/str.2/raw.wfr
         162 sample/voyp/vms/unk.1/raw.wfr
         103 sample/voyp/vms/unk.2/raw.wfr
          46 sample/voyp/vms/unk.3/raw.wfr
         226 sample/voyp/vms/unk.4/raw.wfr
         246 sample/voyp/vms/unk.5/raw.wfr
         297 sample/voyp/vms/unk.6/raw.wfr
         235 sample/voyp/vms/unk.7/raw.wfr
           0 sample/voyp/vms/unk.8/raw.wfr
        8105 sample/voyp/vms/tot.t/raw.wfr

       lines file
     ------- ------------
           1 sample/voyl/vms/hea.1/raw.wfr
           0 sample/voyl/vms/hea.2/raw.wfr
           0 sample/voyl/vms/heb.1/raw.wfr
           0 sample/voyl/vms/heb.2/raw.wfr
          10 sample/voyl/vms/cos.1/raw.wfr
         225 sample/voyl/vms/cos.2/raw.wfr
         112 sample/voyl/vms/cos.3/raw.wfr
         127 sample/voyl/vms/bio.1/raw.wfr
         303 sample/voyl/vms/zod.1/raw.wfr
          92 sample/voyl/vms/pha.1/raw.wfr
         155 sample/voyl/vms/pha.2/raw.wfr
           0 sample/voyl/vms/str.1/raw.wfr
           0 sample/voyl/vms/str.2/raw.wfr
           0 sample/voyl/vms/unk.1/raw.wfr
           0 sample/voyl/vms/unk.2/raw.wfr
           0 sample/voyl/vms/unk.3/raw.wfr
          15 sample/voyl/vms/unk.4/raw.wfr
           0 sample/voyl/vms/unk.5/raw.wfr
           0 sample/voyl/vms/unk.6/raw.wfr
           0 sample/voyl/vms/unk.7/raw.wfr
           2 sample/voyl/vms/unk.8/raw.wfr
         882 sample/voyl/vms/tot.t/raw.wfr

       lines file
     ------- ------------
         709 sample/voyi/vms/hea.1/raw.wfr
         150 sample/voyi/vms/hea.2/raw.wfr
         326 sample/voyi/vms/heb.1/raw.wfr
          68 sample/voyi/vms/heb.2/raw.wfr
           3 sample/voyi/vms/cos.1/raw.wfr
         183 sample/voyi/vms/cos.2/raw.wfr
          82 sample/voyi/vms/cos.3/raw.wfr
         387 sample/voyi/vms/bio.1/raw.wfr
          26 sample/voyi/vms/zod.1/raw.wfr
          95 sample/voyi/vms/pha.1/raw.wfr
         129 sample/voyi/vms/pha.2/raw.wfr
          76 sample/voyi/vms/str.1/raw.wfr
         675 sample/voyi/vms/str.2/raw.wfr
          23 sample/voyi/vms/unk.1/raw.wfr
          29 sample/voyi/vms/unk.2/raw.wfr
          13 sample/voyi/vms/unk.3/raw.wfr
          31 sample/voyi/vms/unk.4/raw.wfr
          34 sample/voyi/vms/unk.5/raw.wfr
          39 sample/voyi/vms/unk.6/raw.wfr
          34 sample/voyi/vms/unk.7/raw.wfr
           0 sample/voyi/vms/unk.8/raw.wfr
        2159 sample/voyi/vms/tot.t/raw.wfr

       lines file
     ------- ------------
         646 sample/voyf/vms/hea.1/raw.wfr
         156 sample/voyf/vms/hea.2/raw.wfr
         270 sample/voyf/vms/heb.1/raw.wfr
          69 sample/voyf/vms/heb.2/raw.wfr
           3 sample/voyf/vms/cos.1/raw.wfr
         167 sample/voyf/vms/cos.2/raw.wfr
          77 sample/voyf/vms/cos.3/raw.wfr
         397 sample/voyf/vms/bio.1/raw.wfr
          30 sample/voyf/vms/zod.1/raw.wfr
          85 sample/voyf/vms/pha.1/raw.wfr
         134 sample/voyf/vms/pha.2/raw.wfr
          74 sample/voyf/vms/str.1/raw.wfr
         678 sample/voyf/vms/str.2/raw.wfr
          25 sample/voyf/vms/unk.1/raw.wfr
          27 sample/voyf/vms/unk.2/raw.wfr
          12 sample/voyf/vms/unk.3/raw.wfr
          32 sample/voyf/vms/unk.4/raw.wfr
          35 sample/voyf/vms/unk.5/raw.wfr
          42 sample/voyf/vms/unk.6/raw.wfr
          35 sample/voyf/vms/unk.7/raw.wfr
           0 sample/voyf/vms/unk.8/raw.wfr
        2042 sample/voyf/vms/tot.t/raw.wfr

       lines file
     ------- ------------
        1261 sample/voym/vms/hea.1/raw.wfr
         300 sample/voym/vms/hea.2/raw.wfr
         809 sample/voym/vms/heb.1/raw.wfr
         236 sample/voym/vms/heb.2/raw.wfr
          71 sample/voym/vms/cos.1/raw.wfr
         618 sample/voym/vms/cos.2/raw.wfr
         413 sample/voym/vms/cos.3/raw.wfr
        1111 sample/voym/vms/bio.1/raw.wfr
         606 sample/voym/vms/zod.1/raw.wfr
         363 sample/voym/vms/pha.1/raw.wfr
         498 sample/voym/vms/pha.2/raw.wfr
         376 sample/voym/vms/str.1/raw.wfr
        2356 sample/voym/vms/str.2/raw.wfr
         124 sample/voym/vms/unk.1/raw.wfr
          53 sample/voym/vms/unk.2/raw.wfr
          21 sample/voym/vms/unk.3/raw.wfr
         177 sample/voym/vms/unk.4/raw.wfr
         192 sample/voym/vms/unk.5/raw.wfr
         243 sample/voym/vms/unk.6/raw.wfr
         187 sample/voym/vms/unk.7/raw.wfr
           0 sample/voym/vms/unk.8/raw.wfr
        5633 sample/voym/vms/tot.t/raw.wfr

  Checking the sums:
  
    foreach lang ( voyp voyl voyn )
      printf "\n%-24s" "${lang}/vms/*/raw.wfr: "
      cat sample/${lang}/vms/{${secscm}}/raw.wfr \
        | gawk '/./{t += $1;} END{print t}' 
      printf "%-24s" "${lang}/vms/tot.t/raw.wfr: "
      cat sample/${lang}/vms/tot.t/raw.wfr \
        | gawk '/./{t += $1;} END{print t}' 
    end

      voyp/vms/*/raw.wfr:     37385
      voyp/vms/tot.t/raw.wfr: 37385

      voyl/vms/*/raw.wfr:     1171
      voyl/vms/tot.t/raw.wfr: 1171

      voyn/vms/*/raw.wfr:     38556
      voyn/vms/tot.t/raw.wfr: 38556

    foreach sec ( ${secs} )
      printf "\n%-32s" "{voyp,voyl}/vms/${sec}/raw.wfr: "
      cat sample/{voyp,voyl}/vms/${sec}/raw.wfr \
        | gawk '/./{t += $1;} END{print t}' 
      printf "%-32s" "voyn/vms/${sec}/raw.wfr: "
      cat sample/voyn/vms/${sec}/raw.wfr \
        | gawk '/./{t += $1;} END{print t}' 
    end

      {voyp,voyl}/vms/hea.1/raw.wfr:  6867
      voyn/vms/hea.1/raw.wfr:         6867

      {voyp,voyl}/vms/hea.2/raw.wfr:  868
      voyn/vms/hea.2/raw.wfr:         868

      {voyp,voyl}/vms/heb.1/raw.wfr:  2901
      voyn/vms/heb.1/raw.wfr:         2901

      {voyp,voyl}/vms/heb.2/raw.wfr:  557
      voyn/vms/heb.2/raw.wfr:         557

      {voyp,voyl}/vms/cos.1/raw.wfr:  195
      voyn/vms/cos.1/raw.wfr:         195

      {voyp,voyl}/vms/cos.2/raw.wfr:  1746
      voyn/vms/cos.2/raw.wfr:         1746

      {voyp,voyl}/vms/cos.3/raw.wfr:  1006
      voyn/vms/cos.3/raw.wfr:         1006

      {voyp,voyl}/vms/bio.1/raw.wfr:  6975
      voyn/vms/bio.1/raw.wfr:         6975

      {voyp,voyl}/vms/zod.1/raw.wfr:  1370
      voyn/vms/zod.1/raw.wfr:         1370

      {voyp,voyl}/vms/pha.1/raw.wfr:  1023
      voyn/vms/pha.1/raw.wfr:         1023

      {voyp,voyl}/vms/pha.2/raw.wfr:  1588
      voyn/vms/pha.2/raw.wfr:         1588

      {voyp,voyl}/vms/str.1/raw.wfr:  755
      voyn/vms/str.1/raw.wfr:         755

      {voyp,voyl}/vms/str.2/raw.wfr:  10768
      voyn/vms/str.2/raw.wfr:         10768

      {voyp,voyl}/vms/unk.1/raw.wfr:  213
      voyn/vms/unk.1/raw.wfr:         213

      {voyp,voyl}/vms/unk.2/raw.wfr:  140
      voyn/vms/unk.2/raw.wfr:         140

      {voyp,voyl}/vms/unk.3/raw.wfr:  47
      voyn/vms/unk.3/raw.wfr:         47

      {voyp,voyl}/vms/unk.4/raw.wfr:  317
      voyn/vms/unk.4/raw.wfr:         317

      {voyp,voyl}/vms/unk.5/raw.wfr:  342
      voyn/vms/unk.5/raw.wfr:         342

      {voyp,voyl}/vms/unk.6/raw.wfr:  489
      voyn/vms/unk.6/raw.wfr:         489

      {voyp,voyl}/vms/unk.7/raw.wfr:  387
      voyn/vms/unk.7/raw.wfr:         387

      {voyp,voyl}/vms/unk.8/raw.wfr:  2
      voyn/vms/unk.8/raw.wfr:         2

REMOVING BAD WORDS

  Next we separate the "bad" words, those with unreadable characters
  and weirdos Success percentages will be computed relative to the
  remaining "good" words. Most other files are derived from `gud.wfr',
  the frequency file for good words.
  
  Weirdos are defined as characters and combinations that are not part
  of the basic glyph set
  
    e i  a o  q y  d l  r s  n m   k t  f p  ch sh  ckh cth  cfh cph 

  Note that we exclude  { g j u v x z } as well as any { c h }
  that are not part of the compound glyphs listed above.

  We believe that this selection will not introduce a significnat bias
  in the grammar-fitting percentages. Tokens that contain weirdos are
  probably abbreviations or symbols, which should not be counted in
  the totals; or embellished words, which are likely to be chosen for
  embellishment independently of their fitness or not to the grammar.
  As for tokens that have discrepant readings, the divergence should
  not be strongly correlated to their fitness to the grammar.

    foreach lang ( ${langs} )
      foreach sec ( ${secs} tot.t )
        set ifile = "sample/${lang}/vms/${sec}/raw.wfr"
        set gfile = "sample/${lang}/vms/${sec}/gud.wfr"
        echo "${ifile} -> ${gfile}"
        cat ${ifile} \
          | select-good-words -v inField=3 -v writeBad=0 \
          > ${gfile}
        set bfile = "sample/${lang}/vms/${sec}/bad.wfr"
        echo "${ifile} -> ${bfile}"
        cat ${ifile} \
          | select-good-words -v inField=3 -v writeBad=1 \
          > ${bfile}
      end
    end

CHECKING FOR SAMPLE BIAS

  Computing the fraction of bad words/tokens that were excluded because 
  of not-so-weird weirdos like <b>, <g>, <j>, <v>, <u>, <z>,
  or nonstandard uses of <c> and <h>:
  
    foreach lang ( ${langs} )
      foreach sec ( tot.t )
        printf "\n%s" "${lang}/vms/${sec}"
        cat sample/${lang}/vms/${sec}/bad.wfr \
          | gawk '/./{w++;t+=$1;} END{printf " %5d%-7s", w, sprintf("(%d)",t);}'
        cat sample/${lang}/vms/${sec}/bad.wfr \
          | gawk '($3 ~ /[*?]/){ print; }' \
          | gawk '/./{w++;t+=$1;} END{printf " %5d%-7s", w, sprintf("(%d)",t);}'
        cat sample/${lang}/vms/${sec}/bad.wfr \
          | gawk '($3 ~ /^[a-z]*$/){ print; }' \
          | gawk '/./{w++;t+=$1;} END{printf " %5d%-7s", w, sprintf("(%d)",t);}'
        cat sample/${lang}/vms/${sec}/bad.wfr \
          | gawk '($3 ~ /^[a-z]*[ao][i]*([?][i]|[i][?])[i]*[n]$/){ print; }' \
          | gawk '/./{w++;t+=$1;} END{printf " %5d%-7s", w, sprintf("(%d)",t);}'
        printf "\n"
      end
    end

  Here are the numbers. Column "nbad" is the count of rejected
  words(tokens). Column "[?]" is the number of words(tokens) that are
  unreadable, contentious, or contain non-basic weirdos (non-lowercase
  EVA characters). Column "[bchv...]" is the number of words(tokens)
  that were rejected only because they contain some of the forbidden
  characters above. Column "[ai?n]" are the words(tokens) that were
  rejected mainly because of disagreement of the "iin/iiin" type.
        
      type               nbad         [?]        [bchv...]      [ai?n]
      -------------- ------------ ------------ ------------ ------------
      voyn/vms/tot.t  1708(2526)   1612(2407)     96(119)     114(396)  
      voyp/vms/tot.t  1580(2358)   1501(2257)     79(101)     114(396)  
      voyl/vms/tot.t   161(168)     143(150)      18(18)        0(0)    
      voyi/vms/tot.t   246(282)     241(277)       5(5)        29(44)   
      voyf/vms/tot.t   294(335)     267(306)      27(29)       26(36)   
      voym/vms/tot.t  1147(1698)   1100(1646)     47(52)       86(316)  
                           
  Listing the bad words that were rejected only because of lowercase
  weirdos:
  
    foreach lang ( voyp voyl )
      foreach sec ( tot.t )
        printf "\nFrom %s:\n\n" "${lang}/vms/${sec}.wfr"
        cat sample/${lang}/vms/${sec}/bad.wfr \
          | gawk '($3 ~ /^[a-z]*$/){print $1, $3; }' \
          | sort -b +0 -1nr +1 -2 \
          | format-counts-packed \
          | sed -e 's/^/  /'
      end
    end

    From voyp/vms/tot.t.wfr:

      v(7) x(7) c(4) cheg(3) xar(3) amg(2) cto(2) g(2) aikhckhy(1)
      aithy(1) arg(1) arxor(1) axor(1) chckshy(1) chcpar(1)
      chcs(1) checta(1) chepchx(1) chocty(1) chodalg(1)
      choekchcey(1) choikhy(1) chokolg(1) cholxy(1) chxar(1)
      ckcho(1) ckchol(1) ckshy(1) cky(1) coy(1) cpheeg(1) cseo(1)
      ctar(1) ctchy(1) ctechy(1) ctoiin(1) ctos(1) dag(1) daing(1)
      dchog(1) dkeeeg(1) docodal(1) doithy(1) gaiin(1) kedarxy(1)
      lxor(1) ockey(1) ockhh(1) oetalchg(1) ogam(1) olgy(1) org(1)
      oxar(1) oxor(1) oxy(1) pchocty(1) qocky(1) qodaikhy(1)
      qokeefcy(1) qokg(1) rokaix(1) salxar(1) sarg(1)
      shecphhedy(1) shhy(1) shokog(1) shxam(1) soleeg(1) teyteg(1)
      todashx(1) vo(1) vr(1) vs(1) xoiin(1) xol(1) yhal(1)
      ykceol(1) ypcheg(1) ytcharg(1)

    From voyl/vms/tot.t.wfr:

      cfhhy(1) chockhhy(1) chodalg(1) ddsschx(1) docfhhy(1) gy(1)
      oalcheg(1) ocsesy(1) oecs(1) ofacfom(1) okaramog(1)
      okeeog(1) opalg(1) opchaldg(1) oteedyg(1) soshxar(1)
      ydashgarain(1) yskhy(1)

  Tabulating the fraction of good and bad words per section
   (ppt = parts per thousand):
  
    foreach lang ( ${langs} )
      set afile = ".raw-gud-bad-counts-${lang}-vms.txt"; 
      echo " "; echo "      Good/bad statistics for subset ${lang}/vms:"; echo " "
      count-raw-gud-bad-toks-wrds ${lang}/vms ${secs} / tot.t \
        > ${afile}
      cat ${afile} \
        | sed -e 's:/::g' -e 's/^/      /' 
    end
    
      Good/bad statistics for subset voyn/vms:
 
      #                    tokens                         words             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    6867   6704  976    163   23   2132   1981  928    151   70
        hea.2     868    823  947     45   51    554    509  917     45   81
        heb.1    2901   2820  971     81   27   1189   1111  933     78   65
        heb.2     557    510  913     47   84    331    288  867     43  129
        cos.1     195    155  790     40  204     83     72  857     11  130
        cos.2    1746   1590  910    156   89   1019    868  850    151  148
        cos.3    1006    795  789    211  209    620    429  690    191  307
        bio.1    6975   6697  960    278   39   1597   1382  864    215  134
        zod.1    1370    988  720    382  278    884    555  627    329  371
        pha.1    1023    944  921     79   77    561    483  859     78  138
        pha.2    1588   1452  913    136   85    808    694  857    114  140
        str.1     755    670  886     85  112    483    402  830     81  167
        str.2   10768  10097  937    671   62   3225   2779  861    446  138
        unk.1     213    202  943     11   51    162    153  938      9   55
        unk.2     140    134  950      6   42    103     97  932      6   57
        unk.3      47     44  916      3   62     46     43  914      3   63
        unk.4     317    306  962     11   34    239    228  950     11   45
        unk.5     342    309  900     33   96    246    214  866     32  129
        unk.6     489    431  879     58  118    297    247  828     50  167
        unk.7     387    357  920     30   77    235    208  881     27  114
        unk.8       2      2  666      0    0      2      2  666      0    0
      
        tot.t   38556  36030  934   2526   65   8591   6883  801   1708  198

      Good/bad statistics for subset voyp/vms:
 
      #                    tokens                         words             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    6866   6703  976    163   23   2131   1980  928    151   70
        hea.2     868    823  947     45   51    554    509  917     45   81
        heb.1    2901   2820  971     81   27   1189   1111  933     78   65
        heb.2     557    510  913     47   84    331    288  867     43  129
        cos.1     185    146  784     39  209     73     63  851     10  135
        cos.2    1491   1353  906    138   92    868    733  843    135  155
        cos.3     884    713  805    171  193    533    380  711    153  286
        bio.1    6828   6555  959    273   39   1536   1325  862    211  137
        zod.1    1010    701  693    309  305    641    379  590    262  408
        pha.1     926    858  925     68   73    485    418  860     67  137
        pha.2    1426   1309  917    117   81    684    587  856     97  141
        str.1     755    670  886     85  112    483    402  830     81  167
        str.2   10768  10097  937    671   62   3225   2779  861    446  138
        unk.1     213    202  943     11   51    162    153  938      9   55
        unk.2     140    134  950      6   42    103     97  932      6   57
        unk.3      47     44  916      3   62     46     43  914      3   63
        unk.4     302    292  963     10   33    226    216  951     10   44
        unk.5     342    309  900     33   96    246    214  866     32  129
        unk.6     489    431  879     58  118    297    247  828     50  167
        unk.7     387    357  920     30   77    235    208  881     27  114
        unk.8       0      0    0      0    0      0      0    0      0    0
      
        tot.t   37385  35027  936   2358   63   8105   6525  804   1580  194
 
      Good/bad statistics for subset voyl/vms:
 
      #                    tokens                         words             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1       1      1  500      0    0      1      1  500      0    0
        hea.2       0      0    0      0    0      0      0    0      0    0
        heb.1       0      0    0      0    0      0      0    0      0    0
        heb.2       0      0    0      0    0      0      0    0      0    0
        cos.1      10      9  818      1   90     10      9  818      1   90
        cos.2     255    237  925     18   70    225    208  920     17   75
        cos.3     122     82  666     40  325    112     72  637     40  353
        bio.1     147    142  959      5   33    127    122  953      5   39
        zod.1     360    287  795     73  202    303    233  766     70  230
        pha.1      97     86  877     11  112     92     81  870     11  118
        pha.2     162    143  877     19  116    155    136  871     19  121
        str.1       0      0    0      0    0      0      0    0      0    0
        str.2       0      0    0      0    0      0      0    0      0    0
        unk.1       0      0    0      0    0      0      0    0      0    0
        unk.2       0      0    0      0    0      0      0    0      0    0
        unk.3       0      0    0      0    0      0      0    0      0    0
        unk.4      15     14  875      1   62     15     14  875      1   62
        unk.5       0      0    0      0    0      0      0    0      0    0
        unk.6       0      0    0      0    0      0      0    0      0    0
        unk.7       0      0    0      0    0      0      0    0      0    0
        unk.8       2      2  666      0    0      2      2  666      0    0
      
        tot.t    1171   1003  855    168  143    882    721  816    161  182
 
      Good/bad statistics for subset voyi/vms:
 
      #                    tokens                         words             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    1339   1313  979     26   19    709    683  961     26   36
        hea.2     185    178  956      7   37    150    143  947      7   46
        heb.1     440    427  968     13   29    326    313  957     13   39
        heb.2      77     65  833     12  153     68     56  811     12  173
        cos.1       3      2  500      1  250      3      2  500      1  250
        cos.2     203    185  906     18   88    183    165  896     18   97
        cos.3      90     71  780     19  208     82     63  759     19  228
        bio.1     823    782  949     41   49    387    352  907     35   90
        zod.1      30     20  645     10  322     26     17  629      9  333
        pha.1     112    103  911      9   79     95     86  895      9   93
        pha.2     161    137  845     24  148    129    107  823     22  169
        str.1      80     73  901      7   86     76     69  896      7   90
        str.2    1083   1005  927     78   71    675    606  896     69  102
        unk.1      26     22  814      4  148     23     20  833      3  125
        unk.2      32     31  939      1   30     29     28  933      1   33
        unk.3      13     12  857      1   71     13     12  857      1   71
        unk.4      33     32  941      1   29     31     30  937      1   31
        unk.5      35     29  805      6  166     34     28  800      6  171
        unk.6      45     43  934      2   43     39     37  925      2   50
        unk.7      39     37  925      2   50     34     32  914      2   57
        unk.8       0      0    0      0    0      0      0    0      0    0
      
        tot.t    4849   4567  941    282   58   2159   1913  885    246  113
 
      Good/bad statistics for subset voyf/vms:
 
      #                    tokens                         words             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    1339   1302  971     37   27    646    613  947     33   51
        hea.2     185    166  892     19  102    156    137  872     19  121
        heb.1     440    424  961     16   36    270    255  940     15   55
        heb.2      77     74  948      3   38     69     66  942      3   42
        cos.1       3      2  500      1  250      3      2  500      1  250
        cos.2     203    180  882     23  112    167    144  857     23  136
        cos.3      90     67  736     23  252     77     54  692     23  294
        bio.1     823    788  956     35   42    397    362  909     35   87
        zod.1      30     12  387     18  580     30     12  387     18  580
        pha.1     112    101  893     11   97     85     74  860     11  127
        pha.2     161    132  814     29  179    134    108  800     26  192
        str.1      80     73  901      7   86     74     67  893      7   93
        str.2    1083   1002  924     81   74    678    600  883     78  114
        unk.1      26     24  888      2   74     25     23  884      2   76
        unk.2      32     28  848      4  121     27     23  821      4  142
        unk.3      13     12  857      1   71     12     11  846      1   76
        unk.4      33     32  941      1   29     32     31  939      1   30
        unk.5      35     31  861      4  111     35     31  861      4  111
        unk.6      45     34  739     11  239     42     31  720     11  255
        unk.7      39     30  750      9  225     35     27  750      8  222
        unk.8       0      0    0      0    0      0      0    0      0    0
      
        tot.t    4849   4514  930    335   69   2042   1748  855    294  143

      Good/bad statistics for subset voym/vms:
 
      #                    tokens                         words             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    4055   3966  977     89   21   1261   1175  931     86   68
        hea.2     468    451  961     17   36    300    283  940     17   56
        heb.1    2002   1950  973     52   25    809    758  935     51   62
        heb.2     402    370  918     32   79    236    208  877     28  118
        cos.1     112     98  867     14  123     71     61  847     10  138
        cos.2    1077    981  910     96   89    618    524  846     94  151
        cos.3     697    573  820    124  177    413    300  724    113  272
        bio.1    5182   4985  961    197   38   1111    958  861    153  137
        zod.1     949    668  703    281  295    606    364  599    242  398
        pha.1     699    651  930     48   68    363    316  868     47  129
        pha.2    1096   1033  941     63   57    498    444  889     54  108
        str.1     595    524  879     71  119    376    309  819     67  177
        str.2    8599   8087  940    512   59   2356   2034  862    322  136
        unk.1     159    154  962      5   31    124    119  952      5   40
        unk.2      71     70  972      1   13     53     52  962      1   18
        unk.3      21     20  909      1   45     21     20  909      1   45
        unk.4     236    228  962      8   33    177    169  949      8   44
        unk.5     272    249  912     23   84    192    170  880     22  113
        unk.6     399    354  885     45  112    243    205  840     38  155
        unk.7     309    290  935     19   61    187    169  898     18   95
        unk.8       0      0    0      0    0      0      0    0      0    0
      
        tot.t   27400  25702  937   1698   61   5633   4486  796   1147  203

  Formatting the tables for the tech report:

    foreach lang ( ${langs} )
      set afile = ".raw-gud-bad-counts-${lang}-vms.txt"; 
      set tfile = "${lang}/vms/tw-counts-by-sect.tex"; 
      echo " "; echo " ${afile} -> ${tfile}"; echo " "
      cat ${afile} \
        | tex-format-raw-gud-bad-counts \
        > sample/${tfile}
      update-paper-include sample/${tfile} ${tbldir}/${tfile}
    end

  Extracting the main statistics for the tech report:

    foreach lang ( ${langs} )
      set afile = ".raw-gud-bad-counts-${lang}-vms.txt"; 
      set sfile = "${lang}/vms/tw-summary.tex"; 
      echo " "; echo " ${afile} -> ${sfile}"; echo " "
      cat ${afile} \
        | tex-format-raw-gud-bad-summary -v sample=${lang}Vms \
        > sample/${sfile}
      update-paper-include sample/${sfile} ${tbldir}/${sfile}
    end

# END
