Hacking at the Voynich manuscript - Side notes
100 Preparing a clean Voynichese sample for analysis 

Last edited on 2025-05-04 21:19:09 by stolfi

SUMMARY

  We prepare clean Voynichese samples of prose and labels
  (without weirdos, unreadable characters, or contentious readings)
  for the statistical analyses that will go into the "lexeme structure"
  technical report.
  
  Redid it on 2025-04-29 to fix a bug in "". It was inserting blanks
  lines in the "raw.tlw" files AFTER the first token of a new
  line/paragraph, instead of BEFORE it.  Must check all other notes --
  did they depend on those blank lines?
  
SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../.. work 
    
    ln -s work/basify_weirdos.gawk
    ln -s work/combine_counts.gawk
    ln -s work/compute_cum_cum_freqs.gawk
    ln -s work/compute_cum_freqs.sh
    ln -s work/compute_freqs.gawk
    ln -s work/extract_section_from_evt.sh
    ln -s work/format_words_filled.sh
    ln -s work/format_counts_packed.gawk
    ln -s work/select_units.gawk
    ln -s work/show_first_last_lines.sh
    ln -s work/totalize_fields.gawk
    ln -s work/update_paper_include.sh
    ln -s work/vms_wc.sh
    ln -s work/words_from_evt.gawk
    
REFERENCE DATA

  The source data will be the interlinear release 1.6e6, 
  already chopped into sections.
  
DIRECTORY STRUCTURE

  The data files generated by this note (text, word counts, tables,
  etc.) for each sample text will live in the subdirectories "gen/",
  "gen/LANG/", "gen/LANG/BUK/", and "gen/LANG/BUK/SEC.K/", where
  
     LANG  the sample's language. Two samples should have the same LANG
             only if they use the same spelling for shared words.
             Thus, English and Italian are different LANGs. Different
             encodings of Chinese (pinyin, GR, RomanNum) are different
             LANGs. Medieval French and Modern French are different
             LANGs. The Bible (with modern spelling) and War of the
             World are the same LANG. After much analysis, it seems
             that we can assign a single LANG ("voyn") to all parts of
             the VMS. 
     
     BUK   the book. Two samples with the same LANG and BUK should
             be by the same author and part of the same book. 
             
             For Voynichese, BUK is 
             
               "maj" = whole text, only the "majority vote" transcription lines.
               "lab" = only lines of "maj" that have labels or single words.
               "prs" = only the lines of "maj" excluded from "lab".
               "ini" = from "prs", only the first token of after each break.
               "fin" = from "prs", only the last token before each break.
               "mid" = from "prs", only every line of "prs" minus "ini" and "fin" tokens.
               "tak" = whole text, only Takahashi's transcription lines.
               
             For the other languages, BUK is a book tag (e.g. "wow"
             for War of the Worlds, "ptt" for the Pentateuch).
     
     SEC   the major division within the book. The divisions must
             be disjoint. Partition of the book into divisions is
             worth the trouble only if the usage of common lexemes is
             expected to vary significantly between divisions (due to
             differences in subject matter and/or style), and those
             differences are considered relevant for the analysis.
             
             For the VMS, each classical section (Biological,
             Pharmaceutical, etc.) is a separate division, except that
             we split the Herbal section into two divisions "hea" and
             "heb". In the Culpeper herbal, the preamble, plant
             descriptions, and recipes could be in three separate
             divisions. In the Pentateuch, we could let each of the
             five books be a separate division. And so on.
             
         K the sub-division of SEC. For Voynichese, a subdivision is a
             maximal string of *consecutive* pages that belong to the
             same SEC; e.g. the Herbal-A consists of two separate sets
             of pages, "hea.1" and "hea.2". For other languages, we
             usually don't need to have more than one subdivision per 
             SEC. In this note, each sub-division will be called a "section".
             
  Whether a BUK is partitioned into sections or not,
  it always has a pseudo-section "tot.1" which is the entire sample
  (hence the union of all other sections).
  
  Some of these data files are formatted as LaTeX tables and commands,
  and placed in the folders "tex/", "tex/LANG/", "tex/LANG/BUK/" and
  "tex/LANG/BUK/SEC.K/" as appropriate.
  
  The foldes "inp/", "inp/LANG/", "inp/LANG/BUK/" and "inp/LANG/BUK/SEC.K/"
  contain files or links created by hand that are inputs to various scripts.
   
OUTPUT DATA FILES PER BOOK

  Each folder "gen/LANG/BUK/" contains the following files:

    "raw.evt" contains the text of that book according to some
      specific transcription, extracted or derived from the global VMS
      source EVT file. The file is in the EVT format, with each weirdo
      converted to an equivalent basic EVA char, or to "*" if impossible.

    "fnums.txt" contains the numbers of the logical pages
      (like "f11r", "f100v", "f86v5") that occur in that book.

    "sections-occ.tags" is the list of sections that occur in that book,
      in publishable order.

    "sections-use.tags" is the subset of the above that are worth 
      analyzing separately, in the same order.  It varies depending on
      the BUK.

    "raw-gud-bad-tw-counts.txt" is a table with counts and percentages 
      of tokens and lexemes in "raw.evt", for each of the sections
      in "sections-use.tags", including the invalid ("bad") and 
      valid ("gud") tokens and lexemes (see below). 
      
      Apart form '#'-comments and blank lines, each line of this file has the 
      format 
      
        "SEC.K  RAWTK GUDTK GUDTKPPM  BADTK BADTKPPM   RAWWD GUDWD GUDWDPPM  BADWD BADWDPPM"
      where 
      
        SEC is a section tag, like "hea.1", "cos.2", "unk.3".
        
        RAWTK,GUDTK,BADTK counts of total, good, and bad tokens in sction.
        
        RAWWD,GUDWD,BADWD counts of total, good, and bad lexemes in section,
        
        {xx}TKPPM = 100*{xx}TK/RAWTK, witk 1 decimal, where {xx} is "GUD" or "BAD".
        
        {xx}WDPPM = 100*{xx}WD/RAWWD, witk 1 decimal.

  Each folder "tex/LANG/BUK/" contains the following files:
  
    "raw-gud-bad-tw-counts.tex" the table "raw-gud-bad-tw-counts.txt"
      from "gen/LANG/BUK/", formatted as a LaTeX table.
      
    "raw-gud-bad-summary.tex", that defines the entries of that table
      as separate LaTeX macros.
 
OUTPUT DATA FILES PER SECTION

  The main output files are "DIR/raw.evt" and "DIR/XXX.EEE"
  where 
  
    DIR is any of the folders "gen/LANG/BUK/SEC.K",
    
    XXX is "raw", "gud", or "bad", and 
    
    EEE is "tlw", "wfr", or "wdf"
    
  The file "DIR/raw.evt" contains the 
  lines from "gen/LANG/BUK/raw.evt" that belong to section SEC.K
  
  The file "DIR/raw.tlw" contains the raw sequence of tokens and
  paragraph delimiters that occur in the "DIR/raw.evt" file, one token per
  line, in the format "TYPE LOC STRING", where 
  
    LOC is the EVT-style line location code
    
    TYPE is the type of the token ("p" = punctuation, "s" = symbol,
      "a" = alpha word)
  
    STRING is the token in EVA encoding.
    
  There is a line "# =" wherever raw.evt has
  a "=" delimiter -- namely, between paragraphs, labels, titles, etc
  (but not at line breaks within paragraphs).
  
  The file "DIR/raw.wfr" contains the corresponding lexemes with occurrence
  counts and relative frequencies.
  
  The file "DIR/raw.wdf" contains the lexemes of "DIR/raw.tlw" as a running text,
  with ~72 characters per line, separated by simple spaces or line breaks,
  without locators,paragraph breaks, section breaks, etc.
  
  The files "DIR/gud.EEE" and "DIR/bad.EEE", where
  EEE is "tlw", "wfr", or "wdf" are the subsets of "DIR/raw.EEE".

  List of Voynichese "books":

REMOVING BAD WORDS

  The "bad" tokens and lexemes are those with unreadable characters
  weirdos, or combinations that are considered "invalid" for some
  reason. The excluded lexemes are saved in "DIR/bad.wfr", and the balance
  is saved in "DIR/gud.wfr". Most other files are derived from `gud.wfr',
  the frequency file for good lexemes.
  
  Weirdos are defined as characters and combinations that are not part
  of the basic glyph set
  
    e i  a o  q y  d l  r s  n m   k t  f p  ch sh  ckh cth  cfh cph 

  Note that we exclude  { g j u v x z } as well as any { c h }
  that are not part of the compound glyphs listed above.

  We believe that this selection will not introduce a significant bias
  in the grammar-fitting percentages. Tokens that contain weirdos are
  probably abbreviations or symbols, which should not be counted in
  the totals; or embellished words, which are likely to be chosen for
  embellishment independently of their fitness or not to the grammar.
  As for tokens that have discrepant readings, the divergence should
  not be strongly correlated to their fitness to the grammar.

DO IT

  Do it. (See the output at end of this note.)
     
    make_all_data_and_tables.sh

>> OLD >>
  
  Here are the numbers.
        
      type               nbad         [?]        [bchv...]      [ai?n]
      -------------- ------------ ------------ ------------ ------------
      voyn/maj/tot.1  1708(2526)   1612(2407)     96(119)     114(396)  
      voyn/prs/tot.1  1580(2358)   1501(2257)     79(101)     114(396)  
      voyn/lab/tot.1   161(168)     143(150)      18(18)        0(0)    
      voyn/ini/tot.1   246(282)     241(277)       5(5)        29(44)   
      voyn/mid/tot.1  1147(1698)   1100(1646)     47(52)       86(316)  
      voyn/fin/tot.1   294(335)     267(306)      27(29)       26(36)   
      voyn/tak/tot.1   497(626)     127(154)     370(472)       2(2)    
                           
  Qad words that were rejected only because of lowercase
  weirdos:

    From voyp/vms/tot.1.wfr:

      v(7) x(7) c(4) cheg(3) xar(3) amg(2) cto(2) g(2) aikhckhy(1)
      aithy(1) arg(1) arxor(1) axor(1) chckshy(1) chcpar(1)
      chcs(1) checta(1) chepchx(1) chocty(1) chodalg(1)
      choekchcey(1) choikhy(1) chokolg(1) cholxy(1) chxar(1)
      ckcho(1) ckchol(1) ckshy(1) cky(1) coy(1) cpheeg(1) cseo(1)
      ctar(1) ctchy(1) ctechy(1) ctoiin(1) ctos(1) dag(1) daing(1)
      dchog(1) dkeeeg(1) docodal(1) doithy(1) gaiin(1) kedarxy(1)
      lxor(1) ockey(1) ockhh(1) oetalchg(1) ogam(1) olgy(1) org(1)
      oxar(1) oxor(1) oxy(1) pchocty(1) qocky(1) qodaikhy(1)
      qokeefcy(1) qokg(1) rokaix(1) salxar(1) sarg(1)
      shecphhedy(1) shhy(1) shokog(1) shxam(1) soleeg(1) teyteg(1)
      todashx(1) vo(1) vr(1) vs(1) xoiin(1) xol(1) yhal(1)
      ykceol(1) ypcheg(1) ytcharg(1)

    From voyl/vms/tot.1.wfr:

      cfhhy(1) chockhhy(1) chodalg(1) ddsschx(1) docfhhy(1) gy(1)
      oalcheg(1) ocsesy(1) oecs(1) ofacfom(1) okaramog(1)
      okeeog(1) opalg(1) opchaldg(1) oteedyg(1) soshxar(1)
      ydashgarain(1) yskhy(1)

OUTPUT OF "MAKE"

  Sample voyn/maj:
  
        lines   words     bytes file        
      ------- ------- --------- ------------
         1066    2132     64512 gen/voyn/maj/hea.1/raw.evt
          134     268      8660 gen/voyn/maj/hea.2/raw.evt
          316     632     24711 gen/voyn/maj/heb.1/raw.evt
           61     122      4644 gen/voyn/maj/heb.2/raw.evt
           13      26      1132 gen/voyn/maj/cos.1/raw.evt
          393     786     19115 gen/voyn/maj/cos.2/raw.evt
          186     372      9994 gen/voyn/maj/cos.3/raw.evt
          902    1804     62353 gen/voyn/maj/bio.1/raw.evt
          335     670     15343 gen/voyn/maj/zod.1/raw.evt
          174     348     10021 gen/voyn/maj/pha.1/raw.evt
          284     568     15718 gen/voyn/maj/pha.2/raw.evt
           80     160      6158 gen/voyn/maj/str.1/raw.evt
         1084    2168     90650 gen/voyn/maj/str.2/raw.evt
           28      56      1835 gen/voyn/maj/unk.1/raw.evt
           26      52      1801 gen/voyn/maj/unk.2/raw.evt
            7      14       461 gen/voyn/maj/unk.3/raw.evt
           48      96      2972 gen/voyn/maj/unk.4/raw.evt
           35      70      2844 gen/voyn/maj/unk.5/raw.evt
           45      90      3845 gen/voyn/maj/unk.6/raw.evt
           39      78      3002 gen/voyn/maj/unk.7/raw.evt
            1       2        67 gen/voyn/maj/unk.8/raw.evt
         5514   11901    360159 gen/voyn/maj/tot.1/raw.evt

        lines   words     bytes file        
      ------- ------- --------- ------------
         7047   20961    164869 gen/voyn/maj/hea.1/raw.tlw
          882    2632     21493 gen/voyn/maj/hea.2/raw.tlw
         2959    8819     70279 gen/voyn/maj/heb.1/raw.tlw
          570    1697     13835 gen/voyn/maj/heb.2/raw.tlw
          205     605      4403 gen/voyn/maj/cos.1/raw.tlw
         2032    5810     44962 gen/voyn/maj/cos.2/raw.tlw
         1123    3252     26380 gen/voyn/maj/cos.3/raw.tlw
         7171   21317    174100 gen/voyn/maj/bio.1/raw.tlw
         1674    4718     36687 gen/voyn/maj/zod.1/raw.tlw
         1123    3269     26971 gen/voyn/maj/pha.1/raw.tlw
         1763    5114     42370 gen/voyn/maj/pha.2/raw.tlw
          763    2281     19088 gen/voyn/maj/str.1/raw.tlw
        11056   32880    283328 gen/voyn/maj/str.2/raw.tlw
          220     653      5215 gen/voyn/maj/unk.1/raw.tlw
          142     424      3534 gen/voyn/maj/unk.2/raw.tlw
           49     145      1129 gen/voyn/maj/unk.3/raw.tlw
          337     991      7977 gen/voyn/maj/unk.4/raw.tlw
          351    1044      8939 gen/voyn/maj/unk.5/raw.tlw
          492    1473     12624 gen/voyn/maj/unk.6/raw.tlw
          392    1171      9897 gen/voyn/maj/unk.7/raw.tlw
            2       6        49 gen/voyn/maj/unk.8/raw.tlw
        40372  119300    978182 gen/voyn/maj/tot.1/raw.tlw

             lines file
           ------- ------------
              2132 gen/voyn/maj/hea.1/raw.wfr
               554 gen/voyn/maj/hea.2/raw.wfr
              1189 gen/voyn/maj/heb.1/raw.wfr
               331 gen/voyn/maj/heb.2/raw.wfr
                83 gen/voyn/maj/cos.1/raw.wfr
              1019 gen/voyn/maj/cos.2/raw.wfr
               620 gen/voyn/maj/cos.3/raw.wfr
              1597 gen/voyn/maj/bio.1/raw.wfr
               884 gen/voyn/maj/zod.1/raw.wfr
               561 gen/voyn/maj/pha.1/raw.wfr
               808 gen/voyn/maj/pha.2/raw.wfr
               483 gen/voyn/maj/str.1/raw.wfr
              3225 gen/voyn/maj/str.2/raw.wfr
               162 gen/voyn/maj/unk.1/raw.wfr
               103 gen/voyn/maj/unk.2/raw.wfr
                46 gen/voyn/maj/unk.3/raw.wfr
               239 gen/voyn/maj/unk.4/raw.wfr
               246 gen/voyn/maj/unk.5/raw.wfr
               297 gen/voyn/maj/unk.6/raw.wfr
               235 gen/voyn/maj/unk.7/raw.wfr
                 2 gen/voyn/maj/unk.8/raw.wfr
              8591 gen/voyn/maj/tot.1/raw.wfr

             lines file
           ------- ------------
              1981 gen/voyn/maj/hea.1/gud.wfr
               509 gen/voyn/maj/hea.2/gud.wfr
              1111 gen/voyn/maj/heb.1/gud.wfr
               288 gen/voyn/maj/heb.2/gud.wfr
                72 gen/voyn/maj/cos.1/gud.wfr
               868 gen/voyn/maj/cos.2/gud.wfr
               429 gen/voyn/maj/cos.3/gud.wfr
              1382 gen/voyn/maj/bio.1/gud.wfr
               555 gen/voyn/maj/zod.1/gud.wfr
               483 gen/voyn/maj/pha.1/gud.wfr
               694 gen/voyn/maj/pha.2/gud.wfr
               402 gen/voyn/maj/str.1/gud.wfr
              2779 gen/voyn/maj/str.2/gud.wfr
               153 gen/voyn/maj/unk.1/gud.wfr
                97 gen/voyn/maj/unk.2/gud.wfr
                43 gen/voyn/maj/unk.3/gud.wfr
               228 gen/voyn/maj/unk.4/gud.wfr
               214 gen/voyn/maj/unk.5/gud.wfr
               247 gen/voyn/maj/unk.6/gud.wfr
               208 gen/voyn/maj/unk.7/gud.wfr
                 2 gen/voyn/maj/unk.8/gud.wfr
              6883 gen/voyn/maj/tot.1/gud.wfr

             lines file
           ------- ------------
               151 gen/voyn/maj/hea.1/bad.wfr
                45 gen/voyn/maj/hea.2/bad.wfr
                78 gen/voyn/maj/heb.1/bad.wfr
                43 gen/voyn/maj/heb.2/bad.wfr
                11 gen/voyn/maj/cos.1/bad.wfr
               151 gen/voyn/maj/cos.2/bad.wfr
               191 gen/voyn/maj/cos.3/bad.wfr
               215 gen/voyn/maj/bio.1/bad.wfr
               329 gen/voyn/maj/zod.1/bad.wfr
                78 gen/voyn/maj/pha.1/bad.wfr
               114 gen/voyn/maj/pha.2/bad.wfr
                81 gen/voyn/maj/str.1/bad.wfr
               446 gen/voyn/maj/str.2/bad.wfr
                 9 gen/voyn/maj/unk.1/bad.wfr
                 6 gen/voyn/maj/unk.2/bad.wfr
                 3 gen/voyn/maj/unk.3/bad.wfr
                11 gen/voyn/maj/unk.4/bad.wfr
                32 gen/voyn/maj/unk.5/bad.wfr
                50 gen/voyn/maj/unk.6/bad.wfr
                27 gen/voyn/maj/unk.7/bad.wfr
                 0 gen/voyn/maj/unk.8/bad.wfr
              1708 gen/voyn/maj/tot.1/bad.wfr

      Good/bad statistics for voyn/maj:

      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    6867   6704  976    163   23   2132   1981  928    151   70
        hea.2     868    823  947     45   51    554    509  917     45   81
        heb.1    2901   2820  971     81   27   1189   1111  933     78   65
        heb.2     557    510  913     47   84    331    288  867     43  129
        cos.1     195    155  790     40  204     83     72  857     11  130
        cos.2    1746   1590  910    156   89   1019    868  850    151  148
        cos.3    1006    795  789    211  209    620    429  690    191  307
        bio.1    6975   6697  960    278   39   1597   1382  864    215  134
        zod.1    1370    988  720    382  278    884    555  627    329  371
        pha.1    1023    944  921     79   77    561    483  859     78  138
        pha.2    1588   1452  913    136   85    808    694  857    114  140
        str.1     755    670  886     85  112    483    402  830     81  167
        str.2   10768  10097  937    671   62   3225   2779  861    446  138
        unk.1     213    202  943     11   51    162    153  938      9   55
        unk.2     140    134  950      6   42    103     97  932      6   57
        unk.3      47     44  916      3   62     46     43  914      3   63
        unk.4     317    306  962     11   34    239    228  950     11   45
        unk.5     342    309  900     33   96    246    214  866     32  129
        unk.6     489    431  879     58  118    297    247  828     50  167
        unk.7     387    357  920     30   77    235    208  881     27  114
        unk.8       2      2  666      0    0      2      2  666      0    0
      
        tot.1   38556  36030  934   2526   65   8591   6883  801   1708  198

  Sample voyn/prs:

        lines   words     bytes file        
      ------- ------- --------- ------------
         1065    2130     64485 gen/voyn/prs/hea.1/raw.evt
          134     268      8660 gen/voyn/prs/hea.2/raw.evt
          316     632     24711 gen/voyn/prs/heb.1/raw.evt
           61     122      4644 gen/voyn/prs/heb.2/raw.evt
            4       8       870 gen/voyn/prs/cos.1/raw.evt
          206     412     13662 gen/voyn/prs/cos.2/raw.evt
           85     170      7150 gen/voyn/prs/cos.3/raw.evt
          775    1550     58885 gen/voyn/prs/bio.1/raw.evt
           36      72      6945 gen/voyn/prs/zod.1/raw.evt
           89     178      7635 gen/voyn/prs/pha.1/raw.evt
          135     270     11650 gen/voyn/prs/pha.2/raw.evt
           80     160      6158 gen/voyn/prs/str.1/raw.evt
         1084    2168     90650 gen/voyn/prs/str.2/raw.evt
           28      56      1835 gen/voyn/prs/unk.1/raw.evt
           26      52      1801 gen/voyn/prs/unk.2/raw.evt
            7      14       461 gen/voyn/prs/unk.3/raw.evt
           33      66      2563 gen/voyn/prs/unk.4/raw.evt
           35      70      2844 gen/voyn/prs/unk.5/raw.evt
           45      90      3845 gen/voyn/prs/unk.6/raw.evt
           39      78      3002 gen/voyn/prs/unk.7/raw.evt
            0       0         0 gen/voyn/prs/unk.8/raw.evt
         4540    9953    332777 gen/voyn/prs/tot.1/raw.evt

        lines   words     bytes file        
      ------- ------- --------- ------------
         7045   20956    164841 gen/voyn/prs/hea.1/raw.tlw
          882    2632     21493 gen/voyn/prs/hea.2/raw.tlw
         ,2959    8819     70279 gen/voyn/prs/heb.1/raw.tlw
          570    1697     13835 gen/voyn/prs/heb.2/raw.tlw
          186     557      4105 gen/voyn/prs/cos.1/raw.tlw
         1606    4703     37794 gen/voyn/prs/cos.2/raw.tlw
          904    2692     22690 gen/voyn/prs/cos.3/raw.tlw
         6915   20658    170013 gen/voyn/prs/bio.1/raw.tlw
         1015    3040     25966 gen/voyn/prs/zod.1/raw.tlw
          942    2810     24107 gen/voyn/prs/pha.1/raw.tlw
         1455    4336     37509 gen/voyn/prs/pha.2/raw.tlw
          763    2281     19088 gen/voyn/prs/str.1/raw.tlw
        11056   32880    283328 gen/voyn/prs/str.2/raw.tlw
          220     653      5215 gen/voyn/prs/unk.1/raw.tlw
          142     424      3534 gen/voyn/prs/unk.2/raw.tlw
           49     145      1129 gen/voyn/prs/unk.3/raw.tlw
          307     916      7559 gen/voyn/prs/unk.4/raw.tlw
          351    1044      8939 gen/voyn/prs/unk.5/raw.tlw
          492    1473     12624 gen/voyn/prs/unk.6/raw.tlw
          392    1171      9897 gen/voyn/prs/unk.7/raw.tlw
            0       0         0 gen/voyn/prs/unk.8/raw.tlw
        38269  113923    943994 gen/voyn/prs/tot.1/raw.tlw

       lines file
     ------- ------------
        2131 gen/voyn/prs/hea.1/raw.wfr
         554 gen/voyn/prs/hea.2/raw.wfr
        1189 gen/voyn/prs/heb.1/raw.wfr
         331 gen/voyn/prs/heb.2/raw.wfr
          73 gen/voyn/prs/cos.1/raw.wfr
         868 gen/voyn/prs/cos.2/raw.wfr
         533 gen/voyn/prs/cos.3/raw.wfr
        1536 gen/voyn/prs/bio.1/raw.wfr
         641 gen/voyn/prs/zod.1/raw.wfr
         485 gen/voyn/prs/pha.1/raw.wfr
         684 gen/voyn/prs/pha.2/raw.wfr
         483 gen/voyn/prs/str.1/raw.wfr
        3225 gen/voyn/prs/str.2/raw.wfr
         162 gen/voyn/prs/unk.1/raw.wfr
         103 gen/voyn/prs/unk.2/raw.wfr
          46 gen/voyn/prs/unk.3/raw.wfr
         226 gen/voyn/prs/unk.4/raw.wfr
         246 gen/voyn/prs/unk.5/raw.wfr
         297 gen/voyn/prs/unk.6/raw.wfr
         235 gen/voyn/prs/unk.7/raw.wfr
           0 gen/voyn/prs/unk.8/raw.wfr
        8105 gen/voyn/prs/tot.1/raw.wfr

       lines file
     ------- ------------
        1980 gen/voyn/prs/hea.1/gud.wfr
         509 gen/voyn/prs/hea.2/gud.wfr
        1111 gen/voyn/prs/heb.1/gud.wfr
         288 gen/voyn/prs/heb.2/gud.wfr
          63 gen/voyn/prs/cos.1/gud.wfr
         733 gen/voyn/prs/cos.2/gud.wfr
         380 gen/voyn/prs/cos.3/gud.wfr
        1325 gen/voyn/prs/bio.1/gud.wfr
         379 gen/voyn/prs/zod.1/gud.wfr
         418 gen/voyn/prs/pha.1/gud.wfr
         587 gen/voyn/prs/pha.2/gud.wfr
         402 gen/voyn/prs/str.1/gud.wfr
        2779 gen/voyn/prs/str.2/gud.wfr
         153 gen/voyn/prs/unk.1/gud.wfr
          97 gen/voyn/prs/unk.2/gud.wfr
          43 gen/voyn/prs/unk.3/gud.wfr
         216 gen/voyn/prs/unk.4/gud.wfr
         214 gen/voyn/prs/unk.5/gud.wfr
         247 gen/voyn/prs/unk.6/gud.wfr
         208 gen/voyn/prs/unk.7/gud.wfr
           0 gen/voyn/prs/unk.8/gud.wfr
        6525 gen/voyn/prs/tot.1/gud.wfr

       lines file
     ------- ------------
         151 gen/voyn/prs/hea.1/bad.wfr
          45 gen/voyn/prs/hea.2/bad.wfr
          78 gen/voyn/prs/heb.1/bad.wfr
          43 gen/voyn/prs/heb.2/bad.wfr
          10 gen/voyn/prs/cos.1/bad.wfr
         135 gen/voyn/prs/cos.2/bad.wfr
         153 gen/voyn/prs/cos.3/bad.wfr
         211 gen/voyn/prs/bio.1/bad.wfr
         262 gen/voyn/prs/zod.1/bad.wfr
          67 gen/voyn/prs/pha.1/bad.wfr
          97 gen/voyn/prs/pha.2/bad.wfr
          81 gen/voyn/prs/str.1/bad.wfr
         446 gen/voyn/prs/str.2/bad.wfr
           9 gen/voyn/prs/unk.1/bad.wfr
           6 gen/voyn/prs/unk.2/bad.wfr
           3 gen/voyn/prs/unk.3/bad.wfr
          10 gen/voyn/prs/unk.4/bad.wfr
          32 gen/voyn/prs/unk.5/bad.wfr
          50 gen/voyn/prs/unk.6/bad.wfr
          27 gen/voyn/prs/unk.7/bad.wfr
           0 gen/voyn/prs/unk.8/bad.wfr
        1580 gen/voyn/prs/tot.1/bad.wfr

 
      Good/bad statistics for voyn/prs:

      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1    6866   6703  976    163   23   2131   1980  928    151   70
        hea.2     868    823  947     45   51    554    509  917     45   81
        heb.1    2901   2820  971     81   27   1189   1111  933     78   65
        heb.2     557    510  913     47   84    331    288  867     43  129
        cos.1     185    146  784     39  209     73     63  851     10  135
        cos.2    1491   1353  906    138   92    868    733  843    135  155
        cos.3     884    713  805    171  193    533    380  711    153  286
        bio.1    6828   6555  959    273   39   1536   1325  862    211  137
        zod.1    1010    701  693    309  305    641    379  590    262  408
        pha.1     926    858  925     68   73    485    418  860     67  137
        pha.2    1426   1309  917    117   81    684    587  856     97  141
        str.1     755    670  886     85  112    483    402  830     81  167
        str.2   10768  10097  937    671   62   3225   2779  861    446  138
        unk.1     213    202  943     11   51    162    153  938      9   55
        unk.2     140    134  950      6   42    103     97  932      6   57
        unk.3      47     44  916      3   62     46     43  914      3   63
        unk.4     302    292  963     10   33    226    216  951     10   44
        unk.5     342    309  900     33   96    246    214  866     32  129
        unk.6     489    431  879     58  118    297    247  828     50  167
        unk.7     387    357  920     30   77    235    208  881     27  114
        unk.8       0      0    0      0    0      0      0    0      0    0
      
        tot.1   37385  35027  936   2358   63   8105   6525  804   1580  194

  Sample voyn/lab:

        lines   words     bytes file        
      ------- ------- --------- ------------
            1       2        27 gen/voyn/lab/hea.1/raw.evt
            0       0         0 gen/voyn/lab/hea.2/raw.evt
            0       0         0 gen/voyn/lab/heb.1/raw.evt
            0       0         0 gen/voyn/lab/heb.2/raw.evt
            9      18       262 gen/voyn/lab/cos.1/raw.evt
          187     374      5453 gen/voyn/lab/cos.2/raw.evt
          101     202      2844 gen/voyn/lab/cos.3/raw.evt
          127     254      3468 gen/voyn/lab/bio.1/raw.evt
          299     598      8398 gen/voyn/lab/zod.1/raw.evt
           85     170      2386 gen/voyn/lab/pha.1/raw.evt
          149     298      4068 gen/voyn/lab/pha.2/raw.evt
            0       0         0 gen/voyn/lab/str.1/raw.evt
            0       0         0 gen/voyn/lab/str.2/raw.evt
            0       0         0 gen/voyn/lab/unk.1/raw.evt
            0       0         0 gen/voyn/lab/unk.2/raw.evt
            0       0         0 gen/voyn/lab/unk.3/raw.evt
           15      30       409 gen/voyn/lab/unk.4/raw.evt
            0       0         0 gen/voyn/lab/unk.5/raw.evt
            0       0         0 gen/voyn/lab/unk.6/raw.evt
            0       0         0 gen/voyn/lab/unk.7/raw.evt
            1       2        67 gen/voyn/lab/unk.8/raw.evt
         1231    3335     37703 gen/voyn/lab/tot.1/raw.evt

        lines   words     bytes file        
      ------- ------- --------- ------------
            1       3        24 gen/voyn/lab/hea.1/raw.tlw
            0       0         0 gen/voyn/lab/hea.2/raw.tlw
            0       0         0 gen/voyn/lab/heb.1/raw.tlw
            0       0         0 gen/voyn/lab/heb.2/raw.tlw
           18      46       294 gen/voyn/lab/cos.1/raw.tlw
          425    1105      7164 gen/voyn/lab/cos.2/raw.tlw
          218     558      3687 gen/voyn/lab/cos.3/raw.tlw
          255     657      4084 gen/voyn/lab/bio.1/raw.tlw
          658    1676     10717 gen/voyn/lab/zod.1/raw.tlw
          180     457      2862 gen/voyn/lab/pha.1/raw.tlw
          307     776      4859 gen/voyn/lab/pha.2/raw.tlw
            0       0         0 gen/voyn/lab/str.1/raw.tlw
            0       0         0 gen/voyn/lab/str.2/raw.tlw
            0       0         0 gen/voyn/lab/unk.1/raw.tlw
            0       0         0 gen/voyn/lab/unk.2/raw.tlw
            0       0         0 gen/voyn/lab/unk.3/raw.tlw
           29      73       415 gen/voyn/lab/unk.4/raw.tlw
            0       0         0 gen/voyn/lab/unk.5/raw.tlw
            0       0         0 gen/voyn/lab/unk.6/raw.tlw
            0       0         0 gen/voyn/lab/unk.7/raw.tlw
            2       6        49 gen/voyn/lab/unk.8/raw.tlw
         2102    5375     34187 gen/voyn/lab/tot.1/raw.tlw

       lines file
     ------- ------------
           1 gen/voyn/lab/hea.1/raw.wfr
           0 gen/voyn/lab/hea.2/raw.wfr
           0 gen/voyn/lab/heb.1/raw.wfr
           0 gen/voyn/lab/heb.2/raw.wfr
          10 gen/voyn/lab/cos.1/raw.wfr
         225 gen/voyn/lab/cos.2/raw.wfr
         112 gen/voyn/lab/cos.3/raw.wfr
         127 gen/voyn/lab/bio.1/raw.wfr
         303 gen/voyn/lab/zod.1/raw.wfr
          92 gen/voyn/lab/pha.1/raw.wfr
         155 gen/voyn/lab/pha.2/raw.wfr
           0 gen/voyn/lab/str.1/raw.wfr
           0 gen/voyn/lab/str.2/raw.wfr
           0 gen/voyn/lab/unk.1/raw.wfr
           0 gen/voyn/lab/unk.2/raw.wfr
           0 gen/voyn/lab/unk.3/raw.wfr
          15 gen/voyn/lab/unk.4/raw.wfr
           0 gen/voyn/lab/unk.5/raw.wfr
           0 gen/voyn/lab/unk.6/raw.wfr
           0 gen/voyn/lab/unk.7/raw.wfr
           2 gen/voyn/lab/unk.8/raw.wfr
         882 gen/voyn/lab/tot.1/raw.wfr

       lines file
     ------- ------------
           1 gen/voyn/lab/hea.1/gud.wfr
           0 gen/voyn/lab/hea.2/gud.wfr
           0 gen/voyn/lab/heb.1/gud.wfr
           0 gen/voyn/lab/heb.2/gud.wfr
           9 gen/voyn/lab/cos.1/gud.wfr
         208 gen/voyn/lab/cos.2/gud.wfr
          72 gen/voyn/lab/cos.3/gud.wfr
         122 gen/voyn/lab/bio.1/gud.wfr
         233 gen/voyn/lab/zod.1/gud.wfr
          81 gen/voyn/lab/pha.1/gud.wfr
         136 gen/voyn/lab/pha.2/gud.wfr
           0 gen/voyn/lab/str.1/gud.wfr
           0 gen/voyn/lab/str.2/gud.wfr
           0 gen/voyn/lab/unk.1/gud.wfr
           0 gen/voyn/lab/unk.2/gud.wfr
           0 gen/voyn/lab/unk.3/gud.wfr
          14 gen/voyn/lab/unk.4/gud.wfr
           0 gen/voyn/lab/unk.5/gud.wfr
           0 gen/voyn/lab/unk.6/gud.wfr
           0 gen/voyn/lab/unk.7/gud.wfr
           2 gen/voyn/lab/unk.8/gud.wfr
         721 gen/voyn/lab/tot.1/gud.wfr

       lines file
     ------- ------------
           0 gen/voyn/lab/hea.1/bad.wfr
           0 gen/voyn/lab/hea.2/bad.wfr
           0 gen/voyn/lab/heb.1/bad.wfr
           0 gen/voyn/lab/heb.2/bad.wfr
           1 gen/voyn/lab/cos.1/bad.wfr
          17 gen/voyn/lab/cos.2/bad.wfr
          40 gen/voyn/lab/cos.3/bad.wfr
           5 gen/voyn/lab/bio.1/bad.wfr
          70 gen/voyn/lab/zod.1/bad.wfr
          11 gen/voyn/lab/pha.1/bad.wfr
          19 gen/voyn/lab/pha.2/bad.wfr
           0 gen/voyn/lab/str.1/bad.wfr
           0 gen/voyn/lab/str.2/bad.wfr
           0 gen/voyn/lab/unk.1/bad.wfr
           0 gen/voyn/lab/unk.2/bad.wfr
           0 gen/voyn/lab/unk.3/bad.wfr
           1 gen/voyn/lab/unk.4/bad.wfr
           0 gen/voyn/lab/unk.5/bad.wfr
           0 gen/voyn/lab/unk.6/bad.wfr
           0 gen/voyn/lab/unk.7/bad.wfr
           0 gen/voyn/lab/unk.8/bad.wfr
         161 gen/voyn/lab/tot.1/bad.wfr

 
      Good/bad statistics for voyn/lab:

      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
        hea.1       1      1  500      0    0      1      1  500      0    0
        hea.2       0      0    0      0    0      0      0    0      0    0
        heb.1       0      0    0      0    0      0      0    0      0    0
        heb.2       0      0    0      0    0      0      0    0      0    0
        cos.1      10      9  818      1   90     10      9  818      1   90
        cos.2     255    237  925     18   70    225    208  920     17   75
        cos.3     122     82  666     40  325    112     72  637     40  353
        bio.1     147    142  959      5   33    127    122  953      5   39
        zod.1     360    287  795     73  202    303    233  766     70  230
        pha.1      97     86  877     11  112     92     81  870     11  118
        pha.2     162    143  877     19  116    155    136  871     19  121
        str.1       0      0    0      0    0      0      0    0      0    0
        str.2       0      0    0      0    0      0      0    0      0    0
        unk.1       0      0    0      0    0      0      0    0      0    0
        unk.2       0      0    0      0    0      0      0    0      0    0
        unk.3       0      0    0      0    0      0      0    0      0    0
        unk.4      15     14  875      1   62     15     14  875      1   62
        unk.5       0      0    0      0    0      0      0    0      0    0
        unk.6       0      0    0      0    0      0      0    0      0    0
        unk.7       0      0    0      0    0      0      0    0      0    0
        unk.8       2      2  666      0    0      2      2  666      0    0
      
        tot.1    1171   1003  855    168  143    882    721  816    161  182

  Statistics for voyn/tak:

        lines   words     bytes file        
      ------- ------- --------- ------------
         5391   11713    361531 gen/voyn/tak/tot.1/raw.evt

        lines   words     bytes file        
      ------- ------- --------- ------------
        39548  116936    960572 gen/voyn/tak/tot.1/raw.tlw
        
       lines file
     ------- ------------
        8150 gen/voyn/tak/tot.1/raw.wfr

       lines file
     ------- ------------
        7653 gen/voyn/tak/tot.1/gud.wfr

       lines file
     ------- ------------
         497 gen/voyn/tak/tot.1/bad.wfr

 
      Good/bad statistics for voyn/tak:

      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
      
        tot.1   37840  37214  983    626   16   8150   7653  938    497   60

  Statistics for voyn/ini:

        lines   words     bytes file        
      ------- ------- --------- ------------

        lines   words     bytes file        
      ------- ------- --------- ------------
         5726   16301    126670 gen/voyn/ini/tot.1/raw.tlw
         
       lines file
     ------- ------------
        2159 gen/voyn/ini/tot.1/raw.wfr

       lines file
     ------- ------------
        1913 gen/voyn/ini/tot.1/gud.wfr

       lines file
     ------- ------------
         246 gen/voyn/ini/tot.1/bad.wfr

 
      Good/bad statistics for voyn/ini:
 
      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
      
        tot.1    4849   4567  941    282   58   2159   1913  885    246  113

  Statistics for voyn/fin:

        lines   words     bytes file        
      ------- ------- --------- ------------

        lines   words     bytes file        
      ------- ------- --------- ------------
         5726   16301    122721 gen/voyn/fin/tot.1/raw.tlw

       lines file
     ------- ------------
        2042 gen/voyn/fin/tot.1/raw.wfr

       lines file
     ------- ------------
        1748 gen/voyn/fin/tot.1/gud.wfr

       lines file
     ------- ------------
         294 gen/voyn/fin/tot.1/bad.wfr
 
      Good/bad statistics for voyn/fin:
 
      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
      
        tot.1    4849   4514  930    335   69   2042   1748  855    294  143

  Statistics for voyn/mid:

        lines   words     bytes file        
      ------- ------- --------- ------------

        lines   words     bytes file        
      ------- ------- --------- ------------
        28240   83880    694625 gen/voyn/mid/tot.1/raw.tlw

       lines file
     ------- ------------
        5633 gen/voyn/mid/tot.1/raw.wfr

       lines file
     ------- ------------
        4486 gen/voyn/mid/tot.1/gud.wfr

       lines file
     ------- ------------
        1147 gen/voyn/mid/tot.1/bad.wfr
 
      Good/bad statistics for voyn/mid:
 
      #                    tokens                         lexemes             
      #         -----------------------------  -----------------------------
      # sec       raw    gud  ppt    bad  ppt    raw    gud  ppt    bad  ppt
      # ------  -----  ----- ----  ----- ----  -----  ----- ----  ----- ----
      
        tot.1   27400  25702  937   1698   61   5633   4486  796   1147  203

      voyn/{prs,lab}/hea.1/raw.wfr:   6867
      voyn/maj/hea.1/raw.wfr:         6867

      voyn/{prs,lab}/hea.2/raw.wfr:   868
      voyn/maj/hea.2/raw.wfr:         868

      voyn/{prs,lab}/heb.1/raw.wfr:   2901
      voyn/maj/heb.1/raw.wfr:         2901

      voyn/{prs,lab}/heb.2/raw.wfr:   557
      voyn/maj/heb.2/raw.wfr:         557

      voyn/{prs,lab}/cos.1/raw.wfr:   195
      voyn/maj/cos.1/raw.wfr:         195

      voyn/{prs,lab}/cos.2/raw.wfr:   1746
      voyn/maj/cos.2/raw.wfr:         1746

      voyn/{prs,lab}/cos.3/raw.wfr:   1006
      voyn/maj/cos.3/raw.wfr:         1006

      voyn/{prs,lab}/bio.1/raw.wfr:   6975
      voyn/maj/bio.1/raw.wfr:         6975

      voyn/{prs,lab}/zod.1/raw.wfr:   1370
      voyn/maj/zod.1/raw.wfr:         1370

      voyn/{prs,lab}/pha.1/raw.wfr:   1023
      voyn/maj/pha.1/raw.wfr:         1023

      voyn/{prs,lab}/pha.2/raw.wfr:   1588
      voyn/maj/pha.2/raw.wfr:         1588

      voyn/{prs,lab}/str.1/raw.wfr:   755
      voyn/maj/str.1/raw.wfr:         755

      voyn/{prs,lab}/str.2/raw.wfr:   10768
      voyn/maj/str.2/raw.wfr:         10768

      voyn/{prs,lab}/unk.1/raw.wfr:   213
      voyn/maj/unk.1/raw.wfr:         213

      voyn/{prs,lab}/unk.2/raw.wfr:   140
      voyn/maj/unk.2/raw.wfr:         140

      voyn/{prs,lab}/unk.3/raw.wfr:   47
      voyn/maj/unk.3/raw.wfr:         47

      voyn/{prs,lab}/unk.4/raw.wfr:   317
      voyn/maj/unk.4/raw.wfr:         317

      voyn/{prs,lab}/unk.5/raw.wfr:   342
      voyn/maj/unk.5/raw.wfr:         342

      voyn/{prs,lab}/unk.6/raw.wfr:   489
      voyn/maj/unk.6/raw.wfr:         489

      voyn/{prs,lab}/unk.7/raw.wfr:   387
      voyn/maj/unk.7/raw.wfr:         387

      voyn/{prs,lab}/unk.8/raw.wfr:   2
      voyn/maj/unk.8/raw.wfr:         2

# END