Hacking at the Voynich manuscript - Side notes
037 A concordance of the VMs

Last edited on 2001-03-29 09:02:23 by stolfi

  The goal of this note is to prepare a full concordance for 
  the VMs, showing every word or short phrase in 
  its context.

98-09-14 stolfi  (redone in part 98-12-31)
===============

I. STRUCTURE OF THE CONCORDANCE

  The concordance will be produced in two formats:
  a single machine-readable file, and a set of HTML pages
  (one per word pattern).

  The main intermediate file for a location map is a "raw concordance".
  Each line of this file represents one occurrence of some string 
  in the text, in the format

    LOC TRANS START LENGTH LCTX STRING RCTX PATT STAG PNUM HNUM
    1   2     3     4      5    6      7    8    9    10   11

  where

    LOC        is a line locator, like "f1r.11", "f86v2.R1.12a" etc.

    TRANS      is a letter identifying a transcriber, e.g. "F" for Friedman.

    START      is the index of the first byte of the occurrence
               in the text line (counting from 1).

    LENGTH     is the original length of the occurrence in the text, 
               including fillers, comments, spaces, etc..

    LCTX       is a word or word sequence, the left context of STRING.

    STRING     is the non-empty string in question, without any fillers,
               comments, non-significant spaces, line breaks, etc..

    RCTX       is a word or word sequence, the right context of STRING.

    PATT       a sorting pattern derived from STRING; see below.

    STAG       a tag identifying a section of the VMS, e.g. "hea" or "bio".

    PNUM       is the page's p-number, which is sequential and better for sorting.
    
    HNUM       an HTML page number; see below.

  For EVA format text, the START is relative to column 20.  In that
  case LENGTH does not include columns 1-19 and "#"-comments.  It does
  include "{}"-comments, "!" and "%" fillers, and any ASCII blanks
  beyond column 20.
  
  In this file the symbol "/" will be used instead of "-" to denote
  line breaks, in order to distingish them from embedded "-" denoting
  gaps, vellum defects, intruding figures, etc.

  The STRING is a single and whole Vms word, as delimited by EVA word
  separators [-/=,.]; or a sequence of two or more consecutive words,
  up to a certain maximum length.
 
  The LCTX and RCTX strings consist of zero or more words, including
  their word delimiters, that surrounded this occurrence of
  STRING.  Each context is extended until it includes a specified
  number of non-delimiter characters, or hits the boundary of the
  enclosing unit or paragraph. At least one word delimiter is always
  present.
  
  The delimiters that surrounded the STRING in the original text are
  *not* included in the string itself, but are included in the
  context.
  
  The STRING, LCTX, and RCTX fields may extend across comments,
  fillers, gaps and intrusions ("-") and ordinary line breaks ("/");
  but not across paragraph breaks ("=") or changes of textual unit.
  When the string extends across a line break, the NEWLINE character,
  the locators in columns 1-19, and any interveing #-comments are not
  included in the string, and not counted in the LENGTH. However the
  EVA codes for word and line break [-/.,] are included in STRING and
  and counted in LENGTH.

II. THE INPUT TEXTS

II.1 THE VMS TEXT

  The VMS concordance is to be built from a majority version of the
  EVA interlinear (derived from release interln1.6e6.evt; see note
  045) and all major individual versions. Since the enum-text-phrases
  scripts can process only one version at a time, we must run it
  several times.
  
  We must turn bad weirdos into "*".

    ln -s ../045/inter-cm.evt

    cat inter-cm.evt \
      | egrep -e '^(|## *)<[^;<>]*(|;[ACDFGHKLTUV])>' \
      | sed \
          -e 's/[&][*][^{}][^{}]/*\!\!\!/g' \
          -e 's/[&][^{}][*][^{}]/*\!\!\!/g' \
          -e 's/[&][^{}][^{}][*]/*\!\!\!/g' \
      | basify-weirdos \
      > vms.evt

II.2 WAR OF THE WORLDS

  Creating a control text (Well's War of the Worlds).  I took chapters
  1-16 (about 34000 words) and hacked the text into an EVMT-like
  format, as follows. 

  I added an "=" at the end of each paragraph, with emacs.  Then I
  regularized the spaces with the following script

    gawk \
      ' \
        /^[#]/ { print; next; } \
        /^ *$/ { print; next; } \
        /^ *CHAPTER/ { print; next; }  \
        /./ {  \
          str = tolower($0); gsub(/[0-9]/, "n", str); \
          gsub(/[.;:\!?]/, ".", str); gsub(/[-,/"'"'"'()]/, " ", str); \
          gsub(/[ ]*$/, " ", str); \
          gsub(/[ ]*[.][. ]*/, "-", str); gsub(/[ ]+/, ".", str);  \
          gsub(/[-.]*[=][-.]*/, "=", str); gsub(/[.]*[-][.]*/, "-", str); \
          gsub(/^[- .]*/, "", str); print str; \
        } \
      ' 

  Then I added location codes, with the following script:

    gawk \
      ' BEGIN { pg=0; sc=0; } \
        /^[#]/ { print; next; } \
        /^ *$/ { print "#"; \
          if(ln > 0) \
            { un++; if(un > 9) { pg++; un=1; } ln = 0; } \
          next; } \
        /^ *CHAPTER/ { gsub(/^ */, "# ", $0); print; pg++; sc++; un=1; ln=0; next; }  \
        /./ {  \
          ln++; \
          if((ln == 1) && (un == 1)) \
            { printf "f%03dr c%02d\n", pg, sc > "wow-fnum-to-sectag.tbl"; } \
          loc = sprintf("<f%03dr.P%1d.%02d;W>     ", pg, un, ln);  \
          gsub(/^[- .,=]*/, loc, $0); print; next; \
        } \
      ' 

II.3 LEWIS AND CLARK JOURNALS

  Creating another control text, with more erratic spelling (Lewis and
  Clark's expedition journals, by various participants, abridged and
  merged by Florentine Films.)  With emacs, I deleted all editorial
  notes, prefixed dates with "@", author names with "%", and text
  lines with "|".  The result is lac.txt

  Still with emacs, I added an "=" at the end each paragraph. Then I
  converted the text to lowercase, and the spaces to [-=.,], with this 
  script:

    cat ../Texts/full/engl-wow
    gawk \
      ' \
        /^[#@%]/ { print; next; } \
        /^ *$/ { print; next; } \
        /^[|]/ {  \
          str = tolower($0); gsub(/^[|][ ]*/, "", str); gsub(/[0-9]/, "n", str); \
          gsub(/[.;:\!?]/, ".", str); gsub(/[-,/"'"'"'()]/, " ", str); \
          gsub(/[ ]*$/, " ", str); \
          gsub(/[ ]*[.][. ]*/, "-", str); gsub(/[ ]+/, ".", str);  \
          gsub(/[-.]*[=][-=.]*/, "=", str); gsub(/[.]*[-][-.]*/, "-", str); \
          gsub(/^[- .]*/, "", str); printf "|   %s\n", str; \
        } \
      ' 

  Then I made each entry into a page with the following script:

    gawk \
      ' BEGIN { pg=0; } \
        /^[#]/ {print; next;} \
        /^[@]/ {gsub(/^[@]/, "#", $0); print; next;} \
        /^[%]/ { \
          pg++; ln = 0; tr = substr($0,3); \
          gsub(/^[%]/, "#", $0); print; next; \
        } \
        /^ *$/ { print "#"; next; } \
        /^[|]/ { \
          ln++; gsub(/^[|][ ]*/, "", $0); \
          printf "<f%03dr.P.%02d;%s>     %s\n", pg, ln, tr, $0; \
        } \
      ' \
    | sed \
        -e 's/;Charles Floyd, Jr.>/;F>/g' \
        -e 's/;John Ordway>/;O>/g' \
        -e 's/;Joseph Whitehouse>/;W>/g' \
        -e 's/;Meriwether Lewis>/;L>/g' \
        -e 's/;Patrick Gass>/;G>/g' \
        -e 's/;William Clark>/;C>/g'

  Let's define the sections as the transcribers:

    cat lac.evt \
      | gawk \
          ' BEGIN{ tb["G"]="gas"; tb["L"]="lew"; tb["C"]="cla"; \
                   tb["O"]="ord"; tb["F"]="flo"; tb["W"]="whi"; \
            } \
            /[<]/ { \
              gsub(/[>].*/, "", $0); gsub(/[<]/, "", $0); \
              gsub(/[.].*[;]/, " ", $0); if($2 in tb) {$2 = tb[$2];} \
              print; next; \
            } \
          ' \
      | sort | uniq \
      > lac-fnum-to-sectag.tbl

II.4 BOOK OF ENOCH

  Another control text: The Book of Enoch in Ethiopian (Ge`ez, SERA
  encoding), edited by Michal Jerabek of Charles University.
  The starting text is 
  I replaced space by ".", then ":|:" by "=" and line break with emacs, 
  "::." by "-" (no line break), then added line numbers with emacs,
  and the script in ../../Texts/renumber-evt-lines, and more
  emacs.

  Handy variables: ${samples} is the list of samples to process,
  ${samvers} includes the versions to consider,
  ${samtitmajs} includes the titles and majority version (if any).
  
    set samples = ( vms wow lac eno )
    set samcomm = "`echo ${samples} | tr ' ' ','`"
    set vmsvers = "ACDFGHKLTUV"
    set wowvers = "W"
    set lacvers = "CLFGOW"
    set enovers = "J"
    set samvers = ( vms.${vmsvers} wow.${wowvers} lac.${lacvers} eno.${enovers} )
    set samtitmajs = ( \
      vms.VMS/A  \
      wow.War_of_the_Worlds/W \
      lac.Lewis_and_Clark/ \
      eno.1_Enoch/J \
    )

III VALIDATION AND STATISTICS

  Checking all versions: 
  
    foreach sam ( ${samples} )
      echo ""; echo "sample = ${sam}"
      cat ${sam}.evt \
        | validate-new-evt-format \
            -v checkLineLenths=1 \
            -v chars="`cat ${sam}.chars`" \
            -v checkLineTerminators=1 \
            -v requireUnitHeaders=0 \
        >& ${sam}.bugs
      tail -3 ${sam}.bugs
    end

  Word length statistics (to select a suitable phrase length):
  
    foreach sam ( ${samples} )
      echo ""; echo "sample = ${sam}"
      cat ${sam}.evt \
        | words-from-evt \
        | egrep -v '[*?]' \
        | count-word-lengths \
        > ${sam}.wst
    end
    multicol -v titles="${samples}" `echo {${samcomm}}.wst` > .all.wst
    
vms                            wow                            lac                            eno                          
-----------------------------  -----------------------------  -----------------------------  -----------------------------
len nwords example             len nwords example             len nwords example             len nwords example           
--- ------ ------------------  --- ------ ------------------  --- ------ ------------------  --- ------ ------------------
  1   3379 y                     1   1632 a                     1   2036 a                     1      5 k                 
  2   9436 ol                    2   5266 my                    2   5936 an                    2    471 `1                
  3  15059 ary                   3   7873 had                   3   7470 the                   3    822 Sdq               
  4  28359 oror                  4   5928 come                  4   6452 city                  4   2022 lomu              
  5  40135 sheey                 5   3997 which                 5   4672 ruins                 5   3132 'Inze             
  6  31165 chckhy                6   2965 before                6   2583 appear                6   2810 teSHfe            
  7  18422 okeolan               7   2614 brother               7   2214 antient               7   3422 mewa`Il           
  8   7071 chedkaly              8   1588 stopping              8   1168 numerous              8   2272 we'azman          
  9   2525 lcheylchy             9   1003 direction             9    691 regularly             9   1775 weyeHewru         
 10    730 pchallarar           10    547 beginnings           10    401 supporting           10    867 bema`hdere        
 11    183 darailchedy          11    225 elphinstone          11    166 recollected          11    507 ytwe`hew`hu       
 12     93 lshedyoraiin         12    116 astonishment         12     84 extraodanary         12    256 weyrE'Iywomu      
 13     30 cheopolteeedy        13     54 indescribable        13     24 circumstances        13    120 weytwe`hew`hu     
 14      8 cheoltchedaiin       14     16 disintegrating       14      6 asstonishingly       14     64 'Im'Istnfasomu    
 15      4 ypchocpheosaiin      15      3 notwithstanding      15      8 notwithstanding      15     15 we'a`I`Smtihomu   
 16      1 chepchefyshdchdy     16      2 incomprehensible     16      3 counterballanced     16     16 we'itrE'Iywomunu  
 17      1 chckhoekeckhesshy                                   17      1 instancetaniously    17      4 webe`slTanatihomu 
                                                                                              18      2 weyastebeqWu`Iwomu

  Thus the longest word in the VMS is 17 characters. 
  Let's adopt that as the maximum phrase width.

    set maxlen = 17

  (Oops, the "eno" sample has an 18-character word. Must remember to allow for
  that in the formatting...)

III. THE BASIC CONCORDANCE
  
  Creating the basic concordance, fields 1-7. We can discard all
  entries that have any "*"s in the STRING field (but "*"s in the
  context field are OK.).
  
    echo "maxlen = ${maxlen}"
    foreach samver ( ${samvers} )
      set sam = "${samver:r}"
      set ver = ( ` echo ${samver:e} | sed -e 's/./& /g'` )
      rm -f ${sam}-${maxlen}.roc 
      foreach v ( ${ver} )
        echo " "; echo "sample = ${sam} version = ${v}"
        cat ${sam}.evt \
          | egrep '^(|## *)<[^;]*(|;'"${v}"')>' \
          | enum-text-phrases -f eva2erg.gawk \
              -v maxLength=${maxlen} \
              -v leftContext=15 \
              -v rightContext=15 \
          | gawk '($6 !~ /[*]/){print;}' \
          | gzip \
          > ${sam}-${maxlen}-${v}.roc.gz
        echo "kept `zcat ${sam}-${maxlen}-${v}.roc.gz | wc -l` good phrases"
      end
    end

      sample = vms version = A
      read   38659 words
      wrote 114954 phrases
      kept  101945 good phrases

      sample = vms version = C
      read   18102 words
      wrote  54103 phrases
      kept   52849 good phrases

      sample = vms version = D
      read     284 words
      wrote    803 phrases
      kept     765 good phrases

      sample = vms version = F
      read   33316 words
      wrote  97933 phrases
      kept   96487 good phrases

      sample = vms version = G
      read    4177 words
      wrote  12097 phrases
      kept   11908 good phrases

      sample = vms version = H
      read   37919 words
      wrote 112114 phrases
      kept  110479 good phrases

      sample = vms version = K
      read     258 words
      wrote    319 phrases
      kept     317 good phrases

      sample = vms version = L
      read    1231 words
      wrote   3554 phrases
      kept    3169 good phrases

      sample = vms version = T
      read    5360 words
      wrote  14333 phrases
      kept   14323 good phrases

      sample = vms version = U
      read   11386 words
      wrote  35057 phrases
      kept   31739 good phrases

      sample = vms version = V
      read    9654 words
      wrote  30022 phrases
      kept   28869 good phrases

      sample = wow version = W
      read   33829 words
      wrote 119530 phrases
      kept  119530 good phrases

      sample = lac version = C
      read    7831 words
      wrote  28859 phrases
      kept   28859 good phrases

      sample = lac version = L
      read    5695 words
      wrote  20175 phrases
      kept   20175 good phrases

      sample = lac version = F
      read    1398 words
      wrote   5561 phrases
      kept    5561 good phrases

      sample = lac version = G
      read    4891 words
      wrote  17548 phrases
      kept   17548 good phrases

      sample = lac version = O
      read    9293 words
      wrote  36540 phrases
      kept   36540 good phrases

      sample = lac version = W
      read    4807 words
      wrote  18837 phrases
      kept   18837 good phrases

      sample = eno version = J
      read   18582 words
      wrote  39799 phrases
      kept   39799 good phrases

  Merging the VMS concordances. We sort by location, position and length,
  then transcriber code, for the sake of the context-replacement step below.
  
    zcat vms-${maxlen}-[$vmsvers].roc.gz \
      | sort +0 -1 +2 -4 +1 -2 \
      | gzip \
      > vms-${maxlen}-temp.roc.gz
      
  Removing non-redundant entries from the VMS concordance. We replace
  the context strings by those of the majority version, if available,
  in order to reduce the number or entries in later steps.. Then we
  condense all entries that have the same location and contents, and
  differ only in transcriber code.

    zcat vms-${maxlen}-temp.roc.gz \
      | gawk \
          ' \
            ($2 == "A"){ oloc=$1; opos=$3; olen=$4; lc=$5; rc=$7; print; next;} \
            (($1==oloc)&&($3==opos)&&($4==olen)) { $5=lc; $7=rc; print; } \
          ' \
      | sort +0 -1 +2 -7 +1 -2 \
      | remove-redundant-roc-entries \
      | gzip \
      > vms-${maxlen}.roc.gz

     404905 records read
     267200 records ignored
     137705 records written

  Pretending to do the same for other samples:

    zcat wow-${maxlen}-[$wowvers].roc.gz \
      | sort +0 -1 +2 -7 +1 -2 \
      | gzip \
      > wow-${maxlen}.roc.gz

    zcat lac-${maxlen}-[$lacvers].roc.gz \
      | sort +0 -1 +2 -7 +1 -2 \
      | gzip \
      > lac-${maxlen}.roc.gz

    zcat eno-${maxlen}-[$enovers].roc.gz \
      | sort +0 -1 +2 -7 +1 -2 \
      | gzip \
      > eno-${maxlen}.roc.gz
      
  Checking the word length distribution, to make sure we didn't lose anything:
  
    foreach sam ( ${samples} )
      echo " "; echo "sample = ${sam}"
      zcat ${sam}-${maxlen}.roc.gz \
        | gawk '($6 \!~ /[-/.,=]/) { print $6; }' \
        | count-word-lengths \
        > .${sam}.wst
    end
    multicol -v titles="${samples}" `echo .{${samcomm}}.wst` > .new.wst
  
  The next step is to add to each accepted line of the concordance an
  8th field, PATT, which is obtained from STRING by identifying
  similar characters, removing spaces and q's, etc.

    foreach sam ( ${samples} )
      echo " "; echo "sam = ${sam}"
      zcat ${sam}-${maxlen}.roc.gz \
        | add-${sam}-match-key -f eva2erg.gawk \
            -v inField=6 \
            -v outField=8 \
        | gzip \
        > ${sam}-${maxlen}.poc.gz
      echo "`zcat ${sam}-${maxlen}.poc.gz | wc` ${sam}-${maxlen}.poc"
    end

        lines   words    bytes file        
      ------- ------- -------- ------------
       137705 1101640 10955849 vms-17.poc
       119530  956240  9499018 wow-17.poc
       127520 1020160  9551708 lac-17.poc
        39799  318392  3107113 eno-17.poc

  Checking for empty patterns:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}.poc.gz \
        | gawk '(NF \!= 8){ print;}' 
    end

      (empty output)

  Let's collect the patterns and count their frequencies:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}.poc.gz \
        | gawk '/./{p=($1 ":" $3 ":" $4); if(p\!=op){print $8;} op=p;}' \
        | sort | uniq -c | expand \
        | sort +0 -1nr +1 -2 \
        > ${sam}-${maxlen}.pfr

      cat ${sam}-${maxlen}.pfr \
        | gawk '($1 > 1){print}' \
        > ${sam}-${maxlen}-nonu.pfr

      dicio-wc ${sam}-${maxlen}{,-nonu}.pfr
    end

        lines   words     bytes file        
      ------- ------- --------- ------------
        48539   97078   1028787 vms-17.pfr
         6778   13556    123522 vms-17-nonu.pfr

        lines   words     bytes file        
      ------- ------- --------- ------------
        73545  147090   1530446 wow-17.pfr
         8121   16242    140188 wow-17-nonu.pfr

        lines   words     bytes file        
      ------- ------- --------- ------------
        68984  137968   1171549 lac-17.pfr
         9978   19956    150836 lac-17-nonu.pfr

        lines   words     bytes file        
      ------- ------- --------- ------------
        23328   46656    464196 eno-17.pfr
         3602    7204     61830 eno-17-nonu.pfr

  Then we extract the FNUM temporarily as field 9, then
  use it to insert the section tag STAG (hea, heb, bio, etc.)
  as the final field 9, and the page's p-number PNUM
  as field 10:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}.poc.gz \
        | gawk '/./{f=$1;gsub(/[.].*$/,"",f); print $0,f;}' \
        | map-field \
            -v inField=9 \
            -v outField=9 \
            -v table=${sam}-fnum-to-sectag.tbl \
        | map-field \
            -v inField=10 \
            -v outField=10 \
            -v table=${sam}-fnum-to-pnum.tbl \
        | gawk '/./{NF = 10;print;}' \
        | gzip \
        > ${sam}-${maxlen}.soc.gz
      echo "`zcat ${sam}-${maxlen}.soc.gz | wc` ${sam}-${maxlen}.soc"
    end

        lines   words     bytes file        
      ------- ------- --------- ------------
       137705 1377050  12195194 vms-17.soc
       119530 1195300  10574788 wow-17.soc
       127520 1275200  10699388 lac-17.soc
        39799  397990   3465304 eno-17.soc

  Next we sort the concordance by pattern, actual string (minus
  blanks), section, and page.  For this purpose we add 
  a temporary 11th field which is just STRING, RCTX and LCTX
  concatented without blanks.  This field is removed after the 
  sort.

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}.soc.gz \
        | gawk \
            ' /./{ \
                key = gensub(/[-/=,. ]/,"","g",($6 $7 $5)); \
                print $0, key; } \
            ' \
        | sort +7 -8 +10 -11 +8 -9 +9 -10  \
        | gawk '/./{NF = 10; print}' \
        | gzip \
        > ${sam}-${maxlen}-srt.soc.gz
      echo "`zcat ${sam}-${maxlen}-srt.soc.gz | wc` ${sam}-${maxlen}-srt.soc"
    end

        lines   words     bytes file        
      ------- ------- --------- ------------
       137705 1377050  12195194 vms-17-srt.soc
       119530 1195300  10574788 wow-17-srt.soc
       127520 1275200  10699388 lac-17-srt.soc
        39799  397990   3465304 eno-17-srt.soc

  Let's count the number of entries according to the initial letter(s) of the 
  pattern:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"; echo " "
      zcat ${sam}-${maxlen}-srt.soc.gz \
        | gawk '//{print substr($8, 1, 1);}' \
        | sort | uniq -c | expand \
        > .${sam}.lstat
    end
    multicol  -v titles="${samples}" `echo .{${samcomm}}.lstat` > .all.lstat

       vms        wow        lac        eno      
    ---------  ---------  ---------  ---------
      14454 d    15973 d     2089 &     1528 0
      44363 e     6278 e     5941 b        2 a
         48 i    11239 i     3208 d     2722 b
       4921 l    18062 n     8500 e      723 d
         36 n    24588 o     5601 f     6871 e
      62935 o     5941 p     2155 g     4009 j
         15 q     2590 r     6323 h     4900 k
      10766 t     8958 s      758 j     2559 l
         90 v    25892 t     6133 k     2917 m
         77 x        9 z     2616 l      504 n
                             9537 n      573 r
                            37502 o     2097 s
                             3762 p     1179 t
                             2620 r       46 u
                             7148 s     9169 v
                            22227 t           
                             1400 x           

  There are too many strings to build a single HTML file with the concordance.
  So let's extract only those patterns that include at least two entries, or
  whose STRING is a single word:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}-srt.soc.gz \
        | mark-interesting-patterns \
        | gawk '($11 == "+"){NF = 10; print;}' \
        | gzip \
        > ${sam}-${maxlen}-ok.soc.gz
      echo "`zcat ${sam}-${maxlen}-ok.soc.gz | wc` ${sam}-${maxlen}-ok.soc"
    end

    vms
      77322 records marked "+"
      60383 records marked "-"
      77322  773220 6356963 vms-17-ok.soc

    wow
      56292 records marked "+"
      63238 records marked "-"
      56292  562920 4506071 wow-17-ok.soc

    lac
      69402 records marked "+"
      58118 records marked "-"
      69402  694020 5401726 lac-17-ok.soc

    eno
      23628 records marked "+"
      16171 records marked "-"
      23628  236280 1922173 eno-17-ok.soc

  Let's compute the max size of the fields, with dots and all:

    foreach field ( 1 2 5 6 7 8 9 )
      echo " "
      echo "=== field ${field} ==="
      foreach sam ( ${samples} )
        echo " "; echo "${sam}"
        zcat ${sam}-${maxlen}-ok.soc.gz \
          | gawk -v n=${field} '/./ {print $(n);}' \
          | count-word-lengths \
          > .${sam}-${field}.szs
      end
      echo "field ${field} ===" > .all-${field}.szs
      echo ' ' >> .all-${field}.szs
      multicol -v titles="${samples}" `echo .{${samcomm}}-${field}.szs` \
        >> .all-${field}.szs
    end
    
    field 1 - max 12
    field 2 - max  6
    field 5 - max 33, can be clipped at 20 or less.
    field 6 - max 24
    field 7 - max 32, can be clipped at 20 or less.
    field 8 - max 17

  I recorded these numbers as shell variables:

    set maxlft = 32
    set maxstr = 24
    set maxrht = 33

IV. THE HTML CONCORDANCE

  Let's split the HTML-formatted concordance into pages with about 1000 lines. 
  We will add the 11th field, HNUM, which is the HTML page number
  (counting from 0):

    set pgsize = 1000;
    echo "pgsize = $pgsize"
    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}-ok.soc.gz \
        | gawk -v pgsize=${pgsize} \
            ' /./{ \
                pat=$8; \
                if ((pat \!= ppat) && (NR-ndone > pgsize)) \
                  { pg++; ndone = NR-1; } \
                printf "%s %03d\n", $0, pg; \
                ppat = pat; \
              } \
            ' \
        | gzip \
        > ${sam}-${maxlen}-ok.hoc.gz
      echo "`zcat ${sam}-${maxlen}-ok.hoc.gz | wc` ${sam}-${maxlen}-ok.hoc"
    end

        lines   words   bytes file        
      ------- ------- ------- ------------
        77322  850542 6666251 vms-17-ok.hoc
        56292  619212 4731239 wow-17-ok.hoc
        69402  763422 5679334 lac-17-ok.hoc
        23628  259908 2016685 eno-17-ok.hoc

  Checking the number of strings per page:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      zcat ${sam}-${maxlen}-ok.hoc.gz \
        | gawk '/./{ print $11; }' \
        | sort | uniq -c | expand \
        > .${sam}.pgsizes
    end
    multicol  -v titles="${samples}" `echo .{${samcomm}}.pgsizes`

    vms          wow          lac          eno        
    -----------  -----------  -----------  -----------
       1295 000     1014 000     1106 000     1037 000
       1002 001     1000 001     1000 001     1002 001
       1002 002     1075 002     1004 002     1010 002
       1244 003     1250 003     1052 003     1041 003
       1004 004     1022 004     1001 004     1006 004
       1003 005     1038 005     1009 005     1000 005
       1002 006     1001 006     1167 006     1000 006
       1001 007     1000 007     1092 007     1001 007
       1300 008     1001 008     1107 008     1000 008
       1001 009     1001 009     1123 009     1000 009
       1001 010     1378 010     1158 010     1003 010
       1006 011     1001 011     1013 011     1093 011
       1061 012     1000 012     1064 012     1007 012
       1146 013     1003 013     1000 013     1001 013
       1037 014     1033 014     1002 014     1002 014
       1008 015     1002 015     1004 015     1001 015
       1012 016     1215 016     1006 016     1001 016
       1024 017     1013 017     1052 017     1056 017
       1002 018     1021 018     1003 018     1000 018
       1058 019     1128 019     1036 019     1000 019
       1018 020     1303 020     1000 020     1000 020
       1000 021     1749 021     1001 021     1000 021
       1033 022     1027 022     1049 022     1000 022
       1019 023     2079 023     1014 023      367 023
       1005 024     1003 024     1008 024             
       1000 025     1001 025     1024 025             
       1000 026     1000 026     2714 026             
       1004 027     1133 027     2280 027             
       1002 028     2003 028     1008 028             
       1089 029     1479 029     1001 029             
       1002 030     1001 030     1002 030             
       1010 031     1009 031     1121 031             
       1001 032     1003 032     2477 032             
       1016 033     1020 033     1003 033             
       1576 034     1001 034     1005 034             
       1001 035     1108 035     1032 035             
       1019 036     1000 036     1004 036             
       1003 037     3720 037     1618 037             
       1002 038     1003 038     1000 038             
       1003 039     1000 039     1144 039             
       1588 040     1147 040     1000 040             
       1001 041     1001 041     1002 041             
       1000 042     1170 042     1016 042             
       1003 043     1061 043     1000 043             
       1007 044     1400 044     1001 044             
       1686 045     1001 045     1029 045             
       1018 046     1026 046     1050 046             
       1002 047      648 047     1001 047             
       1000 048                  3479 048             
       1021 049                  1063 049             
       1001 050                  1015 050             
       1330 051                  1102 051             
       1002 052                  1000 052             
       1000 053                  1011 053             
       1295 054                  1314 054             
       1135 055                  1005 055             
       1134 056                  1000 056             
       1595 057                  1001 057             
       1001 058                  1020 058             
       2680 059                   789 059             
       1010 060                                       
       1000 061                                       
       1067 062                                       
       1013 063                                       
       1003 064                                       
       1003 065                                       
       1150 066                                       
       1002 067                                       
       1000 068                                       
       1005 069                                       
        558 070                                       

  Now let's split the file into pages:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      set dir = "${sam}-pages-new"
      mkdir ${dir}
      rm -f /tmp/${sam}-???.hoc
      zcat ${sam}-${maxlen}-ok.hoc.gz \
        | gawk -v sam=${sam} \
            ' BEGIN { ppg = ""; } \
              /./{ \
                pg = $11; \
                if (pg \!= ppg) \
                  { if(ppg \!= "") { close(wr); } \
                    wr = ("/tmp/" sam "-" pg ".hoc"); ppg = pg; \
                    printf "%s\n", pg; \
                  } \
                print > wr; \
              } \
              END { if(ppg \!= "") { close(wr); } } \
            ' \
        > ${sam}-pages-new/all.nums
      cat ${sam}-pages-new/all.nums
    end

  Let's format the pages:            

    cat /tmp/vms-005.hoc \
      | format-new-concordance \
          -v title="Test Concordance - Section 005" \
          -v prevPage=004 -v nextPage=006 \
          -v maxLeft=20 -v maxString=${maxstr} -v maxRight=20 \
          -v majTran=A \
      > test.html

    foreach samtitmaj ( ${samtitmajs} )
      set samtit = "${samtitmaj:h}"
      set maj = "${samtitmaj:t}"
      set sam = "${samtit:r}"
      set tit = "`echo ${samtit:e} | tr '_' ' '`"
      echo " "; echo "${sam} (${tit}) maj = '${maj}'"
      rm -f ${sam}-pages-new/*.html
      set pgs = ( `cat ${sam}-pages-new/all.nums` )
      set nxs = ( $pgs[2-] )
      set prev = "";
      foreach page ( $pgs )
        if ( $#nxs > 0 ) then
          set next = "$nxs[1]"; shift nxs;
        else
          set next = ""
        endif
        echo "${page} prev = $prev next = $next"
        cat /tmp/${sam}-${page}.hoc \
          | format-new-concordance \
              -v title="${tit} Concordance - Section ${page}" \
              -v prevPage="$prev" -v nextPage="$next" \
              -v maxLeft=20 -v maxString=${maxstr} -v maxRight=20 \
              -v majTran="${maj}" \
          > ${sam}-pages-new/${page}.html
        set prev = "$page"
      end
    end

  Let's create the alphabetic index file:

    foreach samtitmaj ( ${samtitmajs} )
      set samtit = "${samtitmaj:h}"
      set sam = "${samtit:r}"
      set tit = "`echo ${samtit:e} | tr '_' ' '`"
      echo " "; echo "${sam} (${tit})"
      zcat ${sam}-${maxlen}-ok.hoc.gz \
        | gawk '/./{ print $6,$11,$8,gensub(/[-/=,.]/, "", "g", $6); }' \
        | sort +3 -4 +1 -2n \
        | uniq \
        | gawk '/./{ NF=3; print; }' \
        | format-string-index \
            -title "${tit} Concordance - Word and Phrase index" \
        > ${sam}-pages-new/index.html
    end

V. PUBLISHING

  Preparing a compressed archive:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      rm -f ${sam}-pages-new/all.zip
      ( cd ${sam}-pages-new && zip -klv all index.html ???.html )
    end

  Preparing a zip-compressed version of the machine concordance
  for the benefit of Windows/dos users:
 
    cat vms-${maxlen}-ok.soc.gz | zip vms-${maxlen}-ok.zip -

  Installing the pages:

    foreach sam ( ${samples} )
      echo " "; echo "${sam}"
      mv ${sam}-pages ${sam}-pages-old
      mv ${sam}-pages-new ${sam}-pages
      touch ${sam}-pages/.www_browsable
    end

  See Analysis-dam-draft.txt