Hacking at the Voynich manuscript - Side notes
010 Word distribution maps

  This is partly a remake of work from Notebook-2.txt, originally done around
  97-07-05.

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]

    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]

    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.

    Then I went back and started redoing some of the previous tasks
    using the new encoding.

97-12-04 stolfi
===============

  I decided it was time to rebuild the word and label location maps, with
  EVA-based encoding, in the light of the three-way paradigm.

  The main intermediate file for a location map is a "raw concordance".
  Each line of this file represents one occurrence of some string 
  in the text, in the format

    PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS
    1    2    3    4    5     6     7      8   9      10 

  where

    PNUM       is a sequential page number, "001" to "234".
    
    FNUM       is the corresponding folio-based page number, 
               "f1r" thru "f86v5" to "f106v"

    UNIT       is the code of a text unit within the page, e.g. "P" or "R1"
    
    LINE       is the code of a line within that unit, 27 or 10a

    TRANS      is a letter identifying a transcriber, e.g. "F" for Friedman

    START      is the index of the first byte of the occurrence
               in the text line (counting from 1).

    LENGTH     is the original length of the occurrence in the text, 
               including fillers, comments, spaces, etc..

    POS        is a number giving the approximate position
               of the occurrence within the whole text; used
               for sorting, etc.

    STRING     is the non-empty string in question, without any fillers,
               comments, non-significant spaces, line breaks, etc..
    
    OBS        An arbitrary non-empty string, without embedded blanks.
    
  The STRING field may extend across one or more line breaks.
  In that case the line breaks are not included in the string and
  not counted in the LENGTH.
  
  For EVA format text, the START is relative to column 20.  In that
  case LENGTH does not include columns 1-19 and "#"-comments.  It does
  include "{}"-comments, "!" and "%" fillers, and any ASCII blanks
  beyond column 20.
  
  The START and LENGTH fields are used only by programs that 
  list or highlight the occurrecnes in the original text. 
  They may be 0 if not known.
  
  Similarly the POS field is used only for computing positional
  correlations and building block-based (as opposed to page-based)
  occurrence maps.  It too can be 0 of not known.
  
  For word-based maps, the STRING is a single and whole Vms word,
  delimited by EVA word separators [-=,.]; or a sequence of two or
  more consecutive words.  The string may extend across comments,
  fillers, and ordinary line breaks ("-"); but not across paragraph
  breaks ("=") or changes of textual unit.  (To simplify processing,
  the words in the STRING field are always separated by a single ".",
  irrespective of the original separators used in the text.  The
  delimiters surrounding the STRING are *not* included.)
  
  In word-based concordances, a reasonable choice for the POS field is
  the number of non-empty words preceding the occurrence in the
  sample.

97-11-17 stolfi
===============

  First, I collected some "interesting" words, the ones whose
  distribution is worth mapping, and which may have non-trivial
  semantics attached.
  
  From a message by R. Zandberger I got the labels of f67r2, and
  entered as new text unit L16-eva/67r2.L. (Robert Firth once
  conjectured they were the Ptolemaic planets.)

  Using data posted by John Grove, I split several of my textual units
  (L16-eva/f*) into smaller units, distinguishing real "parags" from
  his so-called "titles". (Almost all of his "titles" are actually
  short lines placed at the *end* of a paragraph).

  The affected files are:

    f1r.P    -> f1r.P1 f1r.T1 f1r.P2 f1r.T2 f1r.P3 f1r.T3 f1r.P4 f1r.T4
    f8r.P    -> f8r.P1 f8r.T1 f8r.P2 f8r.T2 f8r.P3 f8r.T3
    f9r.P    -> f9r.P f9r.T
    f16r.P   -> f16r.P1 f16r.T1 f16r.P2
    f18r.P   -> f18r.P f18r.T
    f19v.P   -> f19v.P f19v.T
    f22v.P   -> f22v.P f22v.T
    f24r.P   -> f24r.P f24r.T
    f25r.P   -> f25r.P f25r.T
    f27r.P   -> f27r.P f27r.T
    f28v.P   -> f28v.P1 f28v.T1 f28v.P2 f28v.T2
    f31r.P   -> f31r.P f31r.T
    f39r.P   -> f39r.P f39r.P
    f40v.P   -> f40v.P f40v.T
    f41v.P   -> f41v.P f41v.T
    f42r.P   -> f42r.P1 f42r.T1 f42r.P2 f42r.T2 f42r.P3 f42r.T3
    f42v.P   -> f42v.P f42v.T
    (new)    -> f57v.T
    (new)    -> f58v.T
    (new)    -> f65r.L
    (old)    -> f66r.W  {entered months ago}
    f82r.P   -> f82r.P1 f82r.T1 f82r.P2
    (new)    -> f85r2.T
    f85r1.P  -> f85r1.P f85r1.T
    f86v5.P  -> f86v5.P f86v5.T
    f94r.P   -> f94r.P f94r.T
    f101v1.P -> f101v1.P f101v1.T
    f101v2.P -> f101v2.P f101v2.T
    f105r.P  -> f105r.P1 f105r.T1 f105r.P2 f105r.T2
    f108v.P  -> f108v.P f108v.T
    f114r.P  -> f114r.P1 f114r.T1 f114r.P2 f114r.T2

  I collected all the labels, titles, and isolated words in one big file:

    cat L16-eva/INDEX \
      | egrep -e '^[^:]*:[^:]*:[^:]*:[^:]*:(labels|words|titles):' \
      | sed -e 's/:.*$//g' \
      > lwt.units

    cat `cat lwt.units | sed -e 's/^/L16-eva\//g'` \
      > labtit.evt

  I then reformatted this data by hand, producing a
  small database of "interesting words and phrases", called
  labtit.idx . 
  
  Each line of this file describes one line of a label or title. There
  are six fields, separated by "|"
   
     LOCATION|LABEL|CLASS|MEANING|SECTION|COMMENTS
     1        2     3     4       5       6
  
  where

    LOC       a location code, e.g. "f86v5.T2.5a;C"

    LABEL     full label or title, in EVA, with EVA word separators

    CLASS     class of label/title, one of
                P  label of plant/vessel, mostly in pharma and herbal sections.
                T  short "title" line under a paragraph, various sections.
                I  "item" label in the list of f66r.
                B  label on "biological" illustration like f77v.
                N  conjectured planet name from f67r1.
                S  star label on astronomical maps.
                Z  label on "day" sectors of zodiac pictures.
                A  other label on astro/cosmo/zodiac diagram.

    MEANING    conjectured meaning, ending with "?" or just "?".

    SECTION    "herbal", "bio", etc. as in the L16-eva/INDEX file.

    COMMENTS   free format, with "_" instead of blanks, or just "-".

  example:

    f100r.t.5;C|sar.chas-daiind|P|plant?|pharma|two line label

97-12-06 stolfi
===============

  Next I obtained the text proper.
  
  Since the frequency of occurrence of a label may depend on the
  transcriber, it is important that we use a single transcription to
  build the block-based map.

  It is not advisable to use a mechanical consensus version for that.
  For one thing, doing so requires mappint the text to an
  error-tolerant alphabet, which makes the resulting map less valuable
  and preempts some useful options, such as strict matching. More
  importantly, the consensus-builder will tend to eliminate certain
  easy-to-misread words (such as those ending with -i*n) in sections
  where there are two versions, and keep them where there is only one
  version---which only adds more noise to an already noisy map.

  I chose the Friedman version (first [|] alternative, code ";F")
  because it was the most complete.  
  
  I made a list of all normal-text units, in binding order

    cat L16-eva/INDEX \
      | gawk \
          ' BEGIN{FS=":"} \
            ($5 \!~ /^(labels|letters|words|titles|[-?])/){print $1;}' \
      > vtx.units

  Then I concatenated all units together, keeping only
  Friedman's transcription "F":
  
    cat `cat vtx.units | sed -e 's/^/L16-eva\//g'` \
      | gawk '/^#/{next;} ($1 ~ /;F>/){print;}' \
      > vtx-f-eva.evt

    dicio-wc vtx-f-eva.evt

     lines   words     bytes file        
    ------ ------- --------- ------------
      3901    7867    281253 vtx-f-eva.evt
  
  Then I made a complete concordance from this file:
  
    cat vtx-f-eva.evt \
      | enum-text-phrases -f eva2erg.gawk \
          -v maxlen=15 \
      | map-field \
          -v inField=1 \
          -v outField=1 \
          -v table=fnum-to-pnum.tbl \
      | gawk '/./{print $0, "-";}' \
      > vtx-f-eva-15.roc
  
    read   33256 words
    wrote  85626 phrases
    
    cat vtx-f-eva-15.roc \
      | gawk '($9 \!~ /[.?*]/) {print;}' \
      | ( printf "good words = "; wc -l )

    good words =   33026

  Checking the length distribution of single words:
    
    cat vtx-f-eva-15.roc \
      | gawk '($9 \!~ /[.]/) {print $9;}' \
      | count-word-lengths

    len nwords example           
    --- ------ ------------------
      1    339 l
      2   1753 ol
      3   3239 ary
      4   6033 aiin
      5   8900 sodal
      6   6784 chckhy
      7   4167 okeolan
      8   1435 oqokaiin
      9    453 orchcthdy
     10    114 ykchedaiin
     11     25 chedyotaiin
     12      8 lshedyoraiin
     13      4 aiinaiiiriiir
     14      1 *******chedylo
     15      1 pchodolchopchal
     
  Checking length distribution with dots and all:
    
    cat vtx-f-eva-15.roc \
      | gawk '/./ {print $9;}' \
      | count-word-lengths
      
    len nwords example           
    --- ------ ------------------
      1    339 l
      2   1753 ol
      3   3249 ary
      4   6111 aiin
      5   9096 sodal
      6   7344 chckhy
      7   5440 ol.lkan
      8   3864 aiin.ary
      9   4228 cheey.qor
     10   4956 chckhy.qol
     11   5822 chal.chcthy
     12   6173 qol.aiin.ary
     13   5786 chcthy.chckhy
     14   5303 cheey.qor.aram
     15   4932 chckhy.qol.aiin
     16   4935 qor.aram.ol.lkan
     17   4905 chcthy.chckhy.qol
     18   1265 ol.lkan.sodal.chal
     19    119 qokam.cham.**.ar.al
     20      6 lol.tar.shr.r.ol.ols

  I recorded this as a shell variable:
  
    set maxlen = 20
  
  Next I made a similar raw concordance file for the words
  and phrases in "labtit.idx".  The space and possible-space codes ".",
  ",", "-" were interpreted as word breaks. (This policy maximizes the
  chance of finding the label in the text.)  The POS field was set to
  zero.  I considered multi-word phrases up to 15 characters long,
  which was the maximum length of any label or title word present in
  the database.  Once again
  
    PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS
    1    2    3    4    5     6     7      8   9      10
    
  where the OBS field is formed from the CLASS and MEANING fields of
  the label/title database.

  I also eliminated identical phrases with same location and offset,
  which originated from different but concordant transcriptions of the same 
  label.

    cat labtit.idx \
      | enum-label-phrases -f eva2erg.gawk \
          -v maxlen=15 \
      | map-field \
          -v inField=1 \
          -v outField=1 \
          -v table=fnum-to-pnum.tbl \
      | sort -b +8 -9 +1 -4 \
      | gawk \
          ' /./ { \
              f = $2; u = $3; l = $4; w = $9; \
              if ((w==wa)&&(f==fa)&&(u==ua)&&(l==la)) next; \
              print; wa = w; fa = f; ua = u; la = l; \
            } \
          ' \
      | sort -b +0 -1n +2 -5 +5 -6n \
      > labtit-def.roc

  Checking format:

    foreach f ( labtit-def vtx-f-eva-15 )
      echo " "; echo '=== '$f
      cat ${f}.roc \
        | egrep -v '^[0-9][0-9][0-9] f[0-9]+[rv][1-6]? [A-Za-z][0-9]* [0-9]+[a-z]? [A-Z] [0-9]+ [0-9]+ [0-9]+ [a-z.*?]+ [^ ]+$'
    end

  Just for the record, I extracted the good label and title words by themselves:

    cat labtit-def.roc \
      | gawk '($9 \!~ /[.*?]/) {print $9;}' \
      | sort | uniq \
      > labtit.dic
    dicio-wc labtit.dic

     lines   words     bytes file        
    ------ ------- --------- ------------
       526     526      3656 labtit.dic

  Just to make sure, I checked the size distribution of these 
  words:

    cat labtit.dic \
      | count-word-lengths

    len nwords example           
    --- ------ ------------------
      1      5 y
      2     12 yy
      3     28 sar
      4     80 ykdy
      5    110 ytain
      6    121 yteody
      7     84 qokeedy
      8     68 ychekchy
      9     27 ydaraishy
     10      4 sochorcfhy
     11      4 ilnirsireik
     12      2 saloiinsheol
     13      2 otcholcheaiin

  So the choice of maxlen=15 was quite reasonable.
      
  The next step was to assign each location in the text to a certain
  "bin".  Ideally each bin should contain the same number of good words,
  so that the number of matches in a bin is proportional to the
  local density of references, undisturbed by bin size.
 
  The results are hard to interpret if we mix different languages and
  subject matters in the same bin. Ideally a bin should contain a
  single language and section. I.e. we should split the text into
  "divisions", each containing a maximal set of pages from the same
  section and language; and then split each division into equal-sized
  bins, as evenly as possible.
  
  To try this idea, I created a file that maps textual unit to section
  and language.  I started from the L16-eva/INDEX

    cat L16-eva/INDEX \
      | gawk 'BEGIN{FS=":"} /./{print $1, ($2 "." $3)}' \
      > unit-to-division.tbl
 
  Then I counted the number of good words in each division:
  
    cat vtx-f-eva-15.roc \
      | gawk '($9 \!~ /[.?*]/){print ($2 "." $3);}' \
      | map-field \
          -v inField=1 \
          -v outField=1 \
          -v table=unit-to-division.tbl \
      | gawk '{print $1}' \
      | sort | uniq -c | expand

      words division
      ----- ---------
        687 ?.A
       1462 ?.B
        173 astro.?
       6690 bio.B
        170 cosmo.?
        139 cosmo.B
       7571 herbal.A
       3336 herbal.B
       2171 pharma.A
      10627 stars.B

  Unfortunately some divisions are too small, so this approach would
  be messy --- it would leave too many leftover blocks.
  
  A compromise is to separate the two languages but keep then
  divide each group into blocks containing the same number of 
  good words.
  
  To that end, I needed a table mapping page p-numbers to
  language;
  
    cat L16-eva/INDEX \
      | gawk 'BEGIN{FS=":"} ($6 \!= "-"){gsub(/p/,"",$6); print $6, $3;}' \
      | sort | uniq \
      > pnum-to-language.tbl
    
  Pages should be homogeneous with respect to language, but 
  let's check anyway:

    cat pnum-to-language.tbl \
      | gawk '/./{if($1==p) print; p=$1;}' 
  
  Let's count how many good words we got for each language:
  
    cat vtx-f-eva-15.roc \
      | gawk '($9 \!~ /[.?*]/){print $1;}' \
      | map-field \
          -v inField=1 \
          -v outField=1 \
          -v table=pnum-to-language.tbl \
      | gawk '{print $1}' \
      | sort | uniq -c | expand

    words lang
   ------ ----
      343 ?
    10429 A
    22254 B    
    
  We could have 10 blocks of A, 21 blocks of B, and one odd block of 
  indeterminate language in between, for a total of 32 blocks:
  
    10429/10 = 1043-
    22254/21 = 1060-

  For the main text concordance, the block index can be computed by
  counting good single words per language (after sorting by position
  and length) and dividing the word count by the desired block sizes.
  I added the block number at end of the line:
  
    PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS LANG BLOCK
    1    2    3    4    5     6     7      8   9      10  11   12
  
  Here it is:

    set blNumA = 10
    set blNumB = 21
    @ nblocks = ${blNumA} + 1 + ${blNumB}
    cat vtx-f-eva-15.roc \
      | sort +7 -8n +6 -7n \
      | map-field \
          -v inField=1 \
          -v outField=11 \
          -v table=pnum-to-language.tbl \
      | gawk \
          -v blNumA=${blNumA} -v blNumB=${blNumB} \
          ' BEGIN { \
              lo["A"] = 00; lo["?"] = blNumA; lo["B"] = blNumA+1; \
              sz["A"] = 1043; sz["?"] = 1050; sz["B"] = 1060; \
            } \
            ($9 \!~ /[?*]/) { \
              word = $9; lang = $11; \
              if (word \!~ /[.]/) n[lang]++; \
              bnum = lo[lang] + int((n[lang]-1)/sz[lang]); \
              print $0, sprintf("%02d", bnum); \
            } \
          ' \
      > vtx-f-eva-15.boc
  
  For the label and title concordance, the block number must be
  estimated indirectly from the page number.  To do that we need a
  table that maps sequential page numbers to blocks.
  
  To create that file, I first extracted from the block-oriented text
  concordance above a file with one line per single word (not phrase),
  containing only
  
    PNUM   the sequential page number, eg. p023
    
    FNUM   the page f-number, eg f103r2 (just in case).
    
    LANG   the language, "A", "?", or "B";
    
    BLOCK  the sequential block number.
    
  Here it is:
  
    cat vtx-f-eva-15.boc \
      | gawk \
          ' ($9 \!~ /[.?*]/) { \
              lang = $11; bnum = $12; pnum = $1; fnum = $2; \
              print pnum, fnum, lang, bnum; \
            } ' \
      > vtx-f-eva-15.blk
      
  First, I checked whether all blocks indeed had comparable amounts
  of good words, and were language-homogeneous:
  
    cat vtx-f-eva-15.blk \
      | gawk '/./{print $4, $3}' \
      | sort +0 -1n +1 -2 | uniq -c | expand
  
       1043 00 A
       1043 01 A
       1043 02 A
       1043 03 A
       1043 04 A
       1043 05 A
       1043 06 A
       1043 07 A
       1043 08 A
       1042 09 A
        343 10 ?
       1060 11 B
       1060 12 B
       1060 13 B
       1060 14 B
       1060 15 B
       1060 16 B
       1060 17 B
       1060 18 B
       1060 19 B
       1060 20 B
       1060 21 B
       1060 22 B
       1060 23 B
       1060 24 B
       1060 25 B
       1060 26 B
       1060 27 B
       1060 28 B
       1060 29 B
       1060 30 B
       1054 31 B

  From this file I extracted a table that gives the max, min, and
  average block number in each page:
  
    cat vtx-f-eva-15.blk \
      | gawk \
          ' /./ { \
              pn = $1; lang = $3; bn = ($4 + 0); bc = 1000-bn; \
              if (bc > lo[pn]) lo[pn] = bc; \
              if (bn > hi[pn]) hi[pn] = bn; \
              sb[pn] += bn; ct[pn] ++; \
              lg[pn,lang] = lang; \
            } \
            END {  \
              for(pn in ct) \
                { lang = (lg[pn,"A"] lg[pn,"?"] lg[pn,"B"]); \
                  bn=int(sb[pn]/ct[pn]+0.5); \
                  printf "%03d %s %2d %2d %2d\n", pn, lang, 1000-lo[pn], bn, hi[pn]; \
                } \
            } \
          ' \
      | sort +0 -1n \
      > pnum-block-ranges.tbl
      
  This list omits pages that have no transcribed plain text. Since
  those pages may have labels, we need to assign them by proximity:
  
    cat pnum-block-ranges.tbl pnum-to-language.tbl \
      | sort -b +0 -1n +1 -2 +3 -4r \
      | gawk \
          -v blNumA=${blNumA} -v blNumB=${blNumB} \
          ' BEGIN { bn["A"] = 0; bn["?"] = blNumA; bn["B"] = blNumA+1; } \
            /./ { \
              pnum = $1; lang = $2; \
              if (NF > 2 ) { bn[lang] = $4; next; } \
              else { printf "%03d %02d\n", pnum, bn[lang]; }\
            } \
          ' \
      > pnum-to-block.tbl

  For the table headers, we need also a table that lists the f-number
  of the first page in each block:
  
    cat vtx-f-eva-15.blk \
      | sort +3 -4n +0 -1n \
      | gawk 'BEGIN{bo="?"} /./{b=$4;f=$2;if(b\!=bo){print b,f;} bo=b;}' \
      > block-to-first-fnum.tbl

    cat block-to-first-fnum.tbl \
      | gawk '/./{print $2;}' \
      | rotate-labels -v width=3 -v shift=0 \
      > block-headings.txt
    
  I also counted the number of good words per page:
  
    cat vtx-f-eva-15.blk \
      | gawk '/./{printf "p%03d %s\n", $1, $2}' \
      | sort +0 -1n | uniq -c | expand \
      > vtx-f-eva-words-per-page.cts

  With the pnum-to-language and pnum-to-block tables, I prepared a block-augmented
  concordance file, for the "defining" occurrences of all label and
  title words (in label/title units).  Like the text version, this one
  too has fields
  
    PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS LANG BLOCK
    1    2    3    4    5     6     7      8   9      10  11   12

  Here it is:
  
    cat labtit-def.roc \
      | gawk '($9 \!~ /[?*]/) {print;}' \
      | map-field \
        -v inField=1 \
        -v outField=11 \
        -v table=pnum-to-language.tbl \
      | map-field \
        -v inField=1 \
        -v outField=12 \
        -v table=pnum-to-block.tbl \
      > labtit-def.boc
    dicio-wc labtit-def.boc
  
     lines   words     bytes file        
    ------ ------- --------- ------------
       894   10728     42471 labtit-def.boc

  Finally, I combined the two block-oriented concordance files
  (labels/titles and text) into a single file, and added as a last
  field PATT the "pattern" of the string: a mapping of STRING that
  deletes "unimportant" details.  I also added a TAG field that can be
  used by the map-building script to distinguish the "special" and
  "ordinary" occurrences.  So the resulting file has format
  
    PNUM FNUM UNIT LINE TRANS START LENGTH POS STRING OBS LANG BLOCK PATT TAG
    1    2    3    4    5     6     7      8   9      10  11   12    13   14
  
    foreach f ( labtit-def/+ vtx-f-eva-15/- )
      set name = "${f:h}"
      set tag = "${f:t}"
      echo "name=${name} tag=${tag}"
      cat ${name}.boc \
        | add-match-key -f eva2erg.gawk \
            -v inField=9 \
            -v outField=13 \
            -v erase_ligatures=1 \
            -v erase_plumes=1 \
            -v ignore_gallows_eyes=1 \
            -v join_ei=1 \
            -v equate_aoy=1 \
            -v collapse_ii=1 \
            -v equate_eights=1 \
            -v erase_q=1 \
            -v erase_word_spaces=1 \
        | gawk -v tag="${tag}" '/./{print $0, tag;}' \
        > ${name}.moc
    end
    dicio-wc {labtit-def,vtx-f-eva-15}.moc

     lines   words     bytes file        
    ------ ------- --------- ------------
       894   11622     49267 labtit-def.moc
     84469 1098097   4538232 vtx-f-eva-15.moc

  To build the occurence maps, I then had to join these two files and sort
  them so as to bring all entries with same PATT together, with the
  "special" entries (TAG="+") preceding the "ordinary" ones (TAG="-").
  
    echo "nblocks = ${nblocks}"
    cat {labtit-def,vtx-f-eva-15}.moc \
      | sort -b +12 -13 +13 -14 +8 -9 +0 -1n +7 -8n \
      > f15-full.msc
    
    cat f15-full.msc \
      | make-word-location-map \
          -v nblocks=${nblocks} \
          -v omitSingles=1 \
          -v totOnly=0 \
      > f15-full.map

    cat f15-full.map \
      | format-word-location-map \
          -v nblocks=${nblocks} \
          -v html=1 \
          -v maxlen=${maxlen} \
          -v ctwd=3 \
          -v blockHeadings=block-headings.txt \
          -v title="Non-unique word patterns" \
          -v showProps=1 \
          -v showPattern=0 \
          -v showAbsCounts=1 \
          -v showRelCounts=0 \
          -v showAvgPos=0 \
      > f15-full.html

  Even with "omitSingles=1", this map was humongous (5 MBytes), so I manually 
  split it into sections, and wrote an index f15-full-index.html
  pointing to the pieces.
  
  I also built a smaller version of this map, deleting all ordinary
  occurrences which were not equivalent to some special occurrence:
  
    echo "nblocks = ${nblocks}"
    cat {labtit-def,vtx-f-eva-15}.moc \
      | sort -b +12 -13 +13 -14 +8 -9 +0 -1n +7 -8n \
      | gawk '{pt=$13;tg=$14;if(tg\!="-")pts=pt; if(pt==pts) print;}' \
      | make-word-location-map \
          -v nblocks=${nblocks} \
          -v omitSingles=0 \
          -v totOnly=0 \
      > f15-spec.map

    echo "maxlen = ${maxlen}"
    echo "nblocks = ${nblocks}"
    cat f15-spec.map \
      | format-word-location-map \
          -v nblocks=${nblocks} \
          -v html=1 \
          -v maxlen=${maxlen} \
          -v ctwd=3 \
          -v blockHeadings=block-headings.txt \
          -v title="Label and title words" \
          -v showProps=1 \
          -v showPattern=0 \
          -v showAbsCounts=1 \
          -v showRelCounts=0 \
          -v showAvgPos=0 \
      > f15-spec.html
      
  I generated another map only with the totals per abstract pattern,
  for simple words only, all of them, main text only:
   
    echo "nblocks = ${nblocks}"
    cat {labtit-def,vtx-f-eva-15}.moc \
      | sort -b +12 -13 +13 -14 +8 -9 +0 -1n +7 -8n \
      | gawk '{st=$9;tg=$14;if((tg=="-")&&(st\!~/[.]/))print;}' \
      | make-word-location-map \
          -v nblocks=${nblocks} \
          -v totOnly=1 \
      > f-00-tot.map
    dicio-wc f-00-tot.map
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      2871  114840    404554 f-00-tot.map

  Let's sort the distributions by similarity and make a picture.
  There are too many of them, so let's omit the patterns that
  occur too rarely.  First, let's make an histogram of the 
  total counts:
  
    cat f-00-tot.map \
      | gawk '{print $1;}' \
      | sort +0 -1nr | uniq -c | expand \
      | compute-cum-freqs

  It seems we can get enough patterns by looking just at the 
  popular ones:

    echo "nblocks = ${nblocks}"
    cat Nofte-010/f-00-tot.map \
      | gawk '($1 >= 10){print;}' \
      > f-00-tot-p.map

    cat f-00-tot-p.map \
      | sort-distr \
          --numValues ${nblocks} \
          --skip 6 \
          --discrete --geometric \
          --cluSort \
          --delReSort --repeat 1 \
          --verbose \
      > f-00-tot-s.map
      
  I edited the file f-00-tot-s.map by hand,
  in an attempt to cluster similar entries toegther.
  Not very successful, though.  I made a picture of it:

    cat f-00-tot-s.map \
      | egrep -v '^ *$' \
      | sort-distr \
          --numValues ${nblocks} \
          --skip 6 \
          --discrete --geometric \
          --repeat 0 \
          --showPicture f-00-tot-s.ppm \
          --verbose \
      > /dev/null
      
    cat f-00-tot-s.ppm \
      | ppmtogif \
      > f-00-tot-s.gif
    xv f-00-tot-s.gif &
    
    /bin/rm -f f-00-tot-s.ppm
      
  I also compared the distribution of words with and without their
  "o" prefix:
  
    cat f-00-tot-p.map \
      | gawk '/./ {k=($34 "~"); gsub(/^o*/, "", k); print k, $0;}' \
      | sort +0 -1 +34 -35 \
      | gawk '/./ {k=$1; if(k\!=ok){print "";} ok=k; print;}' \
      | sed -e 's/^[^ ]* //g' \
      > f-00-tot-o.map
  
  It looks like the distributions of these pairs are surprisingly 
  similar...

  Then I made HTML versions of those maps:

    echo "maxlen = ${maxlen}"
    echo "nblocks = ${nblocks}"
    foreach f ( p s o )
      cat f-00-tot-${f}.map \
        | format-word-location-map \
            -v html=1 \
            -v nblocks=${nblocks} -v maxlen=${maxlen} \
            -v ctwd=3 \
            -v blockHeadings=block-headings.txt \
            -v title="Occurrences of all word patterns per block" \
            -v showProps=0 \
            -v showLineNumber=0 \
            -v showPattern=0 \
            -v totOnly=1 \
            -v showAbsCounts=0 \
            -v showRelCounts=1 \
            -v showAvgPos=0 \
        > f-00-tot-${f}.html
    end
    dicio-wc f-*.html
    
     lines   words     bytes file        
    ------ ------- --------- ------------
       370   14659     65848 f-00-tot-o.html
       370   14659     65848 f-00-tot-p.html
       370   14659     65848 f-00-tot-s.html

  Looking at these maps, it seems that these patterns are both very common 
  and uniformly distributed throughout the manuscript:

     totn pattern    most common wd
     ---- ---------  -----------------
      221 toin       kaiin~                
      635 otoe       qokar~                
      563 oe         or~                   
      764 eeeo       chey~                 
       62 eeee       ches~                 
       33 oteee      qokees~               
       87 eeeteeo    chckhey~              
       53 oeo        ary~                  
      813 otol       qokal~                
      153 toe        kar~                  
      177 eoe        sar~                  
       37 eeeeol     sheeol~               
       33 ot         ot~                   
       48 opeeeo     opchey~               
       78 oeeeo      yshey~                
       46 eetoin     chkaiin~              
     1049 oteeo      qokeey~               
       88 od         am~                   
      484 oto        qoky~                 
       47 eeto       chky~                 
       24 todo       tody~                 
       24 tod        kam~                  
       64 odo        ody~                  
       45 odoe       odar~                 
       95 teeeo      kchey~                
       14 oeeee      oeees~                
       48 deeeo      dchey~                
      113 doie       dair~                 
       30 eeeeoe     cheeor~               
       44 ooe        qoar~                 
       57 otodo      qotody~               
       40 teo        key~                  
       43 teol       keol~                 
       62 oeteo      qockhy~               
      259 eeeoe      cheor~                
       55 eee        she~                  
       82 o          y~                    
      348 doe        dar~                  
       75 eo         sy~