Hacking at the Voynich manuscript
Notebook - volume 5

Warning: these notebooks aren't strictly chronological logs.
  Sometimes I go back and redo things, clarify comments,
  delete garbage, etc.

Summary of previous notebooks
=============================

  On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6
  (landini-interln16.evt) from
  http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip
  
  I manually extracted from it a homogeneous, full-text sample
  bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the
  "biological" section, in Currier's Language B, hand 2.  This section
  includes Currier's and Friedman's transcriptions.  Currier's seems
  to be the most complete of them.
  
  The two versions have many differences (affecting 5-10% of the
  words), and often disagree even in the grouping of symbols: where
  one sees two words the other sees a single word, what is [A] for one
  may be [CI] for the other, and so on.
  
  So I decided to break all characters doen to individual "logical"
  strokes, and use one (computer) character to encode each stroke.
  I called this new encoding "jsa" (Jorge's Super-Analytic). 
  
  After mapping to jsa, I generated a "consensus" version
  of the biological section 
  
    cat bio-m-evt.evt \
      | fsg2jsa \
      > bio-m-jsa.evt
      
    cat bio-m-jsa.evt \
      | make-consensus-interlin \
      > bio-x-jsa.evt
  
    cat bio-x-jsa.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
      > bio-j-jsa.evt

    extract-words-from-interlin \
        -chars "qocilgysxju" \
        bio-j-jsa.evt \
        bio-j-jsa
        
     lines   words     bytes file        
    ------ ------- --------- ------------
      7054    7054     62690 bio-j-jsa.wds
      2132    2132     24925 bio-j-jsa.dic
      4661    4661     40897 bio-j-jsa-gut.wds
       992     992      9720 bio-j-jsa-gut.dic
       840     840      2445 bio-j-jsa-fun.wds
         2       2         5 bio-j-jsa-fun.dic
      1553    1553     19348 bio-j-jsa-bad.wds
      1138    1138     15200 bio-j-jsa-bad.dic

   Digraph counts:

                  q     o     c     i     l     g     y     s     x     j     u   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1398   965  1877   361    60     .     .     .     .     .     .  4661
      q     1     .  1229    18     .     1   154     .     .     .   700     .  2103
      o    21   486     1    63  1087  1071     .     .     .     .     .     .  2729
      c     4   167   176  6137  1209   232  2114  2921  1019     .     .     . 13979
      i     4     1     1     8  1997     2     .     .   560  1616    37   457  4683
      l     .     .     .     .     .     .    16     .     .     .  1566     .  1582
      g    52     .    74  2150     4     4     .     .     .     .     .     .  2284
      y  2790    26     2    47    13    43     .     .     .     .     .     .  2921
      s   463     1    99  1013     1     2     .     .     .     .     .     .  1579
      x   827    24   105   488     5   167     .     .     .     .     .     .  1616
      j    46     .    76  2175     6     .     .     .     .     .     .     .  2303
      u   453     .     1     3     .     .     .     .     .     .     .     .   457
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4661  2103  2729 13979  4683  1582  2284  2921  1579  1616  2303   457 40897

  Some conclusions we get from this and other data:
  
    The valid \i/ sequences are \ij/  \is/ \iis/ \iiu/ \iiiu/ \ix/;
    the others are likely to be scription or transcription errors.
    
    \ci/ and \o/ are lexically similar but distinct glyphs. 
    
    The suffixes \ij/, \iis/, \iiu/, and \iiiu/ are preceded 
    almost exclusively by \ci/ and strictly word-final.  It seems 
    plausible that these are errors:
       
       \oij/     (4 occurrences) should be \ciij/    ( 32 occurrences)    
       \oiiu/    (2 occurrences) should be \ciiiu/   (109 occurrences)    
       \ciiu/    (4 occurrences) should be \ciiiu/   (109 occurrences)    
       \oiiiu/   (9 occurrences) should be \ciiiiu/  (329 occurrences)   
       \ciiiiiu/ (4 occurrences) should be \ciiiiu/  (329 occurrences) 
       \ciiix/   (2 occurrences) should be \ciix/    (403 occurrences) 
       
    \ciiis/ (19 occurrences) may also be a misreading of \ciis/ (291 occurrences).

    \cg/ is always a glyph.
    
    \qo/ is a combination that occurs only in word-initial position.
    
    \qc/ is likely to be a misreading/miswriting of \qo/.
    
    \cy/ is always a glyph, almost certainly a final form of \ci/.
    
    \qj/, \lj/, \qg/, \lg/ are glyphs.
    
    \cs/ is a glyph closely related to (but distinct from) \c/.
    
    \ccg/ is almost always followed by \ci/ or \cy/.
    
  Here "glyph" means a group of strokes that can be treated as a single symbol
  for analysis; it may actually be part of a larger, still unrecognized symbol.
  
  Summarizing again:
  
    \iiiu/, \iiu/, \iis/, \ij/  
    
        The ziggies: strictly final, preceded always by \ci/ or,
        more rarely, by \o/.
        
    \ix/ 
     
        Usually initial or preceded by \ci/ or \o/;
        followed by any letter except ziggies and \qo/,
        \ix/, \is/
        
    \is/ 
    
        Similar to \ix/ except that it cannot be
        followed by capitals or \cg/, either.

    \cy/ 
    
        Almost always final, but occasionaly followed by other letters.
        Preceded by about the same letters as \ci/; indeed, it is 
        probably the final form of \ci/.
        
    \cg/ 
    
        May be followed by many letters, most often \cy/ and \ci/.
        Almost always prededed by \c/, or initial; rarely by \ix/
        or \o/.
        
    \cs/ 
    
        Most often followed by \c/, somewhat less often by \o/,
        \ci/, or word break.  Most often initial, but also 
        preceded by \ix/, gallows, \c/, \cy/, \cg/, \is/.
        
    \lj/, \qj/ 
    
        The H-gallows: Very similar to each other, different from the
        rest, but somewhat similar to the P-gallows.  They probably
        combine with \c/ on both sides to make glyphs.  It is very
        likely that \l/ and \q/ are exactly equivalent.
        
    \lg/, \qg/
    
        The P-gallows: Very similar to each other, different from the
        rest, but somewhat similar to the P-gallows.  They probably
        combine with \c/ on both sides to make glyphs.  It is very
        likely that \l/ and \q/ are exactly equivalent.  They may be
        merely ornate forms of some letter, or several letters (\cg/,
        perhaps), used mainly in the first line of each paragraph (and
        perhaps of each page?)
        
    \qo/ 
    
        Strictly initial, almost always followed by a capital.
        Sometimes misread as \qc/?
    
    \ci/
    
        May be followed only by the ziggies, \ix/, or \ir/
        only.  Often follows a capital, but also \cg/,
        \cs/, \c/, \ix/, \is/, or word break.
        
    \o/ 
    
        Similar to \a/, but is very often word-initial.
                   

  Other conclusions:
  
    * The manuscript does not appear to use any hyphenation mark.  Either
      words are not broken across lines, which would be unusual, or they
      are broken without any extra marks.  Such word breaks may 
      result in statistical anomalies at the beginning and end of lines.
      Could this explain Currier's claim that lines are "functional units"?

    * Note that parsing sequences like \cij/, \ciis/, and \ciiis/ requires
      some care: the right parsings are c+ij, c+iis, ci+iis.  

    * The parsing of \ciis/ is ambiguous: ci+is or c+iis.  Declaring 
      \ciiis/ to be a misreading of \ciis/ would remove the ambiguity.

    * The parsing of \ciiiu/ is ambiguous, too; but since the \iu/
      series does not seem to follow a bare \c/, it seems safe to parse
      it as ci+iiu.
    
    * The gallows characters \qj/ and \lj/ appear to be closely related:
      for every common word with \lj/, there appears to be a 
      a word with \qj/ that occurs with about 1/4 the frequency.
      
    * There seems to be a kinship between the glyphs \cs/ 
      (when not attached to the following \c/s)
      \ir/, and the gallows \lj/ and \qj/ (also, when unattached).
      
    * The same phenomenon can be noted with respect to prefixes
      containing \cc/ and \csc/: for every word beginning with \cc/,
      there is a word where the first \cc/ is replaced by \csc/,
      and practically the same frequency.
      
    * There apepars to be much confusion between the suffixes \iu/
      and \iiiu/. They are almost surely distinct letters, but in
      about one half of the cases, Currier sees \iiu/ where Friedman
      has \iiiu/.
      
    * There appears to be much confusions between \o/ and \ci/.
      
  The strings of \c/, \cs/, \lj/, \qj/, \lg/, \qg/ must be treated
  together, after collapsing the glyphs listed above, since there
  seem to be glyphs consisting of gallows preceded and followed by 
  \c/ or \cc/.  When this is taken into account, we can see that 
  a single \c/ is not a glyph, but \cs/ is.  In fact, after
  shrinking \ci/ to `a', \cs/ to `z', the gallows to `H' or `P', the 
  only possible glyphs of the form [czHp]* with length at most 3 are
  
       freq glyph    
       ---- -----
        795 H       
         52 P       
        152 z       
        138 cc      
         70 zc      
        482 Hc      
        484 ccc     
        439 zcc    ? 
        493 Hcc    ?
         19 cHc     
          4 cPc  
          
  The ones marked `?' may be composite, z+cc and H+cc, but this hypothesis
  does not seem very likely (perhaps they are *sometimes* composite?)
  
  The significant strings of length 4 that cannot be parsed into the glyphs above
  are 
          
         20 cHcc    
          4 cPcc    

  Strings with 4 or more [czHP]'s tend to be quite ambiguous.
  
97-08-14 stolfi
===============

  Hacked make-consensus-interlin to preserve word spaces (".")
  that appear in one text but not on the other.  Also changed
  the handling of paired but mismatched character by not advancing
  a string if the matching had a low score and the string 
  is significantly shorter than the other.
  
  Looking at the raw texts, it seems that the main source of "?"s is
  the confusion between "M" and "N" by Currier and/or Friedman.  So I
  decided to (1) map both [N] and [M] (and other lookalikes) to "m",
  and (2) build the consensus with this encoding, rather than JSA.
  
  I christened the new encoding "hop":
  
    --- jsa2hop ------------------------
    #! /n/gnu/bin/sed -f
    # Yet another stroke-level, error-resistant encoding
    s/[ql]j/H/g
    s/[ql]g/P/g
    s/cs/z/g
    s/ij/k/g
    s/ix/e/g
    s/is/r/g
    s/iiu/n/g
    s/y/i/g
    s/ci/a/g
    s/cg/8/g
    s/ir/w/g
    s/i*n/m/g
    ------------------------------------

    --- fsg2hop ------------------------
    #! /n/gnu/bin/gawk -f

    # Recoding an interlinear file from the FSG alphabet to 
    # my Lossy Ad-hoc Semi-Analytic Fault-Tolerant encoding

    BEGIN {
      print "# Output of fsg2hop - Stolfi's Semi-Analytic Fault-Tolerant alphabet"
    }

    /^ *$/ { print; next }
    /^ *#/ { print; next }
    /^<[^>.;]*>/ { print; next }

    /^<[^>]*\.[^>]*;[A-Z]> / {
      curtxt = substr($0,20)

      # We discard  "%" and "!" since the conversion
      # will destroy synchronism anyway.
      gsub(/[%!]/, "", curtxt);

      # First, the conversion from FSG to JSA (Stolfi's super-analytic)
      gsub(/IIIK/, "iiiij",  curtxt);
      gsub(/IIIL/, "iiiiu",  curtxt);
      gsub(/IIIR/, "iiiis",  curtxt);
      gsub(/IIIE/, "iiiix",  curtxt);
      gsub(/IIE/,  "iiix",   curtxt);
      gsub(/IIR/,  "iiis",   curtxt);
      gsub(/IIK/,  "iiij",   curtxt);
      gsub(/HZ/,   "cqjc",   curtxt);
      gsub(/PZ/,   "cqgc",   curtxt);
      gsub(/DZ/,   "cljc",   curtxt);
      gsub(/FZ/,   "clgc",   curtxt);
      gsub(/IE/,   "iix",    curtxt);
      gsub(/IR/,   "iis",    curtxt);
      gsub(/IK/,   "iij",    curtxt);
      gsub(/2/,    "cs",     curtxt);
      gsub(/4/,    "q",      curtxt);
      gsub(/6/,    "cj",     curtxt);
      gsub(/7/,    "ig",     curtxt);
      gsub(/8/,    "cg",     curtxt);
      gsub(/A/,    "ci",     curtxt);
      gsub(/C/,    "c",      curtxt);
      gsub(/D/,    "lj",     curtxt);
      gsub(/E/,    "ix",     curtxt);
      gsub(/F/,    "lg",     curtxt);
      gsub(/G/,    "cy",     curtxt);
      gsub(/H/,    "qj",     curtxt);
      gsub(/I/,    "i",      curtxt);
      gsub(/K/,    "ij",     curtxt);
      gsub(/L/,    "iu",     curtxt);
      gsub(/M/,    "iiiu",   curtxt);
      gsub(/N/,    "iiu",    curtxt);
      gsub(/O/,    "o",      curtxt);
      gsub(/P/,    "qg",     curtxt);
      gsub(/R/,    "is",     curtxt);
      gsub(/S/,    "csc",    curtxt);
      gsub(/T/,    "cc",     curtxt);
      gsub(/V/,    "?",      curtxt);
      gsub(/Y/,    "?",      curtxt);

      # Now, the conversion from JSA to HOP:

      gsub(/[ql]j/, "H",     curtxt);
      gsub(/[ql]g/, "P",     curtxt);
      gsub(/cs/,    "z",     curtxt);
      gsub(/ij/,    "k",     curtxt);
      gsub(/ix/,    "e",     curtxt);
      gsub(/is/,    "r",     curtxt);
      gsub(/iiu/,   "n",     curtxt);
      gsub(/y/,     "i",     curtxt);
      gsub(/ci/,    "a",     curtxt);
      gsub(/cg/,    "8",     curtxt);
      gsub(/ir/,    "w",     curtxt);
      gsub(/i*n/,   "m",     curtxt);

      print (substr($0,1,19) curtxt);
      next
    }
    ------------------------------------
  
  Built the new encoded interlinear:
  
    cat bio-m-evt.evt \
      | fsg2hop \
      > bio-m-hop.evt

  Reran the consensus:
    
    cat bio-m-hop.evt \
      | make-consensus-interlin \
      > bio-x-hop.evt
      
    cat bio-x-hop.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
      > bio-j-hop.evt
      
  The result was still not very good; an inserted word still 
  can cause the program to lose sync for the rest of the line.
  But it seems to be better than the previous text.
  
  I created by hand a file bio-j-hop.evj, which is like bio-j-hop.evt
  except that it has " " instead of "." as word-space, and " //"
  instead of "-" for end-of-line, and " =" instead of "=" for
  end-of-paragraph.

  Extracted the text files:

    extract-words-from-interlin \
        -chars "aocz8HPerqkmw" \
        bio-j-hop.evt \
        bio-j-hop
        
     lines   words     bytes file        
    ------ ------- --------- ------------
      7670    7670     41815 bio-j-hop.wds
      1510    1510      9982 bio-j-hop.dic
      5894    5894     33804 bio-j-hop-gut.wds
       949     949      6236 bio-j-hop-gut.dic
       843     843      2464 bio-j-hop-fun.wds
         5       5        24 bio-j-hop-fun.dic
       933     933      5547 bio-j-hop-bad.wds
       556     556      3722 bio-j-hop-bad.dic

  Compared to the previous version of the unencoded stroke-level (jsa) files:

    Note the increase in the number of words, from 7054 to 7670.
    Also increased the number of  good words, from 4661 to 5894,
    and the consequent decrease in bad words, from 1553 to  933.

    There was a decrease in the number of distinct words, from 2132 to 1510.
    Also decreased was the number of good distinct words, from  992 to  949,
    and bad distinct words, from 1138 to 503.


    Digraph counts:

                  a     o     c     z     8     H     P     e     r     q     k     m     w    TT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .   251  1235   757   912   472   276    86   313   103  1489     .     .     .  5894
      a  3196     2     4    19    26    14    78     2   491   345     5    39   802    23  5046
      o    28     5     1    39     6    21  1776    68  1173   240     6     5    19     1  3388
      c    10  1059   226  4047    44  1865   408    33    15     4     .     .     5     .  7716
      z    58   109    90   957    10     3     4     1     1     .     .     .     .     .  1233
      8    64  2245    50    45    32     1     5     .     5     1     .     .     .     .  2448
      H    12  1125    98  1479    47     5     .     .     9     .     .     .     1     .  2776
      P     2    20    43   116    17     3     .     .     .     .     .     .     .     .   201
      e  1121   130   117   216   122    61   227    10     4     2     1     .     .     .  2011
      r   514    90    48    24    15     3     1     .     .     .     .     .     .     .   695
      q     1     5  1474    17     2     .     1     1     .     .     .     .     .     .  1501
      k    43     .     1     .     .     .     .     .     .     .     .     .     .     .    44
      m   822     4     1     .     .     .     .     .     .     .     .     .     .     .   827
      w    23     1     .     .     .     .     .     .     .     .     .     .     .     .    24
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5894  5046  3388  7716  1233  2448  2776   201  2011   695  1501    44   827    24 33804

    Next-symbol probability (× 99):

            a  o  c  z  8  H  P  e  r  q  k  m  w TT
        -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
         .  4 21 13 15  8  5  1  5  2 25  .  .  . 99
      a 63  .  .  .  1  .  2  . 10  7  .  1 16  . 99
      o  1  .  .  1  .  1 52  2 34  7  .  .  1  . 99
      c  . 14  3 52  1 24  5  .  .  .  .  .  .  . 99
      z  5  9  7 77  1  .  .  .  .  .  .  .  .  . 99
      8  3 91  2  2  1  .  .  .  .  .  .  .  .  . 99
      H  . 40  3 53  2  .  .  .  .  .  .  .  .  . 99
      P  1 10 21 57  8  1  .  .  .  .  .  .  .  . 99
      e 55  6  6 11  6  3 11  .  .  .  .  .  .  . 99
      r 73 13  7  3  2  .  .  .  .  .  .  .  .  . 99
      q  .  . 97  1  .  .  .  .  .  .  .  .  .  . 99
      k 97  .  2  .  .  .  .  .  .  .  .  .  .  . 99
      m 98  .  .  .  .  .  .  .  .  .  .  .  .  . 99
      w 95  4  .  .  .  .  .  .  .  .  .  .  .  . 99
        -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    TOT 17 15 10 23  4  7  8  1  6  2  4  0  2  0 99

    Previous-symbol probability (× 99):

            a  o  c  z  8  H  P  e  r  q  k  m  w TT
        -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
         .  5 36 10 73 19 10 42 15 15 98  .  .  . 17
      a 54  .  .  .  2  1  3  1 24 49  . 88 96 95 15
      o  .  .  .  1  .  1 63 33 58 34  . 11  2  4 10
      c  . 21  7 52  4 75 15 16  1  1  .  .  1  . 23
      z  1  2  3 12  1  .  .  .  .  .  .  .  .  .  4
      8  1 44  1  1  3  .  .  .  .  .  .  .  .  .  7
      H  . 22  3 19  4  .  .  .  .  .  .  .  .  .  8
      P  .  .  1  1  1  .  .  .  .  .  .  .  .  .  1
      e 19  3  3  3 10  2  8  5  .  .  .  .  .  .  6
      r  9  2  1  .  1  .  .  .  .  .  .  .  .  .  2
      q  .  . 43  .  .  .  .  .  .  .  .  .  .  .  4
      k  1  .  .  .  .  .  .  .  .  .  .  .  .  .  0
      m 14  .  .  .  .  .  .  .  .  .  .  .  .  .  2
      w  .  .  .  .  .  .  .  .  .  .  .  .  .  .  0
        -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    TOT 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99

  Rebuilt .fix.wds, .fix.dic:
  
    cat bio-j-hop.wds \
      | sed -e '/?/s/^.*$/???/g' \
      > .fix.wds

    cat .fix.wds \
      | sort | uniq \
      > .fix.dic
      
    cat .fix.wds \
      | wfreq \
      > .fix.frq

     lines   words     bytes file        
    ------ ------- --------- ------------
       955     955      6264 .fix.dic
       957    2871     17757 .fix.frq
      7670    7670     40000 .fix.wds

  (Note the coincidence in the number of bytes of .fix.wds!)
  
  To do next: redo all statistics...
  
  Word location maps:
  
    cat bio-j-hop-gut.wds \
      | enum-words-in-blocks -v WPB=100 \
      | sort +1 -2 +0 -1n \
      | make-word-location-map -v CTWD=1 -v PERCENT=1 -v NBLOCKS=59 \
      > .baz
      
    cat bio-j-hop-gut.wds \
      | enum-words-in-blocks -v WPB=590 \
      | sort +1 -2 +0 -1n \
      | make-word-location-map -v CTWD=3 -v PERCENT=1 -v NBLOCKS=10 \
      > .bar
  
  Here is the coarse word location map of all the words with 10 or more occurrences:
  
  TOTAL   AVG   DEV WORD       ABS FREQ BY BLOCK               REL FREQ BY BLOCK             WORD     
  ----- ----- ----- ---------  -----------------------------   ----------------------------- ---------
     64   5.2   3.5 8ar        13  5  2  8  3  4  3  7  3 16   20  8  3 12  5  6  5 11  5 25 8ar      
     63   4.6   2.9 8ae         8  5  6 17  3  1  3  8  9  3   13  8  9 27  5  2  5 13 14  5 8ae      
    120   5.2   3.1 8am        10 15 13 15  6  6  8 16 12 19    8 12 11 12  5  5  7 13 10 16 8am      
     53   4.6   3.0 8a         10  5  3  7  4  2  6 10  2  4   19  9  6 13  7  4 11 19  4  7 8a       

     26   6.1   2.5 8oe         .  1  2  5  2  2  3  4  2  5    .  4  8 19  8  8 11 15  8 19 8oe      

     40   3.4   2.8 oHar       11  8  1  4  4  4  3  2  .  3   27 20  2 10 10 10  7  5  .  7 oHar     
     43   5.5   2.7 oHae        3  4  .  5  7  4  7  4  2  7    7  9  . 12 16  9 16  9  5 16 oHae     
    105   5.1   2.5 oHam        5 14  6  4 13 28 10  9 10  6    5 13  6  4 12 26  9  8  9  6 oHam     
     31   4.4   2.6 oHa         2  7  2  4  1  6  3  4  1  1    6 22  6 13  3 19 10 13  3  3 oHa      

     11   5.8   2.5 oHoe        .  1  1  .  2  3  1  .  1  2    .  9  9  . 18 27  9  .  9 18 oHoe     

     53   4.9   3.0 qoHar      10  4  2  .  9  9  3  6  6  4   19  7  4  . 17 17  6 11 11  7 qoHar    
    128   5.6   2.6 qoHae       9  7  5 22  8 15 14 20 19  9    7  5  4 17  6 12 11 15 15  7 qoHae    
    265   4.8   2.7 qoHam      32 11 34 30 25 41 28 21 36  7   12  4 13 11  9 15 10  8 13  3 qoHam    
     84   5.3   2.9 qoHa       12  4  6  5  6 12  9 13  9  8   14  5  7  6  7 14 11 15 11  9 qoHa     

     24   5.5   2.3 qoHoe       .  .  5  3  2  2  7  1  2  2    .  . 21 12  8  8 29  4  8  8 qoHoe    

      7   5.5   2.4 oeHar       1  .  .  .  1  1  3  .  1  .   14  .  .  . 14 14 42  . 14  . oeHar
      6   6.2   1.8 oeHae       .  .  .  .  2  1  2  .  .  1    .  .  .  . 33 17 33  .  . 17 oehae
     31   5.8   1.7 oeHam       1  .  .  4  3  7 11  2  3  .    3  .  . 13 10 22 35  6 10  . oeHam    
     12   5.3   2.6 oeHa        1  1  .  1  2  2  3  .  .  2    8  8  .  8 17 17 25  .  . 17 oeHa     

     11   6.7   1.8 Har         .  .  .  1  1  3  .  3  2  1    .  .  .  9  9 27  . 27 18  9 Har      
     19   5.4   2.7 Hae         2  2  .  .  3  3  5  .  2  2   10 10  .  . 16 16 26  . 10 10 Hae      
     39   5.4   2.3 Ham         3  1  4  3  .  8 11  5  4  .    8  3 10  8  . 20 28 13 10  . Ham      
      6   3.5   2.8 Ha          2  1  .  .  .  2  .  1  .  .   33 17  .  .  . 33  . 17  .  . Ha

     14   4.8   3.0 Hoe         3  .  2  1  .  1  4  1  1  1   21  . 14  7  .  7 28  7  7  7 Hoe      
     11   6.2   1.9 Poe         .  .  1  .  2  1  4  1  1  1    .  .  9  . 18  9 36  9  9  9 Poe      

     19   4.8   2.5 ar          1  2  3  2  1  4  3  .  2  1    5 10 16 10  5 21 16  . 10  5 ar       
     20   3.6   2.5 ae          .  9  3  .  2  .  5  .  .  1    . 45 15  . 10  . 25  .  .  5 ae       
     53   4.6   2.7 am          2  8 10  6  4  6  6  3  4  4    4 15 19 11  7 11 11  6  7  7 am       
     13   4.3   2.2 a           1  2  .  2  3  2  2  .  1  .    8 15  . 15 23 15 15  .  8  . a        

    213   5.3   2.6 oe         22  6  6 25 38 26 41 16  9 24   10  3  3 12 18 12 19  7  4 11 oe       
     66   4.9   2.8 or          7  7  1  8 15  6  9  2  1 10   11 11  2 12 23  9 14  3  2 15 or       

     12   4.2   2.8 zar         1  2  4  .  .  .  3  .  2  .    8 17 33  .  .  . 25  . 17  . zar      
     19   4.1   2.9 zae         3  3  2  3  2  1  .  2  2  1   16 16 10 16 10  5  . 10 10  5 zae      
     57   4.4   2.9 zam         5 10 10  6  7  2  1  4  8  4    9 17 17 10 12  3  2  7 14  7 zam      

     40   5.5   2.9 zoe         4  2  4  3  5  4  3  2 10  3   10  5 10  7 12 10  7  5 25  7 zoe      
     13   5.0   2.9 zor         2  1  .  2  1  2  2  .  2  1   15  8  . 15  8 15 15  . 15  8 zor      

      5   6.7   2.2 rar         .  .  .  1  .  1  1  .  1  1    .  .  . 20  . 20 20  . 20 20 rar
      6   5.3   2.5 rae         .  1  .  1  .  2  1  .  .  1    . 17  . 17  . 33 17  .  . 17 rae
     17   4.7   2.7 ram         .  2  5  3  .  .  1  4  1  1    . 12 29 17  .  .  6 23  6  6 ram      
      5   5.7   3.1 ra          1  .  .  .  1  .  1  1  .  1   20  .  .  . 20  . 20 20  . 20 ra

     10   5.9   2.2 roe         1  .  .  1  .  2  1  5  .  .   10  .  . 10  . 20 10 50  .  . roe      

     13   3.3   3.0 z           2  4  3  1  .  1  .  .  .  2   15 30 23  8  .  8  .  .  . 15 z        

     11   3.6   2.5 r           1  2  4  .  1  1  .  1  1  .    9 18 36  .  9  9  .  9  9  . r        
  ----- ----- ----- ---------  -----------------------------   ----------------------------- ---------
     46   5.7   3.1 Hc8a        5  2  1 10  3  2  3  4  5 11   11  4  2 22  6  4  6  9 11 24 Hc8a     
     29   4.5   2.9 Hcc8a       4  2  4  5  3  2  4  .  .  5   14  7 14 17 10  7 14  .  . 17 Hcc8a    
     11   4.3   2.4 Hcca        1  1  2  .  2  4  .  .  .  1    9  9 18  . 18 36  .  .  .  9 Hcca     

     13   6.4   2.9 aHc8a       1  .  .  3  1  .  .  4  .  4    8  .  . 23  8  .  . 30  . 30 aHc8a    
     12   6.2   2.6 aHcc8a      .  1  .  2  1  2  1  2  .  3    .  8  . 17  8 17  8 17  . 25 aHcc8a   
     10   4.4   2.0 aHcca       1  1  .  1  2  4  .  1  .  .   10 10  . 10 20 40  . 10  .  . aHcca    

     88   5.3   3.5 oHc8a      11 13  8 12  2  2  3  5  5 27   12 15  9 14  2  2  3  6  6 30 oHc8a    
     59   4.7   3.0 oHcc8a      5 12  3  5  7  9  3  3  3  9    8 20  5  8 12 15  5  5  5 15 oHcc8a   
     35   5.1   2.8 oHcca       3  6  1  .  4  7  4  5  2  3    8 17  3  . 11 20 11 14  6  8 oHcca    

     21   4.9   3.2 oeHc8a      2  4  1  3  1  .  5  .  1  4    9 19  5 14  5  . 24  .  5 19 oeHc8a   
     15   7.0   2.3 oeHcc8a     1  .  .  .  1  2  2  3  4  2    7  .  .  .  7 13 13 20 26 13 oeHcc8a  
     20   5.3   2.6 oeHcca      1  2  1  2  .  8  2  .  1  3    5 10  5 10  . 40 10  .  5 15 oeHcca   

    204   5.1   3.1 qoHc8a     24 10 27 39 10 10 11 23 18 32   12  5 13 19  5  5  5 11  9 16 qoHc8a   
    193   4.6   2.9 qoHcc8a    23 15 43 15 14 18 11 22 16 16   12  8 22  8  7  9  6 11  8  8 qoHcc8a  
     90   5.4   2.8 qoHcca      5 10  9  7  5 15  6 13  8 12    6 11 10  8  6 17  7 14  9 13 qoHcca   

      5   3.3   3.1 ecc8a       2  .  1  .  1  .  .  .  1  .   40  . 20  . 20  .  .  . 20  . ecc8a
     53   4.5   2.7 eccc8a      3  7 11  7  5  .  4  8  8  .    6 13 21 13  9  .  7 15 15  . eccc8a   
     17   5.4   2.5 eccca       1  1  2  .  3  2  2  4  1  1    6  6 12  . 17 12 12 23  6  6 eccca    

      3   7.5   3.0 oecc8a      .  .  .  1  .  .  .  .  .  2    .  .  . 33  .  .  .  .  . 66 oecc8a
     24   4.5   3.3 oeccc8a     5  3  2  3  1  1  1  2  3  3   21 12  8 12  4  4  4  8 12 12 oeccc8a  
     11   4.6   3.6 oeccca      3  .  2  1  1  .  .  1  .  3   27  . 18  9  9  .  .  9  . 27 oeccca   

     24   5.9   3.1 cc8a        1  3  2  3  1  .  1  6  2  5    4 12  8 12  4  .  4 25  8 21 cc8a     
    211   4.9   3.0 ccc8a      19 25 29 25 18  9 18 23 22 23    9 12 14 12  8  4  8 11 10 11 ccc8a    
     89   5.2   3.0 ccca        8 15  8  2  4  9 10 12 16  5    9 17  9  2  4 10 11 13 18  6 ccca     

      6   3.8   3.3 zc8a        .  4  .  .  .  .  .  .  2  .    . 66  .  .  .  .  .  . 33  . zc8a
    233   5.3   3.1 zcc8a      18 29 24 26 17 14 19 18 34 34    8 12 10 11  7  6  8  8 14 14 zcc8a    
     86   4.7   3.1 zcca       16 10  4  5  9 10  9  3 10 10   18 12  5  6 10 12 10  3 12 12 zcca     

    233   5.3   3.1 zcc8a      18 29 24 26 17 14 19 18 34 34    8 12 10 11  7  6  8  8 14 14 zcc8a    
     42   4.4   2.7 zccc8a      5  4  6  5  7  2  5  2  4  2   12  9 14 12 17  5 12  5  9  5 zccc8a   
     34   4.1   2.3 zccca       2  5  5  3  9  5  1  2  .  2    6 15 15  9 26 15  3  6  .  6 zccca    

    211   4.9   3.0 ccc8a      19 25 29 25 18  9 18 23 22 23    9 12 14 12  8  4  8 11 10 11 ccc8a    
     23   3.8   2.3 cccc8a      1  3  9  1  4  .  1  2  2  .    4 13 39  4 17  .  4  9  9  . cccc8a   
     35   5.2   2.6 cccca       3  1  5  3  4  4  3  8  3  1    8  3 14  8 11 11  8 23  8  3 cccca    
  ----- ----- ----- ---------  -----------------------------   ----------------------------- ---------
     55   5.1   2.9 cccHca      4  8  4  3  6  7  7  6  3  7    7 14  7  5 11 13 13 11  5 13 cccHca   
     39   5.4   3.2 zccHca      3  8  1  1  4  4  4  2  4  8    8 20  3  3 10 10 10  5 10 20 zccHca   

     36   5.3   2.4 ccccHca     1  2  5  4  1 12  2  2  4  3    3  6 14 11  3 33  6  6 11  8 ccccHca  
     33   5.0   2.6 zcccHca     4  2  3  2  . 10  3  5  3  1   12  6  9  6  . 30  9 15  9  3 zcccHca  

     14   5.3   2.8 cccHa       2  .  .  4  .  2  1  2  2  1   14  .  . 28  . 14  7 14 14  7 cccHa    
     17   5.0   2.1 zccHa       1  1  2  .  1  9  1  1  .  1    6  6 12  .  6 52  6  6  .  6 zccHa    

     12   4.2   2.4 ccccHa      1  .  3  4  .  1  .  2  1  .    8  . 25 33  .  8  . 17  8  . ccccHa   
     13   3.5   2.9 zcccHa      4  1  2  1  1  1  1  .  2  .   30  8 15  8  8  8  8  . 15  . zcccHa   
  ----- ----- ----- ---------  -----------------------------   ----------------------------- ---------
     10   4.6   2.9 8ccc8a      2  1  1  .  .  1  2  3  .  .   20 10 10  .  . 10 20 30  .  . 8ccc8a   
     17   4.3   2.9 8zcc8a      3  2  1  3  1  1  1  4  .  1   17 12  6 17  6  6  6 23  .  6 8zcc8a   
     18   4.4   2.7 Pccc8a      3  2  1  1  4  .  4  2  .  1   17 11  6  6 22  . 22 11  .  6 Pccc8a   
     18   4.8   2.7 Hccc8a      2  1  2  3  2  2  .  3  3  .   11  6 11 17 11 11  . 17 17  . Hccc8a   
     16   4.2   2.3 oPccc8a     2  1  1  2  5  3  .  .  2  .   12  6  6 12 31 19  .  . 12  . oPccc8a  
     14   4.9   3.3 oezcc8a     3  1  .  1  3  .  2  1  .  3   21  7  .  7 21  . 14  7  . 21 oezcc8a  
     22   5.0   3.0 ezcc8a      2  2  4  2  1  1  2  4  1  3    9  9 18  9  5  5  9 18  5 14 ezcc8a   
     11   6.2   3.0 qoHccc8a    1  .  1  2  .  .  1  1  3  2    9  .  9 18  .  .  9  9 27 18 qoHccc8a 
     10   4.7   3.8 oe8a        4  .  .  1  .  .  .  2  2  1   40  .  . 10  .  .  . 20 20 10 oe8a     

     28   6.2   2.4 cccoe       .  1  4  2  1  2  6  3  7  2    .  4 14  7  4  7 21 11 25  7 cccoe    
     13   5.6   2.5 ccoe        .  1  2  .  3  1  2  1  2  1    .  8 15  . 23  8 15  8 15  8 ccoe     

     23   5.1   3.1 oHca        2  4  1  2  2  2  4  1  .  5    9 17  4  9  9  9 17  4  . 22 oHca     
     44   4.6   3.0 qoHca       5  9  2  3  5  3  6  5  .  6   11 20  5  7 11  7 14 11  . 14 qoHca    

  ----- ----- ----- ---------  -----------------------------   ----------------------------- ---------
     12   3.9   2.6 eHam        1  3  1  2  1  2  .  1  .  1    8 25  8 17  8 17  .  8  .  8 eHam     
     20   3.4   2.5 eoe         5  1  5  1  2  3  2  .  .  1   25  5 25  5 10 15 10  .  .  5 eoe      
     10   5.0   3.1 eor         2  1  .  .  1  2  1  1  1  1   20 10  .  . 10 20 10 10 10 10 eor      
     26   4.8   2.9 oea         4  3  .  1  6  3  4  1  1  3   15 11  .  4 23 11 15  4  4 11 oea      
     16   5.5   2.5 oeor        .  3  1  .  1  4  2  2  2  1    . 19  6  .  6 25 12 12 12  6 oeor     
    124   4.9   2.8 qoe        16  6 11 12 22  8 15 13 13  8   13  5  9 10 18  6 12 10 10  6 qoe      
     14   4.5   2.9 zca         3  2  .  .  1  2  3  2  1  .   21 14  .  .  7 14 21 14  7  . zca      
     11   4.4   2.6 zccor       1  2  .  2  1  2  2  .  .  1    9 18  . 18  9 18 18  .  .  9 zccor    
     24   5.1   2.8 zccoe       3  .  2  3  6  1  1  3  3  2   12  .  8 12 25  4  4 12 12  8 zccoe    
     15   5.5   2.2 zcoe        .  2  .  .  6  .  3  2  1  1    . 13  .  . 40  . 20 13  7  7 zcoe     

97-08-16 stolfi
===============

  I found this in Jim Reeds's e-mail archives:

      As for tricky pages:  I suppose in the end we just have to make a
      diagram and whereever V text appears (be it a word, a line, or a para),
      define a ``locus'', with locus identifier entered on the diagram,
      and tag the transcribed text with page/locus/line-num.
      Thus, on the page shown on Kahn p865, in addition to the usual
      locus for lines (viz, the main body of text) we could define 8 more
      loci, call them N1, N2, N3, N4, N5, W1, W2, and E1,
      and have lines in the transcription like:

            152 N1 1        OFAN/AFOE       ; ladies with hands in tubing
            152 N2 1        OPOE/ZC89       ; under N1
            152 N3 1        OEFS8OE         ; center top
            152 N4 1        OPOEOR
            152 N5 1        ORSC8AE         ; under N4
            152 W1 1        2ORORAE         ; above lady's head
            152 W2 1        OECOC8N         ; on her vascular boat's hull
            152 E1 1        OFA