Hacking at the Voynich manuscript
Notebook - volume 3

Warning: these notebooks aren't strictly chronological logs.
  Sometimes I go back and redo things, clarify comments,
  delete garbage, etc.

Summary of previous notebooks
=============================

  On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6
  (landini-interln16.evt) from
  http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip
  
  I manually extracted from it a homogeneous, full-text sample
  bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the
  "biological" section, in Currier's Language B, hand 2.  This section
  includes Currier's and Friedman's transcriptions.  Currier's seems
  to be the most complete of them.
  
  The two versions have many differences (affecting 5-10% of the
  words), and often disagree even in the grouping of symbols: where
  one sees two words the other sees a single word, what is [A] for one
  may be [CI] for the other, and so on.
  
  So I decided to break all characters doen to individual "logical"
  strokes, and use one (computer) character to encode each stroke.
  I called this new encoding "jsa" (Jorge's Super-Analytic). 
  
  After mapping to jsa, I generated a "consensus" version
  of the biological section 
  
    cat bio-m-evt.evt \
      | fsg2jsa \
      > bio-m-jsa.evt
      
    cat bio-m-jsa.evt \
      | make-consensus-interlin \
      > bio-x-jsa.evt
  
    cat bio-x-jsa.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
      > bio-j-jsa.evt

    extract-words-from-interlin \
        -chars "qocilgysxju" \
        bio-j-jsa.evt \
        bio-j-jsa
        
     lines   words     bytes file        
    ------ ------- --------- ------------
      7054    7054     62690 bio-j-jsa.wds
      2132    2132     24925 bio-j-jsa.dic
      4661    4661     40897 bio-j-jsa-gut.wds
       992     992      9720 bio-j-jsa-gut.dic
       840     840      2445 bio-j-jsa-fun.wds
         2       2         5 bio-j-jsa-fun.dic
      1553    1553     19348 bio-j-jsa-bad.wds
      1138    1138     15200 bio-j-jsa-bad.dic

   Digraph counts:

                  q     o     c     i     l     g     y     s     x     j     u   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1398   965  1877   361    60     .     .     .     .     .     .  4661
      q     1     .  1229    18     .     1   154     .     .     .   700     .  2103
      o    21   486     1    63  1087  1071     .     .     .     .     .     .  2729
      c     4   167   176  6137  1209   232  2114  2921  1019     .     .     . 13979
      i     4     1     1     8  1997     2     .     .   560  1616    37   457  4683
      l     .     .     .     .     .     .    16     .     .     .  1566     .  1582
      g    52     .    74  2150     4     4     .     .     .     .     .     .  2284
      y  2790    26     2    47    13    43     .     .     .     .     .     .  2921
      s   463     1    99  1013     1     2     .     .     .     .     .     .  1579
      x   827    24   105   488     5   167     .     .     .     .     .     .  1616
      j    46     .    76  2175     6     .     .     .     .     .     .     .  2303
      u   453     .     1     3     .     .     .     .     .     .     .     .   457
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4661  2103  2729 13979  4683  1582  2284  2921  1579  1616  2303   457 40897

  Some conclusions we get from this and other data:
  
    \ci/ and \o/ are lexically similar but distinct letters. 
    
    The valid \i/ sequences are \ij/  \is/ \iis/ \iiu/ \iiiu/ \ix/;
    the others are likely to be scription or transcription errors.
    
    \qo/ is a combination that occurs only in word-initial position.


97-07-26 stolfi
===============

  Back from the math colloquium, i decided to review the \c/ stroke in
  JSA encoding, to see if strings of \c/s could be parsed
  unambiguously into letters.
  
  First of all, let's review the grouping of \ci*/ into letters:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/ci/a/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 20 \
          'ai*' \
          'oi*'
  
       710  0.59  ai         1642  0.60  o
       330  0.27  aiii       1075  0.39  oi
       130  0.11  aii           9  0.00  oiii
        35  0.03  a             3  0.00  oii
         4  0.00  aiiii     -----  ----  ----
     -----  ----  ----       2729  1.00  TOT
      1209  1.00  TOT      

  Hm, it seems that \ci/ is not really a letter; it is
  most often attached to the following \i/ strings.
  Let's retry with some more context, and removing the \qo/
  combination:

    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/_qo/_A/g' \
          -e 's/ci/a/g' \
      | compare-contexts -lctx 0 -rctx 1 -colw 20 \
          'ai*' \
          'oi*'



       403  0.33  aix         760  0.50  oix
       329  0.27  aiiiu       260  0.17  oq
       271  0.22  ais         250  0.17  ol
       109  0.09  aiiu        171  0.11  ois
        32  0.03  aij          35  0.02  oc
        19  0.02  aiis         17  0.01  o_
        15  0.01  ax            9  0.01  oiiiu
         8  0.01  ac            4  0.00  oij
         4  0.00  as            1  0.00  oiiu
         4  0.00  aiu           1  0.00  oiis
         4  0.00  aiiiiu    -----  ----  ----
         4  0.00  a_         1508  1.00  TOT
         2  0.00  al       
         2  0.00  aiix     
         1  0.00  aq       
         1  0.00  ao       
         1  0.00  aiiis    
     -----  ----  ----     
      1209  1.00  TOT      

  Let's look at the .dic file, instead of .wds, to lessen the effect of
  common words:

    cat bio-j-jsa-gut.dic \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/_qo/_A/g' \
          -e 's/ci/a/g' \
      | compare-contexts -lctx 0 -rctx 1 -colw 20 \
          'ai*' \
          'oi*'

       126  0.35  aix         233  0.49  oix
        86  0.24  ais          64  0.13  oq
        48  0.13  aiiiu        64  0.13  ois
        22  0.06  aiiu         61  0.13  ol
        21  0.06  aij          32  0.07  oc
        14  0.04  aiis         13  0.03  o_
        11  0.03  ax            5  0.01  oiiiu
         8  0.02  ac            4  0.01  oij
         4  0.01  as            1  0.00  oiiu
         4  0.01  aiiiiu        1  0.00  oiis
         3  0.01  a_        -----  ----  ----
         2  0.01  al          478  1.00  TOT
         2  0.01  aiu      
         2  0.01  aiix     
         1  0.00  aq       
         1  0.00  ao       
         1  0.00  aiiis    
     -----  ----  ----     
       356  1.00  TOT      

  Hm, it seems quite possible that the \o/ in \oiiiu/,
  \oij/, \oiiu/, \oiis/ is actually a misreading of \ci/.

  Let's now compare the occurrences of \i/ strings after \c/ 
  against other letters:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 20 \
          'cii*' \
          '[^ci]ii*'


       710  0.59  cii        1075  0.73  oi
       330  0.27  ciiii       361  0.24  _i
       130  0.11  ciii         13  0.01  yi
        35  0.03  ci            9  0.01  oiii
         4  0.00  ciiiii        6  0.00  ji
     -----  ----  ----          5  0.00  xi
      1209  1.00  TOT           4  0.00  gi
                                3  0.00  oii
                                1  0.00  si
                            -----  ----  ----
                             1477  1.00  TOT

  Hm. According to this table, strings of two or more \i/s
  occur only after a \c/ stroke.  
  
  The exceptions \oiii/ (9 instances) and \oii/ (3 instances) can
  easily be explained as misreadings of \ciiii/ (330 instances) and
  \cii/ (130 instances).  If the probability of misreading \ci/ as \o/ is
  independent of context, then we can expect that 20 \oi/s
  are actually \cii/s. 
  
  Let's retry with some additional context:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
      | compare-contexts -lctx 0 -rctx 1 -colw 20 \
          'cii*' \
          '[^ci]ii*'

       403  0.33  ciix        893  0.60  oix
       329  0.27  ciiiiu      282  0.19  _ix
       271  0.22  ciis        178  0.12  ois
       109  0.09  ciiiu        79  0.05  _is
        32  0.03  ciij          9  0.01  oiiiu
        19  0.02  ciiis         8  0.01  yix
        15  0.01  cix           6  0.00  jix
         8  0.01  cic           4  0.00  yis
         4  0.00  cis           4  0.00  oij
         4  0.00  ciiu          3  0.00  xix
         4  0.00  ciiiiiu       3  0.00  gix
         4  0.00  ci_           2  0.00  xis
         2  0.00  cil           2  0.00  oiiu
         2  0.00  ciiix         1  0.00  yij
         1  0.00  ciq           1  0.00  six
         1  0.00  cio           1  0.00  oiis
         1  0.00  ciiiis        1  0.00  gis
     -----  ----  ----      -----  ----  ----
      1209  1.00  TOT        1477  1.00  TOT

  According to this table, it sems that:
  
    The pairs \ix/, \is/, and \ij/ are letters: they 
    can appear after \o/, \ci/, and word-begin, but also after
    other strokes like \c/, \y/, \j/, \g/,  \s/, \x/.
    
    Strings of one or more \i/ ending with the \u/ plume can only 
    occur after \c/.  The exceptions \oiiiu/ (9) and \oiiu/ (2)
    can be explained as misreadings of \ciiiiu/ (329)
    and \ciiiu/ (109).  That would suggest that about 3% of the 
    \o/s are actually \ci/s. 
    
    Strings of two or more \i/s with the \s/, \j/, and \x/
    plumes can appear only after \c/. The exception \oiis/ (1)
    can be explained as a misreading of \ciiis/ (19).
    
  Here is the \c/ column again, manually sorted by last letter:
  
         4  0.00  ci_
         8  0.01  cic
         2  0.00  cil
         1  0.00  cio
         1  0.00  ciq

        32  0.03  ciij

         4  0.00  cis
       271  0.22  ciis
        19  0.02  ciiis
         1  0.00  ciiiis

         4  0.00  ciiu
       109  0.09  ciiiu
       329  0.27  ciiiiu
         4  0.00  ciiiiiu

        15  0.01  cix
       403  0.33  ciix
         2  0.00  ciiix
     -----  ----  ----
      1209  1.00  TOT
  
  From all this data, it seems we can draw the following hypotheses:
  
    The strings \ij/, \is/, and \ix/ are letters.
  
    It is possible that \iis/ is a rare letter, too.
    
    The pair \ci/ is often a letter, but sometimes it is not.
    In particular, \cix/ is the letter \ix/ following a \c/.
    
    The only strings that end with \u/ plume are \ciiiu/ and
    \ciiiiu/.
    
  The last observation has a number of possible explanations:
    
    (1) \ciiiu/ and \ciiiiu/ are letters; or
      
    (2) \iiu/ and \iiiu/ are letters that can occur only after \ci/; or 
      
    (3) \iiiu/ and \iiiiu/ are letters that can occur only after \c/; or 
      
    (4) \iiiu/ is a letter that can occur after \c/ or \ci/.
      
  The mixed hypotheses
  
    (5) \iiu/ and \iiiu/ are letters that can occur after \c/ or \ci/

    (6) \iiiu/ and \iiiiu/ are letters that can occur after \c/ or \ci/
      
  is rather unlikely, given the low frequency of \ciiu/ and \ciiiiiu/. 
  
  Hypothesis (2) has the merit that it provides an alternative
  explanation for the rare occurrences of \oiiu/ and \oiiiu/,
  not depending on transcription errors.
    
  Let's be conservative, and lump only \ix/, \ij/, \is/, \iiu/ 
  as letters, leaving out the first \i/ of \iiiu/ and \iis/.
  Here is a table of these \i/ letters and their left contexts:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/ci/a/g' \
      | compare-contexts -lctx 1 -rctx 0 -colw 17 \
          'i*k' \
          'i*e' \
          'i*r' \
          'i*n'
          
    32  0.86  ak      893  0.55  oe      271  0.48  ar      329  0.72  ain
     4  0.11  ok      403  0.25  ae      178  0.32  or      109  0.24  an
     1  0.03  yk      282  0.17  _e       79  0.14  _r        9  0.02  oin
 -----  ----  ---      15  0.01  ce       19  0.03  air       4  0.01  cn
    37  1.00  TOT       8  0.00  ye        4  0.01  yr        4  0.01  aii
                        6  0.00  je        4  0.01  cr        2  0.00  on
                        3  0.00  ge        2  0.00  er    -----  ----  ---
                        3  0.00  e         1  0.00  oir     457  1.00  TOT
                        2  0.00  aie       1  0.00  gr   
                        1  0.00  se        1  0.00  aii  
                    -----  ----  ---   -----  ----  ---  
                     1616  1.00  TOT     560  1.00  TOT  


  Note the occurrences of \y/ before the \i/ letters, except \m/ and \n/.
  This data confirms our previous guess that the \cy/ group is merely the
  final form of \ci/.  Let's do this reduction, too:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/cy/a/g' \
          -e 's/ci/a/g' \
      | compare-contexts -lctx 1 -rctx 0 -colw 17 \
          'i*k' \
          'i*e' \
          'i*r' \
          'i*n'

        33  0.89  ak      893  0.55  oe      275  0.49  ar      329  0.72  ain
         4  0.11  ok      411  0.25  ae      178  0.32  or      109  0.24  an
     -----  ----  ---     282  0.17  _e       79  0.14  _r        9  0.02  oin
        37  1.00  TOT      15  0.01  ce       19  0.03  air       4  0.01  cn
                            6  0.00  je        4  0.01  cr        4  0.01  aii
                            3  0.00  ge        2  0.00  er        2  0.00  on
                            3  0.00  e         1  0.00  oir   -----  ----  ---
                            2  0.00  aie       1  0.00  gr      457  1.00  TOT
                            1  0.00  se        1  0.00  aii  
                        -----  ----  ---   -----  ----  ---  
                         1616  1.00  TOT     560  1.00  TOT  

  Now, let's check whether the combinations \cy/, \ci/ and \cg/ behave
  like \c/ on the left.  To reduce the number of distinct patterns, I
  will collapse \cs/ to \c/, and erase all gallows:

    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]//' \
          -e 's/cs/c/g' \
          -e 's/ci/a/g' \
          -e 's/cy/a/g' \
          -e 's/cg/8/g' \
          -e 's/c\([^ca8o]\)/C\1/g' \
          -e 's/o\([^ca8o]\)/O\1/g' \
      | compare-contexts -lctx 1 -rctx 0 -colw 20 \
          '[ca8o]*C' \
          '[ca8o]*8' \
          '[ca8o]*a' \
          '[ca8o]*O'

        10  0.19  _cccC      456  0.22  _ccc8      443  0.11  _ccc8a     517  0.46  _O
         6  0.11  xC         322  0.16  _8         418  0.10  qoa        157  0.14  qO
         6  0.11  _C         207  0.10  qoc8       264  0.07  _8a        102  0.09  xO
         5  0.09  _ccC       201  0.10  qocc8      211  0.05  _ccca       65  0.06  _cccO
         4  0.07  _ccccC     150  0.07  xccc8      203  0.05  qoc8a       61  0.05  _cO
         3  0.06  xccccC      95  0.05  _oc8       196  0.05  qocc8a      41  0.04  _ccO
         3  0.06  OC          74  0.04  _cccc8     180  0.04  _oa         36  0.03  sO
         2  0.04  xcC         65  0.03  _occ8      175  0.04  _cccca      33  0.03  qoO
         2  0.04  qocccC      57  0.03  xcc8       174  0.04  xa          26  0.02  _8O
         2  0.04  qoaC        54  0.03  _cc8       140  0.03  xccc8a      18  0.02  _oO
         1  0.02  xccC        51  0.02  x8         108  0.03  _a           9  0.01  xcccO
         1  0.02  scccC       35  0.02  qoccc8      93  0.02  _oc8a        8  0.01  _ocO
         1  0.02  sccC        31  0.02  xc8         87  0.02  _ccccc       6  0.01  xccO
         1  0.02  qoccC       29  0.01  _occc8      85  0.02  qocca        6  0.01  qocO
         1  0.02  qocC        27  0.01  _c8         85  0.02  _ca          6  0.01  _ccccO
         1  0.02  qcc8aC      26  0.01  _8ccc8      72  0.02  _cccc8       6  0.01  _8ccO
         1  0.02  _occca      18  0.01  _ccccc      68  0.02  xccca        4  0.00  _occO
         1  0.02  _cC         16  0.01  _accc8      64  0.02  sa           4  0.00  _ccc8O
         1  0.02  _acccc      14  0.01  _acc8       62  0.02  _occ8a       4  0.00  _8cccO
         1  0.02  _aC         12  0.01  xcccc8      59  0.01  _cca         3  0.00  jO
         1  0.02  _8cccc      12  0.01  sccc8       55  0.01  xcc8a        3  0.00  _cc8O
     -----  ----  ----        12  0.01  _ac8        50  0.01  xcca         2  0.00  xcO
        54  1.00  TOT         11  0.01  _ccccc      49  0.01  _cc8a        2  0.00  qoccO
                               7  0.00  qo8         48  0.01  x8a          2  0.00  _occcO
                               6  0.00  _o8         45  0.01  qoca         2  0.00  _ccccc
                               5  0.00  qcc8        40  0.01  _occa        2  0.00  _acccO
                               5  0.00  _8cc8       34  0.01  qoccc8       1  0.00  xccccO
                               4  0.00  scccc8      29  0.01  xc8a         1  0.00  x8O
                               4  0.00  _a8         28  0.01  _occc8       1  0.00  uO
                               4  0.00  _8c8        27  0.01  _c8a         1  0.00  qoc8O
                               3  0.00  qocccc      26  0.01  _8ccc8       1  0.00  jcccO
                               3  0.00  qoa8        22  0.01  _oca         1  0.00  _oaO
                               3  0.00  qccc8       21  0.01  _occca       1  0.00  _coO
                               3  0.00  _8cccc      18  0.00  _ccccc       1  0.00  _c8aO
                               2  0.00  xoccc8      17  0.00  qoccca       1  0.00  _8cccc
                               2  0.00  xo8         16  0.00  xcccca   -----  ----  ----
                               2  0.00  xccccc      15  0.00  _accc8    1134  1.00  TOT
                               2  0.00  xa8         14  0.00  _acc8a  
                               2  0.00  s8          12  0.00  sccca   
                               2  0.00  _cocc8      12  0.00  _ac8a   
                               2  0.00  _acccc      12  0.00  _aa     
                               2  0.00  _8acc8      11  0.00  xcccc8  
                               2  0.00  _8ac8       11  0.00  xca     
                               1  0.00  xccccc      10  0.00  sccc8a  
                               1  0.00  x8ccc8       9  0.00  _ccccc  
                               1  0.00  x88          9  0.00  _acca   
                               1  0.00  scc8         8  0.00  _accca  
                               1  0.00  sa8          7  0.00  ja      
                               1  0.00  qoc8cc       6  0.00  xccccc  
                               1  0.00  qoc8c8       6  0.00  _o8a    
                               1  0.00  qoacc8       6  0.00  _ccccc  
                               1  0.00  qo8ccc       5  0.00  qo8a    
                               1  0.00  qo8cc8       5  0.00  _acccc  
                               1  0.00  qcccc8       5  0.00  _8cc8a  
                               1  0.00  jcc8         4  0.00  scccc8  
                               1  0.00  jc8          4  0.00  qcc8a   
                               1  0.00  j8           4  0.00  qca     
                               1  0.00  _occo8       4  0.00  _aca    
                               1  0.00  _oa8         4  0.00  _a8a    
                               1  0.00  _o8ccc       4  0.00  _8ccca  
                               1  0.00  _co8         4  0.00  _8c8a   
                               1  0.00  _ccocc       3  0.00  xoa     
                               1  0.00  _ccocc       3  0.00  ua      
                               1  0.00  _cco8        3  0.00  qocccc  
                               1  0.00  _8oc8        3  0.00  qoa8a   
                               1  0.00  _8cccc       3  0.00  qccc8a  
                           -----  ----  ----         3  0.00  jca     
                            2063  1.00  TOT          3  0.00  _occcc  
                                                     3  0.00  _ccoa   
                                                     3  0.00  _8cccc  
                                                     3  0.00  _8cccc  
                                                     3  0.00  _8cca   
                                                     2  0.00  xoccc8  
                                                     2  0.00  xccccc  
                                                     2  0.00  xccccc  
                                                     2  0.00  scca    
                                                     2  0.00  qcca    
                                                     2  0.00  jcca    
                                                     2  0.00  ga      
                                                     2  0.00  _cocc8  
                                                     2  0.00  _acccc  
                                                     2  0.00  _acccc  
                                                     2  0.00  _8acca  
                                                     2  0.00  _8acc8  
                                                     2  0.00  _8ac8a  
                                                     2  0.00  _8aa    
                                                     1  0.00  xocca   
                                                     1  0.00  xo8a    
                                                     1  0.00  xccccc  
                                                     1  0.00  xc8ca   
                                                     1  0.00  xa8a    
                                                     1  0.00  x8ccc8  
                                                     1  0.00  x88a    
                                                     1  0.00  scccca  
                                                     1  0.00  scc8a   
                                                     1  0.00  sca     
                                                     1  0.00  sacca   
                                                     1  0.00  sa8a    
                                                     1  0.00  s8a     
                                                     1  0.00  qocccc  
                                                     1  0.00  qoc8cc  
                                                     1  0.00  qoc8c8  
                                                     1  0.00  qoacc8  
                                                     1  0.00  qoaa    
                                                     1  0.00  qo8cca  
                                                     1  0.00  qo8cc8  
                                                     1  0.00  qo8aca  
                                                     1  0.00  qccca   
                                                     1  0.00  qcc8cc  
                                                     1  0.00  qaa     
                                                     1  0.00  qa      
                                                     1  0.00  jcc8a   
                                                     1  0.00  jc8a    
                                                     1  0.00  j8a     
                                                     1  0.00  _occo8  
                                                     1  0.00  _oc8cc  
                                                     1  0.00  _oaccc  
                                                     1  0.00  _oa8a   
                                                     1  0.00  _o8ccc  
                                                     1  0.00  _coccc  
                                                     1  0.00  _co8a   
                                                     1  0.00  _ccocc  
                                                     1  0.00  _ccocc  
                                                     1  0.00  _ccocc  
                                                     1  0.00  _cco8a  
                                                     1  0.00  _cccoa  
                                                     1  0.00  _ccccc  
                                                     1  0.00  _ccccc  
                                                     1  0.00  _cccaa  
                                                     1  0.00  _ccc8c  
                                                     1  0.00  _ccc8a  
                                                     1  0.00  _ccacc  
                                                     1  0.00  _caa    
                                                     1  0.00  _aoa    
                                                     1  0.00  _acccc  
                                                     1  0.00  _8occa  
                                                     1  0.00  _8oc8a  
                                                     1  0.00  _8cccc  
                                                     1  0.00  _8aca   
                                                 -----  ----  ----    
                                                  4017  1.00  TOT     

  Let's recount, with narrower left contexts (all the \c/s and one more letter):

    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]//' \
          -e 's/cs/c/g' \
          -e 's/ci/a/g' \
          -e 's/cy/a/g' \
          -e 's/cg/8/g' \
          -e 's/c\([^ca8o]\)/C\1/g' \
      | compare-contexts -lctx 1 -rctx 0 -colw 20 \
          'c*C' \
          'c*8' \
          'c*a' \
          'c*o'

    10  0.19  _cccC      486  0.23  _ccc8     1949  0.47  8a        1230  0.45  qo
     6  0.11  xC         367  0.17  _8         613  0.15  oa        1066  0.39  _o
     6  0.11  _C         305  0.14  oc8        221  0.05  _ccca      110  0.04  xo
     5  0.09  aC         269  0.13  occ8       216  0.05  _a          80  0.03  _co
     5  0.09  _ccC       150  0.07  xccc8      180  0.04  _cccca      68  0.02  _ccco
     4  0.07  _ccccC      77  0.04  _cccc8     175  0.04  xa          55  0.02  _cco
     3  0.06  xccccC      67  0.03  occc8      128  0.03  occa        37  0.01  8o
     3  0.06  oC          60  0.03  _cc8        92  0.02  _ca         36  0.01  so
     2  0.04  xcC         57  0.03  xcc8        89  0.02  _ccccc       9  0.00  xccco
     2  0.04  occcC       53  0.03  x8          73  0.02  _cca         6  0.00  xcco
     1  0.02  xccC        32  0.02  _c8         68  0.02  xccca        6  0.00  _cccco
     1  0.02  scccC       31  0.01  xc8         67  0.02  oca          6  0.00  8cco
     1  0.02  sccC        21  0.01  o8          66  0.02  sa           4  0.00  8ccco
     1  0.02  occC        19  0.01  _ccccc      50  0.01  xcca         3  0.00  jo
     1  0.02  ocC         17  0.01  acc8        39  0.01  occca        3  0.00  ao
     1  0.02  accccC      16  0.01  accc8       16  0.00  xcccca       2  0.00  xco
     1  0.02  _cC         14  0.01  ac8         12  0.00  sccca        2  0.00  accco
     1  0.02  8ccccC      12  0.01  xcccc8      11  0.00  xca          2  0.00  _ccccc
 -----  ----  ----        12  0.01  sccc8        8  0.00  8cca         1  0.00  xcccco
    54  1.00  TOT         11  0.01  a8           7  0.00  ja           1  0.00  uo
                          11  0.01  _ccccc       7  0.00  _ccccc       1  0.00  jccco
                           5  0.00  qcc8         6  0.00  xccccc       1  0.00  8cccco
                           4  0.00  scccc8       4  0.00  qca      -----  ----  ----
                           3  0.00  qccc8        4  0.00  occcca    2729  1.00  TOT
                           3  0.00  occcc8       4  0.00  8ccca   
                           2  0.00  xccccc       3  0.00  ua      
                           2  0.00  s8           3  0.00  jca     
                           2  0.00  acccc8       3  0.00  8cccca  
                           1  0.00  xccccc       2  0.00  xccccc  
                           1  0.00  scc8         2  0.00  scca    
                           1  0.00  qcccc8       2  0.00  qcca    
                           1  0.00  jcc8         2  0.00  qa      
                           1  0.00  jc8          2  0.00  jcca    
                           1  0.00  j8           2  0.00  ga      
                       -----  ----  ----         1  0.00  scccca  
                        2114  1.00  TOT          1  0.00  sca     
                                                 1  0.00  qccca   
                                                 1  0.00  _ccccc  
                                                 1  0.00  8ca     
                                             -----  ----  ----    
                                              4131  1.00  TOT     


97-07-27 stolfi
===============

  Again, ignoring repeated words:
  
    cat bio-j-jsa-gut.dic \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]//' \
          -e 's/cs/c/g' \
          -e 's/ci/a/g' \
          -e 's/cy/a/g' \
          -e 's/cg/8/g' \
          -e 's/c\([^ca8o]\)/C\1/g' \
      | compare-contexts -lctx 1 -rctx 0 -colw 20 \
          'c*C' \
          'c*8' \
          'c*a' \
          'c*o'

     5  0.14  xC          79  0.19  _8         331  0.35  8a         258  0.41  _o
     5  0.14  aC          37  0.09  _ccc8      108  0.11  _a         163  0.26  qo
     3  0.08  xccccC      34  0.08  x8          79  0.08  oa          56  0.09  xo
     3  0.08  oC          30  0.07  xccc8       75  0.08  xa          31  0.05  _cco
     3  0.08  _cccC       27  0.06  xcc8        39  0.04  sa          30  0.05  _co
     3  0.08  _ccC        26  0.06  occc8       38  0.04  _ccca       19  0.03  so
     2  0.05  xcC         23  0.05  oc8         37  0.04  _cca        19  0.03  _ccco
     2  0.05  occcC       21  0.05  occ8        32  0.03  _ca         18  0.03  8o
     2  0.05  _ccccC      21  0.05  _cc8        23  0.02  _cccca       7  0.01  xccco
     1  0.03  xccC        18  0.04  o8          22  0.02  xccca        5  0.01  8cco
     1  0.03  scccC       14  0.03  _cccc8      21  0.02  occca        4  0.01  xcco
     1  0.03  sccC        10  0.02  a8          18  0.02  xcca         4  0.01  _cccco
     1  0.03  occC         9  0.02  _ccccc      16  0.02  occa         3  0.00  jo
     1  0.03  ocC          9  0.02  _ccccc      14  0.01  _ccccc       3  0.00  ao
     1  0.03  accccC       8  0.02  xc8         10  0.01  xcccca       2  0.00  xco
     1  0.03  _cC          7  0.02  xcccc8       8  0.01  xca          2  0.00  accco
     1  0.03  _C           7  0.02  _c8          8  0.01  sccca        2  0.00  _ccccc
     1  0.03  8ccccC       6  0.01  accc8        7  0.01  oca          2  0.00  8ccco
 -----  ----  ----         6  0.01  acc8         7  0.01  ja           1  0.00  xcccco
    37  1.00  TOT          4  0.01  sccc8        7  0.01  8cca         1  0.00  uo
                           4  0.01  qcc8         6  0.01  _ccccc       1  0.00  jccco
                           4  0.01  ac8          4  0.00  qca          1  0.00  8cccco
                           3  0.01  scccc8       3  0.00  xccccc   -----  ----  ----
                           2  0.00  xccccc       3  0.00  occcca     632  1.00  TOT
                           2  0.00  s8           3  0.00  jca     
                           2  0.00  qccc8        3  0.00  8ccca   
                           2  0.00  occcc8       2  0.00  xccccc  
                           2  0.00  acccc8       2  0.00  ua      
                           1  0.00  xccccc       2  0.00  scca    
                           1  0.00  scc8         2  0.00  qcca    
                           1  0.00  qcccc8       2  0.00  qa      
                           1  0.00  jcc8         2  0.00  jcca    
                           1  0.00  jc8          2  0.00  ga      
                           1  0.00  j8           2  0.00  8cccca  
                       -----  ----  ----         1  0.00  scccca  
                         423  1.00  TOT          1  0.00  sca     
                                                 1  0.00  qccca   
                                                 1  0.00  _ccccc  
                                                 1  0.00  8ca     
                                             -----  ----  ----    
                                               943  1.00  TOT     

  These data suggest that, ignoring gallows and \s/-plumes, the 
  \c/ strings always end in \ci/, \cy/, \cg/ or \o/. 
  
  Let's look again at the \c/ strings and their relationship to
  \s/-plumes and gallows.  For simplicity, let's map all gallows to
  `H', and \cs/ to `z'; for consistency, let's map \cy/ and \ci/ to `a',
  \cg/ to `8'.
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 24\
          '[czH][czH]*a' \
          '[czH][czH]*o' \
          '[czH][czH]*8' \
          '[czH][czH]*[^czHao8]'

       718 0.40 Ha               106 0.30 Ho               379 0.23 Hc8               18 0.20 z_
       169 0.09 Hcca              63 0.18 zo               319 0.19 Hcc8              11 0.12 cce
       133 0.07 ccca              35 0.10 zcco             317 0.19 ccc8               9 0.10 cccz_
       108 0.06 zcca              34 0.10 ccco             296 0.18 zcc8               7 0.08 H_
        79 0.04 Hca               24 0.07 zco               84 0.05 Hccc8              6 0.07 He
        70 0.04 za                23 0.07 cco               54 0.03 cc8                5 0.05 ccccz_
        54 0.03 cccHca            19 0.05 Hco               51 0.03 zccc8              4 0.04 zce
        48 0.03 cca               13 0.04 Hcco              28 0.02 cccc8              4 0.04 zcccz_
        40 0.02 ccccHca            6 0.02 Hccco             22 0.01 Hzcc8              3 0.03 zccH_
        39 0.02 zccHca             4 0.01 cHco              21 0.01 zc8                3 0.03 Hcn
        39 0.02 cccca              4 0.01 Hzcco             10 0.01 cccHcc8            2 0.02 zc_
        37 0.02 zcccHca            3 0.01 zccco              9 0.01 cccHc8             2 0.02 ccz_
        33 0.02 zccca              3 0.01 cccco              9 0.01 cHcc8              2 0.02 ccr
        33 0.02 Hccca              1 0.00 zzcco              9 0.01 Hzc8               2 0.02 cccH_
        21 0.01 zccHa              1 0.00 zccHo              6 0.00 zcccHcc8           1 0.01 ze
        18 0.01 zca                1 0.00 zcHo               6 0.00 zccHcc8            1 0.01 zcr
        18 0.01 cccHa              1 0.00 zcHco              6 0.00 cHc8               1 0.01 zccz_
        14 0.01 cHcca              1 0.00 czco               5 0.00 H8                 1 0.01 cq
        14 0.01 Hzcca              1 0.00 ccccco             4 0.00 zzcc8              1 0.01 cn
        13 0.01 zcccHa             1 0.00 ccccHco            4 0.00 c8                 1 0.01 cl
        12 0.01 ccccHa             1 0.00 cccHo              3 0.00 ccccHcc8           1 0.01 cccHc_
        12 0.01 cHca               1 0.00 cccHco             3 0.00 ccHc8              1 0.01 cc_
        11 0.01 zccHcca            1 0.00 ccHo               2 0.00 zcccHc8            1 0.01 cHccz_
        11 0.01 ccHa               1 0.00 cHcco              2 0.00 zccHc8             1 0.01 Hcz_
         6 0.00 cccHcca            1 0.00 Hzco               2 0.00 cccHccc8           1 0.01 Hcr
         6 0.00 cHa            ----- ---- ----               2 0.00 cHccc8             1 0.01 Hccz_
         5 0.00 ca               349 1.00 TOT                2 0.00 Hcccc8             1 0.01 Hccccl
         5 0.00 Hcccca                                       1 0.00 zzcHcc8        ----- ---- ----
         4 0.00 zcHa                                         1 0.00 zcz8              91 1.00 TOT
         4 0.00 ccccHcca                                     1 0.00 zccHccc8      
         3 0.00 Hzca                                         1 0.00 ccHzccc8      
         2 0.00 zcccHcca                                     1 0.00 ccHccc8       
         2 0.00 zHcca                                        1 0.00 ccHcc8        
         2 0.00 cccza                                        1 0.00 Hczc8         
         2 0.00 Hczcca                                       1 0.00 Hcz8          
         1 0.00 zzcccHca                                     1 0.00 Hccz8         
         1 0.00 zzcca                                    ----- ---- ----          
         1 0.00 zzccHa                                    1664 1.00 TOT           
         1 0.00 zcccca                                                            
         1 0.00 zccccHcca                                                         
         1 0.00 zccHccca                                                          
         1 0.00 zcHcca                                                            
         1 0.00 zHa                                                               
         1 0.00 ccHzccca                                                          
         1 0.00 ccHcca                                                            
         1 0.00 cHccca                                                            
         1 0.00 Hczca                                                             
         1 0.00 Hccza                                                             
     ----- ---- ----                                                              
      1798 1.00 TOT                                                               

  From these table it seems (again) that \ci/ and \o/ are equivalent;
  and that \cg/ is similar, but not as much.  Also, virtually all \c/ strings
  end with these three letters; only a very few end with \ix/.   
  
  Below I have marked with `#/*', `@/&', `+', and `-' the contexts with 
  at least one frequency greater than 0.20, 0.08, 0.04, and 0.02,
  respectively:
  
     # 718 0.40 Ha             # 106 0.30 Ho             * 379 0.23 Hc8       
     @ 169 0.09 Hcca           &  63 0.18 zo             @ 319 0.19 Hcc8      
     & 133 0.07 ccca           &  35 0.10 zcco           & 317 0.19 ccc8      
     & 108 0.06 zcca           &  34 0.10 ccco           & 296 0.18 zcc8      
     *  79 0.04 Hca            +  24 0.07 zco            +  84 0.05 Hccc8     
     &  70 0.04 za             +  23 0.07 cco            +  54 0.03 cc8       
     -  54 0.03 cccHca         *  19 0.05 Hco            -  51 0.03 zccc8     
     +  48 0.03 cca            @  13 0.04 Hcco              28 0.02 cccc8     
        40 0.02 ccccHca        +   6 0.02 Hccco             22 0.01 Hzcc8     
        39 0.02 zccHca             4 0.01 cHco           +  21 0.01 zc8       
        39 0.02 cccca              4 0.01 Hzcco             10 0.01 cccHcc8   
        37 0.02 zcccHca        -   3 0.01 zccco          -   9 0.01 cccHc8    
     -  33 0.02 zccca              3 0.01 cccco              9 0.01 cHcc8     
     +  33 0.02 Hccca              1 0.00 zzcco              9 0.01 Hzc8      
        21 0.01 zccHa              1 0.00 zccHo              6 0.00 zcccHcc8  
     +  18 0.01 zca                1 0.00 zcHo               6 0.00 zccHcc8   
        18 0.01 cccHa              1 0.00 zcHco              6 0.00 cHc8      
        14 0.01 cHcca              1 0.00 czco           #   5 0.00 H8        
        14 0.01 Hzcca              1 0.00 ccccco             4 0.00 zzcc8     
        13 0.01 zcccHa             1 0.00 ccccHco            4 0.00 c8        
        12 0.01 ccccHa             1 0.00 cccHo              3 0.00 ccccHcc8  
        12 0.01 cHca           -   1 0.00 cccHco             3 0.00 ccHc8     
        11 0.01 zccHcca            1 0.00 ccHo               2 0.00 zcccHc8   
        11 0.01 ccHa               1 0.00 cHcco              2 0.00 zccHc8    
  
  Checking again, without repeated words:
  
    cat bio-j-jsa-gut.dic \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 24\
          '[czH][czH]*a' \
          '[czH][czH]*o' \
          '[czH][czH]*8' \
          '[czH][czH]*[^czHao8]'  

       130 0.31 Ha                59 0.35 Ho                40 0.14 Hc8               12 0.18 z_
        28 0.07 cca               21 0.12 zo                37 0.13 ccc8               8 0.12 cce
        24 0.06 ccca              12 0.07 zco               33 0.12 Hcc8               6 0.09 H_
        23 0.06 Hcca              11 0.06 Hco               23 0.08 zcc8               5 0.08 He
        19 0.05 zcca              11 0.06 Hcco              23 0.08 cc8                4 0.06 zcccz_
        15 0.04 za                10 0.06 zcco              23 0.08 Hccc8              3 0.05 zce
        15 0.04 Hccca             10 0.06 cco               12 0.04 zc8                3 0.05 ccccz_
        15 0.04 Hca               10 0.06 ccco              11 0.04 zccc8              2 0.03 zc_
        11 0.03 cHca               4 0.02 cHco              11 0.04 Hzcc8              2 0.03 ccz_
        10 0.02 ccHa               4 0.02 Hzcco              9 0.03 cccc8              2 0.03 ccr
         9 0.02 cHcca              3 0.02 cccco              7 0.02 Hzc8               2 0.03 cccz_
         8 0.02 zccHa              2 0.01 Hccco              6 0.02 cHcc8              2 0.03 cccH_
         8 0.02 Hzcca              1 0.01 zzcco              5 0.02 cccHcc8            1 0.02 ze
         7 0.02 cccca              1 0.01 zccco              5 0.02 cHc8               1 0.02 zcr
         6 0.01 zca                1 0.01 zccHo              5 0.02 H8                 1 0.02 zccz_
         6 0.01 cccHa              1 0.01 zcHo               4 0.01 zcccHcc8           1 0.02 zccH_
         6 0.01 cHa                1 0.01 zcHco              3 0.01 ccccHcc8           1 0.02 cq
         5 0.01 zccca              1 0.01 czco               3 0.01 cccHc8             1 0.02 cn
         5 0.01 zcccHca            1 0.01 ccccco             3 0.01 c8                 1 0.02 cl
         5 0.01 ccccHca            1 0.01 ccccHco            2 0.01 zccHcc8            1 0.02 cccHc_
         5 0.01 cccHca             1 0.01 cccHo              2 0.01 cccHccc8           1 0.02 cc_
         5 0.01 ca                 1 0.01 cccHco             2 0.01 ccHc8              1 0.02 cHccz_
         4 0.01 zccHcca            1 0.01 ccHo               2 0.01 cHccc8             1 0.02 Hcz_
         4 0.01 zccHca             1 0.01 cHcco              1 0.00 zzcc8              1 0.02 Hcr
         4 0.01 ccccHcca           1 0.01 Hzco               1 0.00 zzcHcc8            1 0.02 Hcn
         4 0.01 Hcccca         ----- ---- ----               1 0.00 zcz8               1 0.02 Hccz_
         3 0.01 zcccHa           170 1.00 TOT                1 0.00 zcccHc8            1 0.02 Hccccl
         3 0.01 zcHa                                         1 0.00 zccHccc8       ----- ---- ----
         3 0.01 Hzca                                         1 0.00 zccHc8            66 1.00 TOT
         2 0.00 zHcca                                        1 0.00 ccHzccc8      
         2 0.00 cccza                                        1 0.00 ccHccc8       
         2 0.00 ccccHa                                       1 0.00 ccHcc8        
         2 0.00 cccHcca                                      1 0.00 Hczc8         
         2 0.00 Hczcca                                       1 0.00 Hcz8          
         1 0.00 zzcccHca                                     1 0.00 Hccz8         
         1 0.00 zzcca                                        1 0.00 Hcccc8        
         1 0.00 zzccHa                                   ----- ---- ----          
         1 0.00 zcccca                                     284 1.00 TOT           
         1 0.00 zccccHcca                                                         
         1 0.00 zcccHcca                                                          
         1 0.00 zccHccca                                                          
         1 0.00 zcHcca                                                            
         1 0.00 zHa                                                               
         1 0.00 ccHzccca                                                          
         1 0.00 ccHcca                                                            
         1 0.00 cHccca                                                            
         1 0.00 Hczca                                                             
         1 0.00 Hccza                                                             
     ----- ---- ----                                                              
       414 1.00 TOT                                                               

  Here is a summary of the most common \czH/ contexts.  The numbers are 
  the percentages from the tables above.  Due to an earlier mix-up,
  all percentages except those in the `H' row were computed relative to
  the totals minus the `H' entry.

             in .wds        in .dic
            ----------     ----------
            ci  cb  cg     ci  cb  cg
            --  --  --     --  --  --
     c       0   0   0      2   0   1

     z       6  26   0      5  19   0
     zc      2  10   1      2  11   4
                                     
     H      40  30   0     31  35   2
     
     Hc      7   8  23      5  10  14
     Hcc    16   5  19      8  10  12

     cc      4   9   3     10   9   8
     ccc    12  14  19      8   9  13
     zcc    10  14  18      7   9   8
                                     
     cccc    4   1   2      2   3   3
     zccc    3   1   3      2   1   4
                                     
     Hccc    3   2   5      5   2   8
                                     
     cccHc   5   0   1      2   1   1
     ccccHc  4   0   0      2   1   0
     zccHc   4   0   0      1   0   0
     zcccHc  3   0   0      3   0   0
  
  It seems that 
  
     \c/ alone is not a letter
     
     \z/,
     \zc/   are similar letters  
            (most common before \o/)
     
     \cc/, 
     \ccc/, 
     \zcc/  are similar letters 
            (equally common before \a/, \o/, \8/).
     
     \H/    is a common letter
            (most common before \a/ and \o/)
          
     
     \Hc/,
     \Hcc/  are similar letters  
            (most common before \o/ and \8/).
            The latter does not seem to be 
            a group \H/-\cc/.
     
     \Hccc/ may be a letter 
            (most common before \a/ and \o/)
            but may be a group \Hc/-\cc/
            or \H/-\ccc/

     \cccc/,
     \zccc/ may be rare letters
            (equally common before \a/, \o/, \8/).
            but may also be groups \cc/-\cc/ and 
            \zc/-\cc/
            
  Let's enumerate these various patterns:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 17 \
          '[^czH]z[^czH]' \
          '[^czH]zc[^czH]' \
          '[^czH]cc[^czH]' \
          '[^czH]ccc[^czH]' \
          '[^czH]zcc[^czH]' \
          '[^czH]H[^czH]' \
          '[^czH]Hc[^czH]' \
          '[^czH]Hcc[^czH]' 

       66 0.44 _za    19 0.27 _zco    22 0.16 _cc8   192 0.40 _ccc8   219 0.50 _zcc8
       63 0.42 _zo    16 0.23 _zca    19 0.14 _cca    97 0.20 eccc8    77 0.18 _zcca
        6 0.04 _z_    13 0.19 ezc8    17 0.12 _cco    78 0.16 _ccca    43 0.10 ezcc8
        5 0.03 ez_     6 0.09 _zc8    15 0.11 ecca    44 0.09 eccca    30 0.07 _zcco
        5 0.03 az_     3 0.04 _zce    15 0.11 ecc8    25 0.05 _ccco    19 0.04 8zcc8
        1 0.01 qza     3 0.04 8zco     9 0.07 _cce    11 0.02 8ccc8    18 0.04 ezcca
        1 0.01 oza     2 0.03 ezco     7 0.05 occ8     7 0.01 rccca     8 0.02 azcc8
        1 0.01 oz_     2 0.03 ezca     6 0.04 8cca     6 0.01 rccc8     6 0.01 rzcc8
        1 0.01 eza     1 0.01 rzc8     5 0.04 8cc8     6 0.01 accc8     6 0.01 azcca
        1 0.01 aza     1 0.01 ezce     3 0.02 qcc8     5 0.01 eccco     4 0.01 ezcco
        1 0.01 _ze     1 0.01 ezc_     3 0.02 ecco     4 0.01 occc8     3 0.01 rzcca
     ---- ---- ----    1 0.01 _zcr     3 0.02 8cco     3 0.01 8ccco     2 0.00 ozcca
      151 1.00 TOT     1 0.01 _zc_     2 0.01 rcca     2 0.00 occca     2 0.00 8zcca
                       1 0.01 8zc8     2 0.01 occa     1 0.00 qccc8     1 0.00 ozcc8
                    ---- ---- ----     2 0.01 jcca     1 0.00 accco     1 0.00 8zcco
                      70 1.00 TOT      2 0.01 ecce     1 0.00 accca  ---- ---- ---- 
                                       1 0.01 rccr     1 0.00 8ccca   439 1.00 TOT  
                                       1 0.01 qcca  ---- ---- ----                  
                                       1 0.01 jcc8   484 1.00 TOT                   
                                       1 0.01 ecc_                                  
                                       1 0.01 acc8                                  
                                       1 0.01 _ccr                                  
                                    ---- ---- ----                                  
                                     138 1.00 TOT                                                                        


      604 0.72 oHa   305 0.63 oHc8   254 0.51 oHcc8       
       61 0.07 eHa    66 0.14 oHca   120 0.24 oHcca       
       51 0.06 oHo    31 0.06 eHc8    31 0.06 eHcca       
       49 0.06 _Ho    27 0.06 _Hc8    29 0.06 eHcc8       
       35 0.04 _Ha    14 0.03 oHco    20 0.04 _Hcc8       
       18 0.02 aHa    14 0.03 aHc8    16 0.03 aHcc8       
        6 0.01 oH_     7 0.01 eHca    10 0.02 aHcca       
        5 0.01 eHo     4 0.01 aHca     7 0.01 _Hcca       
        4 0.00 oHe     3 0.01 oHcn     6 0.01 oHcco       
        3 0.00 oH8     3 0.01 _Hco     6 0.01 _Hcco       
        2 0.00 eHe     2 0.00 eHco     1 0.00 eHcco       
        2 0.00 _H8     2 0.00 _Hca     1 0.00 8Hcca       
        1 0.00 qHo     2 0.00 8Hc8  ---- ---- ----        
        1 0.00 _H_     1 0.00 oHcr   501 1.00 TOT         
     ---- ---- ---- ---- ---- ----                        
      842 1.00 TOT   481 1.00 TOT                         

  Ditto, distinct words only:

    cat bio-j-jsa-gut.dic \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 17 \
          '[^czH]z[^czH]' \
          '[^czH]zc[^czH]' \
          '[^czH]cc[^czH]' \
          '[^czH]ccc[^czH]' \
          '[^czH]zcc[^czH]' \
          '[^czH]H[^czH]' \
          '[^czH]Hc[^czH]' \
          '[^czH]Hcc[^czH]' 

        21 0.44 _zo         8 0.22 ezc8       10 0.14 ecc8       15 0.21 eccc8       8 0.15 ezcc8
        11 0.23 _za         7 0.19 _zco        8 0.11 _cca       11 0.15 eccca       8 0.15 _zcc8
         5 0.10 az_         4 0.11 _zca        7 0.10 ecca       10 0.14 _ccc8       6 0.12 ezcca
         4 0.08 ez_         3 0.08 8zco        7 0.10 _cco        5 0.07 _ccca       6 0.12 _zcco
         1 0.02 qza         2 0.06 ezco        6 0.08 _cce        4 0.06 rccca       5 0.10 _zcca
         1 0.02 oza         2 0.06 ezca        5 0.07 _cc8        4 0.06 eccco       3 0.06 rzcca
         1 0.02 oz_         2 0.06 _zce        5 0.07 8cca        4 0.06 _ccco       3 0.06 ezcco
         1 0.02 eza         2 0.06 _zc8        2 0.03 rcca        3 0.04 occc8       3 0.06 azcca
         1 0.02 aza         1 0.03 rzc8        2 0.03 qcc8        3 0.04 accc8       3 0.06 8zcc8
         1 0.02 _ze         1 0.03 ezce        2 0.03 occa        3 0.04 8ccc8       2 0.04 rzcc8
         1 0.02 _z_         1 0.03 ezc_        2 0.03 occ8        2 0.03 rccc8       1 0.02 ozcca
     ----- ---- ----        1 0.03 _zcr        2 0.03 jcca        2 0.03 occca       1 0.02 ozcc8
        48 1.00 TOT         1 0.03 _zc_        2 0.03 ecce        1 0.01 qccc8       1 0.02 azcc8
                            1 0.03 8zc8        2 0.03 8cco        1 0.01 accco       1 0.02 8zcco
                        ----- ---- ----        2 0.03 8cc8        1 0.01 accca       1 0.02 8zcca
                           36 1.00 TOT         1 0.01 rccr        1 0.01 8ccco   ----- ---- ---- 
                                               1 0.01 qcca        1 0.01 8ccca      52 1.00 TOT  
                                               1 0.01 jcc8    ----- ---- ----                    
                                               1 0.01 ecco       71 1.00 TOT                     
                                               1 0.01 ecc_                                       
                                               1 0.01 acc8                                       
                                               1 0.01 _ccr                                       
                                           ----- ---- ----                                       
                                              71 1.00 TOT                                        



        72 0.35 oHa        23 0.34 oHc8       13 0.19 oHcc8
        34 0.17 _Ho         8 0.12 oHco        9 0.13 oHcca
        23 0.11 eHa         8 0.12 eHc8        9 0.13 eHcc8
        19 0.09 oHo         6 0.09 oHca        7 0.10 eHcca
        19 0.09 _Ha         4 0.06 eHca        6 0.09 oHcco
        16 0.08 aHa         4 0.06 aHca        6 0.09 _Hcc8
         5 0.02 oH_         4 0.06 aHc8        5 0.07 aHcc8
         5 0.02 eHo         4 0.06 _Hc8        4 0.06 _Hcco
         3 0.01 oHe         2 0.03 eHco        3 0.04 aHcca
         3 0.01 oH8         1 0.01 oHcr        3 0.04 _Hcca
         2 0.01 eHe         1 0.01 oHcn        1 0.01 eHcco
         2 0.01 _H8         1 0.01 _Hco        1 0.01 8Hcca
         1 0.00 qHo         1 0.01 _Hca    ----- ---- ---- 
         1 0.00 _H_         1 0.01 8Hc8       67 1.00 TOT  
     ----- ---- ----    ----- ---- ----                    
       205 1.00 TOT        68 1.00 TOT                     

  Obviously the subset of these letters that begins with `H'
  is different from the others in that the `H' is usually
  preceded by \o/, whereas the rest is usually word-initial
  (in .wds) or preceded by \ix/ (in .dic).
                                                        
  Let's see what we have left out:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | enum-contexts -vPAT='[czH][czH]*' -vCTX=0 \
      | egrep -v -e '^(z|zc|cc|ccc|zcc|H|Hc|Hcc)$' \
      | wfreq
      
       123 0.15 Hccc
        87 0.11 zccc
        70 0.09 cccc
        65 0.08 cccHc
        41 0.05 zccHc
        41 0.05 ccccHc
        40 0.05 Hzcc
        39 0.05 zcccHc
        25 0.03 zccH
        24 0.03 cHcc
        22 0.03 cHc
        21 0.03 cccH
        17 0.02 zccHcc
        16 0.02 cccHcc
        13 0.02 zcccH
        13 0.02 Hzc
        12 0.02 ccccH
        12 0.02 ccH
        12 0.02 c
        11 0.01 cccz
         8 0.01 zcccHcc
         8 0.01 Hcccc
         7 0.01 ccccHcc
         6 0.01 zzcc
         6 0.01 cH
         5 0.01 zcH
         5 0.01 ccccz
         4 0.01 zcccz
         3 0.00 ccHc
         3 0.00 cHccc
         3 0.00 Hccz
         2 0.00 zccHccc
         2 0.00 zHcc
         2 0.00 ccz
         2 0.00 cccHccc
         2 0.00 ccHzccc
         2 0.00 ccHcc
         2 0.00 Hczcc
         2 0.00 Hczc
         2 0.00 Hcz
         1 0.00 zzcccHc
         1 0.00 zzccH
         1 0.00 zzcHcc
         1 0.00 zcz
         1 0.00 zccz
         1 0.00 zccccHcc
         1 0.00 zcccc
         1 0.00 zcHcc
         1 0.00 zcHc
         1 0.00 zH
         1 0.00 czc
         1 0.00 ccccc
         1 0.00 ccHccc
         1 0.00 cHccz
     ----- ---- ----
       794 1.00 TOT


  Again, with some context:
    
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | enum-contexts -vPAT='[^czH][czH][czH]*[^czH]' -vCTX=0 \
      | egrep -v -e '[^czH](z|zc|cc|ccc|zcc|H|Hc|Hcc)[^czH]' \
      | wfreq

        51 0.06 _cccHca
        41 0.05 oHccc8
        38 0.05 _zccc8
        37 0.05 _zccHca
        35 0.04 _zcccHca
        35 0.04 _ccccHca
        34 0.04 _Hccc8
        32 0.04 _cccca
        23 0.03 _zccca
        21 0.03 oHccca
        20 0.03 _cccc8
        19 0.02 _zccHa
        16 0.02 oHzcc8
        16 0.02 _cccHa
        12 0.02 _zcccHa
        12 0.02 _ccccHa
        11 0.01 _zccHcca
        10 0.01 _ccHa
         9 0.01 _cccHc8
         8 0.01 eHccc8
         8 0.01 _cccz_
         8 0.01 _cccHcc8
         8 0.01 _Hccca
         7 0.01 oHzcca
         7 0.01 ezccc8
         7 0.01 _cHcca
         6 0.01 oHzc8
         6 0.01 _zccHcc8
         6 0.01 _cccHcca
         6 0.01 _Hccco
         5 0.01 ezccca
         5 0.01 ecccc8
         5 0.01 _zcccHcc8
         5 0.01 _Hzcca
         4 0.01 ocHcca
         4 0.01 ocHca
         4 0.01 ecccca
         4 0.01 eccccHca
         4 0.01 _zzcc8
         4 0.01 _cHcc8
         4 0.01 _cHca
         4 0.01 _cHc8
         4 0.01 _Hzcc8
         3 0.00 rzccc8
         3 0.00 oHcccca
         3 0.00 jca
         3 0.00 ecccHca
         3 0.00 eHccca
         3 0.00 azccca
         3 0.00 _zccco
         3 0.00 _zccH_
         3 0.00 _zcHa
         3 0.00 _ccccz_
         3 0.00 _ccHc8
         3 0.00 _cHco
         3 0.00 _cHa
         3 0.00 _Hzcco
         2 0.00 rcccHa
         2 0.00 qcHcc8
         2 0.00 qcHc8
         2 0.00 qcHa
         2 0.00 ocHcc8
         2 0.00 oHcccc8
         2 0.00 ezcccz_
         2 0.00 ezcccHca
         2 0.00 ezccHca
         2 0.00 ezccHa
         2 0.00 acHca
         2 0.00 _zcccHcca
         2 0.00 _zcccHc8
         2 0.00 _zccHc8
         2 0.00 _zHcca
         2 0.00 _cccza
         2 0.00 _ccccHcca
         2 0.00 _ccccHcc8
         2 0.00 _cccHccc8
         2 0.00 _cccH_
         2 0.00 _Hzc8
         2 0.00 8zccca
         2 0.00 8zccc8
         2 0.00 8c8
         1 0.00 rcn
         1 0.00 rccz_
         1 0.00 rcccz_
         1 0.00 rcccca
         1 0.00 rcccc8
         1 0.00 qca
         1 0.00 qcHccc8
         1 0.00 qcHcca
         1 0.00 qcHca
         1 0.00 ozccHo
         1 0.00 occHcc8
         1 0.00 ocHco
         1 0.00 ocHccz_
         1 0.00 ocHcco
         1 0.00 oHzca
         1 0.00 oHczcca
         1 0.00 oHczca
         1 0.00 oHczc8
         1 0.00 oHcz_
         1 0.00 oHcz8
         1 0.00 oHccza
         1 0.00 oHccz_
         1 0.00 oHccz8
         1 0.00 oHccccl
         1 0.00 jczco
         1 0.00 jc8
         1 0.00 ezcz8
         1 0.00 ezcccHcc8
         1 0.00 ezcccHa
         1 0.00 ezcHa
         1 0.00 eccz_
         1 0.00 eccccz_
         1 0.00 ecccco
         1 0.00 eccccHcca
         1 0.00 eccccHcc8
         1 0.00 ecccHcc8
         1 0.00 eccHa
         1 0.00 eHzcca
         1 0.00 eHzcc8
         1 0.00 eHczcca
         1 0.00 azcccca
         1 0.00 azccc8
         1 0.00 accccz_
         1 0.00 acccca
         1 0.00 accccHcca
         1 0.00 accccHca
         1 0.00 acccc8
         1 0.00 acHcca
         1 0.00 acHa
         1 0.00 aHzcco
         1 0.00 aHzcc8
         1 0.00 aHzca
         1 0.00 aHcccca
         1 0.00 aHccca
         1 0.00 aHccc8
         1 0.00 _zzcco
         1 0.00 _zzcccHca
         1 0.00 _zzcca
         1 0.00 _zzccHa
         1 0.00 _zzcHcc8
         1 0.00 _zccz_
         1 0.00 _zcccz_
         1 0.00 _zccccHcca
         1 0.00 _zccHccca
         1 0.00 _zccHccc8
         1 0.00 _zcHo
         1 0.00 _zcHco
         1 0.00 _zcHcca
         1 0.00 _zHa
         1 0.00 _cccco
         1 0.00 _ccccco
         1 0.00 _ccccHco
         1 0.00 _cccHo
         1 0.00 _cccHco
         1 0.00 _cccHc_
         1 0.00 _ccHzccc8
         1 0.00 _ccHo
         1 0.00 _ccHccc8
         1 0.00 _ccHcca
         1 0.00 _cHccca
         1 0.00 _cHccc8
         1 0.00 _Hzco
         1 0.00 _Hzca
         1 0.00 _Hcccca
         1 0.00 8zcccz_
         1 0.00 8cccco
         1 0.00 8cccca
         1 0.00 8cccc8
         1 0.00 8cccHcc8
         1 0.00 8Hzcca
     ----- ---- ----
       785 1.00 TOT

  These may be groups of the letters above.
  
  So here is the situation for maximal `czH' strings (after collapsing
  \ci/, \cy/ to `a', and \cg/ to `8', and all gallows to `H'):
  
    string  freq  plausible interpretations
    ------  ----  -------------------------
     
    c          8  invalid
    z        151  a letter.
    H        842  a letter.
                  
    cc       138  a letter.
    zc        70  a letter.
    Hc       481  a letter.
    cz         -  invalid.
    zz         -  invalid.
    Hz         -  invalid.
    cH         6  invalid.
    zH         1  invalid, or z+H.
    HH         -  invalid.
                  
    ccc       71  a letter.
    zcc       52  a letter, or z+cc.
    Hcc       67  a letter, or H+cc.
    czc        1  invalid.
    zzc        -  invalid.
    Hzc       13  a letter, or H+zc.
    cHc       22  a letter (gallows with platform?).
    zHc        -  invalid.
    HHc        -  invalid.
    ccz        2  invalid, or cc+z.
    zcz        1  invalid, or zc+z.
    Hcz        2  invalid, or Hc+z.
    czz        -  invalid.
    zzz        -  invalid.
    Hzz        -  invalid.
    cHz        -  invalid.
    zHz        -  invalid.
    HHz        -  invalid.
    ccH       12  a letter, or cc+H.
    zcH        5  a letter, or zc+H.
    HcH        -  invalid.
    czH        -  invalid.
    zzH        -  invalid.
    HzH        -  invalid. 
    cHH        -  invalid.
    zHH        -  invalid.
    HHH        -  invalid. 
    
    cccc      70  a letter, or cc+cc.
    zccc      87  a letter, or zc+cc, or z+ccc.
    Hccc     123  a letter, or Hc+cc, or H+ccc.
    czcc       -  invalid.
    zzcc       6  invalid, or z+zcc, or z+z+cc.
    Hzcc      40  a letter, or H+zcc, or H+z+cc.
    cHcc      24  a letter.
    zHcc       2  invalid, or z+Hcc, or z+H+cc.
    HHcc       -  invalid.
    cczc       -  invalid.
    zczc       -  invalid.
    Hczc       2  invalid, or Hc+zc.
    czzc       -  invalid.
    zzzc       -  invalid.
    Hzzc       -  invalid.
    cHzc       -  invalid.
    zHzc       -  invalid.
    HHzc       -  invalid.
    ccHc       3  invalid, or cc+Hc.
    zcHc       1  invalid, or zc+Hc.
    HcHc       -  invalid.
    czHc       -  invalid.
    zzHc       -  invalid.
    HzHc       -  invalid.
    cHHc       -  invalid.
    zHHc       -  invalid.
    HHHc       -  invalid.
    cccz      11  letter, or ccc+z.
    zccz       1  invalid, or zcc+z, or z+cc+z.
    Hccz       3  invalid, or Hcc+z, or H+cc+z.
    czcz       -  invalid.
    zzcz       -  invalid.
    Hzcz       -  invalid.
    cHcz       -  invalid.
    zHcz       -  invalid.
    HHcz       -  invalid.
    cczz       -  invalid.
    zczz       -  invalid.
    Hczz       -  invalid.
    czzz       -  invalid.
    zzzz       -  invalid.
    Hzzz       -  invalid.
    cHzz       -  invalid.
    zHzz       -  invalid.
    HHzz       -  invalid.
    ccHz       -  invalid.
    zcHz       -  invalid.
    HcHz       -  invalid.
    czHz       -  invalid.
    zzHz       -  invalid.
    HzHz       -  invalid.
    cHHz       -  invalid.
    zHHz       -  invalid.
    HHHz       -  invalid.
    cccH      21  letter, or ccc+H.
    zccH      25  letter, or zcc+H, or z+cc+H.
    HccH       -  invalid.
    czcH       -  invalid.
    zzcH       -  invalid.
    HzcH       -  invalid.
    cHcH       -  invalid.
    zHcH       -  invalid.
    HHcH       -  invalid.
    cczH       -  invalid.
    zczH       -  invalid.
    HczH       -  invalid.
    czzH       -  invalid.
    zzzH       -  invalid.
    HzzH       -  invalid.
    cHzH       -  invalid.
    zHzH       -  invalid.
    HHzH       -  invalid.
    ccHH       -  invalid.
    zcHH       -  invalid.
    HcHH       -  invalid.
    czHH       -  invalid.
    zzHH       -  invalid.
    HzHH       -  invalid.
    cHHH       -  invalid.
    zHHH       -  invalid.
    HHHH       -  invalid.
    

97-07-28 stolfi
===============

  Let's look more closely at the tall letters:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | compare-contexts -lctx 0 -rctx 0 -colw 24 \
          '[czlqjg]*lj[czlqjg]*' \
          '[czlqjg]*qj[czlqjg]*' \
          '[czlqjg]*lg[czlqjg]*' \
          '[czlqjg]*qg[czlqjg]*'

       571 0.36 lj               230 0.33 qj                10 0.62 lgccc             53 0.34 qgccc
       376 0.24 ljcc             161 0.23 qjc                2 0.12 lg                50 0.32 qg
       322 0.21 ljc              118 0.17 qjcc               2 0.12 clgccc             8 0.05 qgzcc
        38 0.02 cccljc            28 0.04 qjccc              1 0.06 lgcc               8 0.05 qgcc
        32 0.02 ljccc             27 0.04 cccqjc             1 0.06 cclg               4 0.03 qgzc
        27 0.02 zccljc            18 0.03 ccccqjc        ----- ---- ----               4 0.03 cqgcc
        26 0.02 zcccljc           14 0.02 zccqjc            16 1.00 TOT                4 0.03 cqgc
        23 0.01 ccccljc           14 0.02 qjzcc                                        4 0.03 cccqgcc
        18 0.01 ljzcc             11 0.02 zcccqjc                                      3 0.02 cccqg
        17 0.01 zcclj              9 0.01 cqjcc                                        2 0.01 zccqgcc
        14 0.01 ccclj              8 0.01 zcccqj                                       2 0.01 zcccqgc
        10 0.01 zccljcc            7 0.01 zccqj                                        2 0.01 qgcccc
        10 0.01 cccljcc            7 0.01 cqjc                                         2 0.01 cqg
         9 0.01 cljcc              5 0.01 zccqjcc                                      2 0.01 ccqgzccc
         9 0.01 cljc               5 0.01 qjzc                                         1 0.01 zqgcc
         9 0.01 cccclj             5 0.01 ccqj                                         1 0.01 zccqgccc
         6 0.00 zcccljcc           4 0.01 cccqj                                        1 0.01 zccqg
         6 0.00 cclj               3 0.00 zcqj                                         1 0.01 ccqgccc
         5 0.00 zccclj             3 0.00 qcqjc                                        1 0.01 cccqgccc
         4 0.00 ljzc               3 0.00 ccccqj                                       1 0.01 ccccqgcc
         4 0.00 ljcccc             2 0.00 zcccqjcc                                 ----- ---- ----
         4 0.00 ccccljcc           2 0.00 qjccz                                      154 1.00 TOT
         2 0.00 zclj               2 0.00 qcqj                                    
         2 0.00 qcljcc             2 0.00 cccqjcc                                 
         2 0.00 ljczcc             2 0.00 ccccqjcc                                
         2 0.00 ljczc              1 0.00 zcqjcc                                  
         2 0.00 ccljcc             1 0.00 zccccqjcc                               
         2 0.00 ccljc              1 0.00 qjczc                                   
         1 0.00 zzcljcc            1 0.00 qjcz                                    
         1 0.00 zzcclj             1 0.00 qjcccc                                  
         1 0.00 zzcccljc           1 0.00 qcqjccc                                 
         1 0.00 zljcc              1 0.00 qcqjcc                                  
         1 0.00 zlj                1 0.00 cqjccz                                  
         1 0.00 zcljc              1 0.00 cqj                                     
         1 0.00 zccljccc           1 0.00 ccqjc                                   
         1 0.00 qlj            ----- ---- ----                                    
         1 0.00 ljcz             700 1.00 TOT                                     
         1 0.00 ljccz                                                             
         1 0.00 ljccccljc                                                         
         1 0.00 clj                                                               
         1 0.00 cccljccc                                                          
     ----- ---- ----                                                              
      1565 1.00 TOT                                                               

  These statistics confirm the identification of the \l/ and \q/ 
  in gallows.
  
  The questions to decide now are whether \Hcc/ and \zcc/ are letters
  or composites \H/+\cc/ and \z/+\cc/.
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql][jg]/H/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      > .bar
    
    foreach f ( z H )
      cat .bar \
        | enum-contexts -vPAT='[^czH]'"${f}"'[^czH]' \
        | sed -e 's/.$//g' \
        | wfreq \
        > .${f}.L
      cat .bar \
        | enum-contexts -vPAT='[^czH]'"${f}cc"'[^czH]' \
        | sed -e 's/.$//g' \
        | wfreq \
        > .${f}cc.L
      cat .bar \
        | enum-contexts -vPAT='[^czH]'"cc"'[^czH]' \
        | sed -e 's/^.//g' \
        | wfreq \
        > .cc.R
      cat .bar \
        | enum-contexts -vPAT='[^czH]'"${f}cc"'[^czH]' \
        | sed -e 's/^.//g' \
        | wfreq \
        > .${f}cc.R
      pr -m -s'  ' -t -i' '1 -w 96 .${f}cc.L .${f}.L .${f}cc.R  .cc.R  \
        | expand
    end
      
       326 0.74 _zcc       136 0.90 _z         296 0.67 zcc8       54 0.39 cc8
        65 0.15 ezcc         6 0.04 ez         108 0.25 zcca       47 0.34 cca
        22 0.05 8zcc         6 0.04 az          35 0.08 zcco       23 0.17 cco
        14 0.03 azcc         2 0.01 oz       ----- ---- ----       11 0.08 cce
         9 0.02 rzcc         1 0.01 qz         439 1.00 TOT         2 0.01 ccr
         3 0.01 ozcc     ----- ---- ----                            1 0.01 cc_
     ----- ---- ----       151 1.00 TOT                         ----- ---- ----
       439 1.00 TOT                                               138 1.00 TOT



       380 0.76 oHcc       668 0.79 oH         319 0.64 Hcc8       54 0.39 cc8
        61 0.12 eHcc        87 0.10 _H         169 0.34 Hcca       47 0.34 cca
        33 0.07 _Hcc        68 0.08 eH          13 0.03 Hcco       23 0.17 cco
        26 0.05 aHcc        18 0.02 aH       ----- ---- ----       11 0.08 cce
         1 0.00 8Hcc         1 0.00 qH         501 1.00 TOT         2 0.01 ccr
     ----- ---- ----     ----- ---- ----                            1 0.01 cc_
       501 1.00 TOT        842 1.00 TOT                         ----- ---- ----
                                                                  138 1.00 TOT


  From these numbers, it seems plausible that `Hcc' and `zcc' are composites. 

  Collecting again all [czH] patterns, splitting H and P:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
          -e 's/[ql]j/H/' \
          -e 's/[ql]g/P/' \
          -e 's/cs/z/g' \
          -e 's/ij/k/g' \
          -e 's/ix/e/g' \
          -e 's/is/r/g' \
          -e 's/iiu/n/g' \
          -e 's/y/i/g' \
          -e 's/ci/a/g' \
          -e 's/cg/8/g' \
      | enum-contexts -vPAT='[czHP][czHP]*' -vCTX=0 \
      | wfreq
  
       795 0.20 H
       493 0.13 Hcc
       484 0.12 ccc
       482 0.12 Hc
       439 0.11 zcc
       152 0.04 z
       138 0.04 cc
        87 0.02 zccc
        70 0.02 zc
        70 0.02 cccc
        65 0.02 cccHc
        63 0.02 Pccc
        60 0.02 Hccc
        52 0.01 P
        41 0.01 zccHc
        41 0.01 ccccHc
        37 0.01 zcccHc
        32 0.01 Hzcc
        24 0.01 zccH
        20 0.01 cHcc
        19 0.00 cHc
        18 0.00 cccH
        15 0.00 zccHcc
        13 0.00 zcccH
        12 0.00 ccccH
        12 0.00 cccHcc
        11 0.00 cccz
        11 0.00 ccH
         9 0.00 c
         9 0.00 Pcc
         9 0.00 Hzc
         8 0.00 zcccHcc
         8 0.00 Pzcc
         6 0.00 zzcc
         6 0.00 ccccHcc
         6 0.00 Hcccc
         5 0.00 zcH
         5 0.00 ccccz
         4 0.00 zcccz
         4 0.00 cccPcc
         4 0.00 cPcc
         4 0.00 cPc
         4 0.00 cH
         4 0.00 Pzc
         3 0.00 cccP
         3 0.00 ccHc
         3 0.00 Hczc
         3 0.00 Hccz
         2 0.00 zcccPc
         2 0.00 zccPcc
         2 0.00 ccz
         2 0.00 ccPzccc
         2 0.00 ccHcc
         2 0.00 cPccc
         2 0.00 cP
         2 0.00 Pcccc
         2 0.00 Hczcc
         2 0.00 Hcz
         1 0.00 zzcccHc
         1 0.00 zzccH
         1 0.00 zzcHcc
         1 0.00 zcz
         1 0.00 zccz
         1 0.00 zccccHcc
         1 0.00 zcccc
         1 0.00 zccPccc
         1 0.00 zccP
         1 0.00 zccHccc
         1 0.00 zcHcc
         1 0.00 zcHc
         1 0.00 zPcc
         1 0.00 zHcc
         1 0.00 zH
         1 0.00 ccccc
         1 0.00 ccccPcc
         1 0.00 cccPccc
         1 0.00 cccHccc
         1 0.00 ccPccc
         1 0.00 ccP
         1 0.00 cHccz
         1 0.00 cHccc
     ----- ---- ----
      3906 1.00 TOT


  Based on the analysis above, it seems that, after colapsing \ci/,
  \cy/, \cg/, \cs/, and the tall characters \[lq]j/ to `H' \[lq]g/ to
  `P', we can parse most strings consisting of `c', `z', `H', and `P'
  into the following "letters" (the frequencies are for isolated
  occurences only):
  
   freq letter    code  
   ---- --------  ----
    795 H         E
     52 P         P
    152 z         Z
    138 cc        M
     70 zc        R
    482 Hc        I
    484 ccc       O
    439 zcc       A
    493 Hcc       U
     19 cHc       X
      4 cPc       Y
     20 cHcc      K
      4 cPcc      G
      
  Here is my best guess for the parsing of those strings:

      freq      string     best parsing     alt parsings       
      ---- ---- ---------- ------------     --------------------
       795 0.20 H          (H)  
       493 0.13 Hcc        (Hcc)          
       484 0.12 ccc        (ccc)  
       482 0.12 Hc         (Hc)  
       439 0.11 zcc        (zcc)          
       152 0.04 z          (z)  
       138 0.04 cc         (cc)  
        87 0.02 zccc       (z)(ccc)         (zc)(cc)         
        70 0.02 zc         (zc)  
        70 0.02 cccc       (cc)(cc)         
        65 0.02 cccHc      (ccc)(Hc)        (cc)(cHc)       
        63 0.02 Pccc       (P)(ccc)         
        60 0.02 Hccc       (H)(ccc)         (Hc)(cc)
        52 0.01 P          
        41 0.01 zccHc      (zcc)(Hc)        (zc)(cHc)
        41 0.01 ccccHc     (ccc)(cHc)       (cc)(cc)(Hc)
        37 0.01 zcccHc     (zcc)(cHc)       (zc)(cc)(Hc)     
        32 0.01 Hzcc       (H)(zcc)         (H)(z)(cc)  
        24 0.01 zccH       (zcc)(H)             
        20 0.01 cHcc       (cHcc)  
        19 0.00 cHc        (cHc)  
        18 0.00 cccH       (ccc)(H)  
        15 0.00 zccHcc     (zcc)(Hcc)   
        13 0.00 zcccH      (z)(ccc)(H)      (zc)(cc)(H)
        12 0.00 ccccH      (cc)(cc)(H)      
        12 0.00 cccHcc     (ccc)(Hcc)       (cc)(cHcc)
        11 0.00 cccz       (ccc)(z)
        11 0.00 ccH        (cc)(H)
         9 0.00 c          
         9 0.00 Pcc        (P)(cc)
         9 0.00 Hzc        (H)(zc)
         8 0.00 zcccHcc    (zcc)(cHcc)      (zc)(cc)(Hcc), (z)(cc)(cHcc)
         8 0.00 Pzcc       (P)(zcc)
         6 0.00 zzcc       (z)(zcc)
         6 0.00 ccccHcc    (ccc)(cHcc)      (cc)(cc)(Hcc)
         6 0.00 Hcccc      (Hcc)(cc)        (Hc)(ccc)
         5 0.00 zcH        (zc)(H)
         5 0.00 ccccz      (cc)(cc)(z)
         4 0.00 zcccz      (zc)(cc)(z)      (z)(ccc)(z)
         4 0.00 cccPcc     (ccc)(P)(cc)     (cc)(cPcc) 
         4 0.00 cPcc       (cPcc)
         4 0.00 cPc        (cPc)
         4 0.00 cH         
         4 0.00 Pzc        (P)(zc)
         3 0.00 cccP       (ccc)(P)
         3 0.00 ccHc       (cc)(Hc)
         3 0.00 Hczc       (Hc)(zc)
         3 0.00 Hccz       (H)(cc)(z)
         2 0.00 zcccPc     (zcc)(cPc)
         2 0.00 zccPcc     (zcc)(Pcc)       (z)(cc)(P)(cc)
         2 0.00 ccz        (cc)(z)
         2 0.00 ccPzccc    (cc)(P)(zc)(cc)  (cc)(P)(z)(ccc)
         2 0.00 ccHcc      (cc)(Hcc)
         2 0.00 cPccc      (cPc)(cc)
         2 0.00 cP         
         2 0.00 Pcccc      (Pc)(ccc)        (P)(cc)(cc)   
         2 0.00 Hczcc      (Hc)(zcc)
         2 0.00 Hcz        (Hc)(z)
         1 0.00 zzcccHc    (z)(zcc)(cHc)    (z)(zc)(cc)(Hc)
         1 0.00 zzccH      (z)(zcc)(H)    
         1 0.00 zzcHcc     (z)(z)(cHcc)     (z)(zc)(H)(cc)
         1 0.00 zcz        (zc)(z)
         1 0.00 zccz       (zcc)(z)
         1 0.00 zccccHcc   (zc)(cc)(cHcc)   (z)(cc)(cc)(H)(cc)
         1 0.00 zcccc      (zc)(ccc)        (z)(cc)(cc)
         1 0.00 zccPccc    (zc)(cPc)(cc)    (z)(cc)(P)(ccc),  (z)(cc)(Pc)(cc)
         1 0.00 zccP       (zcc)(P)
         1 0.00 zccHccc    (zc)(cHc)(cc)    (zcc)(H)(ccc),    (zcc)(Hc)(cc)
         1 0.00 zcHcc      (zc)(Hcc)        (z)(cHcc)
         1 0.00 zcHc       (zc)(Hc)         (z)(cHc)
         1 0.00 zPcc       (z)(Pcc)       
         1 0.00 zHcc       (z)(Hcc)
         1 0.00 zH         (z)(H)
         1 0.00 ccccc      (cc)(ccc)        (ccc)(cc)
         1 0.00 ccccPcc    (cc)(cc)(P)(cc)  (ccc)(cPcc)      
         1 0.00 cccPccc    (cc)(cpc)(cc)    (ccc)(P)(ccc)
         1 0.00 cccHccc    (cc)(cHc)(cc)    (ccc)(H)(ccc)
         1 0.00 ccPccc     (cc)(P)(ccc)     
         1 0.00 ccP        (cc)(P)
         1 0.00 cHccz      (cHcc)(z)
         1 0.00 cHccc      (cHc)(cc)

  This still doesn't look quite right....

  Let's try it anyway.
  
    jsa2hip
    -------------------------------------------------
    #! /n/gnu/bin/sed -f
    # Recoding superanalytic to "hip" encoding:
    /^[^#]/s/ij/k/g
    /^[^#]/s/ix/e/g
    /^[^#]/s/is/r/g
    /^[^#]/s/iiu/n/g
    /^[^#]/s/y/i/g
    /^[^#]/s/ci/a/g
    /^[^#]/s/cg/8/g
    /^[^#]/s/cs/z/g
    /^[^#]/s/iin/m/g
    /^[^#]/s/in/m/g
    /^[^#]/s/ir/v/g
    /^[^#]/s/qj/E/g
    /^[^#]/s/qg/P/g
    /^[^#]/s/lj/E/g
    /^[^#]/s/lg/P/g
    # Parsing of [czPE] strings:
    /^[^#]/s/[zcEP][zcEP][zcEP][zcEP][zcEP][zcEP][zcEP][zcEP]*/@/g
    /^[^#]/s/zccEcc/AU/g
    /^[^#]/s/ccccEc/OX/g
    /^[^#]/s/zcccEc/AX/g
    /^[^#]/s/ccccE/MME/g
    /^[^#]/s/zccEc/AI/g
    /^[^#]/s/cccEc/OI/g
    /^[^#]/s/zccc/RM/g
    /^[^#]/s/cccc/MM/g
    /^[^#]/s/Pccc/PO/g
    /^[^#]/s/Eccc/EO/g
    /^[^#]/s/Ezcc/EA/g
    /^[^#]/s/zccE/AE/g
    /^[^#]/s/cEcc/K/g
    /^[^#]/s/cccE/OE/g
    /^[^#]/s/cccz/OZ/g
    /^[^#]/s/Ecc/U/g
    /^[^#]/s/ccc/O/g
    /^[^#]/s/zcc/A/g
    /^[^#]/s/cEc/X/g
    /^[^#]/s/ccE/ME/g
    /^[^#]/s/Ec/I/g
    /^[^#]/s/cc/M/g
    /^[^#]/s/zc/R/g
    /^[^#]/s/E/E/g
    /^[^#]/s/z/Z/g
    /^[^#]/s/P/P/g
    -------------------------------------------------

    extract-words-from-interlin \
        -recode jsa2hip \
        -chars "qoa8HPZAEIOUMXRKermnkvc@" \
        bio-j-jsa.evt \
        bio-j-hip

     lines   words     bytes file        
    ------ ------- --------- ------------
      7054    7054     36231 bio-j-hip.wds
      1967    1967     13458 bio-j-hip.dic
      4658    4658     22234 bio-j-hip-gut.wds
       862     862      4575 bio-j-hip-gut.dic
       843     843      2464 bio-j-hip-fun.wds
         5       5        24 bio-j-hip-fun.dic
      1553    1553     11533 bio-j-hip-bad.wds
      1100    1100      8859 bio-j-hip-bad.dic

    Digraph counts (edited):

                  q     o     a     8     R     M     A     O     P     E     Z     I     U     e     r     m     n     k     v     X     K     c     @   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1239   965   161   363   129   149   440   436    65    91   149    32    29   282    79     .     .     .     .     7     8    16    18  4658
      q     .     .  1227     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .  1247
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      o    21     .     .     .    18     .     9     .     .    61   714     .   394   380   893   178     9     .     .     .     7    10     .     .  2727
      a  2794     .     .     .    11     .     .    14     9     .    23     .    19    26   411   275   333   109    33    19     .     .     .     .  4104
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      R     .     .    26    22    31     .   107     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   199
      M     .     .    32   132   142     .    95     .     .     .    36    11     .     .    11     .     .     .     .     .     .     .     .     .   468
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      A     .     .    40   125   322     .     .     .     .     .    25     .    41    15     .     .     .     .     .     .    37     .     .     .   609
      O     .     .    40   167   404     .     .     .     .     7    18    11    77     .     .     .     .     .     .     .    41     .     .     .   765
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      e   825     .   105   114    53    36    49    71   154    10    76     7    43    61     .     .     .     .     .     .     .     .     .     .  1614
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      P     .     .    37    17     .     .    22     8    66     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   165
      8    50     .    37  1948     .     9    18    22    16     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .  2113
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      E    10     .    75   795     .     9     .    32    61     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   996
      r   401     .    36    64     .     .     .     9    16     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   539
      Z    42     .    63    73     .     .     .     7     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   196
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      I     .     .    20   173   391     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    12     .   608
      U     .     .    10   179   321     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   513
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      m   339     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   342
      n   114     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   115
      k    36     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    37
      v    19     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    20
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      X     .     .     .    86    10     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .   101
      K     .     .     .    14    10     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    26
      c     .     .     .    11    12     .     .     .     .    12     .     .     .     .     .     .     .     .     .     .     .     .     .     .    48
      @     .     .     .    11    13     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .    24
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4658  1247  2727  4104  2113   199   468   609   765   165   996   196   608   513  1614   539   342   115    37    20   101    26    48    24 22234


  There are some nice things in this table: \ccc/ = `O' and \cscc/ = `A' come out similar, and same for \cc/ = `M' and \csc/ = `R'.
  
  There are some surprises, such as the similarity between \qg,lg/ = `P' and \cg/ = `8'; or between \lj,qj/ = `E', \cs/ = `Z',
  and \is/ = `r'. 

  The slight differences between members of the same class may be telling us something, too.
  
    \cc/ and \csc/   are similar, but 
                     only \cc/ is followed by \lj/, \qj/, \cs/, or \ix/
    
    \ccc/ and \cscc/ are similar, but 
                     only \ccc/ is followed by \lg/, \qg/, or \cs/
                     only \cscc/ is followed by \ljcc/ or \qjcc/
                     
    \ljc/ and \ljcc/ are similar, but 
                     only \ljc/ is followed by unparsed \c/
                     only \ljcc/ can be preceded by \cscc/
                     
  Also, \cgci/ is probably a letter; indeed \cg/ is followed by \ci/ 91% of the time, although \ci/
  occurs in other contexts too. 

  What can we conclude from these bits, really?