Hacking at the Voynich manuscript
Notebook - volume 2

Warning: these notebooks aren't strictly chronological logs.
  Sometimes I go back and redo things, clarify comments,
  delete garbage, etc.

Summary of previous notebooks
=============================

  On 97-07-05 I obtained Landini's interlinear transcription of the VMs, version 1.6
  (landini-interln16.evt) from
  http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip

  I manually extracted from it a homogeneous, full-text sample
  bio-m-evt.evt, consisting of pages 147-166 (f75r--f84v) of the
  "biological" section, in Currier's Language B, hand 2.  This section
  includes Currier's and Friedman's transcriptions.  Currier's seems
  to be the most complete of them.
  
  I played around with the file over the next three or four days.  The
  boring details are in Notebook-1.txt.  I decided it was time to
  start all again from the beginning.


97-07-09 stolfi
===============

  From these preliminary hacking, I got the following conclusions:

  The manuscript does not appear to use any hyphenation mark.  Either
  words are not broken across lines, which would be unusual, or they
  are broken without any extra marks.  Such word breaks may 
  result in statistical anomalies at the beginning and end of lines.
  Could this explain Currier's claim that lines are "functional units"?

  Comparing the two versions (Currier and Friedman), and looking at
  the word statistics, it seems that both are highly contamiated with
  error (5-10% of the words.  This large amount of noise will 
  mess up any statistical analysis based on either text alone.

  Therefore, before spending more time in the analysis, I must first
  prepare a "corrected" interlinear where discrepancies between FSG
  and Currier are resolved, taking into account the probabilities
  above.

  Loking at the actual shape of the characters, I realized that the
  FSG encoding was not very good for my purposes, since is assigns
  completely different codes to glyphs which may be just calligraphic
  variations of the same grapheme.  Thus I decided to do most
  processing using a more analytical encoding, which can be lumped later.

  I considered using Jacques Guy's "Neo-Frogguy" or "Gui2" encoding,
  but even that is a bit too synthetic --- for example, his <2> should
  be "i'", and his <9> should be `c)', for consistency. (The
  statistics on the occurrence of repeated <i>s apparently confirm
  this choice).

  Thus I decided to define my own "super-analytic" or "JSA" encoding.

  My super-analytic encoding
  --------------------------

    The idea is to break all characters doen to individual "logical"
    strokes, and use one (computer) character to encode each stroke.

    There is some question as to what is a logical stroke, and when two
    strokes are different.  Obviously, the definition of a stroke must
    include not only its shape but also the way it connects to the
    neighboring strokes; and, given the irregularity of handwritten
    glyphs, that may be hard to decide.

    For instance, FSG's [A] character can be broken down into two
    strokes, shaped like the [C] and [I] glyphs.  Supposedly, the
    difference between an [A] and a [CI] is that in the former the
    strokes are connected into a closed shape.  Is this difference
    significant?

    I checked the occurrences of [CI], [CM], and [CN] in the interlinear
    file.  Two things are curious. First, these combinations are
    extremely rare.  Second, a good many of them are transcribed
    differently by Currier and the FSG: where one has [CIIR] the other
    often has [AIR], and vice-versa.  Same for [CM] versus [AN], etc.

    In light of these observations, I have decided to treat all
    occurrences of [A] as [CI]. If the two are indeed different, that
    will be just one more ambiguity added to the inherent ambiguity of
    natural language; so it cannot make the decipherment task more
    difficult.  Confusing the two will change the letter frequencies, it
    is true; but, since the language does not appear to be a
    standardized one, there is not much information we can extract from
    absolute letter frequencies.  The methods we hope to use --- such as
    automaton analysis --- are not significantly disturbed by collapsing
    letters.

    On the other hand, if [A] and [CI] are the same grapheme, using
    different encodings will seriously confuse statistics --- especially
    if the spacing depends on the immediate context.

    For similar resons, it is best to ignore the distinction between
    [T] and [CC], or between [S] and [2C].  The ligature is often
    lost, and we don't know whether it is significant.

    Also, the characters that Currier transcribes as [6] are usually
    transcribed [K] by Friedman, and the two are very similar.  
    Strangely [K] seems to occur mostly at the end of *lines*.

    The characters [7] [V} [Y] do not occur in this corpus.

    Summarizing, the JSA encoding breaks down evey character 
    into strokes, which are cast into one of these types:

      1. "Body" strokes:

          q  same as FSG [4], Guy <4>; also part of [H], [P], [HZ], ...

          o  same as [O], <o>

          c  same as [C], <c>; also part of [A], [8], etc.

          i  same as [I], <i>; also part of [A], [M], [N], [R], etc.

          l  long vertical bar of [D], [F], [DZ], [FZ]

       2. "Limb" strokes ("flourishes", "plumes", ...)

          g  an 8-shaped loop with both ends attached to the previous letter,
             as the right three-fourths of [8] and [7]; and also
             the right-hand swirls of [P], [F], [PZ], [FZ].

          y  a curving descender shaped like a right-parenthesis,
             attached to the top of the preceding stroke; 
             the right-hand stroke of [G] = <9>

          s  a plume attached to the top of the preceding char,
             pointing NE and curving up, as in [2] = <s>, [R] = <2>,
             and [S]

          x  a hook attached to the top of an \i/ stroke,
             curving sharply down and crossing the \i/;
             half of [E] = <x>.

          j  a P-shaped loop with one end attached to the
             top of the previous slope, and the other extending straight down;
             as in the right half of [H], [D], [HZ], [DZ],
             and [K].

          u  a plume similar to \s/, but attached to the *bottom* of the preceding 
             stroke; as in [L], [N], [M].


    The ligature in [T] is ignored, i.e. Guy's <t> and <e> are
    identified with his <c>, and denoted uniformly by \c/.  This
    identification is consistent with the digraph statistics.

    The character <a> is rendered \ci/. In fact, <a> is probably not a
    letter --- it appears to be a \c/ stroke (possibly half of the
    preceding letter) accidentally connected to an \i/ stroke
    (probably the beginning of the next letter).

    The weirdo symbols [Y], [V], etc. will be translated as \?/.

    The FGS -> JSA correspondence is, therefore

        IIIK -> iiiij   IE -> iix   A -> ci   N -> iiu  
        IIIL -> iiiiu   IR -> iis   C -> c    O -> o   
        IIIR -> iiiis   IK -> iij   D -> lj   P -> ag   
        IIIE -> iiiix   2 -> cs     E -> ix   R -> is   
        IIE -> iiix     4 -> a      F -> lg   S -> csc  
        IIR -> iiis     6 -> cj     G -> cy   T -> cc  
        IIK -> iiij     7 -> ig     H -> aj   V -> ?   
        HZ -> cajc      8 -> cg     I -> i    Y -> ?   
        PZ -> cagc                  K -> ij         
        DZ -> cljc                  L -> iu   
        FZ -> clgc                  M -> iiiu 

    Note that the \i/ groups have one more \i/ in JSA than they have 
    in Guy's encoding.  This is redundant but makes it more evident
    that <v>, <x>, <2> are homologous members of their respecive series.
    Also, this encoding fixes a minor discrepancy of Guy2, which
    uses one extra \i/ in the series <ig>, <iig>, ... <iiiig>.

  Ad-hoc encodings
  ----------------

    After mapping everything to the JSA encoding, and looking at the
    digraph frequency tables, I observed that:

    The stroke `l' is always followed by either `j' or `g', hence `lj'
    and `lg' should be single letters.

    Note also that there are two clearly different kinds of strokes, "body" B =
    {`c',`o',`t',`i',`q',`l'} and "limb" L = {`u',`x',`y',`j',`g',`s'}.  If we
    reduce the digraph count matrix to these two classes, plus word break W, we
    get

                    B     L
          ----- ----- -----
              .  6420     .
        B    59 19849 15616
        L  6361  9255     .
          ----- ----- -----

      Next-symbol probabilities (× 99):

                    B     L
          ----- ----- -----
              .    99     .
        B     .    55    44
        L    40    59     .
          ----- ----- -----

      Previous-symbol probabilities (× 99):

                    B     L
          ----- ----- -----
              .    18     .
        B     1    55    99
        L    98    26     .
          ----- ----- -----

    Note that every word begins with a body stroke; this was expected from the
    definition of the limb strokes (they can be recognized only by their
    relationship to a previous stroke).  Note also that a limb stroke cannot be
    followed by another limb stroke; this too is not wholly unexpected.

    The surprise is that almost no words *end* in a body stroke.  The least rare
    body stroke in word-final position is `o'.  The words that end
    in body strokes appear to be errors or the result of breaking a line in
    the middle of a word.

    An interesting observation from the body/limb frequency tables above
    is that the transition probabilities from body stroke to body and
    limb are respectively 55% and 45%.  Thus, if the limb strokes mark
    the end of a syllabe (or letter?), the the average number of body
    strokes in a syllabe is slightly over 2.  (Considering that we are
    counting each \i/ as a body stroke, the correct number may well be
    precisly 2.)



97-07-10 stolfi
===============

  Restarted everything from the beginning.
  
    cat bio-m-evt.evt \
      | fsg2jsa \
      > bio-m-jsa.evt
      
  Prepared a raw text file for training
  
    cat bio-m-jsa.evt \
      | egrep '^<.*;[FC]> ' \
      | sed \
          -e 's/<.*;[FC]> *//g' \
          -e 's/{[^}]*}//g' \
      > bio-m-jsa.txt

  Note that fsg2jsa removes the "%" and "!" characters,
  so the lines in the *-jsa.evt output files are not aligned
  (to align them we must run some dynamic programming). 

    cat bio-m-jsa.txt \
      | grep -v '[*]' \
      | sed -e 's/^/  /g' \
      > bio-m-jsa-trainset.txt

    cat bio-m-jsa-trainset.txt \
      | generate-fix-patterns -vMINOCC=10 \
      > bio-m-jsa-fixer.sed
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      1669    3486    106719 bio-m-evt.evt
      1530    1530     75404 bio-m-evt.txt
      1530    1530    117592 bio-m-jsa.txt
      1470    1470    115821 bio-m-jsa-trainset.txt
       596     716      9932 bio-m-jsa-fixer.sed

  Generate consensus:
  
    cat bio-m-jsa.evt \
      | make-consensus-interlin \
      > bio-x-jsa.evt
      
    cat bio-x-jsa.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
      > bio-j-jsa.evt

    extract-words-from-interlin \
        -chars "qocilgysxju" \
        bio-j-jsa.evt \
        bio-j-jsa
        
     lines   words     bytes file        
    ------ ------- --------- ------------
      7054    7054     62690 bio-j-jsa.wds
      2132    2132     24925 bio-j-jsa.dic
      4661    4661     40897 bio-j-jsa-gut.wds
       992     992      9720 bio-j-jsa-gut.dic
       840     840      2445 bio-j-jsa-fun.wds
         2       2         5 bio-j-jsa-fun.dic
      1553    1553     19348 bio-j-jsa-bad.wds
      1138    1138     15200 bio-j-jsa-bad.dic

   Digraph counts:

                  q     o     c     i     l     g     y     s     x     j     u   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .  1398   965  1877   361    60     .     .     .     .     .     .  4661
      q     1     .  1229    18     .     1   154     .     .     .   700     .  2103
      o    21   486     1    63  1087  1071     .     .     .     .     .     .  2729
      c     4   167   176  6137  1209   232  2114  2921  1019     .     .     . 13979
      i     4     1     1     8  1997     2     .     .   560  1616    37   457  4683
      l     .     .     .     .     .     .    16     .     .     .  1566     .  1582
      g    52     .    74  2150     4     4     .     .     .     .     .     .  2284
      y  2790    26     2    47    13    43     .     .     .     .     .     .  2921
      s   463     1    99  1013     1     2     .     .     .     .     .     .  1579
      x   827    24   105   488     5   167     .     .     .     .     .     .  1616
      j    46     .    76  2175     6     .     .     .     .     .     .     .  2303
      u   453     .     1     3     .     .     .     .     .     .     .     .   457
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4661  2103  2729 13979  4683  1582  2284  2921  1579  1616  2303   457 40897

  Word length statistics:

    cat bio-j-jsa-gut.wds \
      | tr 'a-z0-9' '..........................................................................' \
      | sort | uniq -c

          2 .
         21 ..
        177 ...
        176 ....
        295 .....
        568 ......
        640 .......
       1021 ........
        793 .........
        627 ..........
        184 ...........
         91 ............
         40 .............
         12 ..............
         11 ...............
          1 ................
          2 .................

  Ditto, removing limb letters:
  
    cat bio-j-jsa-gut.wds \
      | tr -d 'gysxju' \
      | tr 'a-z0-9' '..........................................................................' \
      | sort | uniq -c
      
         19 .
        227 ..
        334 ...
        638 ....
       1115 .....
       1290 ......
        685 .......
        263 ........
         72 .........
         13 ..........
          3 ...........
          2 ............
      
  Treating limbs as end-of-words:

    cat bio-j-jsa-gut.wds \
      | sed -e 's/\([gysxju]\)/\1 /g' \
      | tr ' ' '\012' \
      | egrep '.' \
      | sort | uniq -c \
      | sort +0 -1nr \
      | pr -4 -t -s'    ' \
      | expand

     2058 cy              27 ciij             4 ciiu             1 ciiiis
      973 cs              24 ccccqj           4 cix              1 ciocs
      821 qolj            20 qoqg             4 ocqj             1 ciqj
      721 cccg            19 ciiis            4 oij              1 coclj
      628 oix             16 ccois            4 olg              1 cocqg
      454 ccccg           15 ccccs            4 qoclj            1 ij
      438 cg              15 cqj              3 c                1 occcciiiu
      436 ccg             13 ccciix           3 ccciij           1 occcg
      374 ciix            12 ccciis           3 ci               1 occcy
      353 cccy            12 o                3 colj             1 ocy
      327 ciiiiu          11 ccix             3 qcccg            1 oiis
      303 ix              11 clj              2 ccccois          1 oiiu
      272 ccy             10 cccqg            2 ccis             1 oqo
      269 lj               9 cois             2 ccocg            1 oqoix
      245 ciis             9 cqg              2 ccolj            1 oqolg
      237 olj              9 ocg              2 ccoqj            1 oqolj
      219 oqj              9 oiiiu            2 cicg             1 q
      205 qoqj             8 cccciix          2 ciiix            1 qccccg
      186 ccccy            8 cciis            2 cilj             1 qcccy
      138 ois              8 ccs              2 cis              1 qccy
      134 qj               7 cccs             2 clg              1 qci
      133 qoix             7 cciix            2 co               1 qcs
      108 ciiiu            7 ccqg             2 occccg           1 qcy
      101 ccclj            7 lg               2 qclj             1 qlj
       86 is               7 qcqj             2 qoccccg          1 qoccccy
       71 qg               7 qocg             2 qocqj            1 qocccy
       65 cclj             7 qois             1 cc               1 qocclj
       55 ccoix            6 cccois           1 ccccciiii        1 qociiiiu
       54 cccqj            6 oclj             1 cccccoix         1 qociis
       44 cccccy           6 qo               1 cccciij          1 qociix
       37 cccclj           6 qocccg           1 ccccoix          1 qocy
       37 cccoix           5 cccccs           1 ccccqg           1 qoiiu
       36 coix             5 cccciis          1 cciij            1 qolg
       35 oqg              5 ocs              1 cclg             1 qooix
       32 ccqj             4 cics             1 ciclj            1 qoqolj
       30 cccccg           4 ciiiiiu          1 cicqj


  Obvously, not all "letters" or "syllabes" end with limb strokes;
  some "cc" groups must be broken, too. (Many are probably the [T]
  and [S] characters.)
  
  Here is the table again, sorted by reverse-lex order:

    cat bio-j-jsa-gut.wds \
      | sed -e 's/\([gysxju]\)/\1 /g' \
      | tr ' ' '\012' \
      | egrep '.' \
      | revbytes | sort | revbytes | uniq -c \
      | pr -4 -t -s'    ' \
      | expand

          3 c                3 ccciij           2 co               4 ciiiiiu
          1 cc               1 cccciij          6 qo               9 oiiiu
        438 cg               4 oij              1 oqo              1 oiiu
        436 ccg            269 lj               1 q                1 qoiiu
        721 cccg            11 clj            973 cs             303 ix
        454 ccccg           65 cclj             8 ccs              4 cix
         30 cccccg         101 ccclj            7 cccs            11 ccix
          2 occccg          37 cccclj          15 ccccs          374 ciix
          2 qoccccg          1 qocclj           5 cccccs           7 cciix
          1 qccccg           1 ciclj            4 cics            13 ccciix
          1 occcg            6 oclj             5 ocs              8 cccciix
          6 qocccg           1 coclj            1 ciocs            1 qociix
          3 qcccg            4 qoclj            1 qcs              2 ciiix
          2 cicg             2 qclj            86 is             628 oix
          9 ocg              2 cilj             2 cis             36 coix
          2 ccocg          237 olj              2 ccis            55 ccoix
          7 qocg             3 colj           245 ciis            37 cccoix
          7 lg               2 ccolj            8 cciis            1 ccccoix
          2 clg            821 qolj            12 ccciis           1 cccccoi
          1 cclg             1 oqolj            5 cccciis          1 qooix
          4 olg              1 qoqolj           1 qociis         133 qoix
          1 qolg             1 qlj             19 ciiis            1 oqoix
          1 oqolg          134 qj               1 ciiiis        2058 cy
         71 qg              15 cqj              1 oiis           272 ccy
          9 cqg             32 ccqj           138 ois            353 cccy
          7 ccqg            54 cccqj            9 cois           186 ccccy
         10 cccqg           24 ccccqj          16 ccois           44 cccccy
          1 ccccqg           1 cicqj            6 cccois           1 qoccccy
          1 cocqg            4 ocqj             2 ccccois          1 occcy
         35 oqg              2 qocqj            7 qois             1 qocccy
         20 qoqg             7 qcqj             4 ciiu             1 qcccy
          3 ci               1 ciqj           108 ciiiu            1 qccy
          1 qci            219 oqj              1 occccii          1 ocy
          1 ij               2 ccoqj          327 ciiiiu           1 qocy
         27 ciij           205 qoqj             1 cccccii          1 qcy
          1 cciij           12 o                1 qociiii

  Let's analyze the family {i,ii,iii,iiii}{u,s,j,x}, specifically:
  
    cat bio-j-jsa-gut.wds \
      | sed \
         -e 's/cs/z/g' \
         -e 's/\([^i]i\)/ \1/g' \
         -e 's/i\([^iusxj]\)/i\1 \1/g' \
         -e 's/\([usjx]\)/\1 \1/g' \
         -e 's/z/cs/g' \
      | tr ' ' '\012' \
      | egrep 'i' \
      | revbytes | sort | revbytes \
      | uniq -c | expand

  Results:

    |  32 ciij     |   1 gis      |   4 ciiu     | 281 ix       |   4 cic        
    |   4 oij      |  79 is       | 109 ciiiu    |  14 cix      |   1 i          
    |   1 yij      |   2 xis      |   2 oiiu     |   8 yix      |   5 ci         
    |              |   4 cis      | 329 ciiiiu   |   3 gix      |   3 oi         
    |              |   4 yis      |   9 oiiiu    |   1 csix     |   2 cil        
    |              | 271 ciis     |   4 ciiiiiu  |   6 jix      |   1 cio        
    |              | 178 ois      |              |   3 xix      |   1 ciq      
    |              |  19 ciiis    |              | 403 ciix     |   4 cics      
    |              |   1 oiis     |              | 890 oix      |              
    |              |   1 ciiiis   |              |   2 ciiix    |
                                                  
  Observations:  

    Note that, by far, these suffixes are always preceded by \c/ or \o/.
    The exceptions are \ix/ and \is/ which are often initial
    and occasionally preceded by \y/, \lj/, \qj/ or another \ix/.
    
    Except for the \iu/ family, it seems that \ci/ and \o/ are
    satistically equivalent.  The peculiarity of the \iu/ family
    could be explained by the fact that FSG has separate codes
    [M] and [N] for its members, whereas the other families
    are denoted [IR], [IIR], [IIIR], etc.

    Identifying \o/ with \ci/, the most frequent members of these
    families are
    
      \ciij/   (0.97 of the \ij/ family) 
               
      \is/     (0.16)
      \ciis/   (0.80)
      \ciiis/  (0.04)
               
      \ciiiu/  (0.24)
      \ciiiiu/ (0.74)
      
      \ix/     (0.19)
      \ciix/   (0.78)
    
  Let's check the hypothesis that \o/ ~ \ci/, by looking at the 
  distribution of adjacent letters. I will exclude the words that
  begin with \qo/ since these appear to be special:
  
    cat bio-j-jsa-gut.wds \
      | egrep -v '^qo' \
      | sed \
          -e 's/[ql][jg]/p/g' \
          -e 's/ci/a/g' \
      | enum-trigraphs \
      | egrep '.[ao].' \
      > .ao.tri
      
    cat .ao.tri \
      | egrep '.a.' \
      > .a.tri
      
    cat .ao.tri \
      | egrep '.o.' \
      > .o.tri
       
     lines   words     bytes file        
    ------ ------- --------- ------------
       855     855      3420 .a.tri
      1460    1460      5840 .o.tri
      2315    2315      9260 .ao.tri

    cat .a.tri \
      | tr ' ' '_' \
      | sed -e 's/\(.\).\(.\)/\1.\2	\1a\2/g' \
      | sort | uniq -c \
      > .a.frq
   
    cat .o.tri \
      | tr ' ' '_' \
      | sed -e 's/\(.\).\(.\)/\1.\2	\1o\2/g' \
      | sort | uniq -c \
      > .o.frq
    
    join -t'	' -a1 -a2 -e '000' -j1 2 -j2 2 -o1.3,1.1,2.3,2.1 .a.frq .o.frq \
      | expand \
      | gawk ' { printf "      %3s %4.2f %3s %4.2f %4.2f\n", $1, ($2/855), $3, ($4/1460), ($2/855 + $4/1460) } ' \
      | sed -e 's/\b000/.../g' \
      | sort +4 -5nr

      _ai 0.07 _oi 0.31 0.39
      gai 0.35 goi 0.02 0.38
      pai 0.31 poi 0.05 0.36
      _ap 0.00 _op 0.33 0.33
      sai 0.13 soi 0.06 0.19
      cai 0.07 coi 0.10 0.17
      xai 0.03 xoi 0.06 0.09
      _ac 0.00 _oc 0.02 0.02
      ... 0.00 xo_ 0.01 0.01
      cax 0.01 ... 0.00 0.01

  Nothing clear emerges.  There may be confusion between \ci/ and \o/,
  but the two do not seem to be equivalent.
  
  In all, it seems that 
  
    \ci/ and \o/ are lexically similar but distinct letters. 
    
    The valid \i/ sequences are \ij/  \is/ \iis/ \iiu/ \iiiu/ \ix/;
    the others are likely to be scription or transcription errors.
    
  So let's replace the \i/ sequences by distinct letters, then
  \ci/ by `a', and look at what we get.
  
  Let's use this ad-hoc encoding:
  
      jsa2hoc 
      -------------------------------------------------
      #! /n/gnu/bin/sed -f
      # Recoding superanalytic to ad-hoc encoding:
      s/ij/f/g
      s/ix/e/g
      s/ci/a/g
      s/iiiu/m/g
      s/iiu/n/g
      s/iis/v/g
      s/is/r/g
      s/cs/z/g
      s/cg/8/g
      s/cy/9/g
      s/qo/A/g
      s/qj/H/g
      s/qg/P/g
      s/lj/H/g
      s/lg/P/g
      -------------------------------------------------

    extract-words-from-interlin \
        -recode jsa2hoc \
        -chars "aocz89AHPermnvf" \
        bio-j-jsa.evt \
        bio-j-hoc

     lines   words     bytes file        
    ------ ------- --------- ------------
      7054    7054     41600 bio-j-hoc.wds
      1995    1995     15504 bio-j-hoc.dic
      4626    4626     26296 bio-j-hoc-gut.wds
       858     858      5513 bio-j-hoc-gut.dic
       875     875      2655 bio-j-hoc-fun.wds
        33      33       189 bio-j-hoc-fun.dic
      1553    1553     12649 bio-j-hoc-bad.wds
      1104    1104      9802 bio-j-hoc-bad.dic


    Digraph counts:

                  a     o     c     z     8     9     A     H     P     e     r     m     n     v     f   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .    66   965   621   727   362    95  1215   151    64   282    78     .     .     .     .  4626
      a     3     .     1     2     4     2     .     .     3     .   402   270   328   108    19    32  1174
      o    14     .     .    17     6    11     1     4   463    39   758   170     9     1     1     4  1498
      c     4    60   176  3509    35  1646   855     .   358    31    14     .     .     .     .     .  6688
      z    41    68    63   823     9     3     4     .     2     1     1     .     .     .     .     .  1015
      8    49   308    37    37    31     1  1631     .     4     .     3     1     .     .     .     .  2102
      9  2778     1     2    16    20     9     .     2    64     2     8     4     .     .     .     1  2907
      A     7     3     1    17     .     7     1     1  1022    22   134     7     .     1     .     .  1223
      H    10   583    74  1323    41     3   207     .     .     .     6     .     .     .     .     .  2247
      P     2    11    36    97    14     3     6     .     .     .     .     .     .     .     .     .   169
      e   824    32   105   205   115    53    81     1   180    10     3     2     .     .     .     .  1611
      r   396    42    36    21    13     2    22     .     .     .     .     .     .     .     .     .   532
      m   334     .     .     .     .     .     3     .     .     .     .     .     .     .     .     .   337
      n   109     .     1     .     .     .     .     .     .     .     .     .     .     .     .     .   110
      v    19     .     .     .     .     .     1     .     .     .     .     .     .     .     .     .    20
      f    36     .     1     .     .     .     .     .     .     .     .     .     .     .     .     .    37
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4626  1174  1498  6688  1015  2102  2907  1223  2247   169  1611   532   337   110    20    37 26296

    Next-symbol probability (× 99):

                  a     o     c     z     8     9     A     H     P     e     r     m     n     v     f   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     1    21    13    16     8     2    26     3     1     6     2     .     .     .     .    99
      a     .     .     .     .     .     .     .     .     .     .    34    23    28     9     2     3    99
      o     1     .     .     1     .     1     .     .    31     3    50    11     1     .     .     .    99
      A     1     .     .     1     .     1     .     .    83     2    11     1     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      e    51     2     6    13     7     3     5     .    11     1     .     .     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      c     .     1     3    52     1    24    13     .     5     .     .     .     .     .     .     .    99
      z     4     7     6    80     1     .     .     .     .     .     .     .     .     .     .     .    99
      8     2    15     2     2     1     .    77     .     .     .     .     .     .     .     .     .    99
      H     .    26     3    58     2     .     9     .     .     .     .     .     .     .     .     .    99
      P     1     6    21    57     8     2     4     .     .     .     .     .     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
      9    95     .     .     1     1     .     .     .     2     .     .     .     .     .     .     .    99
      r    74     8     7     4     2     .     4     .     .     .     .     .     .     .     .     .    99
      m    98     .     .     .     .     .     1     .     .     .     .     .     .     .     .     .    99
      n    98     .     1     .     .     .     .     .     .     .     .     .     .     .     .     .    99
      v    94     .     .     .     .     .     5     .     .     .     .     .     .     .     .     .    99
      f    96     .     3     .     .     .     .     .     .     .     .     .     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    17     4     6    25     4     8    11     5     8     1     6     2     1     0     0     0 26296

    Previous-symbol probability (× 99):

                  a     9     o     c     z     8     A     H     P     e     r     m     n     v     f   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     6     3    64     9    71    17    98     7    37    17    15     .     .     .     .    17
      a     .     .     .     .     .     .     .     .     .     .    25    50    96    97    94    86     4
      o     .     .     .     .     .     1     1     .    20    23    47    32     3     1     5    11     6
      A     .     .     .     .     .     .     .     .    45    13     8     1     .     1     .     .     5
      c     .     5    29    12    52     3    78     .    16    18     1     .     .     .     .     .    25
      z     1     6     .     4    12     1     .     .     .     1     .     .     .     .     .     .     4
      8     1    26    56     2     1     3     .     .     .     .     .     .     .     .     .     .     8
      9    59     .     .     .     .     2     .     .     3     1     .     1     .     .     .     3    11
      H     .    49     7     5    20     4     .     .     .     .     .     .     .     .     .     .     8
      P     .     1     .     2     1     1     .     .     .     .     .     .     .     .     .     .     1
      e    18     3     3     7     3    11     2     .     8     6     .     .     .     .     .     .     6
      r     8     4     1     2     .     1     .     .     .     .     .     .     .     .     .     .     2
      m     7     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     1
      n     2     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     0
      v     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     0
      f     1     .     .     .     .     .     .     .     .     .     .     .     .     .     .     .     0
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    99    99    99    99    99    99    99    99    99    99    99    99    99    99    99    99 26296


97-07-11 stolfi
===============

  I took the list of good words and examined the contexts of all "gallows" letters:
  
    cat bio-j-jsa.wds \
      | jsa2hoc \
      | enum-contexts -vCTX=1 -vPAT='[HP]' \
      | wfreq4
      
        710 AHc
        361 AHa
        320 cHc
        319 oHc
        170 oHa
        146 eHc
        108  Hc
         90 AH9
         80 ?Hc
         72 cH9
         57 9Hc
    
  Statistics of \c/ strings:

    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vCTX=0 -vPAT='cc*' \
      | wfreq4
      
        860 c
       1317 cc
        884 ccc
        145 cccc
          1 ccccc
          
  These numbers suggest that \cc/ may be a single letter.
  
  Let's look at the frequencies of single \c/:
   
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \
      | wfreq4

        399 Hc8             9 qcH             4 zce             1 rca
        247 Hc9             8 zca             2 acH             1 qc9
         30 zc8             8 zcH             2 Pca             1 Pco
         25 zco             7 Hcz             2 Pc9             1 Pc8
         23 Hco             6 Hca             2 8c8             1 9cP
         13 zc9             6 AcH             1 zcz             1 9cH
         10 ocH

  It seems that single \c/s are all associated with \z/,
  \H/, or \P/.  I will try to remove the cHc and cPc sequences and see 
  what happens:
  
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | sed -e 's/c[HP]c/X/g' \
      | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \
      | wfreq4

        380 Hc8            25 zco             4 zce             1 qc9
         78 Hc9            19 Hco             2 qcH             1 Xcz
         60 zcX            13 zc9             2 Hca             1 Xco
         30 zc8             8 zca             2 8c8             1 AcX
         30 Xc9             7 Hcz             1 zcz             1 9cP
         29 Xc8             5 zcH             1 rca
    
  It seems that \c8/ and \c9/ are also major sources of single \c/s
      
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | sed \
          -e 's/c[HP]c/X/g' \
          -e 's/zc/Z/g' \
          -e 's/c8/K/g' \
          -e 's/c9/W/g' \
      | enum-contexts -vCTX=00 -vPAT='[^c]c[^c]' \
      | wfreq4
      
        315 HcK             9 Zca             2 rcW             1 Zcz
        166 HcW             6 XcK             2 qcH             1 ZcP
         52 ZcK             6 AcK             2 XcW             1 Xcz
         48 ZcX             5 PcK             2 Hcz             1 Xco
         40 Zco             5 HcZ             2 Hca             1 PcW
         34 ZcW             5 8cK             1 rca             1 AcX
         25 ZcH             3 qcK             1 qcW             1 AcW
         19 Hco             3 KcW             1 ocW             1 9cP
         15 ecK             3 8cW             1 ocK             1 9cK
         14 ecW
     
  Not very illuminating. Let's examine again the \c/ strings in context: 

    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vCTX=1 -vPAT='cc*' \
      | wfreq4

     401 Hc8            11 8ccc8           3 Hcccc9          1 ocP
     347 Hcc8           11  ccca           3 8cco            1 eccz
     322 zcc8           10  cccz           3 8ccco           1 eccccz
     255 Hc9            10  cP             3 8c8             1 ecccco
     200 Hcc9            9 zcca            3  ccccz          1 ecccP
     192  ccc8           9 qcH             3  ccP            1 ecca
     116 zcc9            9 Pcc8            2 zcccP           1 eccP
      97 eccc8           9  cce            2 zc              1 eccH
      90  cccH           8 zca             2 rcccH           1 ecc
      82 zccH            8 zcH             2 rcc9            1 Pco
      67  ccc9           8 Pcc9            2 occc8           1 Pcca
      59 zcccH           7 Hcz             2 ecce            1 Pc8
      53 Pccc8           7 8cc9            2 eccca           1 HccccH
      52 zccc8           7  cccP           2 acH             1 Hc
      51  ccccH          6 rccc9           2 Pcccc9          1 Accc9
      42 eccc9           6 rccc8           2 Pca             1 AccH
      40 zcco            6 eccccH          2 Pc9             1 Acc9
      37 Hccc8           6 Hca             2 Hccco           1 9ccco
      34 zccc9           6 Acc8            2 Hcccc8          1 9ccccz
      31 zc8             6 AcH             2 Accc8           1 9cccc9
      31  cccc9          6 9ccc8           2 9ccccH          1 9cccc8
      26 zco             6  cc9            1 zcz             1 9ccc9
      25 Hco             5 eccco           1 zccz            1 9cc8
      25  ccco           5 ecccc8          1 zccccH          1 9cP
      24 Hccc9           5 8cc8            1 zcccc9          1 9cH
      22  cc8            4 zce             1 rccz            1 8cccco
      20  cccc8          4 zcccz           1 rcccz           1 8cccc9
      17  cco            4 zccP            1 rcccc9          1 8cccc8
      17  cH             4 ecccc9          1 rcccc8          1 8cccH
      15 ecc8            4 Pccco           1 rccca           1 8ccc9
      14 zc9             4 Hccz            1 rca             1 8c9
      14 ecc9            4 Hcca            1 qccc8           1  cccco
      14  ccH            3 zccco           1 qcc9            1  ccccco
      13  cca            3 qcc8            1 qc9             1  cccca
      11 ocH             3 ecco            1 occca           1  ccccP
      11 Pccc9           3 ecccH           1 occ9            1  ca
      11 Hcco            3 Pcco            1 occ8

  Let's look more closely at the \Hc^*8/ strings:

    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vCTX=1 -vPAT='Hcc*8' \
      | wfreq4
      
     198 AHc89          14  Hccc89         2 AHcccc89        1 cHc8
     184 AHcc89         14  Hcc89          2 9Hcc8a          1 AHccc8
      85 oHc89          11 AHccc89         2 8Hc89           1 AHcc8a
      59 oHcc89          9 oHc8a           2  Hcc8           1 AHc8z
      31 cHcc89          6 oHccc89         1 oHcc8a          1 AHc8o
      29 eHc89           5 AHcc8           1 oHc8c           1 AHc8c
      27 eHcc89          5 AHc8a           1 oHc8            1  Hccc8a
      26  Hc89           3 oHcc8           1 eHcc8           1  Hcc8o
      20 cHc89           3 AHc8            1 eHc8c           1  Hcc8a
      15 9Hc89           2 eHccc89         1 eHc8            1  Hc8a
      14 9Hcc89          2 cHccc8          1 cHcc8c

  Curiously, so the string \Hc*8/ is almost always followed by \9/.
  I wonder if that is true in general:
  
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vLCTX=0 -vRCTX=2 -vPAT='c8' \
      | wfreq4

    1526 c89             5 c8oe            2 c8av            1 c89z
      45 c8              4 c8cc            1 c8zc            1 c89o
      24 c8ar            4 c8af            1 c8c9            1 c89H
      22 c8ae            3 c8or            1 c8c8            1 c89A
      10 c8am            3 c89e            1 c8an

  It does seem that \c8/ is almost always followed by \9/, and occasionally
  by end-of-word. Let's look at the left context:
  
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='cc*8' \
      | wfreq4
  
     401 Hc8            31 zc8             6 9ccc8           1 rcccc8
     347 Hcc8           22  cc8            5 ecccc8          1 qccc8
     322 zcc8           20  cccc8          5 8cc8            1 occ8
     192  ccc8          15 ecc8            3 qcc8            1 Pc8
      97 eccc8          11 8ccc8           2 occc8           1 9cccc8
      53 Pccc8           9 Pcc8            2 Hcccc8          1 9cc8
      52 zccc8           6 rccc8           2 Accc8           1 8cccc8
      37 Hccc8           6 Acc8            2 8c8             1  c8

  Let's look at fout \c/s:
  
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vLCTX=0 -vRCTX=1 -vPAT='cccc' \
      | wfreq4
  
      61 ccccH          30 cccc8           3 cccco           1 cccca
      44 cccc9           5 ccccz           1 ccccc           1 ccccP

    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='cccc' \
      | wfreq4
  
     109  cccc           6 Hcccc           3 8cccc           2 rcccc
      17 ecccc           5 9cccc           2 zcccc           2 Pcccc
      
    Let's examine the contexts of \zc/:
    
    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vLCTX=1 -vRCTX=0 -vPAT='zc' \
      | wfreq4

     580  zc            31 8zc            13 rzc             5 czc
     108 ezc            19 9zc             9 zzc             4 ozc
      41 Hzc            14 Pzc

    cat bio-j-jsa-gut.wds \
      | jsa2hoc \
      | enum-contexts -vLCTX=0 -vRCTX=1 -vPAT='zc' \
      | wfreq4

     730 zcc            14 zc9             8 zcH             2 zc
      31 zc8             8 zca             4 zce             1 zcz
      26 zco


  It is time to change my ad-hoc encoding to reflect the consonant/vowel
  theory (as suggested by Grove).
  
  We have identified several categories of symbols:
  
    \iiiu/, \iiu/, \iis/, \ij/  
    
        The ziggies: strictly final, preceded always by \ci/ or,
        more rarely, by \o/.
        
    \cy/ 
    
        Almost always final, but occasionaly followed by other letters.
        Preceded by about the same letters as \ci/; indeed, it is 
        probably the final form of \ci/.
        
    \cg/ 
    
        May be followed by many letters, most often \cy/ and \ci/.
        Almost always prededed by \c/, or initial; rarely by \ix/
        or \o/.
        
    \cs/ 
    
        Most often followed by \c/, somewhat less often by \o/,
        \ci/, or word break.  Most often initial, but also 
        preceded by \ix/, gallows, \c/, \cy/, \cg/, \is/.
        
    \lg/, \qg/, \lj/, \qj/ 
    
        The capitals: Very similar to each other, different from the rest.
        probably to be combined with \c/ on both sides.
        
    \qo/ 
    
        Strictly initial, almost always followed by a capital.
    
    \ix/ 
     
        Usually initial or preceded by \ci/ or \o/;
        followed by any letter except ziggies and \qo/,
        \ix/, \is/
        
    \is/ 
    
        Similar to \ix/ except that it cannot be
        followed by capitals or \cg/, either.

    \ci/
    
        May be followed only by the ziggies, \ix/, or \ir/
        only.  Often follows a capital, but also \cg/,
        \cs/, \c/, \ix/, \is/, or word break.
        
    \o/ 
    
        Similar to \a/, but is very often word-initial.
                   
  With these considerations, I defined a new encoding,
  "hic": 
  
      jsa2hic
      -------------------------------------------------
      #! /n/gnu/bin/sed -f
      # Recoding superanalytic to ad-hoc encoding:
      s/ij/J/g
      s/ix/I/g
      s/ci/S/g
      s/iiiu/M/g
      s/iiu/N/g
      s/iis/L/g
      s/is/C/g
      s/csc/Z/g
      s/cg/U/g
      s/cy/S/g
      s/qo/W/g
      s/qj/A/g
      s/qg/E/g
      s/lj/Y/g
      s/lg/O/g
      s/o/R/g
      s/cc/T/g
     -------------------------------------------------

    extract-words-from-interlin \
        -recode jsa2hic \
        -chars "AEIOUYWSRTMNLc" \
        bio-j-jsa.evt \
        bio-j-hic

     lines   words     bytes file        
    ------ ------- --------- ------------
      7054    7054     41600 bio-j-hic.wds
      1934    1934     15117 bio-j-hic.dic
      4626    4626     26296 bio-j-hic-gut.wds
       800     800      5150 bio-j-hic-gut.dic
       875     875      2655 bio-j-hic-fun.wds
        33      33       189 bio-j-hic-fun.dic
      1553    1553     12649 bio-j-hic-bad.wds
      1101    1101      9778 bio-j-hic-bad.dic

                  S     T     R     L     N     Z     H     Q     A     E     O     U     c   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .   161   965   282    78   215   727   362  1215     .     .     .     .   621  4626
      S  2781     1     3   410   274    69    24    11     2    33    19   108   328    18  4081
      T    14     1     .   758   170   502     6    11     4     4     1     1     9    17  1498
      Q     7     4     1   134     7  1044     .     7     1     .     .     1     .    17  1223
                                                             
      R   824   113   105     3     2   190   115    53     1     .     .     .     .   205  1611
      L   396    64    36     .     .     .    13     2     .     .     .     .     .    21   532
      N    12   807   110     6     .     .    55     6     .     .     .     .     .  1420  2416
      Z    41    72    63     1     .     3     9     3     .     .     .     .     .   823  1015
      H    49  1939    37     3     1     4    31     1     .     .     .     .     .    37  2102
                                                             
      A    36     .     1     .     .     .     .     .     .     .     .     .     .     .    37
      E    19     1     .     .     .     .     .     .     .     .     .     .     .     .    20
      O   109     .     1     .     .     .     .     .     .     .     .     .     .     .   110
      U   334     3     .     .     .     .     .     .     .     .     .     .     .     .   337
      c     4   915   176    14     .   389    35  1646     .     .     .     .     .  3509  6688
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4626  4081  1498  1611   532  2416  1015  2102  1223    37    20   110   337  6688 26296

    Next-symbol probability (× 99):

                  N     R     L     S     T     Z     Q     H     A     E     O     U     c   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     5     6     2     3    21    16    26     8     .     .     .     .    13    99
      N     .     .     .     .    33     5     2     .     .     .     .     .     .    58    99
      R    51    12     .     .     7     6     7     .     3     .     .     .     .    13    99
      L    74     .     .     .    12     7     2     .     .     .     .     .     .     4    99
      S    67     2    10     7     .     .     1     .     .     1     .     3     8     .    99
      T     1    33    50    11     .     .     .     .     1     .     .     .     1     1    99
      Z     4     .     .     .     7     6     1     .     .     .     .     .     .    80    99
      Q     1    85    11     1     .     .     .     .     1     .     .     .     .     1    99
      H     2     .     .     .    91     2     1     .     .     .     .     .     .     2    99
      A    96     .     .     .     .     3     .     .     .     .     .     .     .     .    99
      E    94     .     .     .     5     .     .     .     .     .     .     .     .     .    99
      O    98     .     .     .     .     1     .     .     .     .     .     .     .     .    99
      U    98     .     .     .     1     .     .     .     .     .     .     .     .     .    99
      c     .     6     .     .    14     3     1     .    24     .     .     .     .    52    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    17     9     6     2    15     6     4     5     8     0     0     0     1    25 26296

    Previous-symbol probability (× 99):

                  N     R     L     S     T     Z     Q     H     A     E     O     U     c   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     9    17    15     4    64    71    98    17     .     .     .     .     9    17
      N     .     .     .     .    20     7     5     .     .     .     .     .     .    21     9
      R    18     8     .     .     3     7    11     .     2     .     .     .     .     3     6
      L     8     .     .     .     2     2     1     .     .     .     .     .     .     .     2
      S    60     3    25    51     .     .     2     .     1    88    94    97    96     .    15
      T     .    21    47    32     .     .     1     .     1    11     5     1     3     .     6
      Z     1     .     .     .     2     4     1     .     .     .     .     .     .    12     4
      Q     .    43     8     1     .     .     .     .     .     .     .     1     .     .     5
      H     1     .     .     .    47     2     3     .     .     .     .     .     .     1     8
      A     1     .     .     .     .     .     .     .     .     .     .     .     .     .     0
      E     .     .     .     .     .     .     .     .     .     .     .     .     .     .     0
      O     2     .     .     .     .     .     .     .     .     .     .     .     .     .     0
      U     7     .     .     .     .     .     .     .     .     .     .     .     .     .     1
      c     .    16     1     .    22    12     3     .    78     .     .     .     .    52    25
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    99    99    99    99    99    99    99    99    99    99    99    99    99    99 26296


    cat bio-j-jsa.wds \
      | /n/gnu/bin/awk -f foo.awk \
      | sort | uniq -c | sort -nr | expand \
      > bio-j-jsa-wpairs.freq
      

97-07-14 stolfi
===============

  Over the weekend I have pored over the list of most common words.
  They can be gruped into the following major sets, based on prefixes:
  
    qoljc*/qoqjc*
    
    cccc*/csccc*
    
    oljc*/oqjc*
    
    cgc*
    
    ixcccc*/ixcsccc*
    
    cccljc*/csccljc*/cccqjc*
    
  plus the following short connectives:
  
    oix qoix ois csoix oixcg ixoix
    
    csciiiu csciiiiu ciiiiu
    
  To confirm some hunches about gallows, I tabulated all \lj/ and \qj/
  gallows letters and their neighboring \c/ strings:
  
   cat bio-j-jsa-gut.wds \
      | sed -e 's/cs/c/g' \
      | enum-contexts -vPAT='c*qjc*' -vLCTX=0 -vRCTX=1 \
      | wfreq \
      > .foo
   
    cat bio-j-jsa-gut.wds \
      | sed -e 's/cs/c/g' \
      | enum-contexts -vPAT='c*ljc*' -vLCTX=0 -vRCTX=1 \
      | wfreq \
      > .bar
      
    pr -m -s'   ' -t -i' '1 .foo .bar > .baz


      \qj/ gallows              \lj/ gallows
    -----------------        -------------------
     129  0.18  qjccg         251  0.16  ljccg
      79  0.11  qjcccg        244  0.16  ljcccg
      44  0.06  qjcy          131  0.08  ljcccy
      38  0.05  cccqjccy      100  0.06  ljcy
      36  0.05  qjcccy         55  0.04  ljccy
      31  0.04  qjo            54  0.03  cccljccy
      29  0.04  ccccqjccy      44  0.03  ccccljccy
      26  0.04  qjccccg        42  0.03  ljo
      24  0.03  qjccy          26  0.02  ljccccg
      14  0.02  qjccccy        25  0.02  ljccccy
      10  0.01  cccqjcy        20  0.01  cccljcy
      10  0.01  ccccqjcy       15  0.01  ccccljcy
       8  0.01  qjco           12  0.01  cccljcccg
       6  0.01  cqjcccy        11  0.01  ljco
       5  0.01  qjcco           9  0.01  cccljcccy
       4  0.01  qjccco          8  0.01  cccljccg
       4  0.01  ccqjcy          7  0.00  cljccy
       4  0.01  cccqjcccy       7  0.00  cljcccy
       3  0.00  qjccci          7  0.00  cccljci
       3  0.00  qj              5  0.00  lji
       3  0.00  cqjco           5  0.00  ljcco
       3  0.00  cqjccy          5  0.00  cljcccg
       3  0.00  cqjccg          5  0.00  ccljci
       3  0.00  cqjcccg         5  0.00  ccccljcccy
       3  0.00  cccqjccg        5  0.00  ccccljcccg
       3  0.00  cccqjcccg       4  0.00  ljcccccy
       3  0.00  ccccqjcccg      4  0.00  ccclj
       2  0.00  qjcg            3  0.00  lj
       2  0.00  cqjcy           3  0.00  ccljcy
       2  0.00  ccqjo           2  0.00  ljcci
       2  0.00  ccqjci          2  0.00  ljccco
       1  0.00  qji             2  0.00  ljcccccg
       1  0.00  qjcccccy        2  0.00  cljci
       1  0.00  qjccc           2  0.00  cljccg
       1  0.00  qjcc            2  0.00  ccljccg
       1  0.00  cqjci           2  0.00  ccccljcci
       1  0.00  cqjcco          2  0.00  ccccljccg
       1  0.00  cqjcci          1  0.00  ljcg
       1  0.00  cqjccccg        1  0.00  ljccci
       1  0.00  cqjccc          1  0.00  ljccccl
       1  0.00  ccqjccg         1  0.00  ccljco
       1  0.00  ccqjcccy        1  0.00  ccljcccy
       1  0.00  cccqjci         1  0.00  ccljcccg
       1  0.00  ccccqjci        1  0.00  cccljco
       1  0.00  ccccqjcccy      1  0.00  cccljcci
       1  0.00  cccccqjccc      1  0.00  cccljccccy
   -----  ----  ----            1  0.00  cccljccccg
     700  1.00  TOT             1  0.00  cccljc
                                1  0.00  ccccljco
                                1  0.00  cccccljccy
                            -----  ----  ----
                             1566  1.00  TOT

  Now let's check the significance of the \s/ plume on \c/.
  First, let's list all initial \c/-strings that have plumes
  against all that don't have them:
  
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/cs/z/g' \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
      | enum-contexts -vPAT='_c*z[zc]*[^zc]' -vLCTX=0 -vRCTX=0 \
      | wfreq \
      > .foo
   
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/cs/z/g' \
          -e 's/^/_/g' \
          -e 's/$/_/g' \
      | enum-contexts -vPAT='_c*c[^zc]' -vLCTX=0 -vRCTX=0 \
      | wfreq \
      > .bar
      
    pr -m -s'   ' -t -i' '1 .foo .bar > .baz

   "z"=\cs/ prefixes                    "c" prefixes
   ------------------                   ------------------
   219  0.30  _zcccg                    364  0.32  _cg
    71  0.10  _zcccy                    192  0.17  _ccccg
    69  0.09  _zci                       95  0.08  _cy
    63  0.08  _zo                        67  0.06  _ccccy
    54  0.07  _zccl                      66  0.06  _ci
    38  0.05  _zccccg                    60  0.05  _cccl
    34  0.05  _zcccl                     37  0.03  _cccq
    30  0.04  _zcco                      31  0.03  _cccccy
    26  0.04  _zccq                      29  0.03  _ccccl
    23  0.03  _zccccy                    25  0.02  _ccco
    22  0.03  _zcccq                     23  0.02  _ccccq
    19  0.03  _zco                       22  0.02  _cccg
     9  0.01  _zccy                      20  0.02  _cccccg
     8  0.01  _cccz_                     19  0.02  _cq
     7  0.01  _zcci                      17  0.01  _cco
     6  0.01  _zccg                      13  0.01  _ccci
     6  0.01  _zccci                     11  0.01  _cccci
     6  0.01  _z_                        10  0.01  _cci
     4  0.01  _zzcccg                     9  0.01  _ccl
     3  0.00  _zcq                        8  0.01  _cl
     3  0.00  _zcl                        8  0.01  _ccq
     3  0.00  _zccco                      6  0.01  _cccy
     3  0.00  _ccccz_                     1  0.00  _cccco
     2  0.00  _zl                         1  0.00  _ccccco
     1  0.00  _zzcl                       1  0.00  _ccccci
     1  0.00  _zzcco                  -----  ----  ----
     1  0.00  _zzccl                   1135  1.00  TOT
     1  0.00  _zzcccy                
     1  0.00  _zzcccl                
     1  0.00  _zq                    
     1  0.00  _zi                    
     1  0.00  _zcy                   
     1  0.00  _zccz_                 
     1  0.00  _zcccz_                
     1  0.00  _zccccq                
     1  0.00  _zc_                   
     1  0.00  _ccczcy                
     1  0.00  _ccczci                
 -----  ----  ----                   
   742  1.00  TOT

  Let's do it again with whole words:
  
    cat bio-j-jsa-gut.wds \
      | sed -e 's/cs/z/g' \
      | egrep '^z' \
      | wfreq \
      > .foo
      
    cat bio-j-jsa-gut.wds \
      | sed -e 's/cs/z/g' \
      | egrep '^c' \
      | wfreq \
      > .bar
      
    pr -m -s'   ' -t -i' '1 .foo .bar > .baz
  
     "z"=\cs/ words                       "c" words   
   ------------------                   ------------------
   204  0.28  zcccgcy                   172  0.15  ccccgcy
    69  0.09  zcccy                      73  0.06  cgciiiiu
    36  0.05  zccccgcy                   67  0.06  ccccy
    31  0.04  zciiiiu                    51  0.04  cgciis
    25  0.03  zoix                       50  0.04  cgciix
    24  0.03  zccljccy                   33  0.03  cgcy
    23  0.03  zccccy                     31  0.03  cccccy
    20  0.03  zcccljccy                  29  0.03  cccljccy
    17  0.02  zccoix                     21  0.02  cccqjccy
    14  0.02  zciix                      20  0.02  ciiiiu
    13  0.02  zccqjccy                   19  0.02  cccccgcy
    11  0.02  zcoix                      18  0.02  ccccljccy
    11  0.02  zciis                      17  0.01  cgzcccgcy
    11  0.02  zcccqjccy                  17  0.01  cgoix
    10  0.01  zois                       17  0.01  cccoix
    10  0.01  zccljcy                    17  0.01  ccccqjccy
     9  0.01  zccy                       16  0.01  cccgcy
     9  0.01  zccois                     12  0.01  ciix
     7  0.01  zcccqjcy                   12  0.01  cgciiiu
     6  0.01  zcccgciix                  11  0.01  ccoix
     6  0.01  z                           9  0.01  cyljcccgcy
     5  0.01  zccljcccy                   9  0.01  cgccccgcy
     5  0.01  zccljcccgcy                 9  0.01  ccccljcy
     5  0.01  zcciis                      8  0.01  cyzcccgcy
     5  0.01  zccgcy                      8  0.01  cccz
     5  0.01  zcccljcy                    8  0.01  cccljcy
     4  0.01  zzcccgcy                    7  0.01  cyqjccgcy
     4  0.01  zccqjcy                     7  0.01  ciis
     3  0.00  zoixljcccy                  6  0.01  cyljcccy
     3  0.00  zoixljcccgcy                6  0.01  cgciixcy
     3  0.00  zoiiiu                      6  0.01  cccois
     3  0.00  zcoixcgcy                   5  0.00  cyljccgcy
     3  0.00  zciiis                      5  0.00  ciiiu
     3  0.00  zccqjcccy                   5  0.00  cgciij
     3  0.00  zcclj                       5  0.00  cccy
     3  0.00  zcccoix                     5  0.00  cccljccgcy
     3  0.00  zcccljcccgcy                5  0.00  cccljcccgcy
     3  0.00  zccciix                     5  0.00  ccccgciiiiu
     3  0.00  zccciis                     5  0.00  ccccg
     3  0.00  zcccgciiiiu                 4  0.00  cyzcccy
     2  0.00  zoljcccgcy                  4  0.00  cyccccgcy
     2  0.00  zoixccccy                   4  0.00  cgciiscy
     2  0.00  zoixccccgcy                 4  0.00  cgcccgcy
     2  0.00  zcljcy                      4  0.00  ccix
     2  0.00  zcix                        4  0.00  cccqjcy
     2  0.00  zccqgcccy                   4  0.00  cccljcccy
     2  0.00  zccljccgcy                  4  0.00  ccciix
     2  0.00  zcccqgccy                   4  0.00  ccciis
     2  0.00  zcccljcciix                 4  0.00  cccciix
     2  0.00  zcccljccgcy                 4  0.00  cccciis
     2  0.00  zcccljcccy                  3  0.00  cyzccccy
     2  0.00  zcccg                       3  0.00  cyqjcccy
     1  0.00  zzcljcccgcy                 3  0.00  cyqjcccgcy
     1  0.00  zzccoix                     3  0.00  cqjcccy
     1  0.00  zzccljcy                    3  0.00  cqjcccgcy
     1  0.00  zzcccy                      3  0.00  cqgcccy
     1  0.00  zzcccljccy                  3  0.00  ciixcy
     1  0.00  zqgcccy                     3  0.00  ciij
     1  0.00  zoljoix                     3  0.00  cgois
     1  0.00  zoixzcccy                   3  0.00  cgix
     1  0.00  zoixqjcccgcy                3  0.00  cgciixcgcy
     1  0.00  zoixois                     3  0.00  cgcccoix
     1  0.00  zoixljcy                    3  0.00  cccqjccgcy
     1  0.00  zoixljccy                   3  0.00  cccgciis
     1  0.00  zoixljccgcy                 3  0.00  ccccz
     1  0.00  zoixcy                      3  0.00  ccccqjcy
     1  0.00  zoixcgcy                    2  0.00  cyqjciiiiu
     1  0.00  zoixccljciix                2  0.00  cyljciiiiu
     1  0.00  zoixcccoix                  2  0.00  cljccgcy
     1  0.00  zocljcccy                   2  0.00  ciisoix
     1  0.00  zocgciis                    2  0.00  ciiiiucy
     1  0.00  zljciis                     2  0.00  cgzcccy
     1  0.00  zljcccy                     2  0.00  cgzccccy
     1  0.00  zixz                        2  0.00  cgljccgcy
     1  0.00  zcy                         2  0.00  cgcyljcccgcy
     1  0.00  zcqjoix                     2  0.00  cgciixzccgcy
     1  0.00  zcqjcy                      2  0.00  cgciixo
     1  0.00  zcqjcccy                    2  0.00  cgciiis
     1  0.00  zcoljzccgcy                 2  0.00  cgci
     1  0.00  zcoljcy                     2  0.00  cgccoix
     1  0.00  zcoljciiiiu                 2  0.00  cgccgcy
     1  0.00  zcocqgcccgcy                2  0.00  cgcccy
     1  0.00  zcocljccy                   2  0.00  ccqjcy
     1  0.00  zcljcoix                    2  0.00  ccljccgcy
     1  0.00  zcixcgcy                    2  0.00  cccqjcccgcy
     1  0.00  zcis                        2  0.00  cccqgcccy
     1  0.00  zciljciiiiu                 2  0.00  cccljciis
     1  0.00  zciixljcccy                 2  0.00  ccciij
     1  0.00  zciixcy                     2  0.00  cccciixcy
     1  0.00  zciixccccg                  2  0.00  ccccgoix
     1  0.00  zciisciix                   2  0.00  ccccgciix
     1  0.00  zciiiuo                     2  0.00  ccccgciis
     1  0.00  zccz                        1  0.00  cyzccciixcgcy
     1  0.00  zccqjciis                   1  0.00  cyzccccgcy
     1  0.00  zccqjcccyix                 1  0.00  cyzcccccy
     1  0.00  zccqjcccgcy                 1  0.00  cyqjcy
     1  0.00  zccqgccccgcy                1  0.00  cyqjciix
     1  0.00  zccoljcy                    1  0.00  cyqjciis
     1  0.00  zccoixoix                   1  0.00  cyqjciiiu
     1  0.00  zccoixo                     1  0.00  cyqjccy
     1  0.00  zccoixcgcy                  1  0.00  cyqjcccgciis
     1  0.00  zccljciix                   1  0.00  cyoljcy
     1  0.00  zccljciij                   1  0.00  cyljzccoix
     1  0.00  zccljciiiiu                 1  0.00  cyljzcccgcy
     1  0.00  zccljciiiiiu                1  0.00  cyljciix
     1  0.00  zccljccccy                  1  0.00  cyljciis
     1  0.00  zcciix                      1  0.00  cyljciiiu
     1  0.00  zcciij                      1  0.00  cyljcccgciis
     1  0.00  zccgciix                    1  0.00  cyljcccccy
     1  0.00  zcccz                       1  0.00  cylgccccy
     1  0.00  zcccyljcy                   1  0.00  cylgccccgcy
     1  0.00  zcccyis                     1  0.00  cyixois
     1  0.00  zcccqjcccgcy                1  0.00  cyixcccgcy
     1  0.00  zcccqjcccgcccy              1  0.00  cyiscy
     1  0.00  zcccgoix                    1  0.00  cycqgciiiiu
     1  0.00  zcccgciixcgcy               1  0.00  cycljcccy
     1  0.00  zcccgciis                   1  0.00  cyciis
     1  0.00  zcccgciij                   1  0.00  cycgcy
     1  0.00  zccccqjcccy                 1  0.00  cycgciiszcccy
     1  0.00  zccccgciiis                 1  0.00  cycgciisciix
     1  0.00  zccccg                      1  0.00  cycgciiiiu
     1  0.00  zc                          1  0.00  cycccoix
 -----  ----  ----                        1  0.00  cyccccz
   729  1.00  TOT                         1  0.00  cyccccy
                                          1  0.00  cyccccqjccy
                                          1  0.00  cyccccljcccy
                                          1  0.00  cyccccgciis
                                          1  0.00  cyccccg
                                          1  0.00  cycccccy
                                          1  0.00  cycccccgcy
                                          1  0.00  cy
                                          1  0.00  cqjcy
                                          1  0.00  cqjcoix
                                          1  0.00  cqjcois
                                          1  0.00  cqjcciix
                                          1  0.00  cqjccgcy
                                          1  0.00  cqgcoix
                                          1  0.00  cqgciixoiis
                                          1  0.00  cqgcciix
                                          1  0.00  cqgcciis
                                          1  0.00  cqgccgois
                                          1  0.00  cljciix
                                          1  0.00  cljccy
                                          1  0.00  cljcccy
                                          1  0.00  cljcccgcy
                                          1  0.00  clgccccy
                                          1  0.00  clgccccgcy
                                          1  0.00  cizcy
                                          1  0.00  ciqjccy
                                          1  0.00  ciixzcccz
                                          1  0.00  ciixoix
                                          1  0.00  ciixoiscy
                                          1  0.00  ciixciixcgcy
                                          1  0.00  ciixcccgcy
                                          1  0.00  ciixccccgcy
                                          1  0.00  ciisois
                                          1  0.00  ciiscy
                                          1  0.00  ciisciis
                                          1  0.00  ciiis
                                          1  0.00  cgzcoixcycg
                                          1  0.00  cgzcoix
                                          1  0.00  cgzcois
                                          1  0.00  cgzccoix
                                          1  0.00  cgzccgcy
                                          1  0.00  cgzcccz
                                          1  0.00  cgzccccgcy
                                          1  0.00  cgzccccgciix
                                          1  0.00  cgoqjcccy
                                          1  0.00  cgoljccgcy
                                          1  0.00  cgoixljccgcy
                                          1  0.00  cgoixlgccccgcy
                                          1  0.00  cgoixccccgcy
                                          1  0.00  cgoixcccccgcy
                                          1  0.00  cgoisciiiiu
                                          1  0.00  cgljzcccy
                                          1  0.00  cgljcccy
                                          1  0.00  cgisoix
                                          1  0.00  cgcyqjccy
                                          1  0.00  cgcyqjccgcy
                                          1  0.00  cgcyljzccy
                                          1  0.00  cgcyljcy
                                          1  0.00  cgcyljciiiiu
                                          1  0.00  cgcyljccgcy
                                          1  0.00  cgcyij
                                          1  0.00  cgciixzcccgcy
                                          1  0.00  cgciixoix
                                          1  0.00  cgciixljcy
                                          1  0.00  cgciixcyiscg
                                          1  0.00  cgciixciix
                                          1  0.00  cgciixciiscy
                                          1  0.00  cgciixciis
                                          1  0.00  cgciixciiiiu
                                          1  0.00  cgciixccccgcy
                                          1  0.00  cgciisoiscy
                                          1  0.00  cgciisoij
                                          1  0.00  cgciiscgcy
                                          1  0.00  cgciisccccy
                                          1  0.00  cgciiscccccgciix
                                          1  0.00  cgciiix
                                          1  0.00  cgciiiscycgcy
                                          1  0.00  cgciiiiiu
                                          1  0.00  cgcicljccy
                                          1  0.00  cgccois
                                          1  0.00  cgcccljcccgcy
                                          1  0.00  cgccccy
                                          1  0.00  cgccccoix
                                          1  0.00  cgcccccy
                                          1  0.00  cgcccccgcy
                                          1  0.00  ccqjoix
                                          1  0.00  ccqjciix
                                          1  0.00  ccqjciiis
                                          1  0.00  ccqjccgcy
                                          1  0.00  ccqgzccccgcy
                                          1  0.00  ccqgccccgcy
                                          1  0.00  ccoqjcy
                                          1  0.00  ccoixo
                                          1  0.00  ccoixljccccy
                                          1  0.00  ccoixcy
                                          1  0.00  ccoixccccy
                                          1  0.00  ccocgciiiiu
                                          1  0.00  ccljcy
                                          1  0.00  ccljciix
                                          1  0.00  ccljciisoix
                                          1  0.00  ccljciis
                                          1  0.00  ccljciiiiu
                                          1  0.00  ccljcccy
                                          1  0.00  cclgciiiiu
                                          1  0.00  ccixz
                                          1  0.00  ccixis
                                          1  0.00  ccixciiiiiu
                                          1  0.00  ccixcgciiiiu
                                          1  0.00  ccixccqgzccccy
                                          1  0.00  ccis
                                          1  0.00  ccczcy
                                          1  0.00  ccczciix
                                          1  0.00  cccyqjcccy
                                          1  0.00  cccqgoix
                                          1  0.00  cccqgcy
                                          1  0.00  cccqgcccgcy
                                          1  0.00  cccqgccccgcy
                                          1  0.00  cccqg
                                          1  0.00  cccoixcccgcy
                                          1  0.00  cccoixccccy
                                          1  0.00  cccljcois
                                          1  0.00  cccljciij
                                          1  0.00  cccljcciix
                                          1  0.00  cccljccg
                                          1  0.00  cccljccccg
                                          1  0.00  cccljc
                                          1  0.00  ccclj
                                          1  0.00  ccciixoixcy
                                          1  0.00  ccciisois
                                          1  0.00  ccciiscy
                                          1  0.00  cccgoix
                                          1  0.00  cccgciij
                                          1  0.00  cccgciiiu
                                          1  0.00  ccccqjcccy
                                          1  0.00  ccccqjcccgcy
                                          1  0.00  ccccqgcccgcy
                                          1  0.00  ccccois
                                          1  0.00  ccccljco
                                          1  0.00  ccccljcccy
                                          1  0.00  cccciixois
                                          1  0.00  ccccgois
                                          1  0.00  ccccgcyljciis
                                          1  0.00  ccccgcyix
                                          1  0.00  ccccgcccy
                                          1  0.00  cccccoix
                                          1  0.00  ccccciiiiu
                                          1  0.00  cccccgciis
                                      -----  ----  ----
                                       1148  1.00  TOT
  

  Some conclusions:
  
    * The gallows characters \qj/ and \lj/ appear to be closely related:
      for every common word with \lj/, there appears to be a 
      a word with \qj/ that occurs with about 1/4 the frequency.
      
    * The same phenomenon can be noted with respect to prefixes
      containing \cc/ and \csc/: for every word beginning with \cc/,
      there is a word where the first \cc/ is replaced by \csc/,
      and practically the same frequency.
      
    * There apepars to be much confusion between the suffixes \iu/
      and \iiiu/. 
      
    * There appears to be much confusions between \o/ and \ci/
      
  Recall also our previous guess that \cy/ is just the final form of
  \ci/.  Therefore, I have decided to do the following simplifications
  before recomputing the consensus file:
  
    * Ignore, for the time being, the difference between \qj/
      and \lj/, and between \qg/ and \lg/, replacing them
      by \h/ and \p/, respectively;
    
    * omit the \s/ plume after \c/;
    
    * replace all strings \iiu/, \iiiu/, \iiiiu/, etc. by \m/
    
    * replace \cy/ by \ci/.
    
    * replace \o/ by \ci/
    
  Needless to say, I don't mean that these differences are meaningless;
  it is just that there seems to be structure to be discovered that 
  does not depend on these features.
    
    cat bio-m-jsa.evt \
      | jsa2hec \
      | make-consensus-interlin \
      > bio-x-hec.evt
      
    cat bio-x-hec.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
      > bio-j-hec.evt

    extract-words-from-interlin \
        -chars "mrcgiAeHP" \
        bio-j-hec.evt \
        bio-j-hec
        
      jsa2hec
      -------------------------------------------------
      #! /n/gnu/bin/sed -f
      # Recoding superanalytic to ad-hoc encoding:
      /^[^#]/s/ij/f/g
      /^[^#]/s/ix/e/g
      /^[^#]/s/cy/X/g
      /^[^#]/s/ci/X/g
      /^[^#]/s/iiiiu/m/g
      /^[^#]/s/iiiu/m/g
      /^[^#]/s/iiu/m/g
      /^[^#]/s/iis/v/g
      /^[^#]/s/is/r/g
      /^[^#]/s/X/ci/g
      /^[^#]/s/o/ci/g
      /^[^#]/s/cs/c/g
      /^[^#]/s/qci/A/g
      /^[^#]/s/qj/H/g
      /^[^#]/s/qg/P/g
      /^[^#]/s/lj/H/g
      /^[^#]/s/lg/P/g
      -------------------------------------------------

     lines   words     bytes file        
    ------ ------- --------- ------------
      7069    7069     51059 bio-j-hec.wds
      1495    1495     14517 bio-j-hec.dic
      5081    5081     37045 bio-j-hec-gut.wds
       627     627      5347 bio-j-hec-gut.dic
       929     929      3085 bio-j-hec-fun.wds
        63      63       473 bio-j-hec-fun.dic
      1059    1059     10929 bio-j-hec-bad.wds
       805     805      8697 bio-j-hec-bad.dic

    Digraph counts:

                  m     r     c     g     i     A     e     H     P   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     .    84  3090     .     .  1376   288   173    70  5081
      m   738     .     .     5     .     .     .     .     .     .   743
      r   449     .     .   154     .     .     .     .     2     .   605
      c    47     .     .  7573  2197  6164     .    16   382    33 16412
      g    50     .     1  2138     .     .     .     4     4     .  2197
      i  2889   742   509    97     .     2     8  1274   600    45  6166
      A     8     1     9    32     .     .     1   137  1174    24  1386
      e   888     .     2   614     .     .     1     3   209    11  1728
      H    10     .     .  2528     .     .     .     6     .     .  2544
      P     2     .     .   181     .     .     .     .     .     .   183
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  5081   743   605 16412  2197  6166  1386  1728  2544   183 37045

    Next-symbol probability (× 99):

                  m     r     c     g     i     A     e     H     P   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     .     2    60     .     .    27     6     3     1    99
      m    98     .     .     1     .     .     .     .     .     .    99
      r    73     .     .    25     .     .     .     .     .     .    99
      c     .     .     .    46    13    37     .     .     2     .    99
      g     2     .     .    96     .     .     .     .     .     .    99
      i    46    12     8     2     .     .     .    20    10     1    99
      A     1     .     1     2     .     .     .    10    84     2    99
      e    51     .     .    35     .     .     .     .    12     1    99
      H     .     .     .    98     .     .     .     .     .     .    99
      P     1     .     .    98     .     .     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    14     2     2    44     6    16     4     5     7     0 37045

    Previous-symbol probability (× 99):

                  m     r     c     g     i     A     e     H     P   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     .    14    19     .     .    98    17     7    38    14
      m    14     .     .     .     .     .     .     .     .     .     2
      r     9     .     .     1     .     .     .     .     .     .     2
      c     1     .     .    46    99    99     .     1    15    18    44
      g     1     .     .    13     .     .     .     .     .     .     6
      i    56    99    83     1     .     .     1    73    23    24    16
      A     .     .     1     .     .     .     .     8    46    13     4
      e    17     .     .     4     .     .     .     .     8     6     5
      H     .     .     .    15     .     .     .     .     .     .     7
      P     .     .     .     1     .     .     .     .     .     .     0
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    99    99    99    99    99    99    99    99    99    99 37045


    cat bio-j-hec.wds \
      | /n/gnu/bin/awk -f foo.awk \
      | sort | uniq -c | sort -nr | expand \
      > bio-j-hec-wpairs.freq
  
  It seems that the gallows letters transected by \c..c/ are distinct letters. 
  To check that, let's look at the distribution of \c/ strings on their own
  and around the \lj/ and \qj/ gallows, ignoring the \s/ plumes on \c/: 
  
    cat bio-j-jsa.wds \
      | sed -e 's/cs/c/' \
      | enum-contexts -vPAT='[clqj]*c[clqj]*' -vLCTX=0 -vRCTX=1 \
      | wfreq4

    1848 cy             14 qjccccy         3 qcccg           1 qcqjccccg
     672 ccccg          14 ccccljcy        3 cs              1 qci
     469 ci             12 cccqg           3 cqjco           1 qccy
     453 cg             11 ljco            3 cqjcccg         1 qcci
     425 ljci           11 cqg             3 ccqg            1 qcccy
     251 ljccg          11 cccljcccg       3 cccqjccg        1 qccccg
     243 ljcccg         10 cccqjcy         3 cccqjcccg       1 ljcs
     227 ccccy          10 ccccqjcy        3 ccccqjcccg      1 ljcg
     149 qjci            9 ccs             3 ccccqg          1 ljccci
     131 ljcccy          9 cccljcccy       3 cc              1 ljccccljccy
     129 qjccg           9 cccc            2 qjcg            1 cqjcy
     100 ljcy            8 qjco            2 qcqjccg         1 cqjcco
      87 cci             8 cccljccg        2 qcljcccg        1 cqjcci
      86 cccg            7 cljccy          2 ljcci           1 cqjccg
      80 cccccg          7 cljcccy         2 ljccco          1 cqjccc
      79 qjcccg          7 cccljci         2 ljcccccg        1 ccqjccg
      75 cccccy          6 cccco           2 cqjccy          1 ccqjcccy
      74 ccco            5 qjcco           2 cljci           1 ccljco
      64 co              5 ljcco           2 cljccg          1 ccljcccy
      55 cccljccy        5 cqjcccy         2 clg             1 ccljcccg
      54 ljccy           5 ccy             2 ccqjo           1 cclg
      52 cco             5 ccljci          2 ccqjci          1 cccs
      52 cccy            5 ccg             2 ccljccg         1 cccqjci
      44 qjcy            5 ccccljcccy      2 ccccljcci       1 cccljco
      44 ccccljccy       5 ccccljcccg      2 ccccljccg       1 cccljcci
      38 cccqjccy        5 ccccc           2 ccccci          1 cccljccccy
      36 qjcccy          4 qjccco          2 ccc             1 cccljccccg
      29 ccccqjccy       4 ljcccccy        1 qjcccccy        1 cccljc
      26 qjccccg         4 cljcccg         1 qjccc           1 ccccs
      26 ljccccg         4 ccqjcy          1 qjcc            1 ccccqjci
      25 ljccccy         4 ccljcy          1 qcy             1 ccccqjcccy
      24 qjccy           4 cccqjcccy       1 qcqjcy          1 ccccljco
      24 cccci           4 ccclj           1 qcqjci          1 cccccqjcccy
      23 ccci            4 cccccs          1 qcqjccy         1 ccccco
      20 cccljcy         3 qjccci          1 qcqjcccy        1 ccccccy
      16 c            
    
  Separating into categories:
  
    No gallows    \lj/ gallows    \qj/ gallows

    1848 cy         425 ljci          149 qjci            
     469 ci         251 ljccg         129 qjccg           
     453 cg         243 ljcccg         79 qjcccg          
     672 ccccg      131 ljcccy         44 qjcy            
     227 ccccy      100 ljcy           36 qjcccy          
      87 cci         55 cccljccy       38 cccqjccy        
      86 cccg        54 ljccy          29 ccccqjccy       
      80 cccccg      44 ccccljccy      26 qjccccg         
      75 cccccy      26 ljccccg        24 qjccy           
      74 ccco        25 ljccccy        14 qjccccy         
      64 co          20 cccljcy        12 cccqg           
      52 cco         14 ccccljcy       11 cqg             
      52 cccy        11 ljco           10 cccqjcy         
      24 cccci       11 cccljcccg      10 ccccqjcy        
      23 ccci                           8 qjco         
      16 c            
    

97-07-15 stolfi
===============

  Created sample texts in English ("A mysterious affair at Styles")
  and Portuguese (Rober M. Rosi's master thesis), depunctualized
  and decapitalized.
  
  Extracted letter-in-context statistics for "t" and "d", "p" and "b"
  in those texts.  Results posted to the Voynich list, in reply to 
  Jacques Guy.
  
  Extracted comparative statistics for \c/ strings beginning
  with \cs/ and \c/:
  
    cat bio-j-jsa-gut.wds \
      | sed \
          -e 's/cs/S/g'  \
          -e 's/[ql][gj]/H/g'  \
          -e 's/cg/8/g'  \
          -e 's/cy/9/g'  \
          -e 's/ci/a/g'  \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts -rctx 1 '_cc[cS]*' '_Sc[cS]*' > .foo
  
       192  0.33  _ccc8                     219  0.38  _Scc8
        97  0.17  _cccH                      80  0.14  _SccH
        67  0.11  _ccc9                      71  0.12  _Scc9
        52  0.09  _ccccH                     56  0.10  _ScccH
        31  0.05  _cccc9                     38  0.07  _Sccc8
        25  0.04  _ccco                      30  0.05  _Scco
        22  0.04  _cc8                       23  0.04  _Sccc9
        20  0.03  _cccc8                     19  0.03  _Sco
        17  0.03  _cco                        9  0.02  _Sc9
        17  0.03  _ccH                        7  0.01  _Sca
        13  0.02  _cca                        6  0.01  _Scca
        11  0.02  _ccca                       6  0.01  _ScH
         8  0.01  _cccS_                      6  0.01  _Sc8
         6  0.01  _cc9                        3  0.01  _Sccco
         3  0.01  _ccccS_                     1  0.00  _SccccH
         1  0.00  _cccco                      1  0.00  _ScccS_
         1  0.00  _ccccco                     1  0.00  _SccS_
         1  0.00  _cccca                      1  0.00  _Sc_
         1  0.00  _cccSa                  -----  ----  ----
         1  0.00  _cccS9                    577  1.00  TOT
     -----  ----  ----                   
       586  1.00  TOT
  
  Trying yet another variant encoding, "huc":
  
    cat bio-m-jsa.evt \
      | jsa2huc \
      | make-consensus-interlin \
      > bio-x-huc.evt
      
    cat bio-x-huc.evt \
      | egrep '^<.*;J> ' \
      | sed \
          -e 's/{[^}]*}//g' \
      > bio-j-huc.evt

    extract-words-from-interlin \
        -chars "mnrfcgiaoAeHP" \
        bio-j-huc.evt \
        bio-j-huc
        
      jsa2huc
      -------------------------------------------------
      #! /n/gnu/bin/sed -f
      # Recoding superanalytic to ad-hoc encoding:
      /^[^#]/s/ij/f/g
      /^[^#]/s/ix/e/g
      /^[^#]/s/cy/X/g
      /^[^#]/s/ci/X/g
      /^[^#]/s/iiiiu/m/g
      /^[^#]/s/iiiu/m/g
      /^[^#]/s/iiu/n/g
      /^[^#]/s/iis/v/g
      /^[^#]/s/is/r/g
      /^[^#]/s/X/ci/g
      /^[^#]/s/cs/c/g
      /^[^#]/s/qo/A/g
      /^[^#]/s/qj/H/g
      /^[^#]/s/qg/P/g
      /^[^#]/s/lj/H/g
      /^[^#]/s/lg/P/g
      -------------------------------------------------

     lines   words     bytes file        
    ------ ------- --------- ------------
      7098    7098     48799 bio-j-huc.wds
      1705    1705     14931 bio-j-huc.dic
      4742    4742     33342 bio-j-huc-gut.wds
       737     737      5720 bio-j-huc-gut.dic
       892     892      2815 bio-j-huc-fun.wds
        42      42       296 bio-j-huc-fun.dic
      1464    1464     12642 bio-j-huc-bad.wds
       926     926      8915 bio-j-huc-bad.dic

    Digraph counts:

                  m     n     r     f     c     g     i     o     A     e     H     P   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     .     .    81     .  1917     .     .   991  1243   284   158    68  4742
      m   350     .     .     .     .     3     .     .     .     .     .     .     .   353
      n   110     .     .     .     .     .     .     .     1     .     .     .     .   111
      r   415     .     .     .     .   106     .     .    37     .     .     .     .   558
      f    36     .     .     .     .     .     .     .     1     .     .     .     .    37
      c    47     .     .     .     .  7238  2147  4186   248     .    15   377    32 14290
      g    49     .     .     1     .  2051     .     .    37     .     5     4     .  2147
      i  2862   341   109   290    33    52     .     2     3     2   423    69     2  4188
      o    14    12     1   177     4    37     .     .     .     4   765   478    42  1534
      A     7     .     1     7     .    30     .     .     1     1   133  1047    24  1251
      e   840     .     .     2     .   489     .     .   104     1     4   185    10  1635
      H    10     .     .     .     .  2227     .     .    75     .     6     .     .  2318
      P     2     .     .     .     .   140     .     .    36     .     .     .     .   178
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT  4742   353   111   558    37 14290  2147  4188  1534  1251  1635  2318   178 33342

    Next-symbol probability (× 99):

                  m     n     r     f     c     g     i     o     A     e     H     P   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     .     .     2     .    40     .     .    21    26     6     3     1    99
      m    98     .     .     .     .     1     .     .     .     .     .     .     .    99
      n    98     .     .     .     .     .     .     .     1     .     .     .     .    99
      r    74     .     .     .     .    19     .     .     7     .     .     .     .    99
      f    96     .     .     .     .     .     .     .     3     .     .     .     .    99
      c     .     .     .     .     .    50    15    29     2     .     .     3     .    99
      g     2     .     .     .     .    95     .     .     2     .     .     .     .    99
      i    68     8     3     7     1     1     .     .     .     .    10     2     .    99
      o     1     1     .    11     .     2     .     .     .     .    49    31     3    99
      A     1     .     .     1     .     2     .     .     .     .    11    83     2    99
      e    51     .     .     .     .    30     .     .     6     .     .    11     1    99
      H     .     .     .     .     .    95     .     .     3     .     .     .     .    99
      P     1     .     .     .     .    78     .     .    20     .     .     .     .    99
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    14     1     0     2     0    42     6    12     5     4     5     7     1 33342

    Previous-symbol probability (× 99):

                  m     n     r     f     c     g     i     o     A     e     H     P   TOT
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
            .     .     .    14     .    13     .     .    64    98    17     7    38    14
      m     7     .     .     .     .     .     .     .     .     .     .     .     .     1
      n     2     .     .     .     .     .     .     .     .     .     .     .     .     0
      r     9     .     .     .     .     1     .     .     2     .     .     .     .     2
      f     1     .     .     .     .     .     .     .     .     .     .     .     .     0
      c     1     .     .     .     .    50    99    99    16     .     1    16    18    42
      g     1     .     .     .     .    14     .     .     2     .     .     .     .     6
      i    60    96    97    51    88     .     .     .     .     .    26     3     1    12
      o     .     3     1    31    11     .     .     .     .     .    46    20    23     5
      A     .     .     1     1     .     .     .     .     .     .     8    45    13     4
      e    18     .     .     .     .     3     .     .     7     .     .     8     6     5
      H     .     .     .     .     .    15     .     .     5     .     .     .     .     7
      P     .     .     .     .     .     1     .     .     2     .     .     .     .     1
        ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOT    99    99    99    99    99    99    99    99    99    99    99    99    99 33342


97-07-16 stolfi
===============

  I had already remarked that there were two main categories of words,
  \qoljc/-\qoqjc/ and \ccc/-\cscc/.  Comparing the two lists, it seems 
  that the latter can be split into two subclasses, with and without
  a gallows \lj/ (or \cljc/) in their "suffix".
  
  So I tabulated the three classes of words:
  
    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_AHc.*_' '_ccc[^HP]*_' '_ccc.*[HP].*_'
      
     201  0.20  AHccgci       387  0.46  ccccgci         90  0.29  cccHcci
     191  0.19  AHcccgci      139  0.16  cccci           68  0.22  ccccHcci
     116  0.11  AHcie          61  0.07  ccccci          31  0.10  cccHci
      94  0.09  AHcim          60  0.07  cccccgci        26  0.08  ccccHci
      87  0.09  AHccci         33  0.04  cccoe           15  0.05  cccHcccgci
      79  0.08  AHci           21  0.02  cccgci          11  0.04  cccHccci
      54  0.05  AHcin          17  0.02  cccor           10  0.03  cccHccgci
      50  0.05  AHcir          14  0.02  ccci             5  0.02  ccccHcccgci
      43  0.04  AHcci          10  0.01  cccir            5  0.02  cccPccci
      19  0.02  AHccccgci      10  0.01  cccc             4  0.01  ccccHccci
      12  0.01  AHcccci         9  0.01  ccccgcim         4  0.01  cccH
       7  0.01  AHcieci         9  0.01  ccccgcie         3  0.01  cccHcir
       5  0.00  AHcccg          8  0.01  ccccir           2  0.01  cccoHci
       4  0.00  AHcor           7  0.01  cccie            2  0.01  ccccPcci
       4  0.00  AHcif           7  0.01  ccccie           2  0.01  ccccHccie
       4  0.00  AHccgcir        7  0.01  ccccg            2  0.01  ccccHccgci
       3  0.00  AHciecgci       4  0.00  ccccoe           2  0.01  cccPccccgci
       3  0.00  AHcicgci        4  0.00  ccccgcir         2  0.01  cccHcim
       3  0.00  AHccg           4  0.00  ccccc            2  0.01  cccHcif
       2  0.00  AHcic           3  0.00  cccif            2  0.01  cccHc
       2  0.00  AHccgcie        3  0.00  cccgcir          1  0.00  ccciHccci
       2  0.00  AHcccgcie       3  0.00  ccccgoe          1  0.00  cccciHci
       2  0.00  AHcccccgci      2  0.00  ccccieci         1  0.00  ccccgciHcir
       1  0.00  AHcoeci         1  0.00  cccoeoe          1  0.00  cccccHcci
       1  0.00  AHcoe           1  0.00  cccoeo           1  0.00  cccccHccci
       1  0.00  AHcirci         1  0.00  cccoecgci        1  0.00  ccccPcccgci
       1  0.00  AHcircgci       1  0.00  cccoecccgci      1  0.00  ccccHco
       1  0.00  AHciie          1  0.00  cccoecccci       1  0.00  ccccHcccgccci
       1  0.00  AHcieoe         1  0.00  ccciror          1  0.00  cccPoe
       1  0.00  AHciecgcgci     1  0.00  cccirci          1  0.00  cccPci
       1  0.00  AHciecccci      1  0.00  cccieoeci        1  0.00  cccPcccgci
       1  0.00  AHciec          1  0.00  cccgoe           1  0.00  cccP
       1  0.00  AHcicccgci      1  0.00  cccgcin          1  0.00  cccHcor
       1  0.00  AHcgci          1  0.00  cccgcif          1  0.00  cccHcie
       1  0.00  AHccoe          1  0.00  cccgcie          1  0.00  cccHccie
       1  0.00  AHccir          1  0.00  ccccor           1  0.00  cccHccg
       1  0.00  AHccie          1  0.00  ccccieor         1  0.00  cccHcccie
       1  0.00  AHccgor         1  0.00  ccccgor          1  0.00  cccHcccci
       1  0.00  AHccgcim        1  0.00  ccccgcif         1  0.00  cccHccccg
       1  0.00  AHccgccgci      1  0.00  ccccgciecgci  ----  ----  ----
       1  0.00  AHccgccccgci    1  0.00  ccccgccci      307  1.00  TOT
       1  0.00  AHccccg         1  0.00  cccccoe         
       1  0.00  AHccccci        1  0.00  cccccim         
       1  0.00  AHccccHcci      1  0.00  cccccie         
       1  0.00  AHccc           1  0.00  cccccgcir       
       1  0.00  AHcc            1  0.00  cccccg          
    ----  ----  ----            1  0.00  cccccci         
    1010  1.00  TOT          ----  ----  ----              
                              846  1.00  TOT

  There are only 34 words that begin with "AH" but not "AHc".
  Of these, 21 are "AHoe", which may well be misreadings of "AHcie".
  
  Among the remaining words, there seems to be four additional major 
  classes:
    
    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_oHc.*_' '_cgc.*_' '_oeHc.*_' '_Hc.*_'

      85  0.20  oHccgci      72  0.22  cgcim            28  0.21  Hccgci          20  0.19  oeHccgci   
      57  0.13  oHcccgci     51  0.16  cgcir            18  0.13  Hccccgci        19  0.18  oeHccci    
      41  0.09  oHcim        50  0.15  cgcie            16  0.12  Hcccgci         15  0.15  oeHcccgci  
      40  0.09  oHcie        36  0.11  cgci             10  0.07  Hcie            10  0.10  oeHci      
      36  0.08  oHcir        27  0.08  cgccccgci        10  0.07  Hcccci           7  0.07  oeHcir     
      35  0.08  oHccci       12  0.04  cgcin             8  0.06  Hcir             7  0.07  oeHcim     
      28  0.06  oHci          6  0.02  cgcif             7  0.05  Hcim             5  0.05  oeHcin     
      22  0.05  oHcci         6  0.02  cgcieci           5  0.04  Hccci            5  0.05  oeHcie     
      16  0.04  oHcin         5  0.02  cgcirci           4  0.03  Hccoe            4  0.04  oeHcci     
      13  0.03  oHcccci       5  0.02  cgcccgci          3  0.02  Hcoe             3  0.03  oeHcccci   
      10  0.02  oHccccgci     4  0.01  cgcccoe           3  0.02  Hcccoe           2  0.02  oeHccccgci 
       6  0.01  oHcoe         3  0.01  cgciecgci         2  0.01  Hcin             1  0.01  oeHcoe     
       4  0.01  oHcieor       3  0.01  cgccoe            2  0.01  Hci              1  0.01  oeHcif     
       4  0.01  oHcieci       3  0.01  cgcccci           2  0.01  Hcci             1  0.01  oeHcicgci  
       4  0.01  oHccgcir      3  0.01  cgccccci          2  0.01  Hcccg            1  0.01  oeHccg     
       3  0.01  oHcccg        2  0.01  cgcieo            1  0.01  Hcircie          1  0.01  oeHcccir   
       2  0.00  oHcieoe       2  0.01  cgciecccgci       1  0.01  Hcirci           1  0.01  oeHccccci  
       2  0.00  oHciecgci     2  0.01  cgcieccccgci      1  0.01  Hcif           ---  ----  ----       
       2  0.00  oHccgcie      2  0.01  cgciHccgci        1  0.01  Hcieci         103  1.00  TOT        
       1  0.00  oHcoeor       2  0.01  cgciHcccgci       1  0.01  Hcic                                 
       1  0.00  oHcoecgci     2  0.01  cgccor            1  0.01  HciAHci                              
       1  0.00  oHciroeof     2  0.01  cgccgci           1  0.01  Hccgcioe                             
       1  0.00  oHcircif      2  0.01  cgccci            1  0.01  Hccgcie                              
       1  0.00  oHcioc        2  0.01  cgcccccgci        1  0.01  HcccoHcccgci                         
       1  0.00  oHcieccci     1  0.00  cgcirorci         1  0.01  Hcccir                               
       1  0.00  oHciecccgci   1  0.00  cgcirof           1  0.01  Hcccie                               
       1  0.00  oHcieccccgci  1  0.00  cgcircccci        1  0.01  HcccgoeHcgci                         
       1  0.00  oHcieHci      1  0.00  cgcircccccgcie    1  0.01  Hcccgcif                             
       1  0.00  oHcicgci      1  0.00  cgciie            1  0.01  Hccccgcir                            
       1  0.00  oHcgcie       1  0.00  cgcieoe           1  0.01  Hccccci                              
       1  0.00  oHccor        1  0.00  cgciecirci      ---  ----  ----                                 
       1  0.00  oHccoe        1  0.00  cgciecircg      135  1.00  TOT                                  
       1  0.00  oHccocgci     1  0.00  cgciecir                                                        
       1  0.00  oHccoHcir     1  0.00  cgciecim                                                        
       1  0.00  oHccgcim      1  0.00  cgciecie                                                        
       1  0.00  oHccgcif      1  0.00  cgcieHci                                                        
       1  0.00  oHccgcieor    1  0.00  cgcicHcci                                                       
       1  0.00  oHccgccci     1  0.00  cgciHcim                                                        
       1  0.00  oHccg         1  0.00  cgciHci                                                         
       1  0.00  oHcccir       1  0.00  cgciHcci                                                        
       1  0.00  oHcccie       1  0.00  cgciHccci                                                       
       1  0.00  oHcccgcie     1  0.00  cgccoecicg                                                      
       1  0.00  oHccccic      1  0.00  cgccccoe                                                        
       1  0.00  oHccccci      1  0.00  cgcccccgcie                                                     
     ---  ----  ----          1  0.00  cgccccc                                                         
     435  1.00  TOT           1  0.00  cgcccHcccgci                                                    
                            ---  ----  ----                                                            
                            326  1.00  TOT                                                             

  The rest, again, seems to contain two major classes of words:

    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_oec.*_' '_ec.*_'

      73  0.39  eccccgci            38  0.29  oeccccgci        
      20  0.11  ecccci              22  0.17  oeci             
       8  0.04  ecgci               19  0.15  oecccci          
       7  0.04  ecccgci              9  0.07  oecgci           
       6  0.03  eci                  6  0.05  oecccgci         
       6  0.03  eccccci              4  0.03  oeccci           
       6  0.03  ecccccgci            4  0.03  oeccccci         
       5  0.03  eccci                3  0.02  oecin            
       5  0.03  eccccg               3  0.02  oecim            
       5  0.03  eccccHcci            3  0.02  oecccoe          
       3  0.02  ecie                 3  0.02  oeccccg          
       3  0.02  eccor                2  0.02  oecie            
       3  0.02  ecccoe               2  0.02  oecccHcci        
       3  0.02  ecccHcci             2  0.02  oec              
       2  0.01  ecir                 1  0.01  oecir            
       2  0.01  ecim                 1  0.01  oecimci          
       2  0.01  ecce                 1  0.01  oeccieci         
       2  0.01  eccccHcccgci         1  0.01  oecccor          
       2  0.01  ecccHci              1  0.01  oecccgcir        
       2  0.01  ecc                  1  0.01  oeccccif         
       1  0.01  ecif                 1  0.01  oecccccgci       
       1  0.01  ecgoe                1  0.01  oecccccg         
       1  0.01  ecgcir               1  0.01  oeccccc          
       1  0.01  ecgcim               1  0.01  oeccccHcci       
       1  0.01  ecgcieor             1  0.01  oeccHci          
       1  0.01  ecg                ---  ----  ----               
       1  0.01  ecco               131  1.00  TOT                
       1  0.01  ecccir                                           
       1  0.01  ecccgciA                                         
       1  0.01  eccccor                                          
       1  0.01  eccccir                                          
       1  0.01  eccccif                                          
       1  0.01  eccccgcirci                                      
       1  0.01  eccccgcir                                        
       1  0.01  eccccgcie                                        
       1  0.01  eccccc                                           
       1  0.01  eccccHcie                                        
       1  0.01  eccccHccci                                       
       1  0.01  eccc                                             
       1  0.01  ecccPcccgci                                      
     ---  ----  ----                                               
     185  1.00  TOT                                                

  Yet some more:

    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_oPc.*_' '_Pc.*_' '_cic.*_' '_ciHc.*_'
      
    18  0.49  oPccccgci       17  0.40  Pccccgci      12  0.32  ciccccgci       12  0.23  ciHccgci
     4  0.11  oPcccci          4  0.10  Pcccoe         5  0.13  cicccci         12  0.23  ciHcccgci
     2  0.05  oPcir            4  0.10  Pccccgcir      4  0.11  ciccccci        10  0.19  ciHccci
     2  0.05  oPci             3  0.07  Pcccgci        2  0.05  cicccccgci       4  0.08  ciHcim
     2  0.05  oPcccgci         3  0.07  Pcccci         1  0.03  cicir            3  0.06  ciHcie
     2  0.05  oPccccci         2  0.05  Pccor          1  0.03  cicgcircie       2  0.04  ciHcir
     1  0.03  oPcieci          2  0.05  Pccoe          1  0.03  cicgcircccci     2  0.04  ciHcin
     1  0.03  oPcieccccgci     1  0.02  Pcir           1  0.03  cicgcim          2  0.04  ciHcci
     1  0.03  oPcie            1  0.02  PciHccgci      1  0.03  cicgci           2  0.04  ciHcccgcir
     1  0.03  oPcieHcim        1  0.02  Pcgoe          1  0.03  cicccoe          1  0.02  ciHci
     1  0.03  oPcccieci        1  0.02  Pcgcieccor     1  0.03  cicccciecgci     1  0.02  ciHcccoe
     1  0.03  oPccccgcie       1  0.02  Pcccoecgci     1  0.03  ciccccgcir       1  0.02  ciHccccgci
     1  0.03  oPccccg          1  0.02  Pccciroe       1  0.03  ciccccg          1  0.02  ciHccccci
   ---  ----  ----             1  0.02  Pccccgcie      1  0.03  cicccccci      ---  ----  ----
    37  1.00  TOT            ---  ----  ----           1  0.03  ciccccc         53  1.00  TOT
                              42  1.00  TOT            1  0.03  ciccccHcci   
                                                       1  0.03  ciccccHccci  
                                                       1  0.03  cicPcim      
                                                       1  0.03  cicHccci     
                                                     ---  ----  ----          
                                                      38  1.00  TOT           

  Yet some more (yawn):
  
    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_APc.*_' '_rc.*_' '_Aec.*_' '_eHc.*_'

     13  0.22  rcim         8  0.20  eHccgci     9  0.24  Aecccci      11  0.55  APccccgci   
     10  0.17  rccccgci     7  0.17  eHcccgci    7  0.19  Aeci          2  0.10  APcccgci    
      6  0.10  rcie         6  0.15  eHcim       7  0.19  Aeccccgci     2  0.10  APcccci     
      5  0.08  rcccci       4  0.10  eHcir       3  0.08  Aecccccgci    1  0.05  APci        
      3  0.05  rcir         3  0.07  eHccci      2  0.05  Aecim         1  0.05  APcgci      
      3  0.05  rcif         2  0.05  eHcin       2  0.05  Aecie         1  0.05  APccoe      
      2  0.03  rci          2  0.05  eHci        2  0.05  Aeccci        1  0.05  APcccoe     
      2  0.03  rccccg       2  0.05  eHccccgci   1  0.03  Aecir         1  0.05  APcccgcie   
      2  0.03  rcccccgci    1  0.02  eHcoecgci   1  0.03  Aecin       ---  ----  ----        
      2  0.03  rcccHci      1  0.02  eHcifo      1  0.03  Aecgci       20  1.00  TOT         
      1  0.02  rcirci       1  0.02  eHcieor     1  0.03  Aecccoe                            
      1  0.02  rcin         1  0.02  eHcci       1  0.03  Aecccgci                           
      1  0.02  rcieci       1  0.02  eHccgcci  ---  ----  ----                               
      1  0.02  rciecce      1  0.02  eHcccg     37  1.00  TOT                                
      1  0.02  rcicHcci     1  0.02  eHcccci                                                 
      1  0.02  rccci      ---  ----  ----                                                    
      1  0.02  rcccgci     41  1.00  TOT                                                     
      1  0.02  rcccciecg                                                                     
      1  0.02  rccccie                                                                       
      1  0.02  rccccci                                                                       
      1  0.02  rcccc                                                                         
      1  0.02  rccc                                                                          
    ---  ----  ----                                                                          
     60  1.00  TOT                                                                           

  On casual inspection, many of these classes seem to consist of two 
  superimposed classes, differing by a "c/cc" switch.  For instance,
  the '_ccc.*[HP].*_' class seems to be the union of '_cccHc.*_' and '_ccccHc.*_'.
  
  Let's try to identify the suffixes:
  
    /bin/rm -f .title
    /bin/rm -f .table
    /bin/echo "_" > .table
    
    set noglob
    set ofmt = "0"
    set npat = 1
    foreach pat ( \
      'cc[^H]'  'cc.*H' \
      'ci[^H]'  'ci.*H' \
      'cg[^H]'  'cg.*H' \
      'oe[^H]'  'oe.*H' \
      'Ae[^H]'  'Ae.*H' \
      'e[^H]'   'e.*H'  \
      'r[^H]'   'r.*H'  \
      'AH'      'oH'    \
      'AP'      'oP'    \
      'H'       'P'     \
    )
      /n/gnu/bin/printf " %7s" "${pat}" >> .title
      /bin/cat bio-j-huc-gut.wds \
        | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \
        | /n/gnu/bin/egrep "_${pat}[^H]*_" \
        | /n/gnu/bin/sed -e "s/_${pat}//g" \
        | /n/gnu/bin/sort | uniq -c \
        | /n/gnu/bin/expand \
        > .suff.frq

      /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp
      /bin/mv .tmp .table
      @ npat = ${npat} + 1
      set ofmt = "${ofmt},1.${npat}"
    end
    unset noglob
    
    /n/gnu/bin/printf "\n" >> .title
    
    cat .table \
      | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \
      | sort -nr \
      > .tbsort
    
    cat .title .tbsort \
      | format-suffix-table
      
  Here are the results:
  
    SUFFIX       TOTAL  cc[^H]   cc.*H  ci[^H]   ci.*H  cg[^H]   cg.*H  oe[^H]   oe.*H
    ------------ ----- ------- ------- ------- ------- ------- ------- ------- -------
    TOTALORUM     4262     966     322     103      58     345      18     158     115
    ------------ ----- ------- ------- ------- ------- ------- ------- ------- -------
    cccgci         503       0      21      14      12      27       3      38      15 
    ccgci          455      60      15       0      12       5       6       6      20 
    cgci           392     388       0       0       0       2       0       0       0 
    ci             353     139      68       7       2       0       2       0      13 
    cci            327      61     160       0       4       2       2       4       8 
    ccci           254       1      20       5      12       3       3      21      20 
    cie            187       7       4       0       3       0       0       0       5 
    cim            167       1       5       0       4       0       1       0       7 
    cir            125       8       5       1       2       0       0       0       8 
    ccccgci        125       1       0       4       1       3       0       5       3 
    im              92       0       0       0       0      72       0       3       0 
    oe              90      33       3       3       0       1       0       0       1 
    e               90      37       0       0       0      17       0       7       0 
    i               87      14       0       0       0      36       0      22       0 
    cin             81       0       0       0       2       0       0       0       5 
    _               78       7       4      47       0       4       0       4       0 
    cccci           71       0       2       5       0       3       1       4       3 
    ie              70       7       0       0       0      50       0       2       0 
    ir              69      10       0       1       0      51       0       1       0 
    r               42      13       0       0       0       3       0      13       0 
    gci             40      21       0       1       0       0       0       9       0 
    m               32      30       0       0       0       0       0       0       0 
    or              31      17       0       2       0       0       0       0       0 
    ccoe            22       1       0       1       0       4       0       3       0 
    cccg            22       0       0       1       0       0       0       3       0 
    coe             19       4       1       0       0       3       0       0       1 
    in              17       0       0       0       0      12       0       3       0 
    cieci           16       2       0       0       0       0       0       1       0 
    c               15      11       2       0       0       0       0       0       0 
    if              13       3       0       0       0       6       0       0       0 
    cor             11       1       1       0       0       2       0       0       0 
    cgcim           11      11       0       0       0       0       0       0       0 
    cgcie           10       9       0       0       0       0       0       0       0 
    ccgcir          10       1       0       0       0       0       0       1       0 
    cccoe           10       0       0       0       1       1       0       0       0 
    cif              9       0       2       0       0       0       0       0       1 
    cg               8       7       0       0       0       0       0       0       0 
    ccccci           8       0       0       1       1       0       0       0       1 
    irci             7       1       0       0       0       5       0       0       0 
    ieci             7       0       0       0       0       6       0       0       0 
    eci              7       2       0       0       0       0       0       0       0 
    ccg              7       1       1       0       0       0       0       0       1 
    cc               7       4       0       0       0       0       0       0       0 
    cieor            6       1       0       0       0       0       0       0       0 
    ciecgci          6       0       0       1       0       0       0       0       0 
    cicgci           5       0       0       0       0       0       0       0       1 
    cgcir            5       4       0       0       0       0       0       0       0 
    ccie             5       1       3       0       0       0       0       0       0 
    ccgcie           5       0       0       0       0       0       0       0       0 
    cccgcie          5       0       0       0       0       0       0       0       0 
    ccccgcir         5       0       0       0       0       0       0       0       0 
    Pccci            5       5       0       0       0       0       0       0       0 
    oecgci           4       1       0       0       0       0       0       0       0 
    oecccgci         4       1       0       0       0       0       0       0       0 
    gcir             4       3       0       0       0       0       0       0       0 
    ecgci            4       3       0       0       0       0       0       0       0 
    cgoe             4       3       0       0       0       0       0       0       0 
    ccor             4       0       0       0       0       0       0       1       0 
    cccir            4       0       0       0       0       0       0       0       1 
    cccie            4       0       1       0       0       0       0       0       0 
    cccgcir          4       0       0       1       2       0       0       0       0 
    ccccg            4       0       1       0       0       0       0       1       0 
    cccc             4       0       0       1       0       1       0       1       0 
    om               3       0       0       0       0       0       0       0       1 
    oeccccgci        3       0       0       0       0       0       0       0       0 
    iecgci           3       0       0       0       0       3       0       0       0 
    ecccci           3       1       0       0       0       0       0       1       0 
    circi            3       0       0       0       0       0       0       0       0 
    cieoe            3       0       0       0       0       0       0       0       0 
    cic              3       0       0       0       0       0       0       0       0 
    ccccgcie         3       0       0       0       0       1       0       0       0 
    cccccgci         3       1       0       0       0       0       0       0       0 
    roe              2       0       0       0       0       0       0       1       0 
    rcim             2       0       0       0       0       1       0       0       0 
    rcie             2       1       0       0       0       0       0       0       0 
    oeoe             2       1       0       0       0       0       0       0       0 
    oeci             2       0       0       0       0       0       0       0       0 
    oeccci           2       0       0       0       0       0       0       0       0 
    oecccci          2       1       0       0       0       0       0       0       0 
    ieo              2       0       0       0       0       2       0       0       0 
    iecccgci         2       0       0       0       0       2       0       0       0 
    ieccccgci        2       0       0       0       0       2       0       0       0 
    goe              2       1       0       0       0       0       0       0       0 
    gcim             2       0       0       1       0       0       0       0       0 
    eor              2       0       0       0       0       0       0       0       0 
    coecgci          2       0       0       0       0       0       0       0       0 
    co               2       0       1       0       0       0       0       0       0 
    cieccccgci       2       0       0       0       0       0       0       0       0 
    ce               2       0       0       0       0       0       0       0       0 
    ccir             2       0       0       0       0       0       0       0       0 
    ccgcim           2       0       0       0       0       0       0       0       0 
    cccif            2       0       0       0       0       0       0       1       0 
    ccc              2       0       0       0       0       0       0       0       0 
    cPcci            2       2       0       0       0       0       0       0       0 
    cPcccgci         2       2       0       0       0       0       0       0       0 
    Pccccgci         2       2       0       0       0       0       0       0       0 
    rcin             1       0       0       0       0       0       0       0       0 
    oroeccccgci      1       0       0       0       0       0       0       0       0 
    orci             1       0       0       1       0       0       0       0       0 
    orccci           1       0       0       0       0       0       0       0       0 
    of               1       0       0       0       0       0       0       1       0 
    oeo              1       1       0       0       0       0       0       0       0 
    oecircir         1       0       0       0       0       0       0       0       0 
    oecim            1       0       0       0       0       0       0       0       0 
    oeciecccgci      1       0       0       0       0       0       0       0       0 
    oecgcie          1       0       0       0       0       0       0       0       0 
    oecgccccgci      1       0       0       0       0       0       0       0       0 
    oecccgcie        1       0       0       0       0       0       0       0       0 
    oecccg           1       0       0       0       0       0       0       0       0 
    oeccccg          1       0       0       0       0       0       0       0       0 
    oePci            1       0       0       0       0       0       0       0       0 
    ocgcie           1       0       0       0       0       0       0       0       0 
    oPci             1       0       0       0       0       0       0       0       0 
    no               1       1       0       0       0       0       0       0       0 
    irorci           1       0       0       0       0       1       0       0       0 
    iror             1       1       0       0       0       0       0       0       0 
    irof             1       0       0       0       0       1       0       0       0 
    ircccci          1       0       0       0       0       1       0       0       0 
    ircccccgcie      1       0       0       0       0       1       0       0       0 
    imci             1       0       0       0       0       0       0       1       0 
    iie              1       0       0       0       0       1       0       0       0 
    ieoeci           1       1       0       0       0       0       0       0       0 
    ieoe             1       0       0       0       0       1       0       0       0 
    iecirci          1       0       0       0       0       1       0       0       0 
    iecircg          1       0       0       0       0       1       0       0       0 
    iecir            1       0       0       0       0       1       0       0       0 
    iecim            1       0       0       0       0       1       0       0       0 
    iecie            1       0       0       0       0       1       0       0       0 
    iecce            1       0       0       0       0       0       0       0       0 
    gcircie          1       0       0       1       0       0       0       0       0 
    gcircccci        1       0       0       1       0       0       0       0       0 
    gcin             1       1       0       0       0       0       0       0       0 
    gcif             1       1       0       0       0       0       0       0       0 
    gcieor           1       0       0       0       0       0       0       0       0 
    gcie             1       1       0       0       0       0       0       0       0 
    g                1       0       0       0       0       0       0       0       0 
    eof              1       0       0       0       0       0       0       0       0 
    eoe              1       0       0       0       0       0       0       0       0 
    eo               1       1       0       0       0       0       0       0       0 
    eccccgci         1       0       0       0       0       1       0       0       0 
    eccccg           1       1       0       0       0       0       0       0       0 
    ecccccgci        1       0       0       0       0       1       0       0       0 
    ePccccgci        1       0       0       0       0       1       0       0       0 
    coeor            1       0       0       0       0       0       0       0       0 
    coecicg          1       0       0       0       0       1       0       0       0 
    coeci            1       0       0       0       0       0       0       0       0 
    ciroeof          1       0       0       0       0       0       0       0       0 
    ciroe            1       0       1       0       0       0       0       0       0 
    circif           1       0       0       0       0       0       0       0       0 
    circie           1       0       0       0       0       0       0       0       0 
    circgci          1       0       0       0       0       0       0       0       0 
    cioc             1       0       0       0       0       0       0       0       0 
    ciie             1       0       0       0       0       0       0       0       0 
    cifo             1       0       0       0       0       0       0       0       0 
    ciecgcgci        1       0       0       0       0       0       0       0       0 
    cieccci          1       0       0       0       0       0       0       0       0 
    ciecccgci        1       0       0       0       0       0       0       0       0 
    ciecccci         1       0       0       0       0       0       0       0       0 
    ciec             1       0       0       0       0       0       0       0       0 
    cicccgci         1       0       0       0       0       0       0       0       0 
    cgor             1       1       0       0       0       0       0       0       0 
    cgcif            1       1       0       0       0       0       0       0       0 
    cgciecgci        1       1       0       0       0       0       0       0       0 
    cgcieccor        1       0       0       0       0       0       0       0       0 
    cgccci           1       1       0       0       0       0       0       0       0 
    ccoeci           1       0       0       0       0       0       0       0       0 
    ccocgci          1       0       0       0       0       0       0       0       0 
    ccim             1       1       0       0       0       0       0       0       0 
    ccgor            1       0       0       0       0       0       0       0       0 
    ccgcioe          1       0       0       0       0       0       0       0       0 
    ccgcif           1       0       0       0       0       0       0       0       0 
    ccgcieor         1       0       0       0       0       0       0       0       0 
    ccgciA           1       0       0       0       0       0       0       0       0 
    ccgcci           1       0       0       0       0       0       0       0       0 
    ccgccgci         1       0       0       0       0       0       0       0       0 
    ccgccci          1       0       0       0       0       0       0       0       0 
    ccgccccgci       1       0       0       0       0       0       0       0       0 
    cccor            1       0       0       0       0       0       0       0       0 
    cccoecgci        1       0       0       0       0       0       0       0       0 
    ccciroe          1       0       0       0       0       0       0       0       0 
    cccieci          1       0       0       0       0       0       0       0       0 
    ccciecgci        1       0       0       1       0       0       0       0       0 
    ccciecg          1       0       0       0       0       0       0       0       0 
    cccgcirci        1       0       0       0       0       0       0       0       0 
    cccgcif          1       0       0       0       0       0       0       0       0 
    cccgccci         1       0       1       0       0       0       0       0       0 
    ccccic           1       0       0       0       0       0       0       0       0 
    ccccc            1       0       0       1       0       0       0       0       0 
    ccPcccgci        1       0       0       0       0       0       0       0       0 
    ccPccccci        1       1       0       0       0       0       0       0       0 
    Poe              1       1       0       0       0       0       0       0       0 
    Pcim             1       0       0       1       0       0       0       0       0 
    Pci              1       1       0       0       0       0       0       0       0 
    Pcccgci          1       1       0       0       0       0       0       0       0 
    P                1       1       0       0       0       0       0       0       0 
  

    SUFFIX       TOTAL Ae[^H]  Ae.*H  e[^H]   e.*H  r[^H]   r.*H     AH     oH     AP     oP      H       P
    ------------ ----- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------        
    TOTALORUM     4262     38     13    219     61     73      3   1043    451     23     40    150      63  
    ------------ ----- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------        
    cccgci         503      7      2     74      9     10      0    191     57      2      2     16       3 
    ccgci          455      1      0      7      8      1      0    201     85      0      0     28       0 
    cgci           392      0      0      0      0      0      0      1      0      1      0      0       0 
    ci             353      0      4      0      4      0      2     79     28      1      2      2       0 
    cci            327      2      1      5      9      1      1     43     22      0      0      2       0 
    ccci           254      9      3     21      4      5      0     87     35      0      0      5       0 
    cie            187      0      0      0      1      0      0    116     40      0      1     10       0 
    cim            167      0      0      0      7      0      0     94     41      0      0      7       0 
    cir            125      0      0      0      4      0      0     50     36      0      2      8       1 
    ccccgci        125      3      0      8      2      2      0     19     10     11     18     18      17 
    im              92      2      0      2      0     13      0      0      0      0      0      0       0 
    oe              90      0      0      1      1      0      0     21     10      1      1      6       8 
    e               90      1      0     16      1      9      0      2      0      0      0      0       0 
    i               87      7      0      6      0      2      0      0      0      0      0      0       0 
    cin             81      0      0      0      2      0      0     54     16      0      0      2       0 
    _               78      0      0      5      0      0      0      2      3      1      0      1       0 
    cccci           71      0      1      6      1      1      0     12     13      2      4     10       3 
    ie              70      2      0      3      0      6      0      0      0      0      0      0       0 
    ir              69      1      0      2      0      3      0      0      0      0      0      0       0 
    r               42      0      0     10      0      3      0      0      0      0      0      0       0 
    gci             40      1      0      8      0      0      0      0      0      0      0      0       0 
    m               32      0      0      0      0      2      0      0      0      0      0      0       0 
    or              31      0      0      0      0      0      0      4      2      1      1      4       0 
    ccoe            22      1      0      3      0      0      0      1      1      1      0      4       2 
    cccg            22      0      0      5      1      2      0      5      3      0      0      2       0 
    coe             19      0      0      0      0      0      0      1      6      0      0      3       0 
    in              17      1      0      0      0      1      0      0      0      0      0      0       0 
    cieci           16      0      0      0      0      0      0      7      4      0      1      1       0 
    c               15      0      0      2      0      0      0      0      0      0      0      0       0 
    if              13      0      0      1      0      3      0      0      0      0      0      0       0 
    cor             11      0      0      3      0      0      0      4      0      0      0      0       0 
    cgcim           11      0      0      0      0      0      0      0      0      0      0      0       0 
    cgcie           10      0      0      0      0      0      0      0      1      0      0      0       0 
    ccgcir          10      0      0      0      0      0      0      4      4      0      0      0       0 
    cccoe           10      0      0      0      0      0      0      0      0      1      0      3       4 
    cif              9      0      0      0      1      0      0      4      0      0      0      1       0 
    cg               8      0      0      1      0      0      0      0      0      0      0      0       0 
    ccccci           8      0      0      0      0      0      0      1      1      0      2      1       0 
    irci             7      0      0      0      0      1      0      0      0      0      0      0       0 
    ieci             7      0      0      0      0      1      0      0      0      0      0      0       0 
    eci              7      0      0      4      0      1      0      0      0      0      0      0       0 
    ccg              7      0      0      0      0      0      0      3      1      0      0      0       0 
    cc               7      0      0      1      0      1      0      1      0      0      0      0       0 
    cieor            6      0      0      0      1      0      0      0      4      0      0      0       0 
    ciecgci          6      0      0      0      0      0      0      3      2      0      0      0       0 
    cicgci           5      0      0      0      0      0      0      3      1      0      0      0       0 
    cgcir            5      0      0      1      0      0      0      0      0      0      0      0       0 
    ccie             5      0      0      0      0      0      0      1      0      0      0      0       0 
    ccgcie           5      0      0      0      0      0      0      2      2      0      0      1       0 
    cccgcie          5      0      0      1      0      0      0      2      1      1      0      0       0 
    ccccgcir         5      0      0      0      0      0      0      0      0      0      0      1       4 
    Pccci            5      0      0      0      0      0      0      0      0      0      0      0       0 
    oecgci           4      0      0      0      0      0      0      1      1      0      0      1       0 
    oecccgci         4      0      0      0      0      0      0      0      0      0      0      1       2 
    gcir             4      0      0      1      0      0      0      0      0      0      0      0       0 
    ecgci            4      0      0      1      0      0      0      0      0      0      0      0       0 
    cgoe             4      0      0      0      0      0      0      0      0      0      0      0       1 
    ccor             4      0      0      0      0      0      0      0      1      0      0      0       2 
    cccir            4      0      0      1      0      0      0      0      1      0      0      1       0 
    cccie            4      0      0      0      0      1      0      0      1      0      0      1       0 
    cccgcir          4      0      0      1      0      0      0      0      0      0      0      0       0 
    ccccg            4      0      0      0      0      0      0      1      0      0      1      0       0 
    cccc             4      0      0      1      0      0      0      0      0      0      0      0       0 
    om               3      0      0      0      0      0      0      1      0      0      0      0       1 
    oeccccgci        3      0      0      0      0      0      0      1      0      0      0      0       2 
    iecgci           3      0      0      0      0      0      0      0      0      0      0      0       0 
    ecccci           3      0      0      0      1      0      0      0      0      0      0      0       0 
    circi            3      0      1      0      0      0      0      1      0      0      0      1       0 
    cieoe            3      0      0      0      0      0      0      1      2      0      0      0       0 
    cic              3      0      0      0      0      0      0      2      0      0      0      1       0 
    ccccgcie         3      0      0      0      0      0      0      0      0      0      1      0       1 
    cccccgci         3      0      0      0      0      0      0      2      0      0      0      0       0 
    roe              2      0      0      0      0      1      0      0      0      0      0      0       0 
    rcim             2      0      0      1      0      0      0      0      0      0      0      0       0 
    rcie             2      0      0      1      0      0      0      0      0      0      0      0       0 
    oeoe             2      0      0      0      0      0      0      1      0      0      0      0       0 
    oeci             2      0      0      0      0      0      0      0      0      0      2      0       0 
    oeccci           2      0      0      0      0      0      0      0      0      0      0      0       2 
    oecccci          2      0      0      0      0      0      0      0      0      0      0      0       1 
    ieo              2      0      0      0      0      0      0      0      0      0      0      0       0 
    iecccgci         2      0      0      0      0      0      0      0      0      0      0      0       0 
    ieccccgci        2      0      0      0      0      0      0      0      0      0      0      0       0 
    goe              2      0      0      1      0      0      0      0      0      0      0      0       0 
    gcim             2      0      0      1      0      0      0      0      0      0      0      0       0 
    eor              2      0      0      1      0      0      0      0      1      0      0      0       0 
    coecgci          2      0      0      0      1      0      0      0      1      0      0      0       0 
    co               2      0      0      1      0      0      0      0      0      0      0      0       0 
    cieccccgci       2      0      0      0      0      0      0      0      1      0      1      0       0 
    ce               2      0      0      2      0      0      0      0      0      0      0      0       0 
    ccir             2      0      0      1      0      0      0      1      0      0      0      0       0 
    ccgcim           2      0      0      0      0      0      0      1      1      0      0      0       0 
    cccif            2      0      0      1      0      0      0      0      0      0      0      0       0 
    ccc              2      0      0      0      0      1      0      1      0      0      0      0       0 
    cPcci            2      0      0      0      0      0      0      0      0      0      0      0       0 
    cPcccgci         2      0      0      0      0      0      0      0      0      0      0      0       0 
    Pccccgci         2      0      0      0      0      0      0      0      0      0      0      0       0 
    rcin             1      0      0      1      0      0      0      0      0      0      0      0       0 
    oroeccccgci      1      0      0      0      0      0      0      0      0      0      0      1       0 
    orci             1      0      0      0      0      0      0      0      0      0      0      0       0 
    orccci           1      0      0      0      0      0      0      0      0      0      0      1       0 
    of               1      0      0      0      0      0      0      0      0      0      0      0       0 
    oeo              1      0      0      0      0      0      0      0      0      0      0      0       0 
    oecircir         1      0      0      0      0      0      0      0      0      0      0      0       1 
    oecim            1      0      0      0      0      0      0      0      0      0      0      0       1 
    oeciecccgci      1      0      0      0      0      0      0      0      0      0      0      0       1 
    oecgcie          1      0      0      0      0      0      0      0      0      0      0      1       0 
    oecgccccgci      1      0      0      0      0      0      0      0      0      0      0      0       1 
    oecccgcie        1      0      0      0      0      0      0      0      0      0      0      0       1 
    oecccg           1      0      0      0      0      0      0      0      0      0      0      1       0 
    oeccccg          1      0      0      0      0      0      0      0      0      0      0      0       1 
    oePci            1      0      0      0      0      0      0      0      0      0      0      1       0 
    ocgcie           1      0      0      0      1      0      0      0      0      0      0      0       0 
    oPci             1      0      0      0      0      0      0      1      0      0      0      0       0 
    no               1      0      0      0      0      0      0      0      0      0      0      0       0 
    irorci           1      0      0      0      0      0      0      0      0      0      0      0       0 
    iror             1      0      0      0      0      0      0      0      0      0      0      0       0 
    irof             1      0      0      0      0      0      0      0      0      0      0      0       0 
    ircccci          1      0      0      0      0      0      0      0      0      0      0      0       0 
    ircccccgcie      1      0      0      0      0      0      0      0      0      0      0      0       0 
    imci             1      0      0      0      0      0      0      0      0      0      0      0       0 
    iie              1      0      0      0      0      0      0      0      0      0      0      0       0 
    ieoeci           1      0      0      0      0      0      0      0      0      0      0      0       0 
    ieoe             1      0      0      0      0      0      0      0      0      0      0      0       0 
    iecirci          1      0      0      0      0      0      0      0      0      0      0      0       0 
    iecircg          1      0      0      0      0      0      0      0      0      0      0      0       0 
    iecir            1      0      0      0      0      0      0      0      0      0      0      0       0 
    iecim            1      0      0      0      0      0      0      0      0      0      0      0       0 
    iecie            1      0      0      0      0      0      0      0      0      0      0      0       0 
    iecce            1      0      0      0      0      1      0      0      0      0      0      0       0 
    gcircie          1      0      0      0      0      0      0      0      0      0      0      0       0 
    gcircccci        1      0      0      0      0      0      0      0      0      0      0      0       0 
    gcin             1      0      0      0      0      0      0      0      0      0      0      0       0 
    gcif             1      0      0      0      0      0      0      0      0      0      0      0       0 
    gcieor           1      0      0      1      0      0      0      0      0      0      0      0       0 
    gcie             1      0      0      0      0      0      0      0      0      0      0      0       0 
    g                1      0      0      1      0      0      0      0      0      0      0      0       0 
    eof              1      0      0      1      0      0      0      0      0      0      0      0       0 
    eoe              1      0      0      0      0      0      0      0      1      0      0      0       0 
    eo               1      0      0      0      0      0      0      0      0      0      0      0       0 
    eccccgci         1      0      0      0      0      0      0      0      0      0      0      0       0 
    eccccg           1      0      0      0      0      0      0      0      0      0      0      0       0 
    ecccccgci        1      0      0      0      0      0      0      0      0      0      0      0       0 
    ePccccgci        1      0      0      0      0      0      0      0      0      0      0      0       0 
    coeor            1      0      0      0      0      0      0      0      1      0      0      0       0 
    coecicg          1      0      0      0      0      0      0      0      0      0      0      0       0 
    coeci            1      0      0      0      0      0      0      1      0      0      0      0       0 
    ciroeof          1      0      0      0      0      0      0      0      1      0      0      0       0 
    ciroe            1      0      0      0      0      0      0      0      0      0      0      0       0 
    circif           1      0      0      0      0      0      0      0      1      0      0      0       0 
    circie           1      0      0      0      0      0      0      0      0      0      0      1       0 
    circgci          1      0      0      0      0      0      0      1      0      0      0      0       0 
    cioc             1      0      0      0      0      0      0      0      1      0      0      0       0 
    ciie             1      0      0      0      0      0      0      1      0      0      0      0       0 
    cifo             1      0      0      0      1      0      0      0      0      0      0      0       0 
    ciecgcgci        1      0      0      0      0      0      0      1      0      0      0      0       0 
    cieccci          1      0      0      0      0      0      0      0      1      0      0      0       0 
    ciecccgci        1      0      0      0      0      0      0      0      1      0      0      0       0 
    ciecccci         1      0      0      0      0      0      0      1      0      0      0      0       0 
    ciec             1      0      0      0      0      0      0      1      0      0      0      0       0 
    cicccgci         1      0      0      0      0      0      0      1      0      0      0      0       0 
    cgor             1      0      0      0      0      0      0      0      0      0      0      0       0 
    cgcif            1      0      0      0      0      0      0      0      0      0      0      0       0 
    cgciecgci        1      0      0      0      0      0      0      0      0      0      0      0       0 
    cgcieccor        1      0      0      0      0      0      0      0      0      0      0      0       1 
    cgccci           1      0      0      0      0      0      0      0      0      0      0      0       0 
    ccoeci           1      0      1      0      0      0      0      0      0      0      0      0       0 
    ccocgci          1      0      0      0      0      0      0      0      1      0      0      0       0 
    ccim             1      0      0      0      0      0      0      0      0      0      0      0       0 
    ccgor            1      0      0      0      0      0      0      1      0      0      0      0       0 
    ccgcioe          1      0      0      0      0      0      0      0      0      0      0      1       0 
    ccgcif           1      0      0      0      0      0      0      0      1      0      0      0       0 
    ccgcieor         1      0      0      0      0      0      0      0      1      0      0      0       0 
    ccgciA           1      0      0      1      0      0      0      0      0      0      0      0       0 
    ccgcci           1      0      0      0      1      0      0      0      0      0      0      0       0 
    ccgccgci         1      0      0      0      0      0      0      1      0      0      0      0       0 
    ccgccci          1      0      0      0      0      0      0      0      1      0      0      0       0 
    ccgccccgci       1      0      0      0      0      0      0      1      0      0      0      0       0 
    cccor            1      0      0      1      0      0      0      0      0      0      0      0       0 
    cccoecgci        1      0      0      0      0      0      0      0      0      0      0      0       1 
    ccciroe          1      0      0      0      0      0      0      0      0      0      0      0       1 
    cccieci          1      0      0      0      0      0      0      0      0      0      1      0       0 
    ccciecgci        1      0      0      0      0      0      0      0      0      0      0      0       0 
    ccciecg          1      0      0      0      0      1      0      0      0      0      0      0       0 
    cccgcirci        1      0      0      1      0      0      0      0      0      0      0      0       0 
    cccgcif          1      0      0      0      0      0      0      0      0      0      0      1       0 
    cccgccci         1      0      0      0      0      0      0      0      0      0      0      0       0 
    ccccic           1      0      0      0      0      0      0      0      1      0      0      0       0 
    ccccc            1      0      0      0      0      0      0      0      0      0      0      0       0 
    ccPcccgci        1      0      0      1      0      0      0      0      0      0      0      0       0 
    ccPccccci        1      0      0      0      0      0      0      0      0      0      0      0       0 
    Poe              1      0      0      0      0      0      0      0      0      0      0      0       0 
    Pcim             1      0      0      0      0      0      0      0      0      0      0      0       0 
    Pci              1      0      0      0      0      0      0      0      0      0      0      0       0 
    Pcccgci          1      0      0      0      0      0      0      0      0      0      0      0       0 
    P                1      0      0      0      0      0      0      0      0      0      0      0       0 

  Beware that there is some ambiguity in the "cc[^H]" column: words like "ccccic" could be
  parsed as "cc" + "ccic" or as "c" + "cccic" or as "" + "ccccic".  Thus, we probably
  should exclude the "cc" prefix while we decide what are the valid suffixes.

  Also, the "[^H]" in some prefixes shoudl be removed, since it requires non-empty
  suffixes.
  
  Let's retry again, fixing these bugs, sorting the prefixes by importance,
  and excluding also the "r" prefix:
  
    /bin/rm -f .title
    /bin/rm -f .table
    /bin/echo "_" > .table
    
    set noglob
    set ofmt = "0"
    set npat = 1
    foreach pat ( \
      'AH'  'oH'    'cg'    'cc.*H' \
      'e'   'oe'    'H'     'oe.*H' \
      'ci'  'P'     'e.*H'  'ci.*H' \
      'oP'  'Ae'    'AP'    'cg.*H' \
      'Ae.*H' \
    )
      /n/gnu/bin/printf " %7s" "${pat}" >> .title
      /bin/cat bio-j-huc-gut.wds \
        | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \
        | /n/gnu/bin/egrep "_${pat}[^H]*_" \
        | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \
        | /n/gnu/bin/sort | uniq -c \
        | /n/gnu/bin/expand \
        > .suff.frq

      /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp
      /bin/mv .tmp .table
      @ npat = ${npat} + 1
      set ofmt = "${ofmt},1.${npat}"
    end
    unset noglob
    
    /n/gnu/bin/printf "\n" >> .title
    
    cat .table \
      | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \
      | sort -nr \
      > .tbsort
    
    cat .title .tbsort \
      | format-suffix-table

  Here are the results:
  
    SUFFIX       TOTAL    AH    oH    cg cc.*H     e    oe     H oe.*H    ci
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOTALORUM     3437  1043   451   345   322   223   289   150   115   104
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    ccgci          377   201    85     2    15     0     0    28    20     0
    cccgci         352   191    57     5    21     7     6    16    15     0
    ci             276    79    28    36    68     6    22     2    13     0
    ccccgci        256    19    10    27     0    73    38    18     3    12
    cci            251    43    22     0   160     0     0     2     8     0
    cim            245    94    41    72     5     2     3     7     7     0
    cie            237   116    40    50     4     3     2    10     5     0
    _              228     2     3     0     4     4   131     1     0     1
    ccci           202    87    35     2    20     5     4     5    20     0
    cir            172    50    36    51     5     2     1     8     8     1
    cccci          108    12    13     3     2    20    19    10     3     5
    cin             97    54    16    12     0     0     3     2     5     0
    oe              93    21    10    17     3    16     7     6     1     0
    or              38     4     2     3     0    10    13     4     0     0
    ccccci          24     1     1     3     0     6     4     1     1     4
    m               21     0     0     0     0     0     0     0     0    21
    cgci            21     1     0     0     0     8     9     0     0     1
    cccoe           21     0     0     4     0     3     3     3     0     1
    e               19     2     0     4     0     0     0     0     0    12
    cieci           19     7     4     6     0     0     0     1     0     0
    cif             16     4     0     6     2     1     0     1     1     0
    cccccgci        16     2     0     2     0     6     1     0     0     2
    coe             12     1     6     0     1     0     0     3     1     0
    ccoe            12     1     1     3     0     0     0     4     0     0
    ccccg           12     1     0     0     1     5     3     0     0     1
    cccg            11     5     3     0     0     0     0     2     0     0
    r                8     0     0     0     0     1     0     0     0     7
    circi            8     1     0     5     0     0     0     1     0     0
    ciecgci          8     3     2     3     0     0     0     0     0     0
    ccor             8     0     1     2     0     3     0     0     0     0
    ccgcir           8     4     4     0     0     0     0     0     0     0
    ccccgcir         7     0     0     0     0     1     0     1     0     1
    oeci             6     0     0     0     0     4     0     0     0     0
    ccg              6     3     1     0     1     0     0     0     1     0
    Pccccgci         6     0     0     0     0     1     4     0     0     1
    o                5     0     0     0     0     4     1     0     0     0
    cor              5     4     0     0     1     0     0     0     0     0
    cieor            5     0     4     0     0     0     0     0     0     0
    cicgci           5     3     1     0     0     0     0     0     1     0
    ccgcie           5     2     2     0     0     0     0     1     0     0
    oecgci           4     1     1     0     0     1     0     1     0     0
    oeccccgci        4     1     0     1     0     0     0     0     0     0
    n                4     0     0     0     0     0     0     0     0     4
    cieoe            4     1     2     1     0     0     0     0     0     0
    cieccccgci       4     0     1     2     0     0     0     0     0     0
    ccie             4     1     0     0     3     0     0     0     0     0
    cccir            4     0     1     0     0     1     0     1     1     0
    cccgcie          4     2     1     0     0     0     0     0     0     0
    ccccc            4     0     0     1     0     1     1     0     0     1
    c                4     0     0     0     2     0     2     0     0     0
    roe              3     0     0     1     0     0     0     0     0     2
    om               3     1     0     0     0     0     0     0     1     0
    oecccgci         3     0     0     0     0     0     0     1     0     0
    f                3     0     0     0     0     0     0     0     0     3
    eci              3     0     0     0     0     0     0     0     0     3
    ciecccgci        3     0     1     2     0     0     0     0     0     0
    cic              3     2     0     0     0     0     0     1     0     0
    cccie            3     0     1     0     1     0     0     1     0     0
    cccgcir          3     0     0     0     0     0     1     0     0     0
    ccccgcie         3     0     0     0     0     1     0     0     0     0
    cc               3     1     0     0     0     2     0     0     0     0
    rci              2     0     0     0     0     0     0     0     0     2
    orcim            2     0     0     1     0     1     0     0     0     0
    oeccci           2     0     0     0     0     0     0     0     0     0
    oecccci          2     0     0     0     0     0     1     0     0     0
    mci              2     0     0     0     0     0     0     0     0     2
    eor              2     0     1     0     0     0     0     0     0     1
    eoe              2     0     1     0     0     0     0     0     0     1
    eccci            2     0     0     0     0     0     2     0     0     0
    ecccgci          2     0     0     0     0     0     0     0     0     2
    eccccgci         2     0     0     1     0     0     0     0     0     1
    coecgci          2     0     1     0     0     0     0     0     0     0
    ciie             2     1     0     1     0     0     0     0     0     0
    cieo             2     0     0     2     0     0     0     0     0     0
    cgoe             2     0     0     0     0     1     0     0     0     0
    cgcim            2     0     0     0     0     1     0     0     0     1
    ccgcim           2     1     1     0     0     0     0     0     0     0
    cce              2     0     0     0     0     2     0     0     0     0
    ccccif           2     0     0     0     0     1     1     0     0     0
    ccc              2     1     0     0     0     1     0     0     0     0
    ror              1     0     0     0     0     0     0     0     0     1
    rcir             1     0     0     0     0     0     0     0     0     1
    oroeccccgci      1     0     0     0     0     0     0     1     0     0
    oroe             1     0     0     0     0     0     1     0     0     0
    orcin            1     0     0     0     0     1     0     0     0     0
    orcie            1     0     0     0     0     1     0     0     0     0
    orccci           1     0     0     0     0     0     0     1     0     0
    oeor             1     0     0     0     0     1     0     0     0     0
    oeof             1     0     0     0     0     1     0     0     0     0
    oeoe             1     1     0     0     0     0     0     0     0     0
    oecircir         1     0     0     0     0     0     0     0     0     0
    oecim            1     0     0     0     0     0     0     0     0     0
    oeciecccgci      1     0     0     0     0     0     0     0     0     0
    oecgcie          1     0     0     0     0     0     0     1     0     0
    oecgccccgci      1     0     0     0     0     0     0     0     0     0
    oecccgcie        1     0     0     0     0     0     0     0     0     0
    oecccg           1     0     0     0     0     0     0     1     0     0
    oeccccg          1     0     0     0     0     0     0     0     0     0
    oecccccgci       1     0     0     1     0     0     0     0     0     0
    oePci            1     0     0     0     0     0     0     1     0     0
    oePccccgci       1     0     0     1     0     0     0     0     0     0
    ocgcie           1     0     0     0     0     0     0     0     0     0
    ocg              1     0     0     0     0     1     0     0     0     0
    occci            1     0     0     0     0     1     0     0     0     0
    occccgci         1     0     0     0     0     1     0     0     0     0
    oPci             1     1     0     0     0     0     0     0     0     0
    eorci            1     0     0     0     0     0     0     0     0     1
    eof              1     0     0     0     0     0     1     0     0     0
    eciecgci         1     0     0     0     0     0     0     0     0     1
    ecgcir           1     0     0     0     0     1     0     0     0     0
    ecccci           1     0     0     0     0     0     0     0     0     0
    eccccc           1     0     0     0     0     0     0     0     0     1
    coeor            1     0     1     0     0     0     0     0     0     0
    coeci            1     1     0     0     0     0     0     0     0     0
    co               1     0     0     0     1     0     0     0     0     0
    cirorci          1     0     0     1     0     0     0     0     0     0
    cirof            1     0     0     1     0     0     0     0     0     0
    ciroeof          1     0     1     0     0     0     0     0     0     0
    ciroe            1     0     0     0     1     0     0     0     0     0
    circif           1     0     1     0     0     0     0     0     0     0
    circie           1     0     0     0     0     0     0     1     0     0
    circgci          1     1     0     0     0     0     0     0     0     0
    circccci         1     0     0     1     0     0     0     0     0     0
    circccccgcie     1     0     0     1     0     0     0     0     0     0
    cioc             1     0     1     0     0     0     0     0     0     0
    cimci            1     0     0     0     0     0     1     0     0     0
    cifo             1     0     0     0     0     0     0     0     0     0
    ciecirci         1     0     0     1     0     0     0     0     0     0
    ciecircg         1     0     0     1     0     0     0     0     0     0
    ciecir           1     0     0     1     0     0     0     0     0     0
    ciecim           1     0     0     1     0     0     0     0     0     0
    ciecie           1     0     0     1     0     0     0     0     0     0
    ciecgcgci        1     1     0     0     0     0     0     0     0     0
    cieccci          1     0     1     0     0     0     0     0     0     0
    ciecccci         1     1     0     0     0     0     0     0     0     0
    ciec             1     1     0     0     0     0     0     0     0     0
    cicccgci         1     1     0     0     0     0     0     0     0     0
    cgcircie         1     0     0     0     0     0     0     0     0     1
    cgcircccci       1     0     0     0     0     0     0     0     0     1
    cgcir            1     0     0     0     0     1     0     0     0     0
    cgcieor          1     0     0     0     0     1     0     0     0     0
    cgcieccor        1     0     0     0     0     0     0     0     0     0
    cgcie            1     0     1     0     0     0     0     0     0     0
    cg               1     0     0     0     0     1     0     0     0     0
    ccoecicg         1     0     0     1     0     0     0     0     0     0
    ccoeci           1     0     0     0     0     0     0     0     0     0
    ccocgci          1     0     1     0     0     0     0     0     0     0
    cco              1     0     0     0     0     1     0     0     0     0
    ccir             1     1     0     0     0     0     0     0     0     0
    ccieci           1     0     0     0     0     0     1     0     0     0
    ccgor            1     1     0     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     0     0     0     1     0     0
    ccgcif           1     0     1     0     0     0     0     0     0     0
    ccgcieor         1     0     1     0     0     0     0     0     0     0
    ccgcci           1     0     0     0     0     0     0     0     0     0
    ccgccgci         1     1     0     0     0     0     0     0     0     0
    ccgccci          1     0     1     0     0     0     0     0     0     0
    ccgccccgci       1     1     0     0     0     0     0     0     0     0
    cccor            1     0     0     0     0     0     1     0     0     0
    cccoecgci        1     0     0     0     0     0     0     0     0     0
    ccciroe          1     0     0     0     0     0     0     0     0     0
    cccieci          1     0     0     0     0     0     0     0     0     0
    cccgcif          1     0     0     0     0     0     0     1     0     0
    cccgciA          1     0     0     0     0     1     0     0     0     0
    cccgccci         1     0     0     0     1     0     0     0     0     0
    ccccor           1     0     0     0     0     1     0     0     0     0
    ccccoe           1     0     0     1     0     0     0     0     0     0
    ccccir           1     0     0     0     0     1     0     0     0     0
    cccciecgci       1     0     0     0     0     0     0     0     0     1
    ccccic           1     0     1     0     0     0     0     0     0     0
    ccccgcirci       1     0     0     0     0     1     0     0     0     0
    cccccgcie        1     0     0     1     0     0     0     0     0     0
    cccccg           1     0     0     0     0     0     1     0     0     0
    cccccci          1     0     0     0     0     0     0     0     0     1
    cccPcccgci       1     0     0     0     0     1     0     0     0     0
    cPcim            1     0     0     0     0     0     0     0     0     1
    Poe              1     0     0     0     0     1     0     0     0     0
    Pcccgci          1     0     0     0     0     1     0     0     0     0
    Pcccci           1     0     0     0     0     0     0     0     0     1
    A                1     0     0     0     0     0     1     0     0     0
  
  
    SUFFIX       TOTAL     P  e.*H ci.*H    oP    Ae    AP cg.*H Ae.*H
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOTALORUM     3437    63    61    58    40   119    23    18    13
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- -----
    ccgci          377     0     8    12     0     0     0     6     0
    cccgci         352     3     9    12     2     1     2     3     2
    ci             276     0     4     2     2     7     1     2     4
    ccccgci        256    17     2     1    18     7    11     0     0
    cci            251     0     9     4     0     0     0     2     1
    cim            245     0     7     4     0     2     0     1     0
    cie            237     0     1     3     1     2     0     0     0
    _              228     0     0     0     0    81     1     0     0
    ccci           202     0     4    12     0     2     0     3     3
    cir            172     1     4     2     2     1     0     0     0
    cccci          108     3     1     0     4     9     2     1     1
    cin             97     0     2     2     0     1     0     0     0
    oe              93     8     1     0     1     1     1     0     0
    or              38     0     0     0     1     0     1     0     0
    ccccci          24     0     0     1     2     0     0     0     0
    m               21     0     0     0     0     0     0     0     0
    cgci            21     0     0     0     0     1     1     0     0
    cccoe           21     4     0     1     0     1     1     0     0
    e               19     0     1     0     0     0     0     0     0
    cieci           19     0     0     0     1     0     0     0     0
    cif             16     0     1     0     0     0     0     0     0
    cccccgci        16     0     0     0     0     3     0     0     0
    coe             12     0     0     0     0     0     0     0     0
    ccoe            12     2     0     0     0     0     1     0     0
    ccccg           12     0     0     0     1     0     0     0     0
    cccg            11     0     1     0     0     0     0     0     0
    r                8     0     0     0     0     0     0     0     0
    circi            8     0     0     0     0     0     0     0     1
    ciecgci          8     0     0     0     0     0     0     0     0
    ccor             8     2     0     0     0     0     0     0     0
    ccgcir           8     0     0     0     0     0     0     0     0
    ccccgcir         7     4     0     0     0     0     0     0     0
    oeci             6     0     0     0     2     0     0     0     0
    ccg              6     0     0     0     0     0     0     0     0
    Pccccgci         6     0     0     0     0     0     0     0     0
    o                5     0     0     0     0     0     0     0     0
    cor              5     0     0     0     0     0     0     0     0
    cieor            5     0     1     0     0     0     0     0     0
    cicgci           5     0     0     0     0     0     0     0     0
    ccgcie           5     0     0     0     0     0     0     0     0
    oecgci           4     0     0     0     0     0     0     0     0
    oeccccgci        4     2     0     0     0     0     0     0     0
    n                4     0     0     0     0     0     0     0     0
    cieoe            4     0     0     0     0     0     0     0     0
    cieccccgci       4     0     0     0     1     0     0     0     0
    ccie             4     0     0     0     0     0     0     0     0
    cccir            4     0     0     0     0     0     0     0     0
    cccgcie          4     0     0     0     0     0     1     0     0
    ccccc            4     0     0     0     0     0     0     0     0
    c                4     0     0     0     0     0     0     0     0
    roe              3     0     0     0     0     0     0     0     0
    om               3     1     0     0     0     0     0     0     0
    oecccgci         3     2     0     0     0     0     0     0     0
    f                3     0     0     0     0     0     0     0     0
    eci              3     0     0     0     0     0     0     0     0
    ciecccgci        3     0     0     0     0     0     0     0     0
    cic              3     0     0     0     0     0     0     0     0
    cccie            3     0     0     0     0     0     0     0     0
    cccgcir          3     0     0     2     0     0     0     0     0
    ccccgcie         3     1     0     0     1     0     0     0     0
    cc               3     0     0     0     0     0     0     0     0
    rci              2     0     0     0     0     0     0     0     0
    orcim            2     0     0     0     0     0     0     0     0
    oeccci           2     2     0     0     0     0     0     0     0
    oecccci          2     1     0     0     0     0     0     0     0
    mci              2     0     0     0     0     0     0     0     0
    eor              2     0     0     0     0     0     0     0     0
    eoe              2     0     0     0     0     0     0     0     0
    eccci            2     0     0     0     0     0     0     0     0
    ecccgci          2     0     0     0     0     0     0     0     0
    eccccgci         2     0     0     0     0     0     0     0     0
    coecgci          2     0     1     0     0     0     0     0     0
    ciie             2     0     0     0     0     0     0     0     0
    cieo             2     0     0     0     0     0     0     0     0
    cgoe             2     1     0     0     0     0     0     0     0
    cgcim            2     0     0     0     0     0     0     0     0
    ccgcim           2     0     0     0     0     0     0     0     0
    cce              2     0     0     0     0     0     0     0     0
    ccccif           2     0     0     0     0     0     0     0     0
    ccc              2     0     0     0     0     0     0     0     0
    ror              1     0     0     0     0     0     0     0     0
    rcir             1     0     0     0     0     0     0     0     0
    oroeccccgci      1     0     0     0     0     0     0     0     0
    oroe             1     0     0     0     0     0     0     0     0
    orcin            1     0     0     0     0     0     0     0     0
    orcie            1     0     0     0     0     0     0     0     0
    orccci           1     0     0     0     0     0     0     0     0
    oeor             1     0     0     0     0     0     0     0     0
    oeof             1     0     0     0     0     0     0     0     0
    oeoe             1     0     0     0     0     0     0     0     0
    oecircir         1     1     0     0     0     0     0     0     0
    oecim            1     1     0     0     0     0     0     0     0
    oeciecccgci      1     1     0     0     0     0     0     0     0
    oecgcie          1     0     0     0     0     0     0     0     0
    oecgccccgci      1     1     0     0     0     0     0     0     0
    oecccgcie        1     1     0     0     0     0     0     0     0
    oecccg           1     0     0     0     0     0     0     0     0
    oeccccg          1     1     0     0     0     0     0     0     0
    oecccccgci       1     0     0     0     0     0     0     0     0
    oePci            1     0     0     0     0     0     0     0     0
    oePccccgci       1     0     0     0     0     0     0     0     0
    ocgcie           1     0     1     0     0     0     0     0     0
    ocg              1     0     0     0     0     0     0     0     0
    occci            1     0     0     0     0     0     0     0     0
    occccgci         1     0     0     0     0     0     0     0     0
    oPci             1     0     0     0     0     0     0     0     0
    eorci            1     0     0     0     0     0     0     0     0
    eof              1     0     0     0     0     0     0     0     0
    eciecgci         1     0     0     0     0     0     0     0     0
    ecgcir           1     0     0     0     0     0     0     0     0
    ecccci           1     0     1     0     0     0     0     0     0
    eccccc           1     0     0     0     0     0     0     0     0
    coeor            1     0     0     0     0     0     0     0     0
    coeci            1     0     0     0     0     0     0     0     0
    co               1     0     0     0     0     0     0     0     0
    cirorci          1     0     0     0     0     0     0     0     0
    cirof            1     0     0     0     0     0     0     0     0
    ciroeof          1     0     0     0     0     0     0     0     0
    ciroe            1     0     0     0     0     0     0     0     0
    circif           1     0     0     0     0     0     0     0     0
    circie           1     0     0     0     0     0     0     0     0
    circgci          1     0     0     0     0     0     0     0     0
    circccci         1     0     0     0     0     0     0     0     0
    circccccgcie     1     0     0     0     0     0     0     0     0
    cioc             1     0     0     0     0     0     0     0     0
    cimci            1     0     0     0     0     0     0     0     0
    cifo             1     0     1     0     0     0     0     0     0
    ciecirci         1     0     0     0     0     0     0     0     0
    ciecircg         1     0     0     0     0     0     0     0     0
    ciecir           1     0     0     0     0     0     0     0     0
    ciecim           1     0     0     0     0     0     0     0     0
    ciecie           1     0     0     0     0     0     0     0     0
    ciecgcgci        1     0     0     0     0     0     0     0     0
    cieccci          1     0     0     0     0     0     0     0     0
    ciecccci         1     0     0     0     0     0     0     0     0
    ciec             1     0     0     0     0     0     0     0     0
    cicccgci         1     0     0     0     0     0     0     0     0
    cgcircie         1     0     0     0     0     0     0     0     0
    cgcircccci       1     0     0     0     0     0     0     0     0
    cgcir            1     0     0     0     0     0     0     0     0
    cgcieor          1     0     0     0     0     0     0     0     0
    cgcieccor        1     1     0     0     0     0     0     0     0
    cgcie            1     0     0     0     0     0     0     0     0
    cg               1     0     0     0     0     0     0     0     0
    ccoecicg         1     0     0     0     0     0     0     0     0
    ccoeci           1     0     0     0     0     0     0     0     1
    ccocgci          1     0     0     0     0     0     0     0     0
    cco              1     0     0     0     0     0     0     0     0
    ccir             1     0     0     0     0     0     0     0     0
    ccieci           1     0     0     0     0     0     0     0     0
    ccgor            1     0     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     0     0     0     0     0
    ccgcif           1     0     0     0     0     0     0     0     0
    ccgcieor         1     0     0     0     0     0     0     0     0
    ccgcci           1     0     1     0     0     0     0     0     0
    ccgccgci         1     0     0     0     0     0     0     0     0
    ccgccci          1     0     0     0     0     0     0     0     0
    ccgccccgci       1     0     0     0     0     0     0     0     0
    cccor            1     0     0     0     0     0     0     0     0
    cccoecgci        1     1     0     0     0     0     0     0     0
    ccciroe          1     1     0     0     0     0     0     0     0
    cccieci          1     0     0     0     1     0     0     0     0
    cccgcif          1     0     0     0     0     0     0     0     0
    cccgciA          1     0     0     0     0     0     0     0     0
    cccgccci         1     0     0     0     0     0     0     0     0
    ccccor           1     0     0     0     0     0     0     0     0
    ccccoe           1     0     0     0     0     0     0     0     0
    ccccir           1     0     0     0     0     0     0     0     0
    cccciecgci       1     0     0     0     0     0     0     0     0
    ccccic           1     0     0     0     0     0     0     0     0
    ccccgcirci       1     0     0     0     0     0     0     0     0
    cccccgcie        1     0     0     0     0     0     0     0     0
    cccccg           1     0     0     0     0     0     0     0     0
    cccccci          1     0     0     0     0     0     0     0     0
    cccPcccgci       1     0     0     0     0     0     0     0     0
    cPcim            1     0     0     0     0     0     0     0     0
    Poe              1     0     0     0     0     0     0     0     0
    Pcccgci          1     0     0     0     0     0     0     0     0
    Pcccci           1     0     0     0     0     0     0     0     0
    A                1     0     0     0     0     0     0     0     0
    
  Analysis:
  
    The prefixes "e", "oe", "ci" (without "H") do not appear
    to be equivalent to the other prefixes above.  However,
    "ec", "oec", and "cic" do appear to fit in.
    
    The empty string does not appear to be a valid suffix (yeay!)
  

  Let's redo again:
  
    /bin/rm -f .title
    /bin/rm -f .table
    /bin/touch .table
    
    set noglob
    set ofmt = "0"
    set npat = 1
    foreach pat ( \
      'AH'  'oH'    'cg'    'cc.*H' \
      'ec'  'oec'   'H'     'oe.*H' \
      'cic' 'P'     'e.*H'  'ci.*H' \
      'oP'  'Ae'    'AP'    'cg.*H' \
      'Ae.*H' \
    )
      /n/gnu/bin/printf " %7s" "${pat}" >> .title
      /bin/cat bio-j-huc-gut.wds \
        | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \
        | /n/gnu/bin/egrep "_${pat}[^H][^H]*_" \
        | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \
        | /n/gnu/bin/sort | uniq -c \
        | /n/gnu/bin/expand \
        > .suff.frq

      /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp
      /bin/mv .tmp .table
      @ npat = ${npat} + 1
      set ofmt = "${ofmt},1.${npat}"
    end
    unset noglob
    
    /n/gnu/bin/printf "\n" >> .title
    
    cat .table \
      | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \
      | sort -nr \
      > .tbsort
    
    cat .title .tbsort \
      | format-suffix-table

    SUFFIX       TOTAL    AH    oH    cg cc.*H    ec   oec     H oe.*H
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOTALORUM     3060  1041   448   345   318   171   125   149   115
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- -----
    cccgci         462   191    57     5    21    73    38    16    15
    ccgci          390   201    85     2    15     7     6    28    20
    cci            260    43    22     0   160     5     4     2     8
    ci             248    79    28    36    68     0     0     2    13
    cim            240    94    41    72     5     0     0     7     7
    ccci           237    87    35     2    20    20    19     5    20
    cie            232   116    40    50     4     0     0    10     5
    cir            168    50    36    51     5     0     0     8     8
    ccccgci        142    19    10    27     0     6     1    18     3
    cin             94    54    16    12     0     0     0     2     5
    cccci           78    12    13     3     2     6     4    10     3
    oe              70    21    10    17     3     0     0     6     1
    i               28     0     0     0     0     6    22     0     0
    cieci           20     7     4     6     0     0     1     1     0
    cccg            20     5     3     0     0     5     3     2     0
    ccoe            19     1     1     3     0     3     3     4     0
    gci             18     0     0     0     0     8     9     0     0
    or              15     4     2     3     0     0     0     4     0
    cif             15     4     0     6     2     0     0     1     1
    cccoe           14     0     0     4     0     0     0     3     0
    coe             12     1     6     0     1     0     0     3     1
    ccccci          11     1     1     3     0     0     0     1     1
    ccgcir           9     4     4     0     0     0     1     0     0
    cor              8     4     0     0     1     3     0     0     0
    circi            8     1     0     5     0     0     0     1     0
    ciecgci          8     3     2     3     0     0     0     0     0
    e                7     2     0     4     0     0     0     0     0
    cccccgci         7     2     0     2     0     0     0     0     0
    ccor             6     0     1     2     0     0     1     0     0
    ccg              6     3     1     0     1     0     0     0     1
    im               5     0     0     0     0     2     3     0     0
    ie               5     0     0     0     0     3     2     0     0
    cieor            5     0     4     0     0     0     0     0     0
    cicgci           5     3     1     0     0     0     0     0     1
    ccgcie           5     2     2     0     0     0     0     1     0
    cccgcie          5     2     1     0     0     1     0     0     0
    ccccgcir         5     0     0     0     0     0     0     1     0
    oeccccgci        4     1     0     1     0     0     0     0     0
    ir               4     0     0     0     0     2     1     0     0
    cieoe            4     1     2     1     0     0     0     0     0
    cieccccgci       4     0     1     2     0     0     0     0     0
    ccie             4     1     0     0     3     0     0     0     0
    cccir            4     0     1     0     0     1     0     1     1
    cccgcir          4     0     0     0     0     1     0     0     0
    ccccg            4     1     0     0     1     0     1     0     0
    c                4     0     0     0     2     2     0     0     0
    om               3     1     0     0     0     0     0     0     1
    oecgci           3     1     1     0     0     0     0     1     0
    oecccgci         3     0     0     0     0     0     0     1     0
    in               3     0     0     0     0     0     3     0     0
    ciecccgci        3     0     1     2     0     0     0     0     0
    cic              3     2     0     0     0     0     0     1     0
    cgci             3     1     0     0     0     0     0     0     0
    cccie            3     0     1     0     1     0     0     1     0
    cccc             3     0     0     0     0     1     1     0     0
    oeci             2     0     0     0     0     0     0     0     0
    oeccci           2     0     0     0     0     0     0     0     0
    gcim             2     0     0     0     0     1     0     0     0
    coecgci          2     0     1     0     0     0     0     0     0
    co               2     0     0     0     1     1     0     0     0
    ciie             2     1     0     1     0     0     0     0     0
    cieo             2     0     0     2     0     0     0     0     0
    ce               2     0     0     0     0     2     0     0     0
    ccir             2     1     0     0     0     1     0     0     0
    ccgcim           2     1     1     0     0     0     0     0     0
    cccif            2     0     0     0     0     1     1     0     0
    ccccgcie         2     0     0     0     0     0     0     0     0
    cc               2     1     0     0     0     1     0     0     0
    roe              1     0     0     1     0     0     0     0     0
    oroeccccgci      1     0     0     0     0     0     0     1     0
    orcim            1     0     0     1     0     0     0     0     0
    orccci           1     0     0     0     0     0     0     1     0
    oeoe             1     1     0     0     0     0     0     0     0
    oecircir         1     0     0     0     0     0     0     0     0
    oecim            1     0     0     0     0     0     0     0     0
    oeciecccgci      1     0     0     0     0     0     0     0     0
    oecgcie          1     0     0     0     0     0     0     1     0
    oecgccccgci      1     0     0     0     0     0     0     0     0
    oecccgcie        1     0     0     0     0     0     0     0     0
    oecccg           1     0     0     0     0     0     0     1     0
    oecccci          1     0     0     0     0     0     0     0     0
    oeccccg          1     0     0     0     0     0     0     0     0
    oecccccgci       1     0     0     1     0     0     0     0     0
    oePci            1     0     0     0     0     0     0     1     0
    oePccccgci       1     0     0     1     0     0     0     0     0
    ocgcie           1     0     0     0     0     0     0     0     0
    oPci             1     1     0     0     0     0     0     0     0
    imci             1     0     0     0     0     0     1     0     0
    if               1     0     0     0     0     1     0     0     0
    goe              1     0     0     0     0     1     0     0     0
    gcircie          1     0     0     0     0     0     0     0     0
    gcircccci        1     0     0     0     0     0     0     0     0
    gcir             1     0     0     0     0     1     0     0     0
    gcieor           1     0     0     0     0     1     0     0     0
    g                1     0     0     0     0     1     0     0     0
    eor              1     0     1     0     0     0     0     0     0
    eoe              1     0     1     0     0     0     0     0     0
    ecccci           1     0     0     0     0     0     0     0     0
    eccccgci         1     0     0     1     0     0     0     0     0
    coeor            1     0     1     0     0     0     0     0     0
    coeci            1     1     0     0     0     0     0     0     0
    cirorci          1     0     0     1     0     0     0     0     0
    cirof            1     0     0     1     0     0     0     0     0
    ciroeof          1     0     1     0     0     0     0     0     0
    ciroe            1     0     0     0     1     0     0     0     0
    circif           1     0     1     0     0     0     0     0     0
    circie           1     0     0     0     0     0     0     1     0
    circgci          1     1     0     0     0     0     0     0     0
    circccci         1     0     0     1     0     0     0     0     0
    circccccgcie     1     0     0     1     0     0     0     0     0
    cioc             1     0     1     0     0     0     0     0     0
    cifo             1     0     0     0     0     0     0     0     0
    ciecirci         1     0     0     1     0     0     0     0     0
    ciecircg         1     0     0     1     0     0     0     0     0
    ciecir           1     0     0     1     0     0     0     0     0
    ciecim           1     0     0     1     0     0     0     0     0
    ciecie           1     0     0     1     0     0     0     0     0
    ciecgcgci        1     1     0     0     0     0     0     0     0
    cieccci          1     0     1     0     0     0     0     0     0
    ciecccci         1     1     0     0     0     0     0     0     0
    ciec             1     1     0     0     0     0     0     0     0
    cicccgci         1     1     0     0     0     0     0     0     0
    cgoe             1     0     0     0     0     0     0     0     0
    cgcieccor        1     0     0     0     0     0     0     0     0
    cgcie            1     0     1     0     0     0     0     0     0
    ccoecicg         1     0     0     1     0     0     0     0     0
    ccoeci           1     0     0     0     0     0     0     0     0
    ccocgci          1     0     1     0     0     0     0     0     0
    ccgor            1     1     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     0     0     0     1     0
    ccgcif           1     0     1     0     0     0     0     0     0
    ccgcieor         1     0     1     0     0     0     0     0     0
    ccgciA           1     0     0     0     0     1     0     0     0
    ccgcci           1     0     0     0     0     0     0     0     0
    ccgccgci         1     1     0     0     0     0     0     0     0
    ccgccci          1     0     1     0     0     0     0     0     0
    ccgccccgci       1     1     0     0     0     0     0     0     0
    cccor            1     0     0     0     0     1     0     0     0
    cccoecgci        1     0     0     0     0     0     0     0     0
    ccciroe          1     0     0     0     0     0     0     0     0
    cccieci          1     0     0     0     0     0     0     0     0
    ccciecgci        1     0     0     0     0     0     0     0     0
    cccgcirci        1     0     0     0     0     1     0     0     0
    cccgcif          1     0     0     0     0     0     0     1     0
    cccgccci         1     0     0     0     1     0     0     0     0
    ccccoe           1     0     0     1     0     0     0     0     0
    ccccic           1     0     1     0     0     0     0     0     0
    cccccgcie        1     0     0     1     0     0     0     0     0
    ccccc            1     0     0     1     0     0     0     0     0
    ccc              1     1     0     0     0     0     0     0     0
    ccPcccgci        1     0     0     0     0     1     0     0     0
    Pcim             1     0     0     0     0     0     0     0     0

    SUFFIX       TOTAL   cic     P  e.*H ci.*H    oP    Ae    AP cg.*H Ae.*H
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOTALORUM     3060    35    63    61    58    40    38    22    18    13
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    cccgci         462    12     3     9    12     2     1     2     3     2
    ccgci          390     0     0     8    12     0     0     0     6     0
    cci            260     0     0     9     4     0     0     0     2     1
    ci             248     0     0     4     2     2     7     1     2     4
    cim            240     0     0     7     4     0     2     0     1     0
    ccci           237     5     0     4    12     0     2     0     3     3
    cie            232     0     0     1     3     1     2     0     0     0
    cir            168     0     1     4     2     2     1     0     0     0
    ccccgci        142     2    17     2     1    18     7    11     0     0
    cin             94     0     0     2     2     0     1     0     0     0
    cccci           78     4     3     1     0     4     9     2     1     1
    oe              70     0     8     1     0     1     1     1     0     0
    i               28     0     0     0     0     0     0     0     0     0
    cieci           20     0     0     0     0     1     0     0     0     0
    cccg            20     1     0     1     0     0     0     0     0     0
    ccoe            19     1     2     0     0     0     0     1     0     0
    gci             18     1     0     0     0     0     0     0     0     0
    or              15     0     0     0     0     1     0     1     0     0
    cif             15     0     0     1     0     0     0     0     0     0
    cccoe           14     0     4     0     1     0     1     1     0     0
    coe             12     0     0     0     0     0     0     0     0     0
    ccccci          11     1     0     0     1     2     0     0     0     0
    ccgcir           9     0     0     0     0     0     0     0     0     0
    cor              8     0     0     0     0     0     0     0     0     0
    circi            8     0     0     0     0     0     0     0     0     1
    ciecgci          8     0     0     0     0     0     0     0     0     0
    e                7     0     0     1     0     0     0     0     0     0
    cccccgci         7     0     0     0     0     0     3     0     0     0
    ccor             6     0     2     0     0     0     0     0     0     0
    ccg              6     0     0     0     0     0     0     0     0     0
    im               5     0     0     0     0     0     0     0     0     0
    ie               5     0     0     0     0     0     0     0     0     0
    cieor            5     0     0     1     0     0     0     0     0     0
    cicgci           5     0     0     0     0     0     0     0     0     0
    ccgcie           5     0     0     0     0     0     0     0     0     0
    cccgcie          5     0     0     0     0     0     0     1     0     0
    ccccgcir         5     0     4     0     0     0     0     0     0     0
    oeccccgci        4     0     2     0     0     0     0     0     0     0
    ir               4     1     0     0     0     0     0     0     0     0
    cieoe            4     0     0     0     0     0     0     0     0     0
    cieccccgci       4     0     0     0     0     1     0     0     0     0
    ccie             4     0     0     0     0     0     0     0     0     0
    cccir            4     0     0     0     0     0     0     0     0     0
    cccgcir          4     1     0     0     2     0     0     0     0     0
    ccccg            4     0     0     0     0     1     0     0     0     0
    c                4     0     0     0     0     0     0     0     0     0
    om               3     0     1     0     0     0     0     0     0     0
    oecgci           3     0     0     0     0     0     0     0     0     0
    oecccgci         3     0     2     0     0     0     0     0     0     0
    in               3     0     0     0     0     0     0     0     0     0
    ciecccgci        3     0     0     0     0     0     0     0     0     0
    cic              3     0     0     0     0     0     0     0     0     0
    cgci             3     0     0     0     0     0     1     1     0     0
    cccie            3     0     0     0     0     0     0     0     0     0
    cccc             3     1     0     0     0     0     0     0     0     0
    oeci             2     0     0     0     0     2     0     0     0     0
    oeccci           2     0     2     0     0     0     0     0     0     0
    gcim             2     1     0     0     0     0     0     0     0     0
    coecgci          2     0     0     1     0     0     0     0     0     0
    co               2     0     0     0     0     0     0     0     0     0
    ciie             2     0     0     0     0     0     0     0     0     0
    cieo             2     0     0     0     0     0     0     0     0     0
    ce               2     0     0     0     0     0     0     0     0     0
    ccir             2     0     0     0     0     0     0     0     0     0
    ccgcim           2     0     0     0     0     0     0     0     0     0
    cccif            2     0     0     0     0     0     0     0     0     0
    ccccgcie         2     0     1     0     0     1     0     0     0     0
    cc               2     0     0     0     0     0     0     0     0     0
    roe              1     0     0     0     0     0     0     0     0     0
    oroeccccgci      1     0     0     0     0     0     0     0     0     0
    orcim            1     0     0     0     0     0     0     0     0     0
    orccci           1     0     0     0     0     0     0     0     0     0
    oeoe             1     0     0     0     0     0     0     0     0     0
    oecircir         1     0     1     0     0     0     0     0     0     0
    oecim            1     0     1     0     0     0     0     0     0     0
    oeciecccgci      1     0     1     0     0     0     0     0     0     0
    oecgcie          1     0     0     0     0     0     0     0     0     0
    oecgccccgci      1     0     1     0     0     0     0     0     0     0
    oecccgcie        1     0     1     0     0     0     0     0     0     0
    oecccg           1     0     0     0     0     0     0     0     0     0
    oecccci          1     0     1     0     0     0     0     0     0     0
    oeccccg          1     0     1     0     0     0     0     0     0     0
    oecccccgci       1     0     0     0     0     0     0     0     0     0
    oePci            1     0     0     0     0     0     0     0     0     0
    oePccccgci       1     0     0     0     0     0     0     0     0     0
    ocgcie           1     0     0     1     0     0     0     0     0     0
    oPci             1     0     0     0     0     0     0     0     0     0
    imci             1     0     0     0     0     0     0     0     0     0
    if               1     0     0     0     0     0     0     0     0     0
    goe              1     0     0     0     0     0     0     0     0     0
    gcircie          1     1     0     0     0     0     0     0     0     0
    gcircccci        1     1     0     0     0     0     0     0     0     0
    gcir             1     0     0     0     0     0     0     0     0     0
    gcieor           1     0     0     0     0     0     0     0     0     0
    g                1     0     0     0     0     0     0     0     0     0
    eor              1     0     0     0     0     0     0     0     0     0
    eoe              1     0     0     0     0     0     0     0     0     0
    ecccci           1     0     0     1     0     0     0     0     0     0
    eccccgci         1     0     0     0     0     0     0     0     0     0
    coeor            1     0     0     0     0     0     0     0     0     0
    coeci            1     0     0     0     0     0     0     0     0     0
    cirorci          1     0     0     0     0     0     0     0     0     0
    cirof            1     0     0     0     0     0     0     0     0     0
    ciroeof          1     0     0     0     0     0     0     0     0     0
    ciroe            1     0     0     0     0     0     0     0     0     0
    circif           1     0     0     0     0     0     0     0     0     0
    circie           1     0     0     0     0     0     0     0     0     0
    circgci          1     0     0     0     0     0     0     0     0     0
    circccci         1     0     0     0     0     0     0     0     0     0
    circccccgcie     1     0     0     0     0     0     0     0     0     0
    cioc             1     0     0     0     0     0     0     0     0     0
    cifo             1     0     0     1     0     0     0     0     0     0
    ciecirci         1     0     0     0     0     0     0     0     0     0
    ciecircg         1     0     0     0     0     0     0     0     0     0
    ciecir           1     0     0     0     0     0     0     0     0     0
    ciecim           1     0     0     0     0     0     0     0     0     0
    ciecie           1     0     0     0     0     0     0     0     0     0
    ciecgcgci        1     0     0     0     0     0     0     0     0     0
    cieccci          1     0     0     0     0     0     0     0     0     0
    ciecccci         1     0     0     0     0     0     0     0     0     0
    ciec             1     0     0     0     0     0     0     0     0     0
    cicccgci         1     0     0     0     0     0     0     0     0     0
    cgoe             1     0     1     0     0     0     0     0     0     0
    cgcieccor        1     0     1     0     0     0     0     0     0     0
    cgcie            1     0     0     0     0     0     0     0     0     0
    ccoecicg         1     0     0     0     0     0     0     0     0     0
    ccoeci           1     0     0     0     0     0     0     0     0     1
    ccocgci          1     0     0     0     0     0     0     0     0     0
    ccgor            1     0     0     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     0     0     0     0     0     0
    ccgcif           1     0     0     0     0     0     0     0     0     0
    ccgcieor         1     0     0     0     0     0     0     0     0     0
    ccgciA           1     0     0     0     0     0     0     0     0     0
    ccgcci           1     0     0     1     0     0     0     0     0     0
    ccgccgci         1     0     0     0     0     0     0     0     0     0
    ccgccci          1     0     0     0     0     0     0     0     0     0
    ccgccccgci       1     0     0     0     0     0     0     0     0     0
    cccor            1     0     0     0     0     0     0     0     0     0
    cccoecgci        1     0     1     0     0     0     0     0     0     0
    ccciroe          1     0     1     0     0     0     0     0     0     0
    cccieci          1     0     0     0     0     1     0     0     0     0
    ccciecgci        1     1     0     0     0     0     0     0     0     0
    cccgcirci        1     0     0     0     0     0     0     0     0     0
    cccgcif          1     0     0     0     0     0     0     0     0     0
    cccgccci         1     0     0     0     0     0     0     0     0     0
    ccccoe           1     0     0     0     0     0     0     0     0     0
    ccccic           1     0     0     0     0     0     0     0     0     0
    cccccgcie        1     0     0     0     0     0     0     0     0     0
    ccccc            1     0     0     0     0     0     0     0     0     0
    ccc              1     0     0     0     0     0     0     0     0     0
    ccPcccgci        1     0     0     0     0     0     0     0     0     0
    Pcim             1     1     0     0     0     0     0     0     0     0

  Analysis:
  
    The prefix "cg" seems a bit anomalous.  It looks as if the "cg" 
    were actually the first "c" of the suffix.
    
    Most valid suffixes apparently begin with "c".  Thus 
    we should incorporate the "c" into the prefix.
    
    The "i" suffix is bogus; it entered only because of "eci" and
    "oeci". Note that "ec", "cic" and "oec" incorporate the "c" while
    the other prefixes don't.
    
    The prefixes "cc.*H", "oe.*H", etc. seem anomalous, probably
    because some "H"s are actually "cHc"s.  We should find out what are
    the actual prefixes, and see if we can take the two classes apart.
    
    Most productive suffixes come in pairs that differ by an extra "c"
    at the beginning.
    
    Redoing again with the extra "c"s:
  
    /bin/rm -f .title
    /bin/rm -f .table
    /bin/touch .table
    
    set noglob
    set ofmt = "0"
    set npat = 1
    foreach pat ( \
      'AH'    'oH'    'cg'    'cc.*H' \
      'e'     'oe'    'H'     'Ae'    \
      'oe.*H' 'ci'    'P'     'e.*H'  \
      'ci.*H' 'oP'    'AP'    'cg.*H' \
      'Ae.*H' \
    )
      /n/gnu/bin/printf " %7s" "${pat}" >> .title
      /bin/cat bio-j-huc-gut.wds \
        | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \
        | /n/gnu/bin/egrep "_${pat}c[^HPA]*_" \
        | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \
        | /n/gnu/bin/sort | uniq -c \
        | /n/gnu/bin/expand \
        > .suff.frq

      /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp
      /bin/mv .tmp .table
      @ npat = ${npat} + 1
      set ofmt = "${ofmt},1.${npat}"
    end
    unset noglob
    
    /n/gnu/bin/printf "\n" >> .title
    
    cat .table \
      | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \
      | sort -nr \
      > .tbsort
    
    cat .title .tbsort \
      | format-suffix-table
    
  Results:

    SUFFIX       TOTAL    AH    oH    cg cc.*H     e    oe     H oe.*H
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOTALORUM     2927  1009   433   315   315   169   127   132   113
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- -----
    ccgci          377   201    85     2    15     0     0    28    20
    cccgci         352   191    57     5    21     7     6    16    15
    ci             276    79    28    36    68     6    22     2    13
    ccccgci        256    19    10    27     0    73    38    18     3
    cci            251    43    22     0   160     0     0     2     8
    cim            245    94    41    72     5     2     3     7     7
    cie            237   116    40    50     4     3     2    10     5
    ccci           202    87    35     2    20     5     4     5    20
    cir            172    50    36    51     5     2     1     8     8
    cccci          108    12    13     3     2    20    19    10     3
    cin             97    54    16    12     0     0     3     2     5
    ccccci          24     1     1     3     0     6     4     1     1
    cgci            21     1     0     0     0     8     9     0     0
    cccoe           21     0     0     4     0     3     3     3     0
    cieci           19     7     4     6     0     0     0     1     0
    cif             16     4     0     6     2     1     0     1     1
    cccccgci        16     2     0     2     0     6     1     0     0
    coe             12     1     6     0     1     0     0     3     1
    ccoe            12     1     1     3     0     0     0     4     0
    ccccg           12     1     0     0     1     5     3     0     0
    cccg            11     5     3     0     0     0     0     2     0
    circi            8     1     0     5     0     0     0     1     0
    ciecgci          8     3     2     3     0     0     0     0     0
    ccor             8     0     1     2     0     3     0     0     0
    ccgcir           8     4     4     0     0     0     0     0     0
    ccccgcir         7     0     0     0     0     1     0     1     0
    ccg              6     3     1     0     1     0     0     0     1
    cor              5     4     0     0     1     0     0     0     0
    cieor            5     0     4     0     0     0     0     0     0
    cicgci           5     3     1     0     0     0     0     0     1
    ccgcie           5     2     2     0     0     0     0     1     0
    cieoe            4     1     2     1     0     0     0     0     0
    cieccccgci       4     0     1     2     0     0     0     0     0
    ccie             4     1     0     0     3     0     0     0     0
    cccir            4     0     1     0     0     1     0     1     1
    cccgcie          4     2     1     0     0     0     0     0     0
    ccccc            4     0     0     1     0     1     1     0     0
    c                4     0     0     0     2     0     2     0     0
    ciecccgci        3     0     1     2     0     0     0     0     0
    cic              3     2     0     0     0     0     0     1     0
    cccie            3     0     1     0     1     0     0     1     0
    cccgcir          3     0     0     0     0     0     1     0     0
    ccccgcie         3     0     0     0     0     1     0     0     0
    cc               3     1     0     0     0     2     0     0     0
    coecgci          2     0     1     0     0     0     0     0     0
    ciie             2     1     0     1     0     0     0     0     0
    cieo             2     0     0     2     0     0     0     0     0
    cgoe             2     0     0     0     0     1     0     0     0
    cgcim            2     0     0     0     0     1     0     0     0
    ccgcim           2     1     1     0     0     0     0     0     0
    cce              2     0     0     0     0     2     0     0     0
    ccccif           2     0     0     0     0     1     1     0     0
    ccc              2     1     0     0     0     1     0     0     0
    coeor            1     0     1     0     0     0     0     0     0
    coeci            1     1     0     0     0     0     0     0     0
    co               1     0     0     0     1     0     0     0     0
    cirorci          1     0     0     1     0     0     0     0     0
    cirof            1     0     0     1     0     0     0     0     0
    ciroeof          1     0     1     0     0     0     0     0     0
    ciroe            1     0     0     0     1     0     0     0     0
    circif           1     0     1     0     0     0     0     0     0
    circie           1     0     0     0     0     0     0     1     0
    circgci          1     1     0     0     0     0     0     0     0
    circccci         1     0     0     1     0     0     0     0     0
    circccccgcie     1     0     0     1     0     0     0     0     0
    cioc             1     0     1     0     0     0     0     0     0
    cimci            1     0     0     0     0     0     1     0     0
    cifo             1     0     0     0     0     0     0     0     0
    ciecirci         1     0     0     1     0     0     0     0     0
    ciecircg         1     0     0     1     0     0     0     0     0
    ciecir           1     0     0     1     0     0     0     0     0
    ciecim           1     0     0     1     0     0     0     0     0
    ciecie           1     0     0     1     0     0     0     0     0
    ciecgcgci        1     1     0     0     0     0     0     0     0
    cieccci          1     0     1     0     0     0     0     0     0
    ciecccci         1     1     0     0     0     0     0     0     0
    ciec             1     1     0     0     0     0     0     0     0
    cicccgci         1     1     0     0     0     0     0     0     0
    cgcircie         1     0     0     0     0     0     0     0     0
    cgcircccci       1     0     0     0     0     0     0     0     0
    cgcir            1     0     0     0     0     1     0     0     0
    cgcieor          1     0     0     0     0     1     0     0     0
    cgcieccor        1     0     0     0     0     0     0     0     0
    cgcie            1     0     1     0     0     0     0     0     0
    cg               1     0     0     0     0     1     0     0     0
    ccoecicg         1     0     0     1     0     0     0     0     0
    ccoeci           1     0     0     0     0     0     0     0     0
    ccocgci          1     0     1     0     0     0     0     0     0
    cco              1     0     0     0     0     1     0     0     0
    ccir             1     1     0     0     0     0     0     0     0
    ccieci           1     0     0     0     0     0     1     0     0
    ccgor            1     1     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     0     0     0     1     0
    ccgcif           1     0     1     0     0     0     0     0     0
    ccgcieor         1     0     1     0     0     0     0     0     0
    ccgcci           1     0     0     0     0     0     0     0     0
    ccgccgci         1     1     0     0     0     0     0     0     0
    ccgccci          1     0     1     0     0     0     0     0     0
    ccgccccgci       1     1     0     0     0     0     0     0     0
    cccor            1     0     0     0     0     0     1     0     0
    cccoecgci        1     0     0     0     0     0     0     0     0
    ccciroe          1     0     0     0     0     0     0     0     0
    cccieci          1     0     0     0     0     0     0     0     0
    cccgcif          1     0     0     0     0     0     0     1     0
    cccgccci         1     0     0     0     1     0     0     0     0
    ccccor           1     0     0     0     0     1     0     0     0
    ccccoe           1     0     0     1     0     0     0     0     0
    ccccir           1     0     0     0     0     1     0     0     0
    cccciecgci       1     0     0     0     0     0     0     0     0
    ccccic           1     0     1     0     0     0     0     0     0
    ccccgcirci       1     0     0     0     0     1     0     0     0
    cccccgcie        1     0     0     1     0     0     0     0     0
    cccccg           1     0     0     0     0     0     1     0     0
    cccccci          1     0     0     0     0     0     0     0     0 

    SUFFIX       TOTAL    Ae    ci     P  e.*H ci.*H    oP    AP cg.*H Ae.*H
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    TOTALORUM     2927    37    34    41    57    58    36    20    18    13
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    ccgci          377     0     0     0     8    12     0     0     6     0
    cccgci         352     1     0     3     9    12     2     2     3     2
    ci             276     7     0     0     4     2     2     1     2     4
    ccccgci        256     7    12    17     2     1    18    11     0     0
    cci            251     0     0     0     9     4     0     0     2     1
    cim            245     2     0     0     7     4     0     0     1     0
    cie            237     2     0     0     1     3     1     0     0     0
    ccci           202     2     0     0     4    12     0     0     3     3
    cir            172     1     1     1     4     2     2     0     0     0
    cccci          108     9     5     3     1     0     4     2     1     1
    cin             97     1     0     0     2     2     0     0     0     0
    ccccci          24     0     4     0     0     1     2     0     0     0
    cgci            21     1     1     0     0     0     0     1     0     0
    cccoe           21     1     1     4     0     1     0     1     0     0
    cieci           19     0     0     0     0     0     1     0     0     0
    cif             16     0     0     0     1     0     0     0     0     0
    cccccgci        16     3     2     0     0     0     0     0     0     0
    coe             12     0     0     0     0     0     0     0     0     0
    ccoe            12     0     0     2     0     0     0     1     0     0
    ccccg           12     0     1     0     0     0     1     0     0     0
    cccg            11     0     0     0     1     0     0     0     0     0
    circi            8     0     0     0     0     0     0     0     0     1
    ciecgci          8     0     0     0     0     0     0     0     0     0
    ccor             8     0     0     2     0     0     0     0     0     0
    ccgcir           8     0     0     0     0     0     0     0     0     0
    ccccgcir         7     0     1     4     0     0     0     0     0     0
    ccg              6     0     0     0     0     0     0     0     0     0
    cor              5     0     0     0     0     0     0     0     0     0
    cieor            5     0     0     0     1     0     0     0     0     0
    cicgci           5     0     0     0     0     0     0     0     0     0
    ccgcie           5     0     0     0     0     0     0     0     0     0
    cieoe            4     0     0     0     0     0     0     0     0     0
    cieccccgci       4     0     0     0     0     0     1     0     0     0
    ccie             4     0     0     0     0     0     0     0     0     0
    cccir            4     0     0     0     0     0     0     0     0     0
    cccgcie          4     0     0     0     0     0     0     1     0     0
    ccccc            4     0     1     0     0     0     0     0     0     0
    c                4     0     0     0     0     0     0     0     0     0
    ciecccgci        3     0     0     0     0     0     0     0     0     0
    cic              3     0     0     0     0     0     0     0     0     0
    cccie            3     0     0     0     0     0     0     0     0     0
    cccgcir          3     0     0     0     0     2     0     0     0     0
    ccccgcie         3     0     0     1     0     0     1     0     0     0
    cc               3     0     0     0     0     0     0     0     0     0
    coecgci          2     0     0     0     1     0     0     0     0     0
    ciie             2     0     0     0     0     0     0     0     0     0
    cieo             2     0     0     0     0     0     0     0     0     0
    cgoe             2     0     0     1     0     0     0     0     0     0
    cgcim            2     0     1     0     0     0     0     0     0     0
    ccgcim           2     0     0     0     0     0     0     0     0     0
    cce              2     0     0     0     0     0     0     0     0     0
    ccccif           2     0     0     0     0     0     0     0     0     0
    ccc              2     0     0     0     0     0     0     0     0     0
    coeor            1     0     0     0     0     0     0     0     0     0
    coeci            1     0     0     0     0     0     0     0     0     0
    co               1     0     0     0     0     0     0     0     0     0
    cirorci          1     0     0     0     0     0     0     0     0     0
    cirof            1     0     0     0     0     0     0     0     0     0
    ciroeof          1     0     0     0     0     0     0     0     0     0
    ciroe            1     0     0     0     0     0     0     0     0     0
    circif           1     0     0     0     0     0     0     0     0     0
    circie           1     0     0     0     0     0     0     0     0     0
    circgci          1     0     0     0     0     0     0     0     0     0
    circccci         1     0     0     0     0     0     0     0     0     0
    circccccgcie     1     0     0     0     0     0     0     0     0     0
    cioc             1     0     0     0     0     0     0     0     0     0
    cimci            1     0     0     0     0     0     0     0     0     0
    cifo             1     0     0     0     1     0     0     0     0     0
    ciecirci         1     0     0     0     0     0     0     0     0     0
    ciecircg         1     0     0     0     0     0     0     0     0     0
    ciecir           1     0     0     0     0     0     0     0     0     0
    ciecim           1     0     0     0     0     0     0     0     0     0
    ciecie           1     0     0     0     0     0     0     0     0     0
    ciecgcgci        1     0     0     0     0     0     0     0     0     0
    cieccci          1     0     0     0     0     0     0     0     0     0
    ciecccci         1     0     0     0     0     0     0     0     0     0
    ciec             1     0     0     0     0     0     0     0     0     0
    cicccgci         1     0     0     0     0     0     0     0     0     0
    cgcircie         1     0     1     0     0     0     0     0     0     0
    cgcircccci       1     0     1     0     0     0     0     0     0     0
    cgcir            1     0     0     0     0     0     0     0     0     0
    cgcieor          1     0     0     0     0     0     0     0     0     0
    cgcieccor        1     0     0     1     0     0     0     0     0     0
    cgcie            1     0     0     0     0     0     0     0     0     0
    cg               1     0     0     0     0     0     0     0     0     0
    ccoecicg         1     0     0     0     0     0     0     0     0     0
    ccoeci           1     0     0     0     0     0     0     0     0     1
    ccocgci          1     0     0     0     0     0     0     0     0     0
    cco              1     0     0     0     0     0     0     0     0     0
    ccir             1     0     0     0     0     0     0     0     0     0
    ccieci           1     0     0     0     0     0     0     0     0     0
    ccgor            1     0     0     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     0     0     0     0     0     0
    ccgcif           1     0     0     0     0     0     0     0     0     0
    ccgcieor         1     0     0     0     0     0     0     0     0     0
    ccgcci           1     0     0     0     1     0     0     0     0     0
    ccgccgci         1     0     0     0     0     0     0     0     0     0
    ccgccci          1     0     0     0     0     0     0     0     0     0
    ccgccccgci       1     0     0     0     0     0     0     0     0     0
    cccor            1     0     0     0     0     0     0     0     0     0
    cccoecgci        1     0     0     1     0     0     0     0     0     0
    ccciroe          1     0     0     1     0     0     0     0     0     0
    cccieci          1     0     0     0     0     0     1     0     0     0
    cccgcif          1     0     0     0     0     0     0     0     0     0
    cccgccci         1     0     0     0     0     0     0     0     0     0
    ccccor           1     0     0     0     0     0     0     0     0     0
    ccccoe           1     0     0     0     0     0     0     0     0     0
    ccccir           1     0     0     0     0     0     0     0     0     0
    cccciecgci       1     0     1     0     0     0     0     0     0     0
    ccccic           1     0     0     0     0     0     0     0     0     0
    ccccgcirci       1     0     0     0     0     0     0     0     0     0
    cccccgcie        1     0     0     0     0     0     0     0     0     0
    cccccg           1     0     0     0     0     0     0     0     0     0
    cccccci          1     0     1     0     0     0     0     0     0     0

  Analysis: to a first approximation, the "ordinary" words are
  
    { AH oH cg cc.*H e oe H oe.*H Ae ci P e.*H ci.*H oP AP cg.*H Ae.*H } × 
    { ccgci cccgci ci ccccgci cci cim cie ccci cir cccci cin } 
    
  Rarer suffixes are
  
    { ccccci cgci cccoe cieci cif cccccgci coe ccoe ... }
    
  Looking back, we see that the prefixes "cc.*H" are actually 
  "cccH" and "ccccH".
  
  There is also a "ciH" prefix.
  
  The "ccc[^HP]*" words that we were using before appear to be 
  "c" prefix.
  

  Looked for more info on "H"-containing prefixes:
  
    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_ci.*H.*_' '_e.*H.*_' '_cg.*H.*_' '_Ae.*H.*_' '_oe.*H.*_'

         8  0.13  eHccgci             12  0.21  ciHccgci
         7  0.11  eHcccgci            12  0.21  ciHcccgci
         6  0.10  eHcim               10  0.17  ciHccci
         5  0.08  eccccHcci            4  0.07  ciHcim
         4  0.07  eHcir                3  0.05  ciHcie
         3  0.05  ecccHcci             2  0.03  ciHcir
         3  0.05  eHccci               2  0.03  ciHcin
         2  0.03  eccccHcccgci         2  0.03  ciHcci
         2  0.03  ecccHci              2  0.03  ciHcccgcir
         2  0.03  eHcin                1  0.02  ciroHcci
         2  0.03  eHci                 1  0.02  cioHci
         2  0.03  eHccccgci            1  0.02  ciccccHcci
         1  0.02  eoeHcim              1  0.02  ciccccHccci
         1  0.02  eoHcif               1  0.02  cicHccci
         1  0.02  eccccHcie            1  0.02  ciHci
         1  0.02  eccccHccci           1  0.02  ciHcccoe
         1  0.02  eHoe                 1  0.02  ciHccccgci
         1  0.02  eHocgcie             1  0.02  ciHccccci
         1  0.02  eHecccci         -----  ----  ----
         1  0.02  eHe                 58  1.00  TOT
         1  0.02  eHcoecgci
         1  0.02  eHcifo
         1  0.02  eHcieor
         1  0.02  eHcci
         1  0.02  eHccgcci
         1  0.02  eHcccg
         1  0.02  eHcccci
     -----  ----  ----
        61  1.00  TOT


       20  0.17  oeHccgci           2  0.11  cgciHccgci           4  0.31  AeHci
       19  0.17  oeHccci            2  0.11  cgciHcccgci          3  0.23  AeHccci
       15  0.13  oeHcccgci          2  0.11  cgHccgci             2  0.15  AeHcccgci
       10  0.09  oeHci              1  0.06  cgoeHccgci           1  0.08  AeHcirci
        7  0.06  oeHcir             1  0.06  cgoHccgci            1  0.08  AeHccoeci
        7  0.06  oeHcim             1  0.06  cgoHccci             1  0.08  AeHcci
        5  0.04  oeHcin             1  0.06  cgcieHci             1  0.08  AeHcccci
        5  0.04  oeHcie             1  0.06  cgcicHcci        -----  ----  ----
        4  0.03  oeHcci             1  0.06  cgciHcim            13  1.00  TOT
        3  0.03  oeHcccci           1  0.06  cgciHci
        2  0.02  oeoHci             1  0.06  cgciHcci
        2  0.02  oecccHcci          1  0.06  cgciHccci
        2  0.02  oeHccccgci         1  0.06  cgcccHcccgci
        1  0.01  oeoeHccci          1  0.06  cgHccci
        1  0.01  oeoHcir            1  0.06  cgHcccci
        1  0.01  oeoHccccgci    -----  ----  ----
        1  0.01  oeccccHcci        18  1.00  TOT
        1  0.01  oeccHci
        1  0.01  oePocHcci
        1  0.01  oeHom
        1  0.01  oeHoe
        1  0.01  oeHcoe
        1  0.01  oeHcif
        1  0.01  oeHcicgci
        1  0.01  oeHccg
        1  0.01  oeHcccir
        1  0.01  oeHccccci
    -----  ----  ----
      115  1.00  TOT

  So it seems we got some new prefixes: 
  
    { eH eccccH ciH ciccccH oeH cgciH AeH }
    
  Now that we know that "ccgci" is the most common suffix, let's look for 
  all its prefixes:
  
    cat bio-j-huc-gut.wds \
      | egrep 'ccgci$' \
      | sed -e 's/ccgci$//g' \
      | wfreq

       387  0.25  cc
       201  0.13  AH
       191  0.12  AHc
        85  0.05  oH
        73  0.05  ecc
        60  0.04  ccc
        57  0.04  oHc
        38  0.02  oecc
        28  0.02  H
        27  0.02  cgcc
        21  0.01  c
        20  0.01  oeH
        19  0.01  AHcc
        18  0.01  oPcc
        18  0.01  Hcc
        17  0.01  Pcc
        16  0.01  Hc
        15  0.01  oeHc
        15  0.01  cccHc
        12  0.01  cicc
        12  0.01  ciHc
        12  0.01  ciH
        11  0.01  APcc
        10  0.01  rcc
        10  0.01  oHcc
        10  0.01  cccH
         8  0.01  eH
         7  0.00  ec
         7  0.00  eHc
         7  0.00  Aecc
         6  0.00  oec
         6  0.00  eccc
         6  0.00  Ac
         5  0.00  cgc
         5  0.00  ccccHc
         4  0.00  oePcc
         4  0.00  coeHc
         4  0.00  cHc
         3  0.00  ccH
         3  0.00  cH
         3  0.00  Pc
         3  0.00  Aeccc
         2  0.00  rccc
         2  0.00  oeHcc
         2  0.00  occ
         2  0.00  oPc
         2  0.00  eccccHc
         2  0.00  eHcc
         2  0.00  coecc
         2  0.00  coHc
         2  0.00  ciec
         2  0.00  ciccc
         2  0.00  cgciecc
         2  0.00  cgciec
         2  0.00  cgciHc
         2  0.00  cgciH
         2  0.00  cgccc
         2  0.00  cgH
         2  0.00  cg
         2  0.00  ccccH
         2  0.00  cccPcc
         2  0.00  Poecc
         2  0.00  Poec
         2  0.00  AeHc
         2  0.00  Acc
         2  0.00  APc
         2  0.00  AHccc
         1  0.00  rc
         1  0.00  orccc
         1  0.00  oeoHcc
         1  0.00  oeccc
         1  0.00  ocgcc
         1  0.00  ocHc
         1  0.00  oc
         1  0.00  oPciecc
         1  0.00  oHciecc
         1  0.00  oHciec
         1  0.00  oAPcc
         1  0.00  eocc
         1  0.00  ecccPc
         1  0.00  ePcc
         1  0.00  ePc
         1  0.00  coeH
         1  0.00  ciecc
         1  0.00  ciPcc
         1  0.00  ciHcc
         1  0.00  cgoeccc
         1  0.00  cgoecc
         1  0.00  cgoePcc
         1  0.00  cgoeH
         1  0.00  cgoH
         1  0.00  cgecc
         1  0.00  cgcccHc
         1  0.00  ccocPc
         1  0.00  ccoHc
         1  0.00  cccoec
         1  0.00  ccccPc
         1  0.00  cccPc
         1  0.00  ccPccc
         1  0.00  ccPcc
         1  0.00  cPcc
         1  0.00  Poeciec
         1  0.00  Poecgcc
         1  0.00  PciH
         1  0.00  Horoecc
         1  0.00  Hoec
         1  0.00  HcccoHc
         1  0.00  Aec
         1  0.00  Acgc
         1  0.00  AccHc
         1  0.00  AcHc
         1  0.00  AHoecc
         1  0.00  AHcic
         1  0.00  AHccgcc
         1  0.00  AHccg
     -----  ----  ----
      1562  1.00  TOT

  Ditto with "cie": 
  
    cat bio-j-huc-gut.wds \
      | egrep 'cie$' \
      | sed -e 's/cie$//g' \
      | wfreq

       116  0.35  AH
        50  0.15  cg
        40  0.12  oH
        14  0.04  c
        12  0.04  
        10  0.03  H
         9  0.03  ccccg
         7  0.02  ccc
         7  0.02  cc
         6  0.02  r
         5  0.02  oeH
         3  0.01  e
         3  0.01  ciH
         2  0.01  oe
         2  0.01  oHccg
         2  0.01  ccccHc
         2  0.01  ccH
         2  0.01  Ae
         2  0.01  AHccg
         2  0.01  AHcccg
         1  0.00  rccc
         1  0.00  or
         1  0.00  ocg
         1  0.00  oc
         1  0.00  oPccccg
         1  0.00  oP
         1  0.00  oHcg
         1  0.00  oHcccg
         1  0.00  oHcc
         1  0.00  eor
         1  0.00  eccccg
         1  0.00  eccccH
         1  0.00  eHocg
         1  0.00  coeccH
         1  0.00  cicgcir
         1  0.00  cgcircccccg
         1  0.00  cgcie
         1  0.00  cgcccccg
         1  0.00  ccoH
         1  0.00  ccir
         1  0.00  cccg
         1  0.00  cccc
         1  0.00  cccHcc
         1  0.00  cccHc
         1  0.00  cccH
         1  0.00  cPc
         1  0.00  cHc
         1  0.00  cH
         1  0.00  Poecccg
         1  0.00  Pccccg
         1  0.00  Hoecg
         1  0.00  Hcir
         1  0.00  Hccg
         1  0.00  Hcc
         1  0.00  APcccg
         1  0.00  AHc
         1  0.00  A
     -----  ----  ----
       333  1.00  TOT

  Looking for suffixes that do not begin with "c":
  
    cat bio-j-huc-gut.wds \
      | sed -e 's/^/_/g' -e 's/$/_/g' \
      | compare-contexts '_AH.*_' '_cg.*_' '_cccH.*_'

         72  0.20  cgcim               201  0.19  AHccgci             90  0.51  cccHcci
         51  0.14  cgcir               191  0.18  AHcccgci            31  0.18  cccHci
         50  0.14  cgcie               116  0.11  AHcie               15  0.08  cccHcccgci
         36  0.10  cgci                 94  0.09  AHcim               11  0.06  cccHccci
         27  0.07  cgccccgci            87  0.08  AHccci              10  0.06  cccHccgci
         17  0.05  cgoe                 79  0.08  AHci                 4  0.02  cccH
         12  0.03  cgcin                54  0.05  AHcin                3  0.02  cccHcir
          6  0.02  cgcif                50  0.05  AHcir                2  0.01  cccHcim
          6  0.02  cgcieci              43  0.04  AHcci                2  0.01  cccHcif
          5  0.01  cgcirci              21  0.02  AHoe                 2  0.01  cccHc
          5  0.01  cgcccgci             19  0.02  AHccccgci            1  0.01  cccHcor
          4  0.01  cge                  12  0.01  AHcccci              1  0.01  cccHcie
          4  0.01  cgcccoe               7  0.01  AHcieci              1  0.01  cccHccie
          3  0.01  cgor                  5  0.00  AHcccg               1  0.01  cccHccg
          3  0.01  cgciecgci             4  0.00  AHor                 1  0.01  cccHcccie
          3  0.01  cgccoe                4  0.00  AHcor                1  0.01  cccHcccci
          3  0.01  cgcccci               4  0.00  AHcif                1  0.01  cccHccccg
          3  0.01  cgccccci              4  0.00  AHccgcir         -----  ----  ----
          2  0.01  cgcieo                3  0.00  AHciecgci          177  1.00  TOT
          2  0.01  cgciecccgci           3  0.00  AHcicgci         
          2  0.01  cgcieccccgci          3  0.00  AHccg            
          2  0.01  cgciHccgci            2  0.00  AHe              
          2  0.01  cgciHcccgci           2  0.00  AHcic            
          2  0.01  cgccor                2  0.00  AHccgcie         
          2  0.01  cgccgci               2  0.00  AHcccgcie        
          2  0.01  cgccci                2  0.00  AHcccccgci       
          2  0.01  cgcccccgci            2  0.00  AH               
          2  0.01  cgHccgci              1  0.00  AHom             
          1  0.00  cgroe                 1  0.00  AHoeoe           
          1  0.00  cgorcim               1  0.00  AHoecgci         
          1  0.00  cgoeccccgci           1  0.00  AHoeccccgci      
          1  0.00  cgoecccccgci          1  0.00  AHoPci           
          1  0.00  cgoePccccgci          1  0.00  AHcoeci          
          1  0.00  cgoeHccgci            1  0.00  AHcoe            
          1  0.00  cgoHccgci             1  0.00  AHcirci          
          1  0.00  cgoHccci              1  0.00  AHcircgci        
          1  0.00  cgeccccgci            1  0.00  AHciie           
          1  0.00  cgcirorci             1  0.00  AHcieoe          
          1  0.00  cgcirof               1  0.00  AHciecgcgci      
          1  0.00  cgcircccci            1  0.00  AHciecccci       
          1  0.00  cgcircccccgcie        1  0.00  AHciec           
          1  0.00  cgciie                1  0.00  AHcicccgci       
          1  0.00  cgcieoe               1  0.00  AHcgci           
          1  0.00  cgciecirci            1  0.00  AHccoe           
          1  0.00  cgciecircg            1  0.00  AHccir           
          1  0.00  cgciecir              1  0.00  AHccie           
          1  0.00  cgciecim              1  0.00  AHccgor          
          1  0.00  cgciecie              1  0.00  AHccgcim         
          1  0.00  cgcieHci              1  0.00  AHccgccgci       
          1  0.00  cgcicHcci             1  0.00  AHccgccccgci     
          1  0.00  cgciHcim              1  0.00  AHccccg          
          1  0.00  cgciHci               1  0.00  AHccccci         
          1  0.00  cgciHcci              1  0.00  AHccccHcci       
          1  0.00  cgciHccci             1  0.00  AHccc            
          1  0.00  cgccoecicg            1  0.00  AHcc             
          1  0.00  cgccccoe          -----  ----  ----               
          1  0.00  cgcccccgcie        1044  1.00  TOT                
          1  0.00  cgccccc                                           
          1  0.00  cgcccHcccgci                                      
          1  0.00  cgHccci                                           
          1  0.00  cgHcccci                                          
      -----  ----  ----                                                
        363  1.00  TOT                                                 
      
  So it seems that { oe or om e } are also valid suffixes.

  OK, let's try to parse what we can into prefix:suffix:
  
    split-prefix-suffix
    ------------------------------------------------
    #! /n/gnu/bin/gawk -f

    # Attempts to split words into prefix/suffix inserting ":" in between.

    BEGIN {
      PREFS = "^(AH|AP|Ae|AeH|H|P|cH|ccH|cccH|ccccH|cg|cgciH|ci|ciH|e|eH|eccccH|oH|oP|oe|oeH|r)"
      SUFFS = "([co][^HP]*)$"
      SPLIT = ( PREFS SUFFS )
    }

    ( $0 ~ SPLIT ) {
      match($0, PREFS)
      k = RLENGTH
      $0 = (substr($0, 1, k) ":" substr($0, k + 1))
      print
      next
    }

    /./ { print; next }
    ------------------------------------------------
    
    cat bio-j-huc-gut.wds \
      | split-prefix-suffix \
      | egrep -v ':' \
      | wfreq

       387  0.24  ccccgci
       139  0.09  cccci
       131  0.08  oe
        81  0.05  Ae
        61  0.04  ccccci
        60  0.04  cccccgci
        41  0.03  or
        33  0.02  cccoe
        30  0.02  ccim
        25  0.02  coe
        23  0.01  ccoe
        21  0.01  cim
        21  0.01  cccgci
        17  0.01  cccor
        14  0.01  ccie
        14  0.01  ccci
        12  0.01  cie
        11  0.01  ccir
        10  0.01  cor
        10  0.01  cccir
        10  0.01  cccc
         9  0.01  ccccgcim
         9  0.01  ccccgcie
         8  0.01  ccccir
         7  0.00  cir
         7  0.00  cccie
         7  0.00  ccccie
         7  0.00  ccccg
         6  0.00  cce
         6  0.00  c
         6  0.00  Ar
         6  0.00  Acccgci
         5  0.00  r
         5  0.00  oroe
         5  0.00  orci
         5  0.00  com
         5  0.00  cccPccci
         4  0.00  oePccccgci
         4  0.00  e
         4  0.00  coeHcccgci
         4  0.00  cin
         4  0.00  cge
         4  0.00  ccccoe
         4  0.00  ccccgcir
         4  0.00  ccccc
         4  0.00  cccH
         4  0.00  cPccci
         4  0.00  AcHccci
         4  0.00  A
         3  0.00  orcim
         3  0.00  ocHcci
         3  0.00  oH
         3  0.00  ecccHcci
         3  0.00  coecccci
         3  0.00  coeHccci
         3  0.00  cif
         3  0.00  cieci
         3  0.00  ccoecgci
         3  0.00  cccif
         3  0.00  cccgcir
         3  0.00  ccccgoe
         3  0.00  Acgci
         2  0.00  rcccHci
         2  0.00  oroeci
         2  0.00  om
         2  0.00  oeoHci
         2  0.00  oeeccci
         2  0.00  oecccHcci
         2  0.00  ocgci
         2  0.00  occcci
         2  0.00  occccgci
         2  0.00  ocHccci
         2  0.00  ecccHci
         2  0.00  coeccccgci
         2  0.00  coHcccgci
         2  0.00  ciroe
         2  0.00  circi
         2  0.00  cimci
         2  0.00  ciecccgci
         2  0.00  cgHccgci
         2  0.00  ccoHci
         2  0.00  cccoHci
         2  0.00  ccccieci
         2  0.00  ccccPcci
         2  0.00  cccPccccgci
         2  0.00  HoHoe
         2  0.00  Accci
         2  0.00  Accccgci
         2  0.00  AHe
         2  0.00  AH
         1  0.00  rcicHcci
         1  0.00  orcir
         1  0.00  orcin
         1  0.00  orcie
         1  0.00  orcccci
         1  0.00  orcccccgci
         1  0.00  on
         1  0.00  oeoeHccci
         1  0.00  oeoHcir
         1  0.00  oeoHccccgci
         1  0.00  oeeof
         1  0.00  oeccccHcci
         1  0.00  oeccHci
         1  0.00  oePocHcci
         1  0.00  oeA
         1  0.00  ocicccci
         1  0.00  ocgcir
         1  0.00  ocgcim
         1  0.00  ocgcie
         1  0.00  ocgccccgci
         1  0.00  occie
         1  0.00  occcgci
         1  0.00  occccin
         1  0.00  occccci
         1  0.00  occcPoec
         1  0.00  ocHcor
         1  0.00  ocHccoe
         1  0.00  ocHcccgci
         1  0.00  oPcieHcim
         1  0.00  oHeor
         1  0.00  oHeoe
         1  0.00  oHcieHci
         1  0.00  oHccoHcir
         1  0.00  oAe
         1  0.00  oAPccccgci
         1  0.00  oAHci
         1  0.00  oA
         1  0.00  o
         1  0.00  er
         1  0.00  eoeHcim
         1  0.00  eoHcif
         1  0.00  eecgcir
         1  0.00  ecccPcccgci
         1  0.00  ePoe
         1  0.00  ePcccgci
         1  0.00  ePccccgci
         1  0.00  eHecccci
         1  0.00  eHe
         1  0.00  coeor
         1  0.00  coeci
         1  0.00  coecgci
         1  0.00  coecccoe
         1  0.00  coeccHcie
         1  0.00  coeHci
         1  0.00  coeHcci
         1  0.00  coeHccgci
         1  0.00  cocgcir
         1  0.00  cocHccci
         1  0.00  coHoe
         1  0.00  ciror
         1  0.00  ciroHcci
         1  0.00  circir
         1  0.00  cioHci
         1  0.00  cieorci
         1  0.00  cieor
         1  0.00  cieoe
         1  0.00  cieciecgci
         1  0.00  cieccccgci
         1  0.00  cieccccc
         1  0.00  ciccccHcci
         1  0.00  ciccccHccci
         1  0.00  cicPcim
         1  0.00  cicHccci
         1  0.00  ciPcccci
         1  0.00  ciPccccgci
         1  0.00  ci
         1  0.00  cgroe
         1  0.00  cgoePccccgci
         1  0.00  cgoeHccgci
         1  0.00  cgoHccgci
         1  0.00  cgoHccci
         1  0.00  cgeccccgci
         1  0.00  cgcieHci
         1  0.00  cgcicHcci
         1  0.00  cgcccHcccgci
         1  0.00  cgHccci
         1  0.00  cgHcccci
         1  0.00  cec
         1  0.00  ccor
         1  0.00  ccoeo
         1  0.00  ccoeci
         1  0.00  ccoecccci
         1  0.00  ccoeHcccci
         1  0.00  ccocgcim
         1  0.00  ccocPcccgci
         1  0.00  ccocHcci
         1  0.00  ccoHcim
         1  0.00  ccoHcie
         1  0.00  ccoHcccgci
         1  0.00  ccircie
         1  0.00  ccino
         1  0.00  ccieci
         1  0.00  ccieccccg
         1  0.00  ccieHccci
         1  0.00  cciHcim
         1  0.00  cci
         1  0.00  ccer
         1  0.00  ccecgcim
         1  0.00  ccecgci
         1  0.00  cceccPccccci
         1  0.00  ccec
         1  0.00  cccoeoe
         1  0.00  cccoeo
         1  0.00  cccoecgci
         1  0.00  cccoecccgci
         1  0.00  cccoecccci
         1  0.00  ccciror
         1  0.00  cccirci
         1  0.00  cccieoeci
         1  0.00  ccciHccci
         1  0.00  cccgoe
         1  0.00  cccgcin
         1  0.00  cccgcif
         1  0.00  cccgcie
         1  0.00  ccccor
         1  0.00  ccccieor
         1  0.00  cccciHci
         1  0.00  ccccgor
         1  0.00  ccccgcif
         1  0.00  ccccgciecgci
         1  0.00  ccccgciHcir
         1  0.00  ccccgccci
         1  0.00  cccccoe
         1  0.00  cccccim
         1  0.00  cccccie
         1  0.00  cccccgcir
         1  0.00  cccccg
         1  0.00  cccccci
         1  0.00  cccccHcci
         1  0.00  cccccHccci
         1  0.00  ccccPcccgci
         1  0.00  cccPoe
         1  0.00  cccPci
         1  0.00  cccPcccgci
         1  0.00  cccP
         1  0.00  ccPcim
         1  0.00  ccPccccgci
         1  0.00  ccPcccccgci
         1  0.00  cc
         1  0.00  cPcoe
         1  0.00  cPccir
         1  0.00  cPccie
         1  0.00  cPccgor
         1  0.00  cPcccci
         1  0.00  cPccccgci
         1  0.00  PoecgciHci
         1  0.00  PoeHcccoe
         1  0.00  PoeHccci
         1  0.00  PoHcin
         1  0.00  PciHccgci
         1  0.00  HoePci
         1  0.00  HoeHcci
         1  0.00  HocHccci
         1  0.00  HoHcieci
         1  0.00  HciAHci
         1  0.00  HcccoHcccgci
         1  0.00  HcccgoeHcgci
         1  0.00  H
         1  0.00  Arcim
         1  0.00  Aoe
         1  0.00  An
         1  0.00  Acir
         1  0.00  Acim
         1  0.00  Acie
         1  0.00  AciHci
         1  0.00  AcgciHcci
         1  0.00  Acgccci
         1  0.00  Acgcccgci
         1  0.00  Acgccccg
         1  0.00  Acccci
         1  0.00  AccHcccgci
         1  0.00  AcHcci
         1  0.00  AcHcccgci
         1  0.00  AcHccc
         1  0.00  AP
         1  0.00  AHoPci
         1  0.00  AHccccHcci
         1  0.00  AAHccci
     -----  ----  ----
      1585  1.00  TOT

  I must do something about the "c" prefix....
  
    cat bio-j-huc-gut.wds \
      | split-prefix-suffix \
      | egrep ':' \
      | sed -e 's/^.*://g' \
      | wfreq

  All suffixes recognized by the code:
  
       376  0.12  ccgci
       355  0.11  cccgci
       267  0.08  ci
       265  0.08  ccccgci
       255  0.08  cim
       243  0.08  cie
       240  0.08  cci
       201  0.06  ccci
       174  0.06  cir
       111  0.04  cccci
       102  0.03  oe
        98  0.03  cin
        41  0.01  or
        25  0.01  ccccci
        21  0.01  cgci
        21  0.01  cccoe
        20  0.01  cieci
        18  0.01  cif
        18  0.01  cccccgci
        14  0.00  ccccg
        13  0.00  coe
        12  0.00  ccoe
        11  0.00  cccg
         9  0.00  circi
         8  0.00  ciecgci
         8  0.00  ccor
         8  0.00  ccgcir
         7  0.00  oeci
         7  0.00  ccccgcir
         6  0.00  cor
         6  0.00  ccg
         5  0.00  om
         5  0.00  o
         5  0.00  cieor
         5  0.00  cicgci
         5  0.00  ccie
         5  0.00  ccgcie
         4  0.00  oecgci
         4  0.00  oeccccgci
         4  0.00  cieoe
         4  0.00  cieccccgci
         4  0.00  cccir
         4  0.00  cccgcie
         4  0.00  ccccc
         4  0.00  c
         3  0.00  oecccgci
         3  0.00  ciecccgci
         3  0.00  cic
         3  0.00  cccie
         3  0.00  cccgcir
         3  0.00  ccccgcie
         3  0.00  ccc
         3  0.00  cc
         2  0.00  oroe
         2  0.00  orcim
         2  0.00  oeccci
         2  0.00  oecccci
         2  0.00  coecgci
         2  0.00  ciie
         2  0.00  cieo
         2  0.00  cgoe
         2  0.00  cgcim
         2  0.00  ccgcim
         2  0.00  cce
         2  0.00  ccccif
         1  0.00  oroeccccgci
         1  0.00  orcin
         1  0.00  orcie
         1  0.00  orccci
         1  0.00  oeor
         1  0.00  oeof
         1  0.00  oeoe
         1  0.00  oecircir
         1  0.00  oecim
         1  0.00  oeciecccgci
         1  0.00  oecgcie
         1  0.00  oecgccccgci
         1  0.00  oecccgcie
         1  0.00  oecccg
         1  0.00  oeccccg
         1  0.00  oecccccgci
         1  0.00  ocgcie
         1  0.00  ocg
         1  0.00  occci
         1  0.00  occccgci
         1  0.00  coeor
         1  0.00  coeci
         1  0.00  co
         1  0.00  cirorci
         1  0.00  cirof
         1  0.00  ciroeof
         1  0.00  ciroe
         1  0.00  circif
         1  0.00  circie
         1  0.00  circgci
         1  0.00  circccci
         1  0.00  circccccgcie
         1  0.00  cioc
         1  0.00  cimci
         1  0.00  cifo
         1  0.00  ciecirci
         1  0.00  ciecircg
         1  0.00  ciecir
         1  0.00  ciecim
         1  0.00  ciecie
         1  0.00  ciecgcgci
         1  0.00  ciecce
         1  0.00  cieccci
         1  0.00  ciecccci
         1  0.00  ciec
         1  0.00  cicccgci
         1  0.00  cgcircie
         1  0.00  cgcircccci
         1  0.00  cgcir
         1  0.00  cgcieor
         1  0.00  cgcieccor
         1  0.00  cgcie
         1  0.00  cg
         1  0.00  ccoecicg
         1  0.00  ccoeci
         1  0.00  ccocgci
         1  0.00  cco
         1  0.00  ccir
         1  0.00  ccin
         1  0.00  ccieci
         1  0.00  ccgor
         1  0.00  ccgcioe
         1  0.00  ccgcif
         1  0.00  ccgcieor
         1  0.00  ccgcci
         1  0.00  ccgccgci
         1  0.00  ccgccci
         1  0.00  ccgccccgci
         1  0.00  cccor
         1  0.00  cccoecgci
         1  0.00  ccciroe
         1  0.00  cccieci
         1  0.00  cccgcif
         1  0.00  cccgciA
         1  0.00  cccgccci
         1  0.00  ccccor
         1  0.00  ccccoe
         1  0.00  ccccir
         1  0.00  cccciecgci
         1  0.00  cccciecg
         1  0.00  ccccie
         1  0.00  ccccic
         1  0.00  ccccgcirci
         1  0.00  cccccgcie
         1  0.00  cccccg
         1  0.00  cccccci
         1  0.00  cccc
     -----  ----  ----
      3157  1.00  TOT
    
  In reverse-lex order:

         1  0.00  cccgciA
         4  0.00  c
         3  0.00  cc
         3  0.00  ccc
         1  0.00  cccc
         4  0.00  ccccc
         1  0.00  ciec
         3  0.00  cic
         1  0.00  ccccic
         1  0.00  cioc
         2  0.00  cce
         1  0.00  ciecce
       243  0.08  cie
         5  0.00  ccie
         3  0.00  cccie
         1  0.00  ccccie
         1  0.00  ciecie
         1  0.00  cgcie
         5  0.00  ccgcie
         4  0.00  cccgcie
         3  0.00  ccccgcie
         1  0.00  cccccgcie
         1  0.00  circccccgcie
         1  0.00  oecccgcie
         1  0.00  oecgcie
         1  0.00  ocgcie
         1  0.00  circie
         1  0.00  cgcircie
         1  0.00  orcie
         2  0.00  ciie
       102  0.03  oe
        13  0.00  coe
        12  0.00  ccoe
        21  0.01  cccoe
         1  0.00  ccccoe
         4  0.00  cieoe
         1  0.00  oeoe
         2  0.00  cgoe
         1  0.00  ccgcioe
         1  0.00  ciroe
         1  0.00  ccciroe
         2  0.00  oroe
        18  0.01  cif
         2  0.00  ccccif
         1  0.00  ccgcif
         1  0.00  cccgcif
         1  0.00  circif
         1  0.00  oeof
         1  0.00  ciroeof
         1  0.00  cirof
         1  0.00  cg
         6  0.00  ccg
        11  0.00  cccg
        14  0.00  ccccg
         1  0.00  cccccg
         1  0.00  oeccccg
         1  0.00  oecccg
         1  0.00  cccciecg
         1  0.00  ccoecicg
         1  0.00  ocg
         1  0.00  ciecircg
       267  0.08  ci
       240  0.08  cci
       201  0.06  ccci
       111  0.04  cccci
        25  0.01  ccccci
         1  0.00  cccccci
         1  0.00  ciecccci
         2  0.00  oecccci
         1  0.00  circccci
         1  0.00  cgcircccci
         1  0.00  cieccci
         2  0.00  oeccci
         1  0.00  ccgccci
         1  0.00  cccgccci
         1  0.00  occci
         1  0.00  orccci
         1  0.00  ccgcci
        20  0.01  cieci
         1  0.00  ccieci
         1  0.00  cccieci
         7  0.00  oeci
         1  0.00  coeci
         1  0.00  ccoeci
        21  0.01  cgci
       376  0.12  ccgci
       355  0.11  cccgci
       265  0.08  ccccgci
        18  0.01  cccccgci
         1  0.00  oecccccgci
         4  0.00  cieccccgci
         4  0.00  oeccccgci
         1  0.00  oroeccccgci
         1  0.00  ccgccccgci
         1  0.00  oecgccccgci
         1  0.00  occccgci
         3  0.00  ciecccgci
         1  0.00  oeciecccgci
         3  0.00  oecccgci
         1  0.00  cicccgci
         1  0.00  ccgccgci
         8  0.00  ciecgci
         1  0.00  cccciecgci
         4  0.00  oecgci
         2  0.00  coecgci
         1  0.00  cccoecgci
         1  0.00  ciecgcgci
         5  0.00  cicgci
         1  0.00  ccocgci
         1  0.00  circgci
         1  0.00  cimci
         9  0.00  circi
         1  0.00  ciecirci
         1  0.00  ccccgcirci
         1  0.00  cirorci
       255  0.08  cim
         1  0.00  ciecim
         1  0.00  oecim
         2  0.00  cgcim
         2  0.00  ccgcim
         2  0.00  orcim
         5  0.00  om
        98  0.03  cin
         1  0.00  ccin
         1  0.00  orcin
         5  0.00  o
         1  0.00  co
         1  0.00  cco
         2  0.00  cieo
         1  0.00  cifo
       174  0.06  cir
         1  0.00  ccir
         4  0.00  cccir
         1  0.00  ccccir
         1  0.00  ciecir
         1  0.00  cgcir
         8  0.00  ccgcir
         3  0.00  cccgcir
         7  0.00  ccccgcir
         1  0.00  oecircir
        41  0.01  or
         6  0.00  cor
         8  0.00  ccor
         1  0.00  cccor
         1  0.00  ccccor
         1  0.00  cgcieccor
         5  0.00  cieor
         1  0.00  cgcieor
         1  0.00  ccgcieor
         1  0.00  oeor
         1  0.00  coeor
         1  0.00  ccgor

  Note the isolated peaks at suffixes that beging with "ci" or "o".
  These sharp peaks are evidence that the prefixes recognized above 
  do NOT have alternates with a "c" appended.  That is, the suffixes

        21  0.01  cgci
       376  0.12  ccgci
       355  0.11  cccgci
       265  0.08  ccccgci
        18  0.01  cccccgci

  appear to be different suffixes, and not the same "ccgci"
  suffix attached to different prefixes.
  
  Here are the most significant suffixes:
  
       243  0.08  cie
       102  0.03  oe
        13  0.00  coe
        12  0.00  ccoe
        21  0.01  cccoe
        18  0.01  cif
        11  0.00  cccg
        14  0.00  ccccg
       267  0.08  ci
       240  0.08  cci
       201  0.06  ccci
       111  0.04  cccci
        25  0.01  ccccci
        20  0.01  cieci
        21  0.01  cgci
       376  0.12  ccgci
       355  0.11  cccgci
       265  0.08  ccccgci
        18  0.01  cccccgci
       255  0.08  cim
        98  0.03  cin
       174  0.06  cir
        41  0.01  or

  Removing the strings of "c": 
  
    cat bio-j-huc-gut.wds \
      | split-prefix-suffix \
      | egrep ':c' \
      | sed -e 's/^.*:c*//g' \
      | wfreq
      
      1035  0.35  gci
       845  0.29  i
       255  0.09  im
       252  0.09  ie
       180  0.06  ir
        99  0.03  in
        47  0.02  oe
        33  0.01  g
        22  0.01  ieci
        20  0.01  if
        19  0.01  gcir
        16  0.01  or
        15  0.01  
        14  0.00  gcie
         9  0.00  irci
         9  0.00  iecgci
         5  0.00  ieor
         5  0.00  icgci
         4  0.00  ieoe
         4  0.00  ieccccgci
         4  0.00  ic
         4  0.00  gcim
         3  0.00  oecgci
         3  0.00  iecccgci
         2  0.00  oeci
         2  0.00  o
         2  0.00  iroe
         2  0.00  iie
         2  0.00  ieo
         2  0.00  goe
         2  0.00  gcif
         2  0.00  gcieor
         2  0.00  gccci
         2  0.00  e
         1  0.00  oeor
         1  0.00  oecicg
         1  0.00  ocgci
         1  0.00  irorci
         1  0.00  irof
         1  0.00  iroeof
         1  0.00  ircif
         1  0.00  ircie
         1  0.00  ircgci
         1  0.00  ircccci
         1  0.00  ircccccgcie
         1  0.00  ioc
         1  0.00  imci
         1  0.00  ifo
         1  0.00  iecirci
         1  0.00  iecircg
         1  0.00  iecir
         1  0.00  iecim
         1  0.00  iecie
         1  0.00  iecgcgci
         1  0.00  iecg
         1  0.00  iecce
         1  0.00  ieccci
         1  0.00  iecccci
         1  0.00  iec
         1  0.00  icccgci
         1  0.00  gor
         1  0.00  gcircie
         1  0.00  gcirci
         1  0.00  gcircccci
         1  0.00  gcioe
         1  0.00  gcieccor
         1  0.00  gciA
         1  0.00  gcci
         1  0.00  gccgci
         1  0.00  gccccgci
     -----  ----  ----
      2958  1.00  TOT

  Redefined suffixes, and added a few prefixes:
  
    split-prefix-suffix
    ------------------------------------------------
    #! /n/gnu/bin/gawk -f

    # Attempts to split words into prefix/suffix inserting ":" in between.

    BEGIN {
      PREFS = "^(AH|AP|Ae|AeH|H|P|cH|ccH|cccH|ccccH|cg|cgciH|ci|ciH|e|eH|ecccH|eccccH|oH|oP|oe|oeH|oecccH|coeH|r|c)"
      SUFFS = "([coe][cgiroe]*([mnrfe]|))$"
      SPLIT = ( PREFS SUFFS )
    }

    ( $0 ~ SPLIT ) {
      match($0, PREFS)
      k = RLENGTH
      $0 = (substr($0, 1, k) ":" substr($0, k + 1))
      print
      next
    }

    /./ { print; next }
    ------------------------------------------------

  Here are the suffixes recognized by this code:
  
       746  0.18  cccgci
       398  0.09  ccgci
       343  0.08  ccci
       325  0.08  ccccgci
       285  0.07  cim
       271  0.06  ci
       260  0.06  cci
       257  0.06  cie
       185  0.04  cir
       172  0.04  cccci
       127  0.03  oe
        98  0.02  cin
        51  0.01  or
        45  0.01  ccoe
        36  0.01  coe
        26  0.01  ccccci
        25  0.01  ccor
        25  0.01  cccoe
        21  0.00  cieci
        21  0.00  cgci
        19  0.00  e
        18  0.00  cif
        18  0.00  cccg
        18  0.00  cccccgci
        15  0.00  ccccg
        13  0.00  cccgcie
        13  0.00  ccc
        12  0.00  ccie
        12  0.00  cccir
        11  0.00  ccir
        11  0.00  ccgcir
        10  0.00  om
        10  0.00  cccie
         9  0.00  circi
         9  0.00  cccgcim
         8  0.00  oeci
         8  0.00  ciecgci
         8  0.00  ccccgcir
         7  0.00  cor
         7  0.00  cccgcir
         6  0.00  oeccccgci
         6  0.00  ce
         6  0.00  ccgcie
         6  0.00  ccg
         5  0.00  oecgci
         5  0.00  oecccci
         5  0.00  o
         5  0.00  coecgci
         5  0.00  cieor
         5  0.00  cicgci
         5  0.00  cccc
         5  0.00  c
         4  0.00  cieoe
         4  0.00  cieccccgci
         4  0.00  ccccc
         3  0.00  oecccgci
         3  0.00  eci
         3  0.00  ciecccgci
         3  0.00  cic
         3  0.00  ccif
         3  0.00  cccieci
         3  0.00  cccgoe
         3  0.00  ccccgcie
         3  0.00  cc
         2  0.00  oroe
         2  0.00  orcim
         2  0.00  oeor
         2  0.00  oeccci
         2  0.00  eor
         2  0.00  eoe
         2  0.00  eccci
         2  0.00  ecccgci
         2  0.00  eccccgci
         2  0.00  coeci
         2  0.00  circie
         2  0.00  ciie
         2  0.00  cieo
         2  0.00  cgoe
         2  0.00  cgcim
         2  0.00  ccgcim
         2  0.00  ccgcif
         2  0.00  cce
         2  0.00  cccor
         2  0.00  cccgcif
         2  0.00  cccgccci
         2  0.00  ccccoe
         2  0.00  ccccif
         2  0.00  ccccie
         1  0.00  oroeccccgci
         1  0.00  orcin
         1  0.00  orcie
         1  0.00  orccci
         1  0.00  oeof
         1  0.00  oeoe
         1  0.00  oecircir
         1  0.00  oecim
         1  0.00  oeciecccgci
         1  0.00  oecgcie
         1  0.00  oecgccccgci
         1  0.00  oecccoe
         1  0.00  oecccgcie
         1  0.00  oecccg
         1  0.00  oeccccg
         1  0.00  oecccccgci
         1  0.00  ocgcir
         1  0.00  ocgcie
         1  0.00  ocg
         1  0.00  occci
         1  0.00  occccgci
         1  0.00  eorci
         1  0.00  eof
         1  0.00  eciecgci
         1  0.00  ecgcir
         1  0.00  ecccci
         1  0.00  eccccc
         1  0.00  ec
         1  0.00  coeor
         1  0.00  coeo
         1  0.00  coecccci
         1  0.00  cocgcim
         1  0.00  co
         1  0.00  cirorci
         1  0.00  cirof
         1  0.00  ciroeof
         1  0.00  ciroe
         1  0.00  circif
         1  0.00  circgci
         1  0.00  circccci
         1  0.00  circccccgcie
         1  0.00  cioc
         1  0.00  ciecirci
         1  0.00  ciecircg
         1  0.00  ciecir
         1  0.00  ciecim
         1  0.00  ciecie
         1  0.00  ciecgcgci
         1  0.00  ciecce
         1  0.00  cieccci
         1  0.00  ciecccci
         1  0.00  cieccccg
         1  0.00  ciec
         1  0.00  cicccgci
         1  0.00  cgcircie
         1  0.00  cgcircccci
         1  0.00  cgcir
         1  0.00  cgcieor
         1  0.00  cgcieccor
         1  0.00  cgcie
         1  0.00  cg
         1  0.00  cer
         1  0.00  cecgcim
         1  0.00  cecgci
         1  0.00  cec
         1  0.00  ccoeoe
         1  0.00  ccoeo
         1  0.00  ccoecicg
         1  0.00  ccoeci
         1  0.00  ccoecgci
         1  0.00  ccoecccgci
         1  0.00  ccoecccci
         1  0.00  ccocgci
         1  0.00  cco
         1  0.00  cciror
         1  0.00  ccirci
         1  0.00  ccin
         1  0.00  ccieoeci
         1  0.00  ccieci
         1  0.00  ccgor
         1  0.00  ccgoe
         1  0.00  ccgcioe
         1  0.00  ccgcin
         1  0.00  ccgcieor
         1  0.00  ccgcci
         1  0.00  ccgccgci
         1  0.00  ccgccci
         1  0.00  ccgccccgci
         1  0.00  cccoecgci
         1  0.00  ccciroe
         1  0.00  cccieor
         1  0.00  cccgor
         1  0.00  cccgciecgci
         1  0.00  ccccor
         1  0.00  ccccir
         1  0.00  ccccim
         1  0.00  cccciecgci
         1  0.00  cccciecg
         1  0.00  ccccic
         1  0.00  ccccgcirci
         1  0.00  cccccgcie
         1  0.00  cccccg
         1  0.00  cccccci
     -----  ----  ----
      4207  1.00  TOT
  
  In reverse lex order:
  
         5  0.00  c
         3  0.00  cc
        13  0.00  ccc
         5  0.00  cccc
         4  0.00  ccccc
         1  0.00  eccccc
         1  0.00  ec
         1  0.00  cec
         1  0.00  ciec
         3  0.00  cic
         1  0.00  ccccic
         1  0.00  cioc
        19  0.00  e
         6  0.00  ce
         2  0.00  cce
         1  0.00  ciecce
       257  0.06  cie
        12  0.00  ccie
        10  0.00  cccie
         2  0.00  ccccie
         1  0.00  ciecie
         1  0.00  cgcie
         6  0.00  ccgcie
        13  0.00  cccgcie
         3  0.00  ccccgcie
         1  0.00  cccccgcie
         1  0.00  circccccgcie
         1  0.00  oecccgcie
         1  0.00  oecgcie
         1  0.00  ocgcie
         2  0.00  circie
         1  0.00  cgcircie
         1  0.00  orcie
         2  0.00  ciie
       127  0.03  oe
        36  0.01  coe
        45  0.01  ccoe
        25  0.01  cccoe
         2  0.00  ccccoe
         1  0.00  oecccoe
         2  0.00  eoe
         4  0.00  cieoe
         1  0.00  oeoe
         1  0.00  ccoeoe
         2  0.00  cgoe
         1  0.00  ccgoe
         3  0.00  cccgoe
         1  0.00  ccgcioe
         1  0.00  ciroe
         1  0.00  ccciroe
         2  0.00  oroe
        18  0.00  cif
         3  0.00  ccif
         2  0.00  ccccif
         2  0.00  ccgcif
         2  0.00  cccgcif
         1  0.00  circif
         1  0.00  eof
         1  0.00  oeof
         1  0.00  ciroeof
         1  0.00  cirof
         1  0.00  cg
         6  0.00  ccg
        18  0.00  cccg
        15  0.00  ccccg
         1  0.00  cccccg
         1  0.00  cieccccg
         1  0.00  oeccccg
         1  0.00  oecccg
         1  0.00  cccciecg
         1  0.00  ccoecicg
         1  0.00  ocg
         1  0.00  ciecircg
       271  0.06  ci
       260  0.06  cci
       343  0.08  ccci
       172  0.04  cccci
        26  0.01  ccccci
         1  0.00  cccccci
         1  0.00  ecccci
         1  0.00  ciecccci
         5  0.00  oecccci
         1  0.00  coecccci
         1  0.00  ccoecccci
         1  0.00  circccci
         1  0.00  cgcircccci
         2  0.00  eccci
         1  0.00  cieccci
         2  0.00  oeccci
         1  0.00  ccgccci
         2  0.00  cccgccci
         1  0.00  occci
         1  0.00  orccci
         1  0.00  ccgcci
         3  0.00  eci
        21  0.00  cieci
         1  0.00  ccieci
         3  0.00  cccieci
         8  0.00  oeci
         2  0.00  coeci
         1  0.00  ccoeci
         1  0.00  ccieoeci
        21  0.00  cgci
       398  0.09  ccgci
       746  0.18  cccgci
       325  0.08  ccccgci
        18  0.00  cccccgci
         1  0.00  oecccccgci
         2  0.00  eccccgci
         4  0.00  cieccccgci
         6  0.00  oeccccgci
         1  0.00  oroeccccgci
         1  0.00  ccgccccgci
         1  0.00  oecgccccgci
         1  0.00  occccgci
         2  0.00  ecccgci
         3  0.00  ciecccgci
         1  0.00  oeciecccgci
         3  0.00  oecccgci
         1  0.00  ccoecccgci
         1  0.00  cicccgci
         1  0.00  ccgccgci
         1  0.00  cecgci
         8  0.00  ciecgci
         1  0.00  cccciecgci
         1  0.00  eciecgci
         1  0.00  cccgciecgci
         5  0.00  oecgci
         5  0.00  coecgci
         1  0.00  ccoecgci
         1  0.00  cccoecgci
         1  0.00  ciecgcgci
         5  0.00  cicgci
         1  0.00  ccocgci
         1  0.00  circgci
         9  0.00  circi
         1  0.00  ccirci
         1  0.00  ciecirci
         1  0.00  ccccgcirci
         1  0.00  eorci
         1  0.00  cirorci
       285  0.07  cim
         1  0.00  ccccim
         1  0.00  ciecim
         1  0.00  oecim
         2  0.00  cgcim
         2  0.00  ccgcim
         9  0.00  cccgcim
         1  0.00  cecgcim
         1  0.00  cocgcim
         2  0.00  orcim
        10  0.00  om
        98  0.02  cin
         1  0.00  ccin
         1  0.00  ccgcin
         1  0.00  orcin
         5  0.00  o
         1  0.00  co
         1  0.00  cco
         2  0.00  cieo
         1  0.00  coeo
         1  0.00  ccoeo
         1  0.00  cer
       185  0.04  cir
        11  0.00  ccir
        12  0.00  cccir
         1  0.00  ccccir
         1  0.00  ciecir
         1  0.00  cgcir
        11  0.00  ccgcir
         7  0.00  cccgcir
         8  0.00  ccccgcir
         1  0.00  ecgcir
         1  0.00  ocgcir
         1  0.00  oecircir
        51  0.01  or
         7  0.00  cor
        25  0.01  ccor
         2  0.00  cccor
         1  0.00  ccccor
         1  0.00  cgcieccor
         2  0.00  eor
         5  0.00  cieor
         1  0.00  cccieor
         1  0.00  cgcieor
         1  0.00  ccgcieor
         2  0.00  oeor
         1  0.00  coeor
         1  0.00  ccgor
         1  0.00  cccgor
         1  0.00  cciror

  These are the significant ones (15 or more occurrences):
  
        19  0.00  e
       257  0.06  cie
       127  0.03  oe
        36  0.01  coe
        45  0.01  ccoe
        25  0.01  cccoe
        18  0.00  cif
        18  0.00  cccg
        15  0.00  ccccg
       271  0.06  ci
       260  0.06  cci
       343  0.08  ccci
       172  0.04  cccci
        26  0.01  ccccci
        21  0.00  cieci
        21  0.00  cgci
       398  0.09  ccgci
       746  0.18  cccgci
       325  0.08  ccccgci
        18  0.00  cccccgci
       285  0.07  cim
        98  0.02  cin
       185  0.04  cir
        51  0.01  or
        25  0.01  ccor

  Stripping the "[coe]c*" prefix of all suffixes:
  
    cat bio-j-huc-gut.wds \
      | split-prefix-suffix \
      | egrep ':' \
      | sed -e 's/^.*:[coe]c*//g' \
      | wfreq

      1513  0.36  gci
      1080  0.26  i
       286  0.07  im
       281  0.07  ie
       209  0.05  ir
       135  0.03  e
       110  0.03  oe
        99  0.02  in
        56  0.01  
        51  0.01  r
        42  0.01  g
        37  0.01  or
        29  0.01  gcir
        25  0.01  ieci
        25  0.01  gcie
        23  0.01  if
        13  0.00  gcim
        10  0.00  m
        10  0.00  irci
        10  0.00  iecgci
         8  0.00  eci
         7  0.00  oecgci
         6  0.00  ieor
         6  0.00  goe
         6  0.00  ecgci
         6  0.00  eccccgci
         5  0.00  icgci
         5  0.00  ecccci
         4  0.00  ieoe
         4  0.00  ieccccgci
         4  0.00  ic
         4  0.00  gcif
         3  0.00  oeci
         3  0.00  iecccgci
         3  0.00  gccci
         3  0.00  ecccgci
         2  0.00  roe
         2  0.00  rcim
         2  0.00  oeo
         2  0.00  oecccci
         2  0.00  o
         2  0.00  iroe
         2  0.00  ircie
         2  0.00  iie
         2  0.00  ieo
         2  0.00  gor
         2  0.00  gcieor
         2  0.00  eor
         2  0.00  eccci
         1  0.00  roeccccgci
         1  0.00  rcin
         1  0.00  rcie
         1  0.00  rccci
         1  0.00  orci
         1  0.00  of
         1  0.00  oeor
         1  0.00  oeoe
         1  0.00  oecicg
         1  0.00  oecccgci
         1  0.00  ocgcim
         1  0.00  ocgci
         1  0.00  irorci
         1  0.00  iror
         1  0.00  irof
         1  0.00  iroeof
         1  0.00  ircif
         1  0.00  ircgci
         1  0.00  ircccci
         1  0.00  ircccccgcie
         1  0.00  ioc
         1  0.00  ieoeci
         1  0.00  iecirci
         1  0.00  iecircg
         1  0.00  iecir
         1  0.00  iecim
         1  0.00  iecie
         1  0.00  iecgcgci
         1  0.00  iecg
         1  0.00  iecce
         1  0.00  ieccci
         1  0.00  iecccci
         1  0.00  ieccccg
         1  0.00  iec
         1  0.00  icccgci
         1  0.00  gcircie
         1  0.00  gcirci
         1  0.00  gcircccci
         1  0.00  gcioe
         1  0.00  gcin
         1  0.00  gciecgci
         1  0.00  gcieccor
         1  0.00  gcci
         1  0.00  gccgci
         1  0.00  gccccgci
         1  0.00  er
         1  0.00  eof
         1  0.00  eoe
         1  0.00  ecircir
         1  0.00  ecim
         1  0.00  eciecccgci
         1  0.00  ecgcim
         1  0.00  ecgcie
         1  0.00  ecgccccgci
         1  0.00  ecccoe
         1  0.00  ecccgcie
         1  0.00  ecccg
         1  0.00  eccccg
         1  0.00  ecccccgci
         1  0.00  ec
     -----  ----  ----
      4207  1.00  TOT

  The significant ones (mostly >40 cases)
  
        56  0.01  _
       135  0.03  e
        42  0.01  g
      1513  0.36  gci
      1080  0.26  i
       281  0.07  ie
        23  0.01  if
       286  0.07  im
        99  0.02  in
       209  0.05  ir
       110  0.03  oe
        37  0.01  or
        51  0.01  r
  
  Redefined split-prefix-suffix accordingly:
  
    SUFFS = "([coe]c*(|e|g|gci|i|ie|if|im|in|ir|oe|or|r))$"
  
  Let's see what prefixes we get now:
  
    cat bio-j-huc-gut.wds \
      | split-prefix-suffix \
      | egrep ':' \
      | sed -e 's/:.*$//g' \
      | wfreq
      
      1000  0.25  AH
       921  0.23  c
       412  0.11  oH
       307  0.08  cg
       194  0.05  e
       173  0.04  cccH
       147  0.04  oe
       134  0.03  H
       107  0.03  ccccH
       103  0.03  oeH
        65  0.02  r
        51  0.01  ciH
        50  0.01  ci
        40  0.01  eH
        40  0.01  P
        38  0.01  Ae
        34  0.01  oP
        21  0.01  cH
        21  0.01  AP
        19  0.00  ccH
        11  0.00  AeH
        10  0.00  coeH
         9  0.00  eccccH
         8  0.00  cgciH
         5  0.00  ecccH
         2  0.00  oecccH
     -----  ----  ----
      3922  1.00  TOT
      
  Keping only the significant ones ( >20 occurrences:
  
      1000  0.25  AH
        21  0.01  AP
        38  0.01  Ae
       134  0.03  H
        40  0.01  P
       921  0.23  c
        21  0.01  cH
       173  0.04  cccH
       107  0.03  ccccH
       307  0.08  cg
        50  0.01  ci
        51  0.01  ciH
       194  0.05  e
        40  0.01  eH
       412  0.11  oH
        34  0.01  oP
       147  0.04  oe
       103  0.03  oeH
        65  0.02  r
  
  Fixing split-prefix-suffix:
  
    PREFS = "^(AH|AP|Ae|H|P|c|cH|cccH|ccccH|cg|ci|ciH|e|eH|oH|oP|oe|oeH|r)"
  
  Listing again the suffixes, without "[coe]c*" prefixes:
  
    cat bio-j-huc-gut.wds \
      | split-prefix-suffix \
      | egrep ':' \
      | sed -e 's/^.*:[coe]c*//g' \
      | wfreq

      1497  0.39  gci
      1042  0.27  i
       284  0.07  im
       278  0.07  ie
       208  0.05  ir
       132  0.03  e
       109  0.03  oe
        99  0.03  in
        56  0.01  
        51  0.01  r
        42  0.01  g
        37  0.01  or
        23  0.01  if
     -----  ----  ----
      3858  1.00  TOT

  The complete suffixes:
  
       736  0.19  cccgci
       392  0.10  ccgci
       333  0.09  ccci
       325  0.08  ccccgci
       283  0.07  cim
       257  0.07  ci
       254  0.07  cie
       247  0.06  cci
       184  0.05  cir
       171  0.04  cccci
       124  0.03  oe
        98  0.03  cin
        51  0.01  or
        45  0.01  ccoe
        35  0.01  coe
        26  0.01  ccccci
        25  0.01  ccor
        25  0.01  cccoe
        21  0.01  cgci
        19  0.00  e
        18  0.00  cif
        18  0.00  cccg
        18  0.00  cccccgci
        15  0.00  ccccg
        13  0.00  ccc
        12  0.00  ccie
        12  0.00  cccir
        11  0.00  ccir
        10  0.00  cccie
         7  0.00  cor
         6  0.00  ce
         6  0.00  ccg
         5  0.00  o
         5  0.00  cccc
         5  0.00  c
         4  0.00  ccccc
         3  0.00  eci
         3  0.00  ccif
         3  0.00  cc
         2  0.00  eor
         2  0.00  eoe
         2  0.00  eccci
         2  0.00  ecccgci
         2  0.00  eccccgci
         2  0.00  cce
         2  0.00  cccor
         2  0.00  ccccoe
         2  0.00  ccccif
         2  0.00  ccccie
         1  0.00  ocg
         1  0.00  occci
         1  0.00  occccgci
         1  0.00  ecccci
         1  0.00  eccccc
         1  0.00  ec
         1  0.00  cg
         1  0.00  ccin
         1  0.00  ccccor
         1  0.00  ccccir
         1  0.00  ccccim
         1  0.00  cccccg
         1  0.00  cccccci
     -----  ----  ----
      3858  1.00  TOT
  
  Removed suffixes beginning with "e":
  
    SUFFS = "([co]c*(|e|g|gci|i|ie|if|im|in|ir|oe|or|r))$"
    
  Tabulating prefixes:
  
       998  0.26  AH
       920  0.24  c
       410  0.11  oH
       302  0.08  cg
       194  0.05  e
       173  0.05  cccH
       145  0.04  oe
       134  0.04  H
       107  0.03  ccccH
       103  0.03  oeH
        65  0.02  r
        51  0.01  ciH
        40  0.01  P
        38  0.01  eH
        38  0.01  Ae
        34  0.01  oP
        29  0.01  ci
        21  0.01  cH
        21  0.01  AP
     -----  ----  ----
      3823  1.00  TOT
  
  It looks like "P" is equivalent to "H"...

  Recomputing prefix/suffix table:
  
    /bin/rm -f .title
    /bin/rm -f .table
    /bin/touch .table
    
    set noglob
    set ofmt = "0"
    set npat = 1
    foreach pat ( \
      'AH'    'c'     'oH'    'cg'    \
      'e'     'cccH'  'oe'    'H'     \
      'ccccH' 'oeH'   'r'     'ciH'   \
      'P'     'eH'    'Ae'    'oP'    \
      'ci'    'cH'    'AP'     \
    )
      /n/gnu/bin/printf " %7s" "${pat}" >> .title
      /bin/cat bio-j-huc-gut.wds \
        | /n/gnu/bin/sed -e 's/^/_/g' -e 's/$/_/g' \
        | /n/gnu/bin/egrep "_${pat}[co][^HPA]*_" \
        | /n/gnu/bin/sed -e "s/_${pat}//g" -e '/../s/_$//g' \
        | /n/gnu/bin/sort | uniq -c \
        | /n/gnu/bin/expand \
        > .suff.frq

      /n/gnu/bin/join -a 1 -a 2 -1 1 -2 2 -o"${ofmt},2.1" -e 0 .table .suff.frq > .tmp
      /bin/mv .tmp .table
      @ npat = ${npat} + 1
      set ofmt = "${ofmt},1.${npat}"
    end
    unset noglob
    
    /n/gnu/bin/printf "\n" >> .title
    
    cat .table \
      | /n/gnu/bin/gawk '/./ { s=0; for(i=2;i<=NF;i++) s+=$(i); print s, $0 }' \
      | sort -nr \
      > .tbsort
    
    cat .title .tbsort \
      | format-suffix-table

                 TOTAL    AH    oH     c     H   oeH   ciH    cH    eH  cccH ccccH    cg     P     e    oe    Ae     r    oP    ci    AP
                 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    SUFFIX \ TOT  4104  1038   446   998   148   105    53    21    43   173   109   338    63   212   150    38    73    40    34    22
    ------------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
    ci             257    79    28     1     2    10     1     1     2    31    26    36     0     6    22     7     2     2     0     1
    cci            247    43    22    14     2     4     2     1     1    90    68     0     0     0     0     0     0     0     0     0
    ccci           333    87    35   139     5    19    10     6     3    11     4     2     0     5     4     2     1     0     0     0
    cccci          171    12    13    61    10     3     0     0     1     1     0     3     3    20    19     9     5     4     5     2
    ccccci          26     1     1     1     1     1     1     0     0     0     0     3     0     6     4     0     1     2     4     0
    cccccci          1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0
    
    cgci            21     1     0     0     0     0     0     0     0     0     0     0     0     8     9     1     0     0     1     1
    ccgci          392   201    85    21    28    20    12     3     8    10     2     2     0     0     0     0     0     0     0     0
    cccgci         736   191    57   387    16    15    12     4     7    15     5     5     3     7     6     1     1     2     0     2
    ccccgci        325    19    10    60    18     2     1     0     2     0     0    27    17    73    38     7    10    18    12    11
    cccccgci        18     2     0     0     0     0     0     0     0     0     0     2     0     6     1     3     2     0     2     0
    
    cim            283    94    41    30     7     7     4     0     6     2     0    72     0     2     3     2    13     0     0     0
    ccccim           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    
    cie            254   116    40    14    10     5     3     1     0     1     0    50     0     3     2     2     6     1     0     0
    ccie            12     1     0     7     0     0     0     1     0     1     2     0     0     0     0     0     0     0     0     0
    cccie           10     0     1     7     1     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0
    ccccie           2     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0
    
    cir            184    50    36    11     8     7     2     1     4     3     0    51     1     2     1     1     3     2     1     0
    ccir            11     1     0    10     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccir           12     0     1     8     1     1     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    ccccir           1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    
    cin             98    54    16     0     2     5     2     0     2     0     0    12     0     0     3     1     1     0     0     0
    ccin             1     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
    
    cif             18     4     0     0     1     1     0     0     0     2     0     6     0     1     0     0     3     0     0     0
    ccif             3     0     0     3     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    
    oe             124    21    10    25     6     1     0     0     1     0     0    17     8    16     7     1     9     1     0     1
    coe             35     1     6    23     3     1     0     1     0     0     0     0     0     0     0     0     0     0     0     0
    ccoe            45     1     1    33     4     0     0     0     0     0     0     3     2     0     0     0     0     0     0     1
    cccoe           25     0     0     4     3     0     1     0     0     0     0     4     4     3     3     1     0     0     1     1

    or              51     4     2    10     4     0     0     0     0     0     0     3     0    10    13     0     3     1     0     1
    cor              7     4     0     1     0     0     0     1     0     1     0     0     0     0     0     0     0     0     0     0
    ccor            25     0     1    17     0     0     0     0     0     0     0     2     2     3     0     0     0     0     0     0
    cccor            2     0     0     1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0
    ccccor           1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    
    om              10     1     0     5     0     1     0     0     0     0     0     0     1     0     0     0     2     0     0     0
    
    cccg            18     5     3     7     2     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0
    ccccg           15     1     0     1     0     0     0     0     0     1     0     0     0     5     3     0     2     1     1     0
    
    
    cieci           21     7     4     1     1     0     0     0     0     0     0     6     0     0     0     0     1     1     0     0
    cccgcie         13     2     1     9     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1
    ccc             13     1     0    10     0     0     0     0     0     0     0     0     0     1     0     0     1     0     0     0
    ccgcir          11     4     4     3     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccgcim          9     0     0     9     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oeci             8     0     0     1     0     0     0     0     0     0     0     0     0     4     0     0     1     2     0     0
    circi            8     1     0     0     1     0     0     0     0     0     0     5     0     0     0     0     1     0     0     0
    ciecgci          8     3     2     0     0     0     0     0     0     0     0     3     0     0     0     0     0     0     0     0
    ccccgcir         8     0     0     1     1     0     0     0     0     0     0     0     4     1     0     0     0     0     1     0
    cccgcir          7     0     0     4     0     0     2     0     0     0     0     0     0     0     1     0     0     0     0     0
    oeccccgci        6     1     0     2     0     0     0     0     0     0     0     1     2     0     0     0     0     0     0     0
    ccg              6     3     1     0     0     1     0     0     0     1     0     0     0     0     0     0     0     0     0     0
    ce               6     0     0     6     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgcie           6     2     2     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oecgci           5     1     1     1     1     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    oecccci          5     0     0     3     0     0     0     0     0     0     0     0     1     0     1     0     0     0     0     0
    o                5     0     0     0     0     0     0     0     0     0     0     0     0     4     1     0     0     0     0     0
    coecgci          5     0     1     3     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0
    cieor            5     0     4     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0
    cicgci           5     3     1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccc             5     0     0     4     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0
    c                5     0     0     1     0     0     0     0     0     2     0     0     0     0     2     0     0     0     0     0
    cieoe            4     1     2     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    cieccccgci       4     0     1     0     0     0     0     0     0     0     0     2     0     0     0     0     0     1     0     0
    ccccc            4     0     0     0     0     0     0     0     0     0     0     1     0     1     1     0     0     0     1     0
    oecccgci         3     0     0     0     1     0     0     0     0     0     0     0     2     0     0     0     0     0     0     0
    ciecccgci        3     0     1     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     0     0
    cic              3     2     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccieci          3     0     0     2     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0
    cccgoe           3     0     0     3     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccccgcie         3     0     0     0     0     0     0     0     0     0     0     0     1     1     0     0     0     1     0     0
    cc               3     1     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0
    oroe             2     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     1     0     0     0
    orcim            2     0     0     0     0     0     0     0     0     0     0     1     0     1     0     0     0     0     0     0
    oeor             2     0     0     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    oeccci           2     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     0
    coeci            2     1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    circie           2     0     0     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ciie             2     1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    cieo             2     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0     0     0
    cgoe             2     0     0     0     0     0     0     0     0     0     0     0     1     1     0     0     0     0     0     0
    cgcim            2     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     1     0
    ccgcim           2     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgcif           2     0     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cce              2     0     0     0     0     0     0     0     0     0     0     0     0     2     0     0     0     0     0     0
    cccgcif          2     0     0     1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccgccci         2     0     0     1     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0
    ccccoe           2     0     0     1     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ccccif           2     0     0     0     0     0     0     0     0     0     0     0     0     1     1     0     0     0     0     0
    oroeccccgci      1     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    orcin            1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    orcie            1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    orccci           1     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oeof             1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    oeoe             1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oecircir         1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    oecim            1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    oeciecccgci      1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    oecgcie          1     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oecgccccgci      1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    oecccoe          1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oecccgcie        1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    oecccg           1     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    oeccccg          1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    oecccccgci       1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ocgcir           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ocgcie           1     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0
    ocg              1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    occci            1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    occccgci         1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    coeor            1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    coeo             1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    coecccci         1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cocgcim          1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    co               1     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0
    cirorci          1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    cirof            1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ciroeof          1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    circif           1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    circgci          1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    circccci         1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    circccccgcie     1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    cioc             1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cino             1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cimci            1     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0
    cifo             1     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0
    ciecirci         1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ciecircg         1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ciecir           1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ciecim           1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ciecie           1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ciecgcgci        1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ciecce           1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0
    cieccci          1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ciecccci         1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cieccccg         1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ciec             1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cicccgci         1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cgcircie         1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0
    cgcircccci       1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0
    cgcir            1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    cgcieor          1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    cgcieccor        1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    cgcie            1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cg               1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    cer              1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cecgcim          1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cecgci           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cec              1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccoeoe           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccoeo            1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccoecicg         1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    ccoecgci         1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccoecccgci       1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccoecccci        1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccocgci          1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cco              1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    cciror           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccirci           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccieoeci         1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccieci           1     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0
    ccgor            1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgoe            1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgcioe          1     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgcin           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgcieor         1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgcci           1     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0     0     0     0
    ccgccgci         1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgccci          1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccgccccgci       1     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccoecgci        1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    ccciroe          1     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0
    cccieor          1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccgor           1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccgciecgci      1     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    cccciecgci       1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0
    cccciecg         1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0
    ccccic           1     0     1     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    ccccgcirci       1     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0
    cccccgcie        1     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0     0     0     0
    cccccg           1     0     0     0     0     0     0     0     0     0     0     0     0     0     1     0     0     0     0     0

  Redefined split-prefix-suffix with only the significant suffixes:
  
    SUFFS = "(ccc3ci|cc3ci|ccci|cccc3ci|cim|ci|cix|cci|ci2|cccci|ox|cin|o2|ccox|cox|ccccci|cccox|cco2)$"


97-07-18 stolfi
===============

  I got the idea that the \s/ plume may be a stress mark (that shifts 
  when the word is inflected).  So I decided to redo the analysis for
  Portuguese, comparing o-accent with o-plain, etc.:
  
    cat port.wds \
      | sed -e 's/[óô]/ó/g' \
      | compare-contexts -lctx 0 -rctx 0 \
          '[aeiouáéíóúàâêôü]*[o][aeiouáéíóúàâêôü]*' \
          '[aeiouáéíóúàâêôü]*[ó][aeiouáéíóúàâêôü]*'

      6143  0.93  o        125  0.99  ó                  
       135  0.02  io         1  0.01  eó                 
       112  0.02  ou     -----  ----  ----               
        75  0.01  oi       126  1.00  TOT                
        55  0.01  ao                                     
        36  0.01  eo                                     
        32  0.00  oo                                     
        12  0.00  aio                                    
         5  0.00  eio                                    
         4  0.00  oa                                     
         2  0.00  oá                                     
         2  0.00  oe                                     
     -----  ----  ----                                   
      6613  1.00  TOT                                    


    cat port.wds \
      | sed -e 's/[áâ]/á/g' \
      | compare-contexts -lctx 0 -rctx 0 \
          '[aeiouáéíóúàâêôü]*[a][aeiouáéíóúàâêôü]*' \
          '[aeiouáéíóúàâêôü]*[á][aeiouáéíóúàâêôü]*'
          
  
      6878  0.87  a                     258  0.75  á
       448  0.06  ia                     84  0.24  iá
       224  0.03  ua                      2  0.01  uá
       134  0.02  ai                      2  0.01  oá
        69  0.01  ea                  -----  ----  ----
        55  0.01  ao                    346  1.00  TOT
        23  0.00  uai                
        23  0.00  au                 
        12  0.00  aio                
         8  0.00  iai                
         4  0.00  éia                 
         4  0.00  oa                 
         3  0.00  ía                  
         3  0.00  eia                
         3  0.00  eai                
         2  0.00  aí                  
         2  0.00  aue                
         2  0.00  ae                 
     -----  ----  ----               
      7897  1.00  TOT                

    cat port.wds \
      | sed -e 's/[êé]/é/g' \
      | compare-contexts -lctx 0 -rctx 0 \
          '[aeiouáéíóúàâêôü]*[e][aeiouáéíóúàâêôü]*' \
          '[aeiouáéíóúàâêôü]*[é][aeiouáéíóúàâêôü]*'
          
    7385  0.89  e                     651  0.97  é
     420  0.05  ue                      8  0.01  üé
     196  0.02  ie                      7  0.01  ié
     117  0.01  ei                      4  0.01  éia
      69  0.01  ea                      1  0.00  éi
      61  0.01  eu                  -----  ----  ----
      36  0.00  eo                    671  1.00  TOT
       5  0.00  eio                
       3  0.00  eia                
       3  0.00  eai                
       2  0.00  üe                  
       2  0.00  oe                 
       2  0.00  aue                
       2  0.00  ae                 
       1  0.00  eó                  
       1  0.00  eí                  
       1  0.00  eiú                 
       1  0.00  ee                 
   -----  ----  ----               
    8307  1.00  TOT                

  These tables are skewed by the short words "a", "é", "e", "o", etc.
  So let's require at least one more letter after the accent:

    cat port.wds \
      | sed \
         -e 's/[áâ]/á/g' \
         -e 's/^/_/g' \
         -e 's/$/_/g' \
      | compare-contexts -lctx 1 -rctx 0 \
          '[a][aeiouáéíóúàâêôü]*[a-záéíóúàâêôü]' \
          '[á][aeiouáéíóúàâêôü]*[a-záéíóúàâêôü]'

       284  0.06  par                    73  0.25  ián
       211  0.04  _ar                    36  0.12  _án
       151  0.03  _as                    29  0.10  _ár
       148  0.03  cad                    15  0.05  tán
       144  0.03  lad                    12  0.04  rár
       128  0.03  tas                    12  0.04  rám
       116  0.02  dad                    11  0.04  sár
       111  0.02  tan                    10  0.03  ráf
       110  0.02  _ap                    10  0.03  cál
       101  0.02  rad                     8  0.03  iáv
        95  0.02  das                     7  0.02  mát
        77  0.02  lar                     7  0.02  lás
        76  0.02  fac                     6  0.02  vár
        74  0.02  as                      5  0.02  jáv
        72  0.02  _al                     5  0.02  fác
        70  0.01  tam                     4  0.01  táv
        68  0.01  nal                     4  0.01  ráp
        66  0.01  mais                    4  0.01  pán
        61  0.01  ual                     4  0.01  nál
        58  0.01  tad                     3  0.01  máx
        56  0.01  cas                     3  0.01  lát
        55  0.01  ram                     3  0.01  iám
        48  0.01  car                     3  0.01  cáv
        47  0.01  tar                     2  0.01  uár
        46  0.01  nas                     2  0.01  tár
        46  0.01  mas                     2  0.01  rát
        46  0.01  ias                     1  0.00  tág
        46  0.01  _ao                     1  0.00  ráv
        45  0.01  cal                     1  0.00  oáv
        44  0.01  tal                     1  0.00  oác
        43  0.01  ian                     1  0.00  nár
        41  0.01  am                      1  0.00  láv
        40  0.01  nad                     1  0.00  háv
        39  0.01  ran                     1  0.00  dáv
        37  0.01  sam                     1  0.00  cán
        36  0.01  lag                     1  0.00  bás
        36  0.01  ial                     1  0.00  _át
        35  0.01  _ad                 -----  ----  ----
        34  0.01  val                   291  1.00  TOT
        34  0.01  ras                
       ...  ....  ..... 
         1  0.00  ab                 
     -----  ----  ----               
      4785  1.00  TOT                

  Too many cases, let's reduce them:
  
    cat port.wds \
      | sed \
         -e 's/[áâ]/á/g' \
         -e 's/^/_/g' \
         -e 's/$/_/g' \
      | compare-contexts -lctx 0 -rctx 0 \
          '_[a][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]' \
          '_[á][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]'

       211  0.27  _ar      36  0.55  _án
       151  0.19  _as      29  0.44  _ár
       110  0.14  _ap       1  0.02  _át
        72  0.09  _al   -----  ----  ----
        35  0.05  _ad      66  1.00  TOT
        32  0.04  _at                
        31  0.04  _an                
        28  0.04  _ac                
        26  0.03  _ab                
        15  0.02  _am                
        13  0.02  _aj                
         9  0.01  _aos               
         8  0.01  _aut               
         7  0.01  _ain               
         6  0.01  _av                
         6  0.01  _aq                
         6  0.01  _af                
         5  0.01  _aum               
         4  0.01  _ag                
         2  0.00  _aux               
     -----  ----  ----               
       777  1.00  TOT  

    cat port.wds \
      | sed \
         -e 's/[óô]/ó/g' \
         -e 's/^/_/g' \
         -e 's/$/_/g' \
      | compare-contexts -lctx 0 -rctx 0 \
          '_[o][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]' \
          '_[ó][aeiouáéíóúàâêôü]*[bcdfghjklmnpqrstvwxyz]'

       154  0.39  _os        7  0.70  _ót
        68  0.17  _or        2  0.20  _ób
        54  0.14  _ob        1  0.10  _ór
        39  0.10  _out   -----  ----  ----
        26  0.07  _ot       10  1.00  TOT
        26  0.07  _on                
        16  0.04  _op                
         9  0.02  _oc                
         2  0.01  _oit               
     -----  ----  ----               
       394  1.00  TOT      

  Not very impressive. However, that is not surprising, considering that 
  most accents (especially those that shift) are already omitted by the
  default stress rules.