98-03-09 stolfi
===============

  [ Lots of work on PGMTextFilter and PGMRankFilter omitted. ]
  
  Recreating the images in the norm-filter site:
  
    list .gifs

    --- .gifs ------------------------
    test1-4-05-05-4-95-95-nf
    test1-4-05-05-4-50-80-nf
    test1-6-05-05-6-95-95-nf
    test1-4-05-05-6-50-80-nf
    test1-6-05-05-6-50-80-nf
    test1-1-05-05-6-50-80-nf
    test1-1-05-05-1-50-80-nf
    test1-1-05-05-1-95-95-nf
    test1-3-05-05-3-50-80-nf
    test1-2-05-05-2-95-95-nf
    test1-2-05-05-2-50-80-nf
    test1-3-05-05-3-95-95-nf
    test1-4-05-05-9-50-80-nf
    test1-1-05-05-9-50-80-nf
    test1-4-05-05-6-50-80-hi
    test1-4-05-05-9-50-80-hi
    ----------------------------------

    --- test-nf ------------------------
    #! /bin/csh -f 

    set usage = "$0 DIR NAME LORAD LOFRAC LOCOLOR HIRAD HIFRAC HICOLOR OP"
    set notify

    if ( $#argv != 9 ) then
      echo "usage: ${usage}"; exit 1
    endif

    set dir = "$1"; shift;
    set name = "$1"; shift;
    set lorad = "$1"; shift;
    set lofrc = "$1"; shift
    set locol = "$1"; shift
    set hirad = "$1"; shift;
    set hifrc = "$1"; shift
    set hicol = "$1"; shift
    set op = "$1"; shift

    set otname = "${name}-${lorad}-${lofrc}-${locol}-${hirad}-${hifrc}-${hicol}-${op}"

    echo "${otname}.pgm"
    ( cat ${dir}/${name}.pgm \
      | nice PGMNormFilter \
         -write ${op} \
         -loRadius ${lorad} -loFraction 0.${lofrc} -loColor 0.${locol} \
         -hiRadius ${hirad} -hiFraction 0.${hifrc} -hiColor 0.${hicol} \
         -minWeight 1e-6 \
      > ${dir}/${otname}.pgm \
      && xv ${dir}/${otname}.pgm ) &
    ------------------------------------

    foreach f ( `cat .gifs` )
      echo $f
      if ( ! ( -r filter-gal/norm-filter/${f}.pgm ) ) then
        set parms = ( `echo $f | tr 'a-' 'a '` )
        test-nf filter-gal/norm-filter ${parms}
      endif
      cat filter-gal/norm-filter/${f}.pgm \
        | pnmdepth 255 \
        | ppmtogif \
        > filter-gal/norm-filter/${f}.gif
    end

  In parallel I collected some EVA text in fine-structure.txt
  and separate dits words by hand into "fine elements", 
  with these conventions:
  
    # Notation:
    #
    #   \     bogus line break
    #   /     legitimate word break
    #   /     bogus word break
    #   :     missing word break
    #   .     fine lement separator
    #   (x)   missing "x" here
    #   (=y)  previous char(s) should be "y"
    #   w!    "w" is an anomalous word
    #   p{m}s "p" is prefix, "s" is suffix, "m" may be midfix
  
  I tried to use these fine elements:
  
    # Prefix elements:
    #
    #   q
    #   o     y
    #   ch    sh   
    #
    # Midfix elements:
    #
    #   d
    #   k     t
    #   ckh   cth 
    #   ke    te
    #   kee   tee
    #   che   she
    #   chee  shee 
    #   ch    ch
    #   sh    sh
    #   p     f     (polyvalent?)
    #   cph   cfh   (polyvalent?)
    #
    # Suffix elements:
    #
    #   o     y
    #   dy
    #   ar    or 
    #   al    ol
    #   ain   oin
    #   air   oir
    #   aiin  oiin
    #   aiiin oiiin
    #   aiir  oiir
    #   ch    ch
    #   sh    sh
    #   
    # Unsure still:
    #
    #   s     r

  I extracted the "words" into fine-structure.wds and
  tried to identify prefix, suffix, midfix:
  
    cat fine-structure.txt \
      | sed \
          -e '/^#/d' \
          -e 's/^<[^ ]*> *//g' \
      | tr -d ' .\!\\-' \
      | tr '/:' '\012\012' \
      | egrep '.' \
      > fine-structure.wds

    factor-words fine-structure A
    
    factor-words bio B
    
    
  Checking distribution of word endings:
  
    cat hea-u.wds \
      | sed \
          -e 's/$/#/g' \
          -e 's/^/__/g' \
          -e 's/\(ii*[nr]\)#/-\1/g' \
          -e 's/\(.\)#/-\1/g' \
          -e 's/^.*-//g' \
      | sort | uniq -c | expand | sort +0 -1nr \
      > head-u-ends.frq
      
      count ending
      ----- ------
        220 y
        158 r
        150 l
        101 iin
         62 s
         49 o
         39 m
         22 in
         10 d
          6 *
          6 a
          6 h
          5 ir
          4 iiin
          4 n
          3 iir
          2 e
          1 c
          1 i
    
  Last pair distribution:
  
    cat hea-u.wds \
      | egrep -v '[*]' \
      | egrep -v '(^s|es|c|i)$' \
      | sed \
          -e 's/[cs]h/C/g' \
          -e 's/c[ktpf]h/T/g' \
          -e 's/[ktpf]/t/g' \
          -e 's/[ao]/o/g' \
          -e 's/^q//g' \
          -e 's/^/_/g' -e 's/$/}3/g' \
          -e 's/\([ao]*i*[rlmns]\)}3/}2 \1/g' \
          -e 's/\(i*[rlmns]\)}3/}2 \1/g' \
          -e 's/\([ao]*[oy]\)}3/}2 \1/g' \
          -e 's/}3/}2 _/g' \
          -e 's/\([aoy]*[CSTKPFtkpfdrls]e*\)}2/}1 \1/g' \
          -e 's/\(.e*[aoy]*\)}2/}1 \1/g' \
          -e 's/^.*}1 //g' \
      | sort | uniq -c | expand | sort -b +2 -3 +0 -1nr \
      | gawk '/./{if($3!=o){printf "\n";o=$3;} print;}' \
      > hea-u-endp.frq
      
  It looks like the final -es is usually -ees or -eees, so it could be -sh with
  ligature omitted.  However, these groups are rather common and
  are not transcription errors.  Either ligatures are unimportant,
  or the IST is true...

  The other final -s could be separate words.
     
  The empty suffix:
  
    5 d _   
    5 od _  
    2 C _   
    2 T _   
    1 Ce _  
    1 oC _  
    1 ooT _ 
    1 te _  
  
  The "oi*[nrm]" family:

    42 d oiin      9 d oin    1 d oiiin     4 d oir    1 d oiim  1 d oim
    13 od oiin     5 od oin   1 od oiiin    1 od oir  
    10 ot oiin     3 ot oin   1 oot oiiin             
     8 t oiin      1 Ce oin   1 yt oiiin              
     7 C oiin      1 ol oin 
     3 Ce oiin     1 or oin 
     3 yt oiin     1 s oin  
     2 or oiin   
     2 s oiin    
     1 T oiin    
     1 yC oiin   
     1 yod oiin  
     1 yr oiin   

  The "y" and "o" suffixes.  Note that they aren't very similar but neither
  very different.  Perhaps the "o" suffixes are actually split words?
  
    49 C y    25 C o      2 C ooiin  
    33 od y    9 Ce o     2 t ooiin     
    19 T y     4 ol o     1 _ ooiin    
    17 d y     3 ot o     1 yt ooiin   
    17 ot y    2 T o      
    14 Ce y    2 d o      
    11 t y     2 oC o     
     9 ol y    2 od o     
     6 Cee y   1 l o      
     6 or y    1 oCe o    
     5 oT y    1 ote o    
     4 oC y    1 t o      
     3 Te y    1 yC o     
     3 tee y   1 yt o     
     3 yte y              
     2 ote y              
     2 r y                
     2 s y                
     2 yt y               
     1 _ y    
     1 de y   
     1 oTe y  
     1 ode y  
     1 oor y  
     1 yC y   
     1 yCe y  
     
  The "or", "or", "om", "oy", and "os" suffixes.  Note that "os"
  doesn't belong. 
  
     57 C ol    67 C or     6 C om    5 C oy      2 C os    
     20 ot ol   14 ot or    5 ot om   1 ot oy     2 T os    
     19 T ol    13 Ce or    4 T om                2 _ os    
     17 d ol    13 d or     4 d om                2 yC os   
      5 Ce ol   11 T or     3 od om               1 Cee os  
      5 od ol    8 _ or     3 t om                1 l os    
      4 _ ol     5 t or     2 Ce om               1 ote os  
      4 ote ol   4 od or    1 oT om               1 s os    
      3 oC ol    3 s or     1 oor om          
      3 te ol    3 yC or    1 or om           
      2 s ol     3 yCe or   1 yC om           
      2 t ol     2 l or     1 yte om          
      1 Cee ol   2 oC or                      
      1 de ol    1 Cee or                     
      1 oT ol    1 oCe or                     
      1 ol ol    1 ol or                      
      1 otee ol  1 yt or                      
      1 r ol                                  
      1 yC ol                                 
  
  The double-"o"-"i" suffixes:
  
     1 _ ooiir  1 _ ooin  2 _ oor  
  
  And the complete misfits:
  
      1 or iin    1 C oiir   39 _ s    
                  1 _ oiir    2 Ce s   
      1 C l                
      1 _ l       2 C on   
                  1 Ce on  
      2 d m       1 T on   
      1 _ m                
      1 od m      1 C r    
                  1 _ r    
                  1 ot r   
  
  Distribution of last block ignoring suffixes:
  
    cat hea-u.wds \
      | egrep -v '[*]' | egrep -v '[ci]$' \
      | sed \
          -e 's/eee/che/g' -e 's/ee/ch/g' \
          -e 's/ch/C/g' -e 's/sh/S/g' \
          -e 's/[kt]/t/g' -e 's/[pf]/p/g' \
          -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \
          -e 's/[ao]/o/g' \
          -e 's/^q//g' \
          -e 's/y\(.\)/o\1/g' \
          -e 's/^/_/g' -e 's/$/}3/g' \
          -e 's/\([ao]*i*[rlmns]\)}3/}2 \1/g' \
          -e 's/\(i*[rlmns]\)}3/}2 \1/g' \
          -e 's/\([ao]*[oy]\)}3/}2 \1/g' \
          -e 's/}3/}2 _/g' \
          -e 's/\([aoyc]*\)\([CSTKPFtkpfdrls]\)\(e*\)}2/}1 \1. \2 .\3/g' \
          -e 's/\([aoyc]*\)\(.\)\(e*\)}2/}1 \1. \2 .\3/g' \
          -e 's/^.*}1 //g' \
      | gawk '/./{print $1, $2, $3;}' \
      | sort | uniq -c | expand | sort -b +2 -3 +0 -1nr \
      | gawk '/./{if($3!=o){printf "\n";o=$3;} print;}' \
      | sed -e 's/[.] //' -e 's/ [.]//' \
      > hea-u-penp.frq
  
  Here is the distribution.  Note that "d" and "T", have basically the
  same [ao]*/e* environment.  Ditto for "C" and "S".  We can include
  "t": the patterns are the same, but the frequencies are very
  different.

       118 d      56 T     197 C     46 S      78 ot     5 P    5 op
        73 od      7 oT     39 Ce    14 Se     27 t      1 Pe   1 cp
         2 de      2 Te     20 oC     5 oS     12 ote  
         1 ode     1 oTe     7 oCe              4 te   
         1 ood     1 ooT                        4 ct   
                                                1 oot

        12 or     16 ol    11 s      63 _  
         2 oor     4 l             
         3 r     
     
  Let's now split the entire word:
  
    cat hea-u.wds \
      | egrep -v '[*]' | egrep -v '[ci]$' \
      | sed \
          -e 's/eee/che/g' -e 's/ee/ch/g' \
          -e 's/ch/C/g' -e 's/sh/S/g' \
          -e 's/[kt]/t/g' -e 's/[pf]/p/g' \
          -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \
          -e 's/[ao]/o/g' \
          -e 's/^q//g' \
          -e 's/y\(.\)/o\1/g' \
          -e 's/^/{/' -e 's/$/#/' \
          -e 's/\([ao]*i*[rlmns]\)#/}.\1/g' \
          -e 's/\(i*[rlmns]\)#/}.\1/g' \
          -e 's/\([ao]*[oy]\)#/}.\1/g' \
          -e 's/#/}._/g' \
          -e ':x' \
          -e 's/\([aoyc]*[CSTKPFtkpfdrls]e*\)}/}.\1/g' \
          -e 'tx' \
          -e 's/{}//g' \
      > hea-u-fact.fac

    cat hea-u-fact.fac \
      | tr '.' '\012' \
      | sort | uniq -c | expand | sort -b +0 -1nr \
      > hea-u-elem.frq
      
  Total element frequencies:

    136 d      2 de   66 T    3 Te    8 P    1 Pe 
     79 od     1 ode   8 oT   1 oTe   1 oP         
      1 ood            1 ooT
      
    289 C     47 Ce   87 S   19 Se   63 s            
     26 oC     7 oCe   6 oS          12 os           
     
     147 ot   14 ote  14 op   168 or   175 ol  32 om   4 on
      73 t     5 te    9 p      6 r      8 l    4 m        
       1 oot                    4 oor                     
       6 ct            1 cp   
    
     18 _  (no termination)

    214 y
     94 oiin
     55 o
     21 oin
      6 ooiin
      6 oy
      5 oir
      4 oiiin
      2 oiir
      1 iin
      1 oiim
      1 oim
      1 ooiir
      1 ooin
      
  This parsing has three groups of letters:
  
    { d de T Te P Pe C Ce S Se s }    prefixed by "o" only ~20% of the time
    { t te p (pe?) }                  prefixed by "o" half the time
    { r l m n }                       prefixed by "o" ~90% of the time  
    
  There are many terminations, some beginning with "o" (or "y") and some with "oo"
  (or "oy").
  
  The "oo" group occurs once each before the letters { d t T }
  and 4 times before "r" (almost as often as the empty premodifier).
  
  The "c" (half-platform) modifier occurs before "t" abd "p",
  in exclusion of "o" and "e" modifiers.

  Let's try again, attaching the [ao]* groups after the letter:
  
    cat hea-u.wds \
      | egrep -v '[*]' | egrep -v '[ci]$' \
      | sed \
          -e 's/eee/che/g' -e 's/ee/ch/g' \
          -e 's/ch/C/g' -e 's/sh/S/g' \
          -e 's/[kt]/t/g' -e 's/[pf]/p/g' \
          -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \
          -e 's/[ao]/o/g' \
          -e 's/^q//g' \
          -e 's/y\(.\)/o\1/g' \
          -e 's/^/{/' -e 's/$/#/' \
          -e 's/\(i*[rlmns]\)#/}.\1_/g' \
          -e 's/#/}._/g' \
          -e ':x' \
          -e 's/\([c]*[CSTKPFtkpfdrls]e*[aoy]*\)}/}.\1/g' \
          -e 'tx' \
          -e 's/{\([oy][ao]*\)}/{}.@\1/g' \
          -e 's/{}//g' \
      > hea-u-fcto.fac

    cat hea-u-fcto.fac \
      | sed -e 's/y/o/g' \
      | tr '.' '\012' \
      | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \
      > hea-u-elmo.frq
  
  Element distributions:
  
    209 @o
      9 @oo

    266 Co     46 Ceo    79 So    18 Seo     9 Po   1 Peo
     42 C       8 Ce     14 S      1 Se 
      7 Coo                             
   
     112 to    18 teo    71 To     4 Teo   191 do   3 deo
     104 t      1 te      4 T               25 d         
       5 too  

       4 cto    1 cpo 
       2 ct         

    150 l_    157 r_     62 s_     36 m_     4 n_
     23 lo     17 ro     12 so 
     10 l       4 r       1 s  
     
     15 p 
      8 po
        
      4 iiin_
      1 iim_
    101 iin_
      3 iir_
      1 im_
     22 in_
      5 ir_

  This second alternative shows these groups of letters:
  
    { m n } only final
    
    { l r s }  can be final; when non-final, they behave like next group
  
    { C Ce S Se P Pe cp t te T Te ct d de }  followed by "o" about 90-60% of the time
    
  There are fewer terminations, and the letters seem more homogeneous.
  
  Perhaps if we treat final "y" as a termination, thing get better:
  
    cat hea-u.wds \
      | egrep -v '[*]' | egrep -v '[ci]$' \
      | sed \
          -e 's/eee/che/g' -e 's/ee/ch/g' \
          -e 's/ch/C/g' -e 's/sh/S/g' \
          -e 's/[kt]/t/g' -e 's/[pf]/p/g' \
          -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \
          -e 's/[ao]/o/g' \
          -e 's/^q//g' \
          -e 's/y\(.\)/o\1/g' \
          -e 's/^/{/' -e 's/$/#/' \
          -e 's/\(i*[rlmns]\)#/}.\1_/g' \
          -e 's/\([y]\)#/}.\1_/g' \
          -e 's/#/}._/g' \
          -e ':x' \
          -e 's/\([c]*[CSTKPFtkpfdrls]e*[ao]*\)}/}.\1/g' \
          -e 'tx' \
          -e 's/{\([oy][ao]*\)}/{}.@\1/g' \
          -e 's/{}//g' \
      > hea-u-fcty.fac

    cat hea-u-fcty.fac \
      | tr '.' '\012' \
      | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \
      > hea-u-elmy.frq
  
  Element frequencies:

    208 @o
      9 @oo

    217 Co   70 So   35 Ceo  14 Seo   48 To  3 cto   13 teo  141 do   8 Po  1 cpo 
     96 C    23 S    19 Ce    5 Se    27 T   3 ct     6 te    75 d    1 P          
      2 Coo  
             
             
     131 t    3 Te   17 p     1 Pe    2 de 
      86 to   1 Teo   6 po            1 deo
       4 too                      

     73 _

    150 l_   157 r_          62 s_ 
     19 l     13 r           10 so 
     14 lo     8 ro           3 s  

     36 m_     4 n_

    220 y_
    101 iin_
     22 in_
      5 ir_
      4 iiin_
      3 iir_
      1 iim_
      1 im_
      
  This schema has the following groups of letters:
  
    { C Ce S Se T ct te d p cp }  most often followed by "o"
    { t Te p Pe de }              most often NOT followed by "o"
    { l r s }                     typically final, otherwise ~50% followed by "o"
    { m n }                       only final.
  
  This division seems less logical than the previous one.  Also "s" is
  rather different from "l" and "r" in its affinity for "o".
  
  If we treat final "y" as a termination, we should probably 
  include the "o" in "oiin" etc.  let's try it:
  
    cat hea-u.wds \
      | egrep -v '[*]' | egrep -v '[ci]$' \
      | sed \
          -e 's/eee/che/g' -e 's/ee/ch/g' \
          -e 's/ch/C/g' -e 's/sh/S/g' \
          -e 's/[kt]/t/g' -e 's/[pf]/p/g' \
          -e 's/ckh/K/g' -e 's/cth/T/g' -e 's/cph/P/g' -e 's/cfh/F/g' \
          -e 's/[ao]/o/g' \
          -e 's/^q//g' \
          -e 's/^/{/' -e 's/$/#/' \
          -e 's/\([ao]ii*[rmnsl]\)#/}.\1_/g' \
          -e 's/\(i*[rmnsl]\)#/}.\1_/g' \
          -e 's/\([y]\)#/}.\1_/g' \
          -e 's/#/}._/g' \
          -e ':x' \
          -e 's/\([c]*[CSTKPFtkpfdrls]e*[aoy]*\)}/}.\1/g' \
          -e 'tx' \
          -e 's/{\([oy][ao]*\)}/{}.@\1/g' \
          -e 's/{}//g' \
      > hea-u-fctz.fac

    cat hea-u-fctz.fac \
      | tr '.' '\012' \
      | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \
      > hea-u-elmz.frq
  
  Let me extract my herbal-A trancription again:
  
    rm -f hea-u.txt
    foreach f ( `cat filter-gal/pages.dir` )
      echo "<${f}>" >> hea-u.txt
      cat filter-gal/$f/$f.P?.txt >> hea-u.txt
    end

    cat hea-u.txt \
      | tr -d '%\!' \
      | sed \
          -e '/^#/d' \
          -e 's/^<[^ ]*> *//g' \
          -e 's/{[^{}]*}//g' \
      | tr ' ' '.' \
      | tr '.,=-' '\012\012\012\012' \
      | egrep -e '.' \
      | egrep -v '[*?]' \
      > hea-u.wds
    dicio-wc hea-u.txt hea-u.wds
  
     lines   words     bytes file        
    ------ ------- --------- ------------
       144    1146      8209 hea-u.txt
       803     803      4681 hea-u.wds

  Let's try to make sense out of this.
  
  [ See Notes/017/Note-017.txt ]
  
  JUNK from now on:

  
    set letters = ( t te T Te p pe P Pe d de S Se C Ce r l s m n )
    
    cat ${file}.fac \
      | egrep -e '([*]|[ci]$|ee|[aoy][aoy]|c[ktpf]([^h]|$))' \
      > .${file}-fine.wds

    foreach k ( ${letters} )
      cat .${file}-fine.fac
        | egrep -v '[@]' \
        | sed -e 
              -e 's/[kt]/t/g' -e 's/[pf]/p/g' \
              -e 's/[ao]/o/g' \
              -e 's/^q//g' \
      cat hea-u-fctz.fac \
        | tr '.' '\012' \
        | sort | uniq -c | expand | sort -b +1 -2 +0 -1nr \
        > hea-u-elmz.frq
    ???
      
    cat hea-u-fact.fac \
      | sed -e 's/^/./' -e 's/[.][^.]*$/#/g' \
      | tr -d 'eo.' \
      | sort | uniq -c | expand | sort -b +0 -1nr \
      > hea-u-patt.frq