Hacking at the Voynich manuscript - Side notes
011 Trying to identify possible plant names in the herbal pages.
  
Last edited on 1999-07-28 01:49:30 by stolfi

UNDER WORK
READ AT YOUR OWN COST AND RISK
HARD HATS REQUIRED
WE APOLOGIZE FOR THE INCONVENIENCE

  Summary of previous relevant tasks:

    I obtained Landini's interlinear transcription of the VMs, version
    1.6 (landini-interln16.evt) from
    http://sun1.bham.ac.uk/G.Landini/evmt/intrln16.zip. [Notebook-1.txt]

    Around 97-11-01 I split landini-interln16.evt into many files, with one
    text unit per page. [Notebook-12.txt]

    On 97-11-05 I mapped those files from FSG and other ad-hoc
    alphabets to EVA.  [Notebook-12.txt] The files are
    L16-eva/fNNxx.YY, and a machine-readable description of their
    contents and logical order is in L16-eva/INDEX.
    
    These files were eventually superseded by L16+H-eva/*

I. LOCATING PAGE-SPECIFIC WORDS

  1998-01-25 [redone 1999-01-15]

  The idea is that the plant name should appear once or twice in that
  page, but very rarely in other pages.  Of course, if the language is
  Chinese-like the plant names will consist of two or more words, so 
  we should consider phrases too.
  
  We can extract the information we need from the machine-readable
  concordance file Notes/037/vms-17-ok.hoc.gz. Recall that its format was
  
    LOC TRANS START LENGTH LCTX PHRASE RCTX PATT STAG PNUM HNUM
    1   2     3     4      5    6      7    8    9    10   11

  where

    LOC        is a line locator, like "f1r.11", "f86v2.R1.12a" etc.

    TRANS      is a letter identifying a transcriber, e.g. "F" for Friedman.

    START      is the index of the first byte of the occurrence
               in the text line (counting from 1).

    LENGTH     is the original length of the occurrence in the text, 
               including fillers, comments, spaces, etc..

    LCTX       is a word or word sequence, the left context of PHRASE.

    PHRASE     is the non-empty phrase in question, without any fillers,
               comments, non-significant spaces, line breaks, etc..

    RCTX       is a word or word sequence, the right context of PHRASE.

    PATT       a sorting pattern derived from PHRASE; see below.

    STAG       a tag identifying a section of the VMS, e.g. "hea" or "bio".

    PNUM       is the page's p-number, which is sequential and better for sorting.
    
    HNUM       the section number in the HTML-formatted concordance.

  First, let's make a list of the herbal text units and pages:
  
    cat L16+H-eva/INDEX \
      | gawk -v FS=':' '($3=="her" && $6=="parags"){print $0}' \
      > her.index
      
    cat her.index \
      | gawk -v FS=':' '/./{print $2;}' \
      > her.units

    cat her.index \
      | gawk -v FS=':' '/./{print substr($7,2,3);}' \
      | sort | uniq \
      > her.pages

  Checking the page sequence:
  
    seq 1 200 | gawk '//{printf "%03d\n",$1;}' > .foo
    diff her.pages .foo > .diff

  There are 128 herbal pages, of which 127 have text: p002-p095,
  p097-p111, p116, p118, p177-p178, and p185-p198. Page p115 (f65r)
  has an herbal-like drawing but contains only a "title" and no text.
    
  Pages 115 and 116 were mis-classified as "unk" instead of
  "hea"/"heb" in the L16+H-eva/INDEX file, release 1.6e6. Therefore we
  should remember to fix their STAG fields.
  
  Since the other half of the bifolio is language A, we can assume the
  same for pages 115 and 1116. (Besides, page 115 has only a title.)
  If we do that, then we have 96 herbal-A pages and 32 herbal-B pages.
  
  Which version shall we use? We could use the majority version ("A"),
  which is more reliable, or Takeshi's version ("H"), which is more
  complete (since the "A" version omits all words for which majority
  was not achieved). Let's compare them:
  
    foreach v ( A H )
      zcat ../037/vms-17-ok.hoc.gz \
        | gawk \
            ' (index($2,"'"$v"'") && ($6 \!~ /[-=/.,]/)) { \
                print substr($10,2,3), $1, $3, $4; \
              } \
            ' \
        | select-herbal-pages \
        | sort \
        > .$v.locs
    end
    bool 1-2 .A.locs .H.locs > .A-H.locs
    bool 1-2 .H.locs .A.locs > .H-A.locs
    dicio-wc .[A-Z]*.locs
    
      lines   words     bytes file        
    ------- ------- --------- ------------
        457    1828      8329 .A-H.locs
      10934   43736    196935 .A.locs
        102     408      1883 .H-A.locs
      10579   42316    190489 .H.locs

  So there are only 457 words of the "A" version that are not words in
  Takeshi's; and there are only 102 words in Takeshi's that are not in
  "A". We might as well use the "A" version.
  
  OK, so let's extract from the concordance the phrases from the
  herbal sections, keeping only 
  
      PNUM LOC PATT PHRASE STAG
      1    2   3    4      5
      
  Note that the PNUM is written without the "p".

    foreach f ( hea heb )
      zcat ../037/vms-17-ok.hoc.gz \
        | gawk \
              ' (index($2,"A") && ($6 \!~ /[-=/.,]/)) { \
                  print substr($10,2,3), $1, $8, $6, $9; \
                } \
              ' \
        | gawk -v section=${f} \
              ' (($1 == "115")||($1 == "116")){$5 = "hea";} \
                ($5 == section){print;} \
              ' \
        | sort +2 -4 +0 -1n +1 -2 \
        > ${f}-17.roc
    end
    dicio-wc {hea,heb}-17.roc

      lines   words     bytes file        
    ------- ------- --------- ------------
       7597   37985    216872 hea-17.roc
       3337   16685     96303 heb-17.roc
      
  Next we reduce the data to occurrence counts per page:

    COUNT PNUM STRING
    1     2    3   

  where STRING is either WORD or PHRASE, and COUNT is the count of
  occurrences of STRING on page PNUM.
  
    foreach f ( hea heb )
      cat ${f}-17.roc \
        | gawk '/./{print $1, $3;}' \
        | sort +0 -1n +1 -2 | uniq -c | expand \
        | sort +1 -2n +0 -1nr +2 -3 \
        > ${f}-17-pat.spfr

      cat ${f}-17.roc \
        | gawk '/./{print $1, $4;}' \
        | sort +0 -1n +1 -2 | uniq -c | expand \
        | sort +1 -2n +0 -1nr +2 -3 \
        > ${f}-17-phr.spfr
    end
    dicio-wc {hea,heb}-17-{phr,pat}.spfr
     
      lines   words     bytes file        
    ------- ------- --------- ------------
       6102   18306    109817 hea-17-phr.spfr
       4805   14415     85885 hea-17-pat.spfr
       2673    8019     48277 heb-17-phr.spfr
       2060    6180     36920 heb-17-pat.spfr

  As a control experiments, we will take all the phrases/patterns
  (with correct multiplicities) and assign them randomly to 128 
  pages with the correct sizes
  
    foreach g ( a b )
      foreach f ( phr pat )
        cat he${g}-17-${f}.spfr \
          | randomize-distribution \
          > rn${g}-17-${f}.spfr
      end
    end
    dicio-wc {rna,rnb}-17-{phr,pat}.spfr
  
      lines   words     bytes file        
    ------- ------- --------- ------------
       6424   19272    115356 rna-17-phr.spfr
       5168   15504     92175 rna-17-pat.spfr
       2808    8424     50585 rnb-17-phr.spfr
       2176    6528     38957 rnb-17-pat.spfr

  Let's compute the overall frequency of each pattern and phrase:

    foreach t ( hea rna heb rnb )
      foreach f ( phr pat )
        cat ${t}-17-${f}.spfr \
          | gawk '/./{print $1, $3;}' \
          | combine-counts \
          | sort +0 -1nr +1 -2 \
          > ${t}-17-${f}.sfr
      end
    end
    dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.sfr

      lines   words     bytes file        
    ------- ------- --------- ------------
       2283    4566     33946 hea-17-phr.sfr
        996    1992     15004 hea-17-pat.sfr
       2283    4566     33946 rna-17-phr.sfr
        996    1992     15004 rna-17-pat.sfr
       1248    2496     18303 heb-17-phr.sfr
        622    1244      9213 heb-17-pat.sfr
       1248    2496     18303 rnb-17-phr.sfr
        622    1244      9213 rnb-17-pat.sfr

    foreach f ( phr pat )
      diff {hea,rna}-17-${f}.sfr
      diff {heb,rnb}-17-${f}.sfr
    end
    
    (no output)

  Page sizes:

    foreach t ( hea rna heb rnb )
      foreach f ( phr pat )
        cat ${t}-17-${f}.spfr \
          | gawk '/./{print $1, $2;}' \
          | combine-counts \
          | sort +0 -1nr +1 -2 \
          > ${t}-17-${f}.pfr
      end
    end
    dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.pfr

      lines   words     bytes file        
    ------- ------- --------- ------------
         96     192      1152 hea-17-phr.pfr
         96     192      1152 hea-17-pat.pfr
         96     192      1152 rna-17-phr.pfr
         96     192      1152 rna-17-pat.pfr
         32      64       384 heb-17-phr.pfr
         32      64       384 heb-17-pat.pfr
         32      64       384 rnb-17-phr.pfr
         32      64       384 rnb-17-pat.pfr

    foreach f ( phr pat )
      diff {hea,rna}-17-${f}.pfr
      diff {heb,rnb}-17-${f}.pfr
    end
    
    (no output)

  Then we create a file that shows, for each string "w", the 
  shape of its distribution over the pages, defined as the
  multiset of the nonzero per-page counts of that word, sorted in
  decreasing order.  The file has one record per string in the 
  format 
  
      STRING TOTCT NPAGES NMISS SHAPE 
      1      2     3      4     5
      
  where TOTCT is the total occurrence count of the string, NPAGES is
  the number of pages where the STRING occurs, NMISS is the number of
  pages where the word doesn't occur, and SHAPE is the nonzero counts,
  braced by "()" and separated by ",".
    
    foreach t ( hea rna heb rnb )
      foreach f ( phr pat )
        cat ${t}-17-${f}.spfr \
          | compute-distr-shape \
          | sort +1 -2n +3 -4n +4 -5 +0 -1 \
          > ${t}-17-${f}.sdsh
      end
    end
    dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.sdsh

      lines   words     bytes file        
    ------- ------- --------- ------------
       2283   11415     48704 hea-17-phr.sdsh
        996    4980     25856 hea-17-pat.sdsh
       2283   11415     49355 rna-17-phr.sdsh
        996    4980     26570 rna-17-pat.sdsh
       1248    6240     24996 heb-17-phr.sdsh
        622    3110     14072 heb-17-pat.sdsh
       1248    6240     25270 rnb-17-phr.sdsh
        622    3110     14292 rnb-17-pat.sdsh

  Then we compute the number of words that have each histogram shape:
  
    foreach t ( hea rna heb rnb )
      foreach f ( phr pat )
        cat ${t}-17-${f}.sdsh \
          | gawk '//{print $2, $3, $5;}' \
          | sort | uniq -c | expand \
          | sort +1 -2n +0 -1nr +2 -3n  +3 -4 \
          > ${t}-17-${f}.shfr
      end
    end
    dicio-wc {hea,rna,heb,rnb}-17-{phr,pat}.shfr
    
      lines   words     bytes file        
    ------- ------- --------- ------------
        121     484      6594 hea-17-phr.shfr
        120     480      8060 hea-17-pat.shfr
        102     408      6347 rna-17-phr.shfr
        103     412      8074 rna-17-pat.shfr
         77     308      2875 heb-17-phr.shfr
         87     348      3583 heb-17-pat.shfr
         69     276      2821 rnb-17-phr.shfr
         79     316      3555 rnb-17-pat.shfr

STOPPED HERE
-----


  Then we produce, for each distinct STRING, one record with the
  format

    TOTFR MAXFR PMAX SPECF CTMAX SZMAX TOTCT PATT
    1     2     3    4     5     6     7     8

  where 

    TOTFR      is the overall frequency of PATT.

    MAXFR      is the maximum frequency of PATT in any page.
    
    PMAX       is the PNUM of a page where the freq of PATT is MAXFR.
    
    SPECF      is the ratio MAXFR/TOTFR
    
    CTMAX      is the occurrence count of PATT in page PMAX.
    
    SZMAX      is the total word occurrence count in page PMAX.

  Here it is:

    cat her-17.wpfr \
      | compute-tot-max-freqs \
      | sort -b +2 -3n +3 -4gr \
      > her-17.spc
    dicio-wc her-17.spc

      lines   words     bytes file        
    ------- ------- --------- ------------
       1306   10448     51730 her-17.spc
      
  Let's format it a bit:
  
    cat her-17.spc \
      | gawk \
          ' BEGIN {pg="";} \
            /./ { if($3!=pg){printf "=== page %s word count = %s\n", $3, $6; pg=$3;} \
                  printf "%7.5f %7.5f  %7.5f %4d %4d %s\n", $1, $2, $4, $5, $7, $8; \
                } \
          ' \
      > her-17.spf

  Note that for each page there are many words with TOTCT=1, CTMAX=1
  that end up with high SPECF (around 8).

  
    
  
  
  
  Let's define a word or phrase as "super-specific" if it occurs two
  or more times on the same page, and zero times elsewhere.  Here are
  the super-specific phrases:
  
    cat her-17.spc \
      | gawk '(($5>=$7)&&($7 >= 2)){print}' \
      > her-17-super.spc
    dicio-wc her-17-super.spc

      lines   words     bytes file        
    ------- ------- --------- ------------
          4      32       162 her-17-super.spc

    0.0002 0.0022 058 8.9408 2  63 2 eeeoldo
    0.0002 0.0021 066 8.6319 2 112 2 oleedoin
    0.0003 0.0027 084 8.4124 3 149 3 otedol
    0.0002 0.0021 089 8.3664 2 157 2 dotedo

OLD VERSION

  Looking at the listing above, it seems that the first word of each
  page, as well as the last one, is characteristic of the page.
  
  Moreover, the first word often occurs at the beginning of the second
  paragraph, sometimes with slight modification:
  
  To check the first-word hunch, let's make a list of the first words
  and word clusters in each page.  We beging by extracting the first
  line of each page (from Friedman's transcription, which is the one
  we used to build the starting concordance in Note-010):
  
    cat her.units \
      | gawk \
          -v FS='.' \
          'BEGIN{pg="";} /./{if($1!=pg){print;pg=$1} next;}' \
      > her-first.units
    dicio-wc her-first.units
      
     lines   words     bytes file        
    ------ ------- --------- ------------
       127     127       884 her-first.units

  Next we extract the first lines (usually numbered "1") 
  from each of these units:

    cat `cat her-first.units | sed -e 's/^/L16-eva\//g'` \
      | egrep '[.][01][a]*;F*>' \
      > her-first-lines.txt
    dicio-wc her-first-lines.txt

     lines   words     bytes file        
    ------ ------- --------- ------------
       126     252      9093 her-first-lines.txt
       
  (The missing page is f65v = p116, which doesn't have 
  a Friedman version.)

  Next we output all initial words groups with up to 15 EVA chars
  (not counting spaces): 
    
    cat her-first-lines.txt \
      | enum-text-phrases -f eva2erg.gawk \
          -v maxlen=15 \
      | gawk '($5==1){print $1, $8;}' \
      > her-first-phrases.tbl

    read    1036 words
    wrote   2332 phrases

  Now let's tabulate the pecificity of each of those
  first-phrases:
  
    cat her-f-eva-15.spc \
      | gawk '/./ {printf "%s %5.3f\n", $6, $3;}' \
      > pat-to-specificity.tbl
    dicio-wc pat-to-specificity.tbl
      
     lines   words     bytes file        
    ------ ------- --------- ------------
     18750   37500    329436 pat-to-specificity.tbl

    cat her-f-eva-15.spc \
      | gawk '/./ {printf "%s %7d\n", $6, $1;}' \
      > pat-to-totct.tbl
      
    cat her-first-phrases.tbl \
      | gawk '/./{print $1,$2,$2;}' \
      | fix-words -f word-equiv.gawk \
          -v field=2 \
          -v stripq=1 \
          -v equatekt=${equatekt} \
          -v equatepf=${equatepf} \
      | map-field \
          -v inField=2 -v outField=2 \
          -v table=pat-to-specificity.tbl \
          -v default="9.999" \
      | map-field \
          -v inField=3 -v outField=3 \
          -v table=pat-to-totct.tbl \
          -v default="0" \
      | gawk \
          ' BEGIN {fnum="";} \
            /./ { \
              if (($1 != fnum) &&(fnum != "")) {printf "\n";} \
              printf "%-6s %5.3f %4d %-15s %-15s\n", $1, $2, $3, $4, $5; \
              fnum=$1; \
            } \
          ' \
      > her-first-pats.fmt
      
  Checking whether all first-phrases were found in the concordance:
  
    grep ' 9\.999' her-first-pats.fmt
  
  Let's now select the first unique (specificity=1) prefix in each page:
  
    cat her-first-pats.fmt \
      | gawk \
          ' BEGIN {prevf="";done=1;} \
            /./ { \
              if($1!=prevf) \
                { if (! done) {print prevlin;} \
                  done=0; \
                } \
              if (($2==1.0)&&(! done)) {print; done=1;} \
              prevf = $1; prevlin = $0; \
            } \
          ' \
      > her-first-unique-pats.fmt

  These are the shortest page-specific phrases at the 
  beginning of each page:
  
    page   spect totn pattern         phrase
    -----  ----- ---- --------------- ------------------
    f1v    1.000    1 kchro           kchry          
    f2r    1.000    1 kydaino         kydainy        
    f2v    1.000    1 kooiincheo      kooiin.cheo    
    f3r    1.000    1 ksheos          tsheos         
    f3v    1.000    1 koaiin          koaiin         
    f4r    1.000    1 kodalcho        kodalchy       
    f4v    1.000    1 pchooiin        pchooiin       
    f5r    1.000    1 kshodopchoo     kshody.pchoy   
    f5v    1.000    1 kocheor         k.o.cheor      
    f6r    1.000    1 poaro           foar.y         
    f6v    1.000    1 koaro           koary          
    f7r    0.200    5 pchodaiin       fchodaiin      
    f7v    1.000    1 polysho         polyshy        
    f8r    1.000    1 pshol           pshol          
    f8v    1.000    1 ckhodsoockh     cthod.soocth   
    f9r    1.000    1 kydlo           tydlo          
    f9v    1.000    1 pochor          fochor         
    f10r   1.000    1 pchockhoshor    pchocthy.shor  
    f10v   1.000    1 paiindaiin      paiin.daiin    
    f11r   1.000    1 ksholschoal     tshol.schoal   
    f11v   1.000    1 poldchodo       poldchody      
    f13r   1.000    1 korshor         torshor        
    f13v   1.000    1 koair           koair          
    f14r   1.000    1 pchodaiinchopol pcho.daiin.chopol
    f14v   1.000    1 pdychoiin       pdychoiin      
    f15r   1.000    1 kshorsheokchalo tshor.shey.tchaly
    f15v   1.000    1 poror           poror          
    f16r   1.000    1 pocheodo        pocheody       
    f16v   1.000    1 pchraiin        pchraiin       
    f17r   1.000    1 pshododaram     fshody.daram   
    f17v   1.000    1 pchodol         pchodol        
    f18r   1.000    1 pdrairdo        pdrairdy       
    f18v   1.000    1 kopd            tofd           
    f19r   1.000    1 pchorodcho      pchor.qodchy   
    f19v   1.000    1 pochaiinckhor   pochaiin.cthor 
    f20r   1.000    1 kdchodo         kdchody        
    f20v   1.000    1 paiis           faiis          
    f21r   1.000    1 pchorochockho   pchor.oeeockhy 
    f21v   1.000    1 koldsho         toldshy        
    f22r   1.000    1 pololsho        pololshy       
    f22v   1.000    1 pysaiinor       pysaiinor      
    f23r   1.000    1 pydchdom        pydchdom       
    f23v   1.000    1 podairol        podairol       
    f24r   1.000    1 pororo          porory         
    f24v   1.000    1 kchodarchocpho  tchodar.chocfhy
    f25r   1.000    1 pcholdososho    fcholdy.soshy  
    f25v   1.000    1 pochaiinoko     poeeaiin.qoky  
    f26r   1.000    1 psheoko         psheoky        
    f26v   1.000    1 pchedarodaro    pchedar.qodary 
    f27r   1.000    1 ksor            ksor           
    f27v   1.000    1 pochop          fochof         
    f28r   1.000    1 pchodar         pchodar        
    f28v   1.000    1 ksholooiiin     kshol.qooiiin  
    f29r   1.000    1 poraiin         poraiin        
    f29v   1.000    1 kooiinshor      kooiin.shor    
    f30r   1.000    1 okchesocheo     okchesy.chey   
    f30v   1.000    1 ckhsckhain      cthscthain     
    f31r   1.000    1 kchdeo          keedey         
    f31v   1.000    1 podair          podair         
    f32r   1.000    1 pchaiinshykeodo fchaiin.shykeody
    f32v   1.000    1 kcheodaiinchol  kcheodaiin.chol    (*)[1]
    f33r   1.000    1 kshdar          tshdar         
    f33v   1.000    1 karardaiin      tar.ar.daiin   
    f34r   1.000    1 pcheoepcho      pcheoepchy     
    f34v   1.000    1 kechdochdo      kechdy.chdy    
    f35r   1.000    1 ckhoorcholo     cthoo.rcholy   
    f35v   1.000    1 parchor         parchor        
    f36r   1.000    1 pchapdan        pchafdan       
    f36v   1.000    1 pcharaso        pcharasy       
    f37r   1.000    1 kocphol         tocphol        
    f37v   1.000    1 kshodoockho     kshody.qocthy  
    f38r   1.000    1 kolor           tolor          
    f38v   1.000    1 okchopchol      okchop.chol        (*)[2]
    f39r   1.000    1 kedochshd       tedo.chshd     
    f39v   1.000    1 pdair           pdair          
    f40r   1.000    1 pcheokeodar     pchey.keodar   
    f40v   1.000    1 pchedain        pchedain       
    f41r   1.000    1 p*eo            p*ey           
    f41v   0.500    2 pcheodo         pcheody        
    f42r   1.000    1 ckhsho          cthsho         
    f42v   1.000    1 pchockhoshcho   pcho.ctho.sheey    (*)[3]    
    f43r   1.000    1 karodaiin       tarodaiin      
    f43v   1.000    1 pdsairo         pdsairy        
    f44r   1.000    1 kshodpo         tshodpy        
    f44v   1.000    1 kshookshockhol  tsho.qotshy.cthol
    f45r   1.000    1 pykydal         pykydal        
    f45v   1.000    1 koraro          korary         
    f46r   1.000    1 pcheocpho       pcheocphy      
    f46v   1.000    1 podolshed       pody.lshed     
    f47r   1.000    1 pchair          pchair         
    f47v   1.000    1 pcheok          pcheot         
    f48r   1.000    1 pshdaiin        pshdaiin       
    f48v   1.000    1 pcheodcho       pcheodchy      
    f49r   1.000    1 pychol          pychol         
    f50r   1.000    1 psheor          psheor         
    f50v   1.000    1 kchodoldar      tchy.do.ldar   
    f51r   1.000    1 kcholdchookcho  tcholdchy.qotchy
    f51v   1.000    1 poshodo         poshody        
    f52r   1.000    1 kdokchcpho      tdokchcfhy     
    f52v   1.000    1 pchorchcphol    pchor.chcphol  
    f53r   1.000    1 kadam           kadam          
    f53v   0.500    2 kshorsheo       tshor.shey     
    f54r   1.000    1 podaiinshodal   podaiin.shodal 
    f54v   1.000    1 pcheodar        pcheodar       
    f55r   1.000    1 podaiinshekcho  podaiin.shekchy
    f55v   1.000    1 kchchdchdo      kcheedchdy     
    f56r   1.000    1 o*chal          o*chal         
    f56v   1.000    1 kcheak          kcheat         
    f57r   1.000    1 pocho           poeeo          
    f66v   1.000    1 okeodop         okeodof        
    f87r   1.000    1 poal            poal           
    f87v   1.000    1 pchchodaiin     pcheey.daiin   
    f90r1  1.000    1 polcholokeol    poleeol.qokeol 
    f90r2  1.000    1 koealchs        toealchs       
    f90v2  1.000    1 cphdacho        cphdachy       
    f90v1  1.000    1 pcheor          pcheor         
    f93r   1.000    1 kodshol         kodshol        
    f93v   1.000    1 porsheodo       porsheody      
    f94r   1.000    1 kchdoopainr     tchdy.opainr   
    f94v   1.000    1 kshedochedar    tshedy.chedar  
    f95r1  1.000    1 kshdor          kshdor         
    f95r2  1.000    1 kshedoor        kshedy.or      
    f95v2  1.000    1 kchodopodar     tchody.podar   
    f95v1  0.500    2 kolkchdo        toltchdy       
    f96r   1.000    1 korchchor       tor.cheeor     
    f96v   1.000    1 psheas          psheas    
    
  (*) Entries fixed since first release of this note.
  See NOTES section below.

  Now let's create a table that maps the specific
  word patterns to colors: 
  
    several occurrences, all on one page:    ff0000 red
    one occurrence:                          9900ff light purple
    several occurrences, >=1/2 on one page:  0000ff blue
    several occurrences, <1/2 on any page:   000000 black
  
    cat her-f-eva-15.spc \
      | gawk \
          ' ($1>1){ \
              if ($2>=$1)        {print $6, "ff0000";} \
              else if ($2>=$1/2) {print $6, "0000ff";} \
              else               {print $6, "000000";} \
            } \
          ' \
      > pat-to-color.tbl
    dicio-wc pat-to-color.tbl
      
     lines   words     bytes file        
    ------ ------- --------- ------------
      1795    3590     27107 pat-to-color.tbl

    colorize-pages \
      -v stripq=1 \
      -v equatekt=${equatekt} \
      -v equatepf=${equatepf} \
      `cat her.units` 
      
    make-page-index \
      `cat her.units`
      
  Rene questions whether the fisrt paragraph is more special than the 
  other paragraphs.  So let's extract the first line out of the 
  second paragraph:

    cat `cat her.units | sed -e 's/^/L16-eva\//g'` \
      | egrep ';F*>' \
      | gawk \
          ' BEGIN { curfn=""; parno=9999; done=0; } \
            /^[#]/ { next; } \
            /./ { \
              fn = $1; \
              gsub(/</, "", fn); gsub(/[.].*$/, "", fn); \
              if (fn != curfn) \
                { if ((! done) && (curfn != "")) \
                    { printf "%s only %d parags\n", curfn, parno > "/dev/stderr"; } \
                  curfn=fn; parno=1; lineno=1; done=0; \
                } \
              else \
                { if (lineno == 9999) { parno++; lineno=0; } \
                  lineno++; \
                } \
              if ((parno == 2) && (lineno == 1)) { print; done=1; } \
              if (match($0, /[=]/)) { lineno=9999; } \
              next; \
            } \
          ' \
      > her-second-lines.txt
    dicio-wc her-second-lines.txt
  
    cat her-second-lines.txt \
      | enum-text-phrases -f eva2erg.gawk \
          -v maxlen=15 \
      | gawk '($5==1){print $1, $8;}' \
      > her-second-phrases.tbl
      
    read     672 words
    wrote   1561 phrases

    cat her-second-phrases.tbl \
      | gawk '/./{print $1,$2,$2;}' \
      | fix-words -f word-equiv.gawk \
          -v field=2 \
          -v stripq=1 \
          -v equatekt=${equatekt} \
          -v equatepf=${equatepf} \
      | map-field \
          -v inField=2 -v outField=2 \
          -v table=pat-to-specificity.tbl \
          -v default="9.999" \
      | map-field \
          -v inField=3 -v outField=3 \
          -v table=pat-to-totct.tbl \
          -v default="0" \
      | gawk \
          ' BEGIN {fnum="";} \
            /./ { \
              if (($1 != fnum) &&(fnum != "")) {printf "\n";} \
              printf "%-6s %5.3f %4d %-15s %-15s\n", $1, $2, $3, $4, $5; \
              fnum=$1; \
            } \
          ' \
      > her-second-pats.fmt
      
    grep ' 9\.999' her-second-pats.fmt
  
    cat her-second-pats.fmt \
      | gawk \
          ' BEGIN {prevf="";done=1;} \
            /./ { \
              if($1!=prevf) \
                { if (! done) {print prevlin;} \
                  done=0; \
                } \
              if (($2==1.0)&&(! done)) {print; done=1;} \
              prevf = $1; prevlin = $0; \
            } \
          ' \
      > her-second-unique-pats.fmt

  So, here are the shortest page-specific phrases at 
  the beginning of the SECOND paragraph of each page:
  
    page   spect totn pattern         phrase
    -----  ----- ---- --------------- ------------------
    f1v    1.000    1 pokoo           potoy          
    f2r    1.000    1 kydain          kydain         
    f2v    1.000    1 kchorsho        kchor.shy      
    f3r    1.000    1 pcheolshol      pcheol.shol    
    f3v    1.000    1 kchorokcham     tchor.otcham   
    f4r    1.000    1 pydaiinokcho    pydaiin.qotchy 
    f4v    1.000    1 korchoshchor    torchy.sheeor  
    f5r    1.000    1 kshoshodo       tshy.shody     
    f6v    1.000    1 kchodoshockhol  tchody.shocthol
    f7r    1.000    1 ksholo          ksholo         
    f7v    1.000    1 kchorsheod      kchor.sheod    
    f8r    1.000    1 kchoep          tchoep         
    f8v    1.000    1 pcharcho        pchar.cho      
    f9r    1.000    1 pshoain         pshoain        
    f9v    1.000    1 pchoropchcho    pchor.ypcheey  
    f10r   1.000    1 ocheorckho      ycheor.cthy    
    f10v   1.000    1 okchokor        qotchy.tor     
    f11r   1.000    1 kcholshor       tchol.shor     
    f13r   1.000    1 shorodosho      shorodo.shy    
    f13v   1.000    1 poldaiin        fol.daiin      
    f14r   1.000    1 sosho*chol      soshy.*chol    
    f14v   1.000    1 ochododaiin     ychy.dy.daiin  
    f16r   1.000    1 kchorchorchs    tchor.chor.chs 
    f16v   1.000    1 pchockhochypcho pchocthy.chypchy
    f17r   1.000    1 kcho*           tcho*          
    f18r   1.000    1 kchorshor       tchor.shor     
    f19v   1.000    1 kookcheo        toy.tchey      
    f20r   0.250    4 pchockho        pchocthy       
    f20v   1.000    1 ksholpol        tshol.fol      
    f21r   1.000    1 pchopo          pchofy         
    f22r   1.000    1 pchaiinopcho    pchaiin.opchy  
    f22v   1.000    1 pshor           fshor          
    f23r   1.000    1 okoldookaiir    qokoldy.okaiir 
    f23v   1.000    1 ksholshor       tshol.shor     
    f24v   1.000    1 kocholchor      tochol.chor    
    f26r   1.000    1 pchookedo       pcho.qokedy    
    f27r   1.000    1 kcheocheo       kchey.chey     
    f28v   1.000    1 kshoiin         tshoiin        
    f29r   1.000    1 kcheolcheor     kcheol.cheor   
    f29v   1.000    1 kochon          tochon         
    f30r   1.000    1 opcholol        opchol.ol      
    f31r   1.000    1 kshokeodo       tshoteody      
    f31v   1.000    1 pchchodo        pcheeody       
    f32r   1.000    1 pchokcheychedo  fcho.tcheychedy
    f32v   1.000    1 kshocphor       ksho.cphor     
    f33v   1.000    1 kshdoshepchdo   tshdy.shefchdy 
    f34r   1.000    1 kcheoolchckho   tcheo.olchckhy 
    f34v   1.000    1 pchedarshear    pchedar.shear  
    f35r   1.000    1 paiinchear      paiin.chear    
    f36r   1.000    1 podaiir         podaiir        
    f36v   1.000    1 kchorckhoiin    tchor.ckhoiin  
    f37r   1.000    1 pchokcho        pchotchy       
    f37v   1.000    1 okorchoiin      qotor.choiin   
    f39r   1.000    1 pchdaiin        pchdaiin       
    f39v   1.000    1 pardo           pardy          
    f40r   1.000    1 ksheokcheo      ksheo.kchey    
    f40v   1.000    1 kochschedo      toees.chedy    
    f41r   1.000    1 shedeo          shedey         
    f42r   1.000    1 pchocho         pcho.chy       
    f42v   1.000    1 posheor         posheor        
    f43r   1.000    1 psheso          pshesy         
    f43v   1.000    1 oshedoockho     y.shedy.octhy  
    f44r   1.000    1 kooshysho       toy.shysho     
    f44v   1.000    1 ookalod         yokalod        
    f45r   1.000    1 kolshopchor     kolsho.pchor   
    f45v   1.000    1 okolchoiin      qotol.choiin   
    f46r   1.000    1 kedodokedo      tedy.dotedy    
    f47r   1.000    1 polr            folr           
    f47v   1.000    1 pchodaiindair   pchodaiin.dair 
    f48v   1.000    1 pchedarcheo     pchedar.chey   
    f49r   1.000    1 kshoodain       ksho.qodain    
    f52v   1.000    1 pcheolsholoiin  pcheol.sholoiin
    f54r   1.000    1 korari          korari         
    f55r   1.000    1 kchedar         tchedar        
    f55v   1.000    1 okchddaiin      okchd.daiin    
    f56r   1.000    1 kchokokchol     tchoky.kchol   
    f56v   1.000    1 kchocho         kcho.chy       
    f57r   1.000    1 kcheodarokam    tcheodar.okam  
    f66v   1.000    1 kchodsheodo     tchod.sheody   
    f87r   1.000    1 psheodsho       psheodshy      
    f87v   1.000    1 opchchockheo    opcheey.cthey  
    f90v2  1.000    1 kcheodocpheal   tcheody.cpheal 
    f90v1  1.000    1 ksheodal        ksheodal       
    f94v   1.000    1 kedaiinchedo    tedaiin.chedy  
    f95r1  1.000    1 kchdor          tchdor         
    f95r2  1.000    1 kshod           tshod          
    f95v1  1.000    1 kshdal          tshdal         
    

NOTES

  (*) In the original release of this file, the word equivalence
  function had a minor bug that caused it to miss a few equivalences
  between composite and single words.  Because of the bug, three
  pages had names that were not reaaly specific:
    [1] f32v was "kcheodaiin" now is "kcheodaiin.chol"
    [2] f38v was "okchop"     now is "okchop.chol" 
    [3] f42v was "pcho.ctho"  now is "pcho.ctho.sheey"