# Last edited on 2026-01-16 02:01:44 by stolfi

APPENDIX - INTRODUCTION

  This sub-note describes some experiments that led to the 
  selection of the element set.
  
TENTATIVE SET

  First we tried to parse all the parags tokens with the 
  following set of elements:
  
    {q} {o} {a} {y} {d} {r} {l} {s}
    
    {ch} {che} {sh} {she} {ee} {eee} {ih} {ihe}
    
    {k} {ke} {t} {te} {p} {pe} {f} {fe}
    
    {ckh} {ckhh} {ckhe} {ckhhe}
    {cth} {cthh} {cthe} {cthhe}
    {cph} {cphh} {cphe} {cphhe}
    {cfh} {cfhh} {cfhe} {cfhhe}
  
    {ikh} {ikhh} {ikhe} {ikhhe}
    {ith} {ithh} {ithe} {ithhe}
    {iph} {iphh} {iphe} {iphhe}
    {ifh} {ifhh} {ifhe} {ifhhe}

    {n} {in} {iin} {iiin} 
    {m} {im} {iim} {iiim}
        {ir} {iir} {iiir}
        {id} {iid} {iiid}
        {il} {iil} {iiil}
        {is} {iis} {iiis}
  
    do_093_elem_stats.sh

      all          gud          bad         % gud sec-type
    ------------ ------------ ------------  ----- ----------
     6075.000000  6043.500000    31.500000  99.48 bio-parags
      984.500000   956.500000    28.000000  97.16 cos-parags
     7564.000000  7470.000000    94.000000  98.76 hea-parags
     3298.500000  3252.500000    46.000000  98.61 heb-parags
     2191.500000  2145.500000    46.000000  97.90 pha-parags
    10581.750000 10480.750000   101.000000  99.05 str-parags
     2921.500000  2896.500000    25.000000  99.14 unk-parags

    33616.750000 33245.250000   371.500000  98.89 tot-parags

  So at least 98% of all words in the main sections that have only the
  "valid" chars can be parsed into valid elements of the model.
   
  Here are the element frequencies among the words that can be parsed 
  into elements:

      12667.000000 0.09991 {a}
      21638.750000 0.17067 {o}
      15585.125000 0.12293 {y}

       5193.000000 0.04096 {q}

      11518.250000 0.09085 {d}
       9270.250000 0.07312 {l}
       5824.000000 0.04594 {r}
       2100.500000 0.01657 {s}

       6006.000000 0.04737 {ch}
       3922.250000 0.03094 {che}

       3840.000000 0.03029 {ee}
        321.000000 0.00253 {eee}

       2212.250000 0.01745 {sh}
       1876.875000 0.01480 {she}

          1.000000 0.00001 {ih}
          1.000000 0.00001 {ihe}

       7318.250000 0.05772 {k}
       1526.000000 0.01204 {ke}
       
       4212.250000 0.03322 {t}
        786.500000 0.00620 {te}
        
       1211.750000 0.00956 {p}
          0.0      0.0     {pe}
       
        326.000000 0.00257 {f}
          0.0      0.0     {fe}

        605.000000 0.00477 {ckh}
        207.500000 0.00164 {ckhe}
         28.000000 0.00022 {ckhh} # 0.12

         23.000000 0.00018 {ikh}  # 0.037
          3.000000 0.00002 {ikhe}
          1.000000 0.00001 {ikhh}

        688.500000 0.00543 {cth}
        166.000000 0.00131 {cthe}
         30.000000 0.00024 {cthh} # 0.15

         23.000000 0.00018 {ith}  # 0.033
          3.000000 0.00002 {ithe}
          0.0      0.0     {ithh}

        124.500000 0.00098 {cph}
         56.000000 0.00044 {cphe}
          9.000000 0.00007 {cphh} # 0.14

          7.000000 0.00006 {iph}  # 0.057
          1.000000 0.00001 {iphe}
          1.000000 0.00001 {iphh}

         47.000000 0.00037 {cfh}
         18.000000 0.00014 {cfhe}
          3.000000 0.00002 {cfhh} # 0.16

          2.000000 0.00002 {ifh}  # 0.051
          1.000000 0.00001 {ifhe}
          2.000000 0.00002 {ifhh}

          7.000000 0.00006 {id}
         13.000000 0.00010 {iid}
          3.000000 0.00002 {iiid}

         30.000000 0.00024 {il}
          9.500000 0.00007 {iil}
          2.000000 0.00002 {iiil}

        116.500000 0.00092 {n}
       1673.500000 0.01320 {in}
       3780.500000 0.02982 {iin}
        158.000000 0.00125 {iiin}

        873.500000 0.00689 {m}
         41.000000 0.00032 {im}
         17.000000 0.00013 {iim}
          0.0      0.0     {iiim}

        489.500000 0.00386 {ir}
        131.500000 0.00104 {iir}
          0.0      0.0     {iiir}

         19.000000 0.00015 {is}
         12.000000 0.00009 {iis}
          0.0      0.0     {iiis}

  Eliminating the I-benches

  There are actually three tokens with {ih} in my transcription file,
  but one failed to parse for another reason. Many of the glyphs that
  look like @'Ih' had been transcribed as @'Ch', because the trabscribers
  (me included) saw them as accidentally malformed @'Ch'.
  
  There are 67 tokens with @I-platforms in the parags text, compared to
  1982.5 tokens with @C-platforms. The counts of the former are not only
  small compared to the corresponding @C versions, but the ratio for
  each pair is almost the same, 0.03--0.06. So it seems likely that many
  of those @I platforms are just malformed or misread @C platforms.
  Many of the @I-elements are in fact ambiguous on the images.
 
  Therefore I decided that keeping the @I-benches and @I-platform gallows
  as separate elements it was not worth the trouble. Instead I went
  through my transcription and replaced all those glyph groups by the
  @C versions.  See note 074.
  
  There is the risk that in fact the @I variants of benches and/or
  platform gallows are actually distinct elements, so this "fixing" of
  the @I-elements actually created errors. But the number of such errors
  (~3% of all platform gallows) would be small compared to that of other
  errors that are expected to exist.

  The long platform gallows

  In the parags sections, there are 75 tokens (40 lexemes) with "long
  platform gallows" -- elements of the form {cXhh} where X = gallows,
  compaerd to 447.5 tokens (174 lexemes) with {cXhe} elements.
  
  Of these 40 lexemes, 21 are such that if the @'hh' end is replaced by
  @'he', the result will be lexemes that already occur in the parags
  text. For the most common of these lexemes, the frequency of the @'hh'
  version is 15-25% of that of both versions, @'hh' and @'he'.
  
  The other 19 lexemes with @'hh' gallows are all hapaxes, except
  @'shcphhy' that occurs 3 times. Five of these 19 cannot be parsed as
  elements with the current element set; and three of them would
  actually become parsable if the @'hh' was replaced by @'he'.
  
  I am tempted to either (1) exclude the @'hh' elements of the element set
  (so that the tokens that use them become part of the unparseable residue)
  or (2) map them in the source file to the corresponding @'he' version
  (thus implicitly declaring them errors and correcting them that way).
  
  Either way the impact on the statistics will be small.  However alternative (1)  
  seems more prudent.  The ratio of {cXhh} tokens to {cXhe} (~17%) is rather large.  

  The d-codas
  
  There are only 23 tokens (21 lexemes) whose parsing with the
  eement set above generates elements {id}, {iid}, or {iiid};
  only three with the last one.
  
  Only one of those lexemes, @'daiidy', occurs more than once (7 times);
  all the others are hapaxes. 
  
  In all cases these elements are preceded by an {a} element.
  
  In all but one of thse cases the element {id}, {iid}, {iid} is
  not the end of the word, suggesting that those cases are words
  that were run together with loss of the @n that should have 
  followed the @i string.
  
  These cases may also where a @n was wrongly written or retraced as @d.
  
  It does not seem worth keeping these elements in the set. Let them
  be relegated to the unparsable residue.
  
  The l-codas
  
  There are 46 tokens (49 lexemes) with the coda elements {il}, {iil},
  or {iiil}; only two with the last one. 
  
  The lexemes that occur more than once are @'ail' (4.5 tokens),
  @'aildy' (2), @'dail' (2), and @'sail' (1.5). All the other lexemes
  with these elements occur only once or less.
  
  In practically all cases these elements are preceded by an {a} element.
  
  In 19 tokens (25 lexemes) the element {l}, {iil}, or {iiil} does not
  ocur at the end of the word.  Thus these may be cases of words run together.
  
  I am tempted to exclude all these @l-codas from the element set.
  Capturing those 46 tokens does not seem worth the hassle of including
  those elements.
  
  The s-codas
  
  The elements {is} and {iis} occur in 30 tokens (43 lexemes) of the
  parags text. The only lexeme that occurs more than once (twice only)
  is @'ais'. All the other lexemes with {is} or {iis} occur only once or
  less.
  
  Of these 30 tokens, 12.5 (18 lexemes) have other elements after the {is},
  {iis} or {iiis}.  Thus they may be cases of words run together.
  
  Many of these elements may have been instances of {ir} and {iir} that were
  badly written and/or incorrectly transcribed. The glyphs @r and @s are
  often written with ambiguous shapes that can be read either way.
  
  It seems best to remove them from the element set.