# Last edited on 2026-01-16 02:01:44 by stolfi APPENDIX - INTRODUCTION This sub-note describes some experiments that led to the selection of the element set. TENTATIVE SET First we tried to parse all the parags tokens with the following set of elements: {q} {o} {a} {y} {d} {r} {l} {s} {ch} {che} {sh} {she} {ee} {eee} {ih} {ihe} {k} {ke} {t} {te} {p} {pe} {f} {fe} {ckh} {ckhh} {ckhe} {ckhhe} {cth} {cthh} {cthe} {cthhe} {cph} {cphh} {cphe} {cphhe} {cfh} {cfhh} {cfhe} {cfhhe} {ikh} {ikhh} {ikhe} {ikhhe} {ith} {ithh} {ithe} {ithhe} {iph} {iphh} {iphe} {iphhe} {ifh} {ifhh} {ifhe} {ifhhe} {n} {in} {iin} {iiin} {m} {im} {iim} {iiim} {ir} {iir} {iiir} {id} {iid} {iiid} {il} {iil} {iiil} {is} {iis} {iiis} do_093_elem_stats.sh all gud bad % gud sec-type ------------ ------------ ------------ ----- ---------- 6075.000000 6043.500000 31.500000 99.48 bio-parags 984.500000 956.500000 28.000000 97.16 cos-parags 7564.000000 7470.000000 94.000000 98.76 hea-parags 3298.500000 3252.500000 46.000000 98.61 heb-parags 2191.500000 2145.500000 46.000000 97.90 pha-parags 10581.750000 10480.750000 101.000000 99.05 str-parags 2921.500000 2896.500000 25.000000 99.14 unk-parags 33616.750000 33245.250000 371.500000 98.89 tot-parags So at least 98% of all words in the main sections that have only the "valid" chars can be parsed into valid elements of the model. Here are the element frequencies among the words that can be parsed into elements: 12667.000000 0.09991 {a} 21638.750000 0.17067 {o} 15585.125000 0.12293 {y} 5193.000000 0.04096 {q} 11518.250000 0.09085 {d} 9270.250000 0.07312 {l} 5824.000000 0.04594 {r} 2100.500000 0.01657 {s} 6006.000000 0.04737 {ch} 3922.250000 0.03094 {che} 3840.000000 0.03029 {ee} 321.000000 0.00253 {eee} 2212.250000 0.01745 {sh} 1876.875000 0.01480 {she} 1.000000 0.00001 {ih} 1.000000 0.00001 {ihe} 7318.250000 0.05772 {k} 1526.000000 0.01204 {ke} 4212.250000 0.03322 {t} 786.500000 0.00620 {te} 1211.750000 0.00956 {p} 0.0 0.0 {pe} 326.000000 0.00257 {f} 0.0 0.0 {fe} 605.000000 0.00477 {ckh} 207.500000 0.00164 {ckhe} 28.000000 0.00022 {ckhh} # 0.12 23.000000 0.00018 {ikh} # 0.037 3.000000 0.00002 {ikhe} 1.000000 0.00001 {ikhh} 688.500000 0.00543 {cth} 166.000000 0.00131 {cthe} 30.000000 0.00024 {cthh} # 0.15 23.000000 0.00018 {ith} # 0.033 3.000000 0.00002 {ithe} 0.0 0.0 {ithh} 124.500000 0.00098 {cph} 56.000000 0.00044 {cphe} 9.000000 0.00007 {cphh} # 0.14 7.000000 0.00006 {iph} # 0.057 1.000000 0.00001 {iphe} 1.000000 0.00001 {iphh} 47.000000 0.00037 {cfh} 18.000000 0.00014 {cfhe} 3.000000 0.00002 {cfhh} # 0.16 2.000000 0.00002 {ifh} # 0.051 1.000000 0.00001 {ifhe} 2.000000 0.00002 {ifhh} 7.000000 0.00006 {id} 13.000000 0.00010 {iid} 3.000000 0.00002 {iiid} 30.000000 0.00024 {il} 9.500000 0.00007 {iil} 2.000000 0.00002 {iiil} 116.500000 0.00092 {n} 1673.500000 0.01320 {in} 3780.500000 0.02982 {iin} 158.000000 0.00125 {iiin} 873.500000 0.00689 {m} 41.000000 0.00032 {im} 17.000000 0.00013 {iim} 0.0 0.0 {iiim} 489.500000 0.00386 {ir} 131.500000 0.00104 {iir} 0.0 0.0 {iiir} 19.000000 0.00015 {is} 12.000000 0.00009 {iis} 0.0 0.0 {iiis} Eliminating the I-benches There are actually three tokens with {ih} in my transcription file, but one failed to parse for another reason. Many of the glyphs that look like @'Ih' had been transcribed as @'Ch', because the trabscribers (me included) saw them as accidentally malformed @'Ch'. There are 67 tokens with @I-platforms in the parags text, compared to 1982.5 tokens with @C-platforms. The counts of the former are not only small compared to the corresponding @C versions, but the ratio for each pair is almost the same, 0.03--0.06. So it seems likely that many of those @I platforms are just malformed or misread @C platforms. Many of the @I-elements are in fact ambiguous on the images. Therefore I decided that keeping the @I-benches and @I-platform gallows as separate elements it was not worth the trouble. Instead I went through my transcription and replaced all those glyph groups by the @C versions. See note 074. There is the risk that in fact the @I variants of benches and/or platform gallows are actually distinct elements, so this "fixing" of the @I-elements actually created errors. But the number of such errors (~3% of all platform gallows) would be small compared to that of other errors that are expected to exist. The long platform gallows In the parags sections, there are 75 tokens (40 lexemes) with "long platform gallows" -- elements of the form {cXhh} where X = gallows, compaerd to 447.5 tokens (174 lexemes) with {cXhe} elements. Of these 40 lexemes, 21 are such that if the @'hh' end is replaced by @'he', the result will be lexemes that already occur in the parags text. For the most common of these lexemes, the frequency of the @'hh' version is 15-25% of that of both versions, @'hh' and @'he'. The other 19 lexemes with @'hh' gallows are all hapaxes, except @'shcphhy' that occurs 3 times. Five of these 19 cannot be parsed as elements with the current element set; and three of them would actually become parsable if the @'hh' was replaced by @'he'. I am tempted to either (1) exclude the @'hh' elements of the element set (so that the tokens that use them become part of the unparseable residue) or (2) map them in the source file to the corresponding @'he' version (thus implicitly declaring them errors and correcting them that way). Either way the impact on the statistics will be small. However alternative (1) seems more prudent. The ratio of {cXhh} tokens to {cXhe} (~17%) is rather large. The d-codas There are only 23 tokens (21 lexemes) whose parsing with the eement set above generates elements {id}, {iid}, or {iiid}; only three with the last one. Only one of those lexemes, @'daiidy', occurs more than once (7 times); all the others are hapaxes. In all cases these elements are preceded by an {a} element. In all but one of thse cases the element {id}, {iid}, {iid} is not the end of the word, suggesting that those cases are words that were run together with loss of the @n that should have followed the @i string. These cases may also where a @n was wrongly written or retraced as @d. It does not seem worth keeping these elements in the set. Let them be relegated to the unparsable residue. The l-codas There are 46 tokens (49 lexemes) with the coda elements {il}, {iil}, or {iiil}; only two with the last one. The lexemes that occur more than once are @'ail' (4.5 tokens), @'aildy' (2), @'dail' (2), and @'sail' (1.5). All the other lexemes with these elements occur only once or less. In practically all cases these elements are preceded by an {a} element. In 19 tokens (25 lexemes) the element {l}, {iil}, or {iiil} does not ocur at the end of the word. Thus these may be cases of words run together. I am tempted to exclude all these @l-codas from the element set. Capturing those 46 tokens does not seem worth the hassle of including those elements. The s-codas The elements {is} and {iis} occur in 30 tokens (43 lexemes) of the parags text. The only lexeme that occurs more than once (twice only) is @'ais'. All the other lexemes with {is} or {iis} occur only once or less. Of these 30 tokens, 12.5 (18 lexemes) have other elements after the {is}, {iis} or {iiis}. Thus they may be cases of words run together. Many of these elements may have been instances of {ir} and {iir} that were badly written and/or incorrectly transcribed. The glyphs @r and @s are often written with ambiguous shapes that can be read either way. It seems best to remove them from the element set.