Hacking at the Voynich manuscript - Side notes
025 Classifying OKOKOKO elements as word-initial, -medial, and -final
Last edited on 1999-02-05 05:31:28 by stolfi
Word and line breaks in the OKOKOKO model
-----------------------------------------
In Notes/017 we saw that (practically) every Voynichese word
can be parsed into the paradigm QOKOKO...KO where
Q, O, K are certain sets of letters and letter clusters.
It is instructive to analyze the immediate contexts of definite word
spaces (std), breaks due to figures in the text (fig),
intra-paragraph line breaks (lin), and inter-word pairs (non), in
terms of this paradigm.
The source text
---------------
For this study we will use the majority-vote and consensus
transcriptions, that includes Takeshi's new full transcription. For
simplicity, let's discard all data containing weirdos, extra plumes,
unreadable characters, or the rare letters [abuvxz]. Let's also map
the upper case EVA letters [SCIKTPF] to their lower case varians,
since the capitalization carries no information in those cases.
foreach vt ( m.A c.Y )
set v = ${vt:r}; set t = ${vt:e}
cat ../045/only-${v}.evt \
| egrep -e '^<[^<>]*;'"$t"'>' \
| tr 'SCIKTPF' 'sciktpf' \
| tr -d '\!' \
| sed \
-e 's/^<[^<>]*> *//g' \
-e 's/[{][^{}]*[}]//g' \
-e 's/[&][0-9*?][0-9*?]*[;]\?/*/g' \
-e 's/[buxvz]/*/g' \
-e 's/[.,]*-[-.,]*/-/g' \
-e 's/[,]*[.][,.]*/./g' \
-e 's/[,][,]*/,/g' \
-e 's/.['"'"'"]/?/g' \
-e 's/[^-,./= ]*[%?*][^-,./= ]*/?/g' \
> base-${v}.txt
end
Let's now separate the text into the OKOKOKO elements.
We delete empty elements and put {} around the O strings too:
foreach v ( m c )
cat base-${v}.txt \
| factor-field-OK \
-v inField=1 \
-v erase=1 \
-v outField=1 \
| sed \
-e 's/{_}//g' \
-e 's/_//g' \
-e 's/\([aoy][aoy]*\)/{\1}/g' \
> base-${v}.elt
end
Besides these elements, we will use the following reduced alphabets:
the coarse set "clt"
O = <[aoy]+>
Q = [q]
I = [i]+
R = [djmg] and [rlsn]
X = <ee>, <ch>, <sh>, <ih>, <se>, [ci][ktpf][h], [ci][ktpf], [ktpf]
E = [ehc] not included in the X letters
We consider also the finer set "flt" where the R and X sets get split
further as follows
S = <ee>, <ch>, <sh>, <ih>, <se>
H = [ktpf]
G = [ci][ktpf][h], [c][ktpf]
L = [rlsn]
D = [djmg]
Finally we shoudl consider the "simplified" elements
where possible errors and calligraphic variants are mapped to
likely "correct" versions:
{p} -> {t} (also in composites)
{f} -> {k} (also in composites)
{g} -> {m}
{j} -> {d}
{iXh} -> {cXh}
{cXhh} -> {cXhe}
{iXhh} -> {cXhe}
{iid} -> {ii} {d}
etc.
Let's call these the "simple" elements (slt).
The conversion is done by the sed scripts elt2clt, elt2flt, elt2slt.
The script elt2elt, a no-op, is also provided for uniformity.
Converting:
foreach v ( m c )
foreach map ( clt flt slt )
echo "map = ${map}"
cat base-${v}.elt \
| elt2${map} \
> base-${v}.${map}
end
end
Checking the completeness of the conversion:
foreach v ( m c )
foreach ma ( elt.a-z clt.A-Z flt.A-Z slt.a-z )
set map = "${ma:r}"; set alf = "${ma:e}"
echo "map = ${map} alf = ${alf}"
cat base-${v}.${map} \
| egrep '[{][^{}]*[^{}'"${alf}"'?*][^{}]*[}]' \
> .bugs-${v}.${map}
cat base-${v}.${map} \
| egrep '(^|[}])[^{}]*[^-,.=/ {}][^{}]*([{]|$)' \
>> .bugs-${v}.${map}
head -10 .bugs-${v}.${map}
end
end
Element frequencies
-------------------
Computing the element frequencies:
foreach v ( m c )
foreach map ( elt clt flt slt )
cat base-${v}.${map} \
| tr '{}' '\012\012' \
| egrep '[_A-Za-z?*%]' \
| sort | uniq -c | expand \
| sort +0 -1nr \
> ${map}-${v}.frq
cat ${map}-${v}.frq \
| gawk '/./{print $2;}' \
| sort \
> ${map}-${v}.dic
end
end
Element frequencies
Majority version: Consensus version:
count clt count flt count slt count elt count clt count flt count slt count elt
----- --- ----- --- ----- --- ----- ----- ----- --- ----- --- ----- --- ----- -----
53662 O 53662 O 23747 o 23193 o 45290 O 45290 O 19966 o 19593 o
38745 R 25157 L 16878 y 16681 y 32418 R 20716 L 14831 y 14702 y
38037 X 19283 S 13568 a 13260 a 32223 X 16632 S 10891 d 10862 d
9927 E 16647 H 12490 d 12440 d 8506 E 13888 H 10856 a 10635 a
6351 I 13585 D 10463 ch 10019 l 8052 ? 11702 D 9094 ch 8745 l
5138 Q 9927 E 10075 l 7683 k 4766 I 8506 E 8771 l 8052 ?
2487 ? 6367 I 9919 e 6392 r 4451 Q 8052 ? 8503 e 6314 k
5138 Q 9756 k 6383 ch 4774 I 8052 ? 5491 ch
2487 ? 7114 r 5138 q 4451 Q 8042 k 5400 r
2110 G 6891 t 4570 t 1703 G 5951 r 4451 q
5580 n 4103 ee 5846 t 3870 t
5138 q 4069 che 4451 q 3660 iin
4443 ee 4017 iin 4215 n 3599 che
4385 sh 2487 ? 3782 ee 3535 ee
4204 ii 2369 s 3759 sh 1964 sh
2487 ? 2308 sh 3758 ii 1774 she
2380 s 2029 she 1776 s 1772 s
2058 i 1690 ke 981 i 1459 ke
1138 cth 1326 in 936 cth 1103 p
1095 m 1316 p 811 m 864 te
972 ckh 996 m 767 ckh 756 m
106 iii 993 te 35 iii 630 cth
730 cth 541 ckh
654 ckh 471 ir
582 ir 433 in
371 f 266 f
340 eee 247 eee
260 oa 194 oa
231 e 181 ckhe
229 ckhe 145 cthe
185 cthe 135 e
145 cph 112 cph
133 n 88 oy
... ... 87 n
... ...
Observe that the frequency of {?} increased inthe consensus version,
while all other counts decreased. Note that the counts of {r} and
{s} decreased even relative to the other letters. Otherwise the
differences are minimal.
Plotting the histograms
foreach v ( m c )
foreach map ( clt flt slt elt )
gnuplot <<EOF
set term x11
plot "${map}-${v}.frq" using :1 with steps
pause 120
quit
EOF
end
end
Counting reduced character pairs
--------------------------------
Now let's count the reduced letter pairs inside words and around
all three kinds of word breaks:
foreach v ( m c )
foreach map ( clt flt elt slt )
echo "map = ${map} v = ${v}"
echo "non-breaks ..."
cat base-${v}.${map} \
| sed \
-e 's/\({[^{}]*}\)/\1@\1\!/g' \
-e 's/[\!:]*[-=:., ][-\!=:., ]*/@/g' \
| tr '@' '\012' \
| egrep -e '^{[^{}]*}\!{[^{}]*}$' \
| compute-pair-freqs \
> non-${map}-${v}.frq
echo "simple word breaks ..."
cat base-${v}.${map} \
| sed \
-e 's/[.]\({[^{}]*}\)[.]/.\1\1./g' \
-e 's/\({[^{}]*}\)[.]\({[^{}]*}\)/@\1.\2@/g' \
| tr '@' '\012' \
| egrep -e '^{[^{}]*}[.]{[^{}]*}$' \
| compute-pair-freqs \
> std-${map}-${v}.frq
echo "figure breaks ..."
cat base-${v}.${map} \
| sed \
-e 's/-\({[^{}]*}\)-/-\1\1-/g' \
-e 's/\({[^{}]*}\)-\({[^{}]*}\)/@\1-\2@/g' \
| tr '@' '\012' \
| egrep -e '^{[^{}]*}-{[^{}]*}$' \
| compute-pair-freqs \
> fig-${map}-${v}.frq
echo "line breaks ..."
cat base-${v}.${map} \
| sed \
-e 's/^[-=., ]*\({[^{}]*}\)[-=., ]*$/\1\1/g' \
-e 's/^[-=., ]*\({[^{}]*}\)/\1@/g' \
-e 's/\({[^{}]*}\)[-., ]*$/@\1\//g' \
| tr -d '\012' \
| tr '@' '\012' \
| egrep -e '^{[^{}]*}[/]{[^{}]*}$' \
| compute-pair-freqs \
> lin-${map}-${v}.frq
end
end
Created {clt,flt,slt,elt}.dic from the "-m" versions, reordering
with hindsight into important classes
Format element pair frequencies in matrix form:
foreach v ( m c )
foreach map ( clt flt slt )
foreach brk ( lin fig std non )
echo "map = ${map} brk = ${brk}"
cat ${brk}-${map}-${v}.frq \
| gawk '/./{s=$3; gsub(/[\!:./]/, " ", s); print $1,s;}' \
| count-diword-freqs \
-v rows=${map}.dic -v cols=${map}.dic \
-v counted=1 -v digits=5 \
> ${map}-${brk}-${v}.dwtbl
end
end
end
Analysis in terms of the Q O I R X E classes
--------------------------------------------
Here are the counts for the mapping to the coarse { Q O I R X E } classes:
Pairs around ordinary word breaks (std):
absolute counts per-row percentages
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
T
O
T Q O R X E I Q O R X E I
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
Q 0 . . . . . . Q . . . . . .
O 12535 3287 2786 2748 3703 11 . O 26 22 21 29 . .
R 15134 1068 5797 1879 6366 22 2 R 7 38 12 42 . .
X 174 10 59 19 85 1 . X 5 33 10 48 . .
E 44 5 16 8 15 . . E 11 36 18 34 . .
I 5 . 2 2 1 . . I . . . . . .
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
TOT 27892 4370 8660 4656 10170 34 2 TOT 15 31 16 36 . .
Pairs around figure breaks (fig):
absolute counts per-row percentages
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
T
O
T Q O R X E I Q O R X E I
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
Q 0 . . . . . . Q . . . . . .
O 360 16 134 107 100 3 . O 4 37 29 27 . .
R 382 9 168 98 107 . . R 2 43 25 28 . .
X 19 . 4 6 9 . . X . 21 31 47 . .
E 1 . . 1 . . . E . . . . . .
I 0 . . . . . . I . . . . . .
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
TOT 762 25 306 212 216 3 . TOT 3 40 27 28 . .
Pairs around line breaks (lin):
absolute counts per-row percentages
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
T
O
T Q O R X E I Q O R X E I
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
Q 1 . . 1 . . . Q . . . . . .
O 1244 199 377 415 248 5 . O 15 30 33 19 . .
R 1828 262 636 594 334 2 . R 14 34 32 18 . .
X 28 2 18 5 2 1 . X 7 64 17 7 3 .
E 2 . 1 1 . . . E . . . . . .
I 2 1 1 . . . . I . . . . . .
--- ------- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- --
TOT 3105 464 1033 1016 584 8 . TOT 14 33 32 18 . .
So we can say that
(1) Line breaks occur almost only between { O R } and { Q O R X }
(with frequencies ranging from 6% to 20% of all line breaks);
rarely between X and { Q O R X }
(less than 0.9% of all line breaks);
and essentially never after { Q I E } or before { I E }
(less than 0.4% of all line breaks).
(2) Ordinary word breaks follow the same pattern:
the pairs between { O R } and { Q O R X }
have frequencies between 3.8% and 22%;
pairs between X and { Q O R X } have total
frequency of 0.6%; and all the remaining pairs
account for only 0.3% of the line breaks.
(3) Figure breaks too follow almost the same pattern:
the pairs { O R } and { O R X }
have frequencies ranging from 22% to 12%,
but the pairs { R-Q and O-Q } are much rarer
than around line breaks and ordinary spaces,
about 1--2% each. Breaks between X and { Q O R X }
are slightly more common (2.5% total) and
all other pairs are almost absent (about 0.5%).
(4) The relative frequencies of { Q O R X }
are approximately 1:2:2:1 after a line break,
and 0:3:2:2 after a figure break, roughly
independently of the character before the break.
(5) The relative frequencies of { Q O R X }
after ordinary word breaks seem to depend on the
preceding letter: 1:1:1:1 after O, 1:4:1:4 after R.
However they are still of the same order of magnitude.
(6) Inside words, the valid pairs are
{ QO, OX, OI, IX, XX, XE, XO, EX, EO }
with frequencies ranging from 4.1% to 27%.
The remaining pairs have much lower frequencies
(OO accounts for 0.46% of all pairs, and OE
for only 0.13%).
These observations seem to imply that the "word spaces", line
breaks, and figure breaks are fairly similar when compared to
all inter-character pairs.
Their similarity, and the relative independence of the second letter
on the first strongly suggests that those breaks are indeed word
boundaries. In that case we conclude that Voynichese words may end
in O or R (40-45% and 50-60%, respectively) or rarely X; and may
begin only with Q, O, R, or X.
(A more detailed analysis would show that the O at end of words is
almost always <y>. Also the last R in a line is most often EVA <m>,
which is only rarely seen at the other kinds of word breaks.)
Point (3) shows that figure breaks are more like line and word
breaks than like random inter-character breaks.
The main difference between line breaks and figure breaks is that
the probability of finding a Q is much higher after a line break (14%)
than after a figure break (3%).
The main difference between line and figure breaks on one side, and
the ordinary word breaks on the other, is that the probability of
the letter after an ordinary word break is visibly dependent on the
letter before the break. The difference can be described as an
enhancement of Q.O pairs at the expense of O.O pairs; and an
enhancement of R.O and R.X pairs at the expense fo R.R pairs.
Distribution of letters after a break, depending on the
previous one, ignoring pairs that end in Q:
after R: after O:
O R X O R X
-- -- -- -- -- --
lin 40 37 21 lin 36 39 23
fig 45 26 28 fig 39 31 29
std 41 13 45 std 30 29 40
non 78 4 16 non . 48 34
Point (6) is a partial restatement of the QOKOKOKO paradigm.
Note that the pairs { QO OI IX XE EX EO }, which are fairly
common inside words, are not legal places for word spaces,
line, or figure breaks.
Unfortunately this data does not shed much light on whether each O
(or E) is attached to the preceding X, the following X, or sometimes
both, or neither. There are (practically) no figure breaks adjacent
to an E.
Analysis for the Q O I G H S L D E classes
------------------------------------------
Here are the counts for the mapping to finer classes { Q O I G H S L D E }
Pairs around ordinary word breaks (std):
absolute counts per-row percentages
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
T
O
T Q O L D S H G E I Q O L D S H G
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
Q 0 . . . . . . . . . Q . . . . . . .
O 12535 3287 2786 1493 1255 2359 1050 294 11 . O 26 22 11 10 18 8 2
L 14499 957 5570 604 1186 5127 568 465 21 1 L 6 38 4 8 35 3 3
D 635 111 227 51 38 183 13 10 1 1 D 17 35 8 5 28 2 1
S 57 5 23 4 5 7 12 1 . . S 8 40 7 8 12 21 1
H 112 5 34 4 4 63 1 . 1 . H 4 30 3 3 56 . .
G 5 . 2 1 1 1 . . . . G . . . . . . .
E 44 5 16 2 6 5 9 1 . . E 11 36 4 13 11 20 2
I 5 . 2 1 1 1 . . . . I . . . . . . .
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
TOT 27892 4370 8660 2160 2496 7746 1653 771 34 2 TOT 15 31 7 8 27 5 2
Pairs around figure breaks (fig):
absolute counts per-row percentages
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
T
O
T Q O L D S H G E I Q O L D S H G
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
Q 0 . . . . . . . . . Q . . . . . . .
O 360 16 134 38 69 85 1 14 3 . O 4 37 10 19 23 . 3
L 315 9 133 28 53 86 . 6 . . L 2 42 8 16 27 . 1
D 67 . 35 8 9 12 3 . . . D . 52 11 13 17 4 .
S 2 . . . . 2 . . . . S . . . . . . .
H 9 . 2 2 2 3 . . . . H . . . . . . .
G 8 . 2 1 1 4 . . . . G . . . . . . .
E 1 . . . 1 . . . . . E . . . . . . .
I 0 . . . . . . . . . I . . . . . . .
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
TOT 762 25 306 77 135 192 4 20 3 . TOT 3 40 10 17 25 . 2
Pairs around line breaks (lin):
absolute counts per-row percentages
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
T
O
T Q O L D S H G E I Q O L D S H G
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
Q 1 . . 1 . . . . . . Q . . . . . . .
O 1244 199 377 172 243 124 123 1 5 . O 15 30 13 19 9 9 .
L 1160 187 385 165 206 121 89 5 2 . L 16 33 14 17 10 7 .
D 668 75 251 99 124 58 59 2 . . D 11 37 14 18 8 8 .
S 5 . 5 . . . . . . . S . . . . . . .
H 20 1 11 3 2 . 2 . 1 . H 4 54 14 9 . 9 .
G 3 1 2 . . . . . . . G . . . . . . .
E 2 . 1 1 . . . . . . E . . . . . . .
I 2 1 1 . . . . . . . I . . . . . . .
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- --- -- -- -- -- -- -- --
TOT 3105 464 1033 441 575 303 273 8 8 . TOT 14 33 14 18 9 8 .
These numbers can be summarized as follows: line breaks occur only
between { O L D } and { Q O L D S H }, and practically never after
{ Q S H G E I } or before { G E I }. The distribution of the
first letter of the line does not depend much on the last
letter of the previous line.
Ordinary word breaks have almost the same distribution, except that
they may also occur before G. Figure breaks are even more extreme in
that they occur before G but hardly ever before H.
Also the distribution of the letter after a word break depends
significantly on the letter before the break. In particular the
pairs L.S and D.S are far more common around word breaks than they
are around line breaks.
Analysis for the "corrected" elements
-------------------------------------
In tabular form:
Pairs around ordinary word breaks (std):
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
T c c
O c s e k t
T q a o y r l s n d m h h e k t h h
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
a 29 3 2 5 1 5 3 . . 4 . 3 . . 2 . 1 .
o 836 75 32 80 22 59 108 31 . 120 1 98 42 1 78 37 17 33
y 11670 3209 103 2250 291 228 726 332 1 1128 2 1519 689 7 501 432 61 182
r 4520 232 757 1205 188 15 48 58 . 218 3 965 594 8 54 33 39 94
l 4688 371 150 908 113 56 168 124 1 602 1 1049 600 15 294 118 26 85
s 763 28 199 204 44 2 13 9 . 16 1 136 69 3 6 6 12 14
n 4528 326 209 1383 210 11 30 69 . 342 3 1116 571 1 31 26 48 147
d 371 88 29 81 20 5 27 5 1 19 . 51 32 . 6 2 2 2
m 264 23 10 76 11 . 2 11 . 19 . 69 31 . 5 . . 6
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
TOT 27892 4370 1509 6241 910 383 1131 643 3 2485 11 5054 2656 36 996 657 206 565
Pairs around figure breaks (fig):
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
T c c
O c s e k t
T q a o y r l s n d m h h e k t h h
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
a 0 . . . . . . . . . . . . . . . . .
o 19 2 . 2 4 1 3 . . 1 . 2 2 . . 1 . 1
y 341 14 4 77 47 . 2 32 . 68 . 58 22 1 . . 2 11
r 56 3 2 11 10 . . 4 . 10 . 12 4 . . . . .
l 103 4 3 25 16 2 1 9 . 19 . 12 9 . . . . 3
s 55 . 5 16 5 . . 3 . 10 . 11 5 . . . . .
n 101 2 . 25 15 . . 9 . 14 . 21 12 . . . . 3
d 45 . 2 10 9 2 . 4 . 9 . 4 5 . . . . .
m 22 . . 11 3 . . 2 . . . 2 1 . 2 1 . .
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
TOT 762 25 17 179 110 7 6 64 . 133 2 130 61 1 2 2 2 18
Pairs around line breaks (lin):
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
T c c
O c s e k t
T q a o y r l s n d m h h e k t h h
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
a 7 4 . 1 1 . . . . 1 . . . . . . . .
o 47 8 1 9 6 2 3 6 . 6 . 2 1 . 1 2 . .
y 1190 187 2 168 189 2 25 134 . 236 . 59 62 . 12 108 . 1
r 276 41 1 55 48 3 6 19 . 49 . 16 17 . 2 18 . 1
l 364 57 . 54 59 2 15 53 . 70 . 10 16 . 6 21 . 1
s 109 21 . 19 18 1 2 10 . 14 . 3 8 . . 12 . .
n 411 68 1 66 64 1 4 49 . 73 . 16 35 . 5 25 . 3
d 108 19 1 21 22 1 4 8 . 17 . 6 3 . 1 4 . 1
m 560 56 3 89 115 1 8 77 . 107 . 17 31 1 2 52 . 1
--- ------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
TOT 3105 464 9 489 535 14 67 360 . 575 . 129 173 1 30 243 . 8
Distribution (percentual) of letters after each type of break:
--- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
c c
c s e k t
q a o y r l s n d m h h e k t h h
--- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
non . 10 13 12 5 7 5 4 7 . 4 1 3 7 4 . .
std 15 5 22 3 1 4 2 . 8 . 18 9 . 3 2 . 2
fig 3 2 23 14 . . 8 . 17 . 17 8 . . . . 2
lin 14 . 15 17 . 2 11 . 18 . 4 5 . . 7 . .
Distribution (percentual) of letters before each type of break:
--- ---- ---- ---- ----
n s f l
o t i i
n d g n
--- ---- ---- ---- ----
q 4 0 0 0
a 11 0 0 0
o 19 2 2 1
y 1 41 44 38
r 1 16 7 8
l 3 16 13 11
s 1 2 7 3
n 0 16 13 13
d 10 1 5 3
m 0 0 2 18
ch 8 0 0 0
sh 3 0 0 0
ee 3 0 0 0
k 8 0 0 0
t 5 0 0 0
ckh 0 0 0 0
cth 0 0 0 0
e 8 0 0 0
i 1 0 0 0
ii 3 0 0 0
iii 0 0 0 0
--- ---- ---- ---- ----
From the analysis with "flt" classes, we would expect the following pairs to occur:
{ a o y r l s n d m } and { q a o y r l s n d m ch sh ee k t ckh cth }
Notable absences in all breaks:
*/m - the ratio d:m is 10:1 but the ratio */d : */m is over 200:1
*/n - compared to */r, */l, */s and considering the element frequencies.
of course that is because <n> only occurs after <i> whereas
<r> <l> <s> also occur in other contexts.
*/ee - the ratio ch:sh:ee is 2:1:1 but */ch : */sh : */ee is 2:1:0
a/*, o/* - while a:o:y is 3:6:4, a/*:o/*:y/* is 0:10:150 in ordinary breaks,
and similarly absent in other breaks.
notable absences in line and figure breaks:
*/a - ratio */a:*/o:*/y is 2:8:1 in std, 1:50:50 in lin, while a:o:y is 3:6:4
*/r - 1.3% of std, 1% of fig, 0.4% of lin. Also, the
ratio r:s is 3:1 but */r:*/s is 1:2 in std, 1:25 in lin.
*/k - 3.5% of std, 0.3% of fig, 1% of lin
*/l - 4% in std, 1% of fi, 2% of lin;
notable absences in line breaks:
*/ckh, */cth - 2.7% of std, 2.6% of fig, 0.2% of line
Notable absences in figure breaks:
*/q - 15% of standard breaks, 3% of figure breaks, 15% of line breaks
*/k,*/t - 6% of std, 0.5% of fig, 9% of lin (seen in in flt)
s/r, s/s, d/r, d/s
Notable anomalies:
*/s - 2% of std, 8% of fig, 12% of lin.
This said, the significant pairs found around line breaks
are mostly between
{ o y r l s n d m } and { q o y l s d ch sh t }
where o/* pairs are actually quite rare.
Pairs with second element <t> or <s> are in fact more common around
line breaks than around word breaks. Thus it may be that such
word breaks are preferentially omitted in transcription.
Pairs with first element <m> have the same discrepancy, but there
the likely explanation is that <m> is an abbreviation or a
calligraphic variant that is specifically used at end of line.
(Since <m> also occurs before figure breaks, the abbreviation
theory seems more likely.)