Hacking at the Voynich manuscript - Side notes 045 Computing consensus/majority editions of the EVA interlinear (OBSOLETE) Last edited on 2025-12-08 19:29:56 by stolfi The goal of this note is to condense the various transcriptions present in the EVA interlinear into some sort of "consensus" or "majority" version. OBSOLETE - Sueperseded by Notes/074. SETUP ln -s ../.. work ln -s work/fnum-to-subsec.tbl JOINING THE INTERLINEAR INTO A SINGLE FILE Making a list of all text units: ln -s ../../L16+H-eva cat L16+H-eva/INDEX \ | gawk -v FS=':' '/./{print $2;}' \ > .all.units set units = ( `cat .all.units` ) Safety check: ( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo cat .all.units | sort > .bar diff .foo .bar Concatenating all units, with basic uncapitalized EVA ( cd L16+H-eva && cat ${units} ) \ > inter.evt Checking validity and synchronism: cat inter.evt \ | validate-new-evt-format \ -v checkTerminators=1 \ -v checkLineLengths=1 \ >& inter.bugs EXTRACTING THE TUPLES OF VARIANT READINGS Next we extract from the interlinear a list of all "reading tuples", one for each character position in the text. Each tuple is a string of 26 characters, the readings of that character position in each of the 26 potential variants. Whenever a particular variant does not cover a particular character position, the corresponding tuple element is set to "%". (Note that "%" is presently used in the interlinear itself to mark lines or parts of lines that were skipped by a particular transcriber.) cat inter.evt \ | unbasify-weirdos \ | egrep -v ';[AY]>' \ | extract-reading-tuples \ -f tuple-procs.gawk \ | sort | uniq -c | expand \ | sort -k1nr -k2 \ > inter.tfr 5362 VMS text lines found 17357 interlinear text lines read 245136 tuples written 17357 interlinear text lines read dicio-wc inter.tfr lines words bytes file ------- ------- --------- ------------ 6126 12252 214410 inter.tfr cat inter.tfr \ | gawk '/./{s+=$1} END{print s;}' 245136 We then compute tables that map reading tuples to and consensus readings: cat inter.tfr \ | compute-consensus-table \ > tuple-to-consensus.stats cat tuple-to-consensus.stats \ | gawk '/./{print $2,$3;}' \ > tuple-to-consensus.tbl Similarly, we compute a table that maps reading tuples to majority readings (using equal weights on the first iteration): cat inter.tfr \ | compute-majority-table \ -v alternates="CD,FG,JI,KQ,LM" \ > tuple-to-majority.stats cat tuple-to-majority.stats \ | gawk '/./{print $2,$3;}' \ > tuple-to-majority.tbl Since the weights are unity, the total weight column is the number of transcribers in that reading tuple (ignoring alternates and "%" or "*" readings). Let's count how many recorded positions have been read by 0, 1, 2, ..., 26 transcribers, excluding also positions that are all [!%=-] since those prositions are either fillers or were provided by me: cat tuple-to-majority.stats \ | gawk \ ' ($2 \!~ /^[-=%\!]*$/){ \ nt = int($4+0.0001); ct[nt] += $1; tot += $1; \ if (nt>zt+0){zt=nt} if (26-nt>at+0){at=26-nt} \ }; \ END{ \ for(i=26-at;i<=zt;i++){printf "%3d %7d\n", i, ct[i]} \ printf "tot %7d\n", tot \ } \ ' 0 268 1 2509 2 59253 3 95048 4 68280 5 4055 6 341 tot 229754 Next we can compute some statistics about the accuracy of each transcriber: cat inter.tfr \ | compute-transcriber-correlations \ -v alternates="CD,FG,JI,KQ,LM" \ > inter.trcorrs CREATING THE CONSENSUS AND MAJORITY VERSIONS Now we read the HTML file and produce another file with two extra variants: a "majority" version (first in each batch, transcriber code "A") and a "consensus" version (last in each group, transcriber code "Y"). See the scripts "compute-consensus-table" and "compute-majority-table" for definitions of these terms. cat inter.evt \ | egrep -v '^<.*;[AY]>' \ | unbasify-weirdos \ | combine-versions \ -f tuple-procs.gawk \ -v code=Y \ -v position=last \ -v table=tuple-to-consensus.tbl \ > inter-c.evt cat inter-c.evt \ | combine-versions \ -f tuple-procs.gawk \ -v ignore=Y \ -v code=A \ -v position=first \ -v table=tuple-to-majority.tbl \ | basify-weirdos \ > inter-cm.evt Extracting the bare text of the majority and consensus versions: cat inter-cm.evt \ | sed -e '/^## <[^<>.]*>/s/^## *//g' \ | egrep -v '^#' \ | egrep -v '^<.*;[^A]>' \ | unbasify-weirdos \ > only-m.evt cat inter-cm.evt \ | sed -e '/^## <[^<>.]*>/s/^## *//g' \ | egrep -v '^#' \ | egrep -v '^<.*;[^Y]>' \ | unbasify-weirdos \ > only-c.evt Publishing: foreach f ( only-c only-m ) cat $f.evt | gzip > $f.evt.gz zip $f $f.evt end DISPLAYING THE DIFFERENCES BETWEEN VERSIONS The file will show the majority line at the top and variants below, in the format fNNN.UU LLL A EEEEEEEEEE... T E EE T E EE T E EE Y EEEEEEEEE.. where LLL is a line number, T is a transcriber code, E an EVA character. rm disc/*.html cat inter-cm.evt \ | egrep -v '^## *<[^<>.]*[.][^<>.]*>' \ | egrep -v '^#([ ]|$)' \ | unbasify-weirdos \ | show-discrepancies \ -v title='EVA interlinear 1.6e6 - Discrepancies between versions' \ -f tuple-procs.gawk \ -v dir=disc Publishing the concordance: ( cd disc && rm -f disc.zip && pkzip disc index.html legend.html f*.html ) CREATING PER-PAGE AND PER-SECTION FILES Now let's produce the following files: pages-m/FNUM.evt the majority version split into one file per page. pages-m/all.names the f-numbers of all existing pages, in natural reading order. subsecs-m/TAG.evt the majority version split into one file per subsection. subsecs-m/all.names the tags of all existing sections, in some nice order subsecs-m/TAG.fnums the f-numbers of all existing pages in section TAG, in natural reading order Gathering the page lists: set pages = ( `cat .all.units | egrep -v 'f0' | egrep -v '[.]'` ) mkdir pages-m /bin/rm -f pages-m/all.names pages-m/*.evt .foo cat only-m.evt \ | basify-weirdos \ | sed -e 's/[&][*]/**/g' \ | egrep -v '^<[^<>.]*>' \ | split-pages \ -v outdir=pages-m \ > pages-m/all.names Collecting the list of pages in each section: mkdir subsecs-m set subsecs = ( \ `cat fnum-to-subsec.tbl | gawk '($2 \!~ /xxx/){print $2}' | sort | uniq` \ ) echo "subsecs = ( ${subsecs} )" /bin/rm -f subsecs-m/all.names subsecs-m/*.fnums subsecs-m/.foo foreach tag ( ${subsecs} ) echo "${tag}" cat fnum-to-subsec.tbl \ | grep -w ${tag} \ | gawk '/./{print $1;}' \ > subsecs-m/${tag}.fnums cat `cat subsecs-m/${tag}.fnums | sed -e 's@^\(.*\)$@pages-m/\1.evt@g'` \ > subsecs-m/${tag}.evt echo ${tag} >> subsecs-m/all.names end dicio-wc subsecs-m/*.evt lines words bytes file ------- ------- --------- ------------ 916 1832 62661 subsecs-m/bio.1.evt 13 26 1132 subsecs-m/cos.1.evt 399 798 19260 subsecs-m/cos.2.evt 186 372 9994 subsecs-m/cos.3.evt 1066 2132 64512 subsecs-m/hea.1.evt 134 268 8660 subsecs-m/hea.2.evt 316 632 24711 subsecs-m/heb.1.evt 61 122 4644 subsecs-m/heb.2.evt 174 348 10021 subsecs-m/pha.1.evt 284 568 15718 subsecs-m/pha.2.evt 80 160 6158 subsecs-m/str.1.evt 1084 2168 90650 subsecs-m/str.2.evt 53 106 2535 subsecs-m/unk.1.evt 52 104 2476 subsecs-m/unk.2.evt 7 14 461 subsecs-m/unk.3.evt 82 164 3762 subsecs-m/unk.4.evt 35 70 2844 subsecs-m/unk.5.evt 45 90 3845 subsecs-m/unk.6.evt 39 78 3002 subsecs-m/unk.7.evt 1 2 67 subsecs-m/unk.8.evt 335 670 15343 subsecs-m/zod.1.evt Let's list the pages in each section: ( cd pages-m && ls f*.evt ) \ | sed -e 's/\.evt/ +/' \ > /tmp/present.tbl /bin/rm -f pages-summary.txt foreach sec ( `cat subsecs-m/all.names` ) echo "${sec}" echo "subsection ${sec}" \ >> pages-summary.txt cat subsecs-m/${sec}.fnums \ | map-field \ -v table=/tmp/present.tbl \ -v default='-' \ | sed -e 's/[+] //' -e 's/- \(f[0-9vr]*\)/(\1)/' \ | fmt -w 50 \ | sed -e 's/^/ /' \ >> pages-summary.txt echo " " \ >> pages-summary.txt end