Hacking at the Voynich manuscript - Side notes
044 Adding line numbers to Takahashi's transcription
Last edited on 2006-11-27 22:50:11 by stolfi
The goal of this note is to merge Takeshi Takahashi's full
transcription of the VMS into the interlinear file.
FETCHING TAKESHI'S FILES
Takeshi Takahashi announced to the Voynich list a new full
transcription of the VMS in HTML format, without line numbers. I
fetched his single-file version with HTTP on 1998-11-25
Some time later (1998-12-28) I fetched again the version formatted
as separate pages, which Takeshi says are more up-to-date than the
full file.
ln -s ../../../docs/Takahashi
wget \
--non-verbose \
--input-file=Takahashi/pages/all.urls \
--directory-prefix=Takahashi/pages/ \
--output-file=Takahashi/pages/wget.log
Had to manually edit f66r.htm because it ws formatted as a
.
INITIAL CONVERSION FROM HTML TO EVT FORMAT
Removing irrelevant HTML formatting:
cat Takahashi/full.html \
| cleanup-takeshi-html-1 \
> tak-1.iso
( cd Takahashi/pages/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \
| cleanup-takeshi-html-1 \
| sed \
-e 's/([^()]*)//g' \
-e '/^ *$/d' \
> kat-1.iso
Replaced newline and " " by ".",
by "-" and newline,
by "=" and newline; then cleaned up the spaces.
foreach f ( tak kat )
echo "$f"
cat ${f}-1.iso \
| sed \
-e '/^[##]/s/\(f[0-9a-z]*\)/<\1> {'"${f}"'}@/' \
-e '/^[#]/b' \
-e 's/
/-@/' \
-e 's/
/=@/' \
-e 's@<[/]*B>@@g' \
-e 'y/ /./' \
| tr '\012@' '.\012' \
| sed \
-e 's/[.][.]*_/__/g' \
-e 's/_[.][.]*/__/g' \
-e 's/^[-._=]*//g' \
-e 's/[-._]*[=][-._=]*/=/g' \
-e 's/[.]*[-_][-._]*/-/g' \
-e 's/[.][.][.]*/./g' \
> ${f}-2.iso
end
INSERTING PRELIMINARY LOCATION CODES IN TAKESHI'S FILE
The next step is to insert preliminary locators into Takeshi's file.
This processing was done on the single-file version (tak-2.iso);
later the differences between tak-2.iso and kat-2.iso were checked
and fixed manually.
Using transcriber "H" = Takahashi.
The preliminary locators have empty unit tag and 2-digit line
numbers that increase sequentially per page (i.e. ,
, etc.).
cat tak-2.iso \
| insert-loc-codes \
> tak-3.evt
BUILDING THE TAKESHI-TO-STOLFI LOCATOR DICTIONARY
Then we have to produce a dictionary that maps the preliminary
locators to those used in the interlinear. For that purpose we take
a "best pick" version of the interlinear, chopping the text to 20
bytes, and equalizing all pages to 170 lines:
ln -s ../../L16-eva
cat L16-eva/UNITS \
| gawk -v FS=':' '/./{print $2;}' \
> .all.units
set units = ( `cat .all.units` )
( cd L16-eva && cat $units ) \
| best-pick \
> sto-3.evt
cat sto-3.evt \
| sync-clip-evt -v pageSize=170 \
> sto.clp
Then we do the same to Takahashi's file (chop the text to 20 bytes
and equalize the page sizes.) We also fix some page numbering
bugs:
cat tak-3.evt \
| sed \
-e 's//d' \
-e '//i\' \
-e '## {}\' \
-e ' {not transcribed}' \
| sync-clip-evt -v pageSize=170 \
> tak.clp
Let's check whether we have the same set of pages, in the
same order, with same number of lines:
dicio-wc sto.clp tak.clp
grep '##' sto.clp > sto.pages
grep '##' tak.clp > tak.pages
diff sto.pages tak.pages
OK, now we paste these two files side-by-side:
/n/gnu/bin/paste -d' ' tak.clp sto.clp > tak-sto.clp
We then edit tak-sto.clp manually, shifting and
permuting the right half of each page (locators included) until
the two truncated texts on each line are two versions of the same
VMS line. Unmatched half-lines are discarded (they will require
fixing the line breaks in the file anyway).
Then we delete the truncated text columns, leaving only the
locators (preliminary and interlinear). The resulting file
is saved as tak-sto-locs.tbl.
MAPPING PRELIMINARY LOCATORS TO STOLFI'S
Next, we create a preliminary file tak-4.evt with Takeshi's text and
(mostly) Stolfi's locators:
cat tak-3.evt \
| map-locations \
-v table=tak-sto-locs.tbl \
> tak-4.evt
The script map-locations will leave untouched any locations
that are not listed in the dictionary. (These are identified by
their empty unit tag, i.e. locators of the form "").
We edit those locators by hand. We make a copy of the file
cp -p tak-4.evt tak-5.evt
and edit tak-5.evt with emacs.
[ 1998-12-02 ]
Eventually I finished editing all location numbers in tak-5.evt.
MERGING TAKESHI'S VERSION INTO THE INTERLINEAR
Collecting the units making up the interlinear file (see note 024):
ln -s ../../L16-eva
/bin/rm -f .all.units
cat L16-eva/UNITS \
| gawk -v FS=':' '/^[^#]/{printf "L16-eva/%s\n", $2;}' \
> .all.units
Merging tak-5.evt into the current interlinear:
cat `cat .all.units` \
> inter.evt
merge-version-into-interlin \
-v sourceFile=tak-5.evt \
-v trashFile=tak-5-unmatched.evt \
-v transCodes='H' \
< inter.evt \
> inter+tak.evt
Splitting back into separate files:
mkdir ../../L16+H-eva
ln -s ../../L16+H-eva
cat inter+tak.evt \
| split-evt-into-units \
-v outdir=L16+H-eva \
> .new.units
EDITING AND SYNCHRONIZING THE MERGED FILE
At this point we have created a new interlinear, partitioned one
unit per file. We must still edit the units L16+H-eva/f* by hand,
adding "-{plant}" breaks and synchronizing "!"s.
This step was done on the individual text unit files in L16+H-eva/*
REMOVING NEEDLESS CAPITALIZATION
In his transcription, Takeshi used capitalized EVA for proper display
with Gabriel's fonts (i.e. Sh/Ch/cTh etc. instead of sh/ch/cth/etc.
To simplify comparison and statistical analysis, it is better
to convert those codes to lower case EVA. After all, they can be
re-capitalized with Rene's VTT.
So let's get a list of all unit files, in reading order:
cat L16+H-eva/UNITS \
| gawk -v FS=':' '/./{print $2;}' \
> .all.units
set units = ( `cat .all.units` )
Safety check:
( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo
cat .all.units | sort > .bar
diff .foo .bar
Concatenating all units:
( cd L16+H-eva && cat ${units} ) \
> inter+tak-ed.evt
cat inter+tak-ed.evt \
| remove-needless-capitalization \
> inter+tak-ed-noc.evt
diff inter+tak-ed.evt inter+tak-ed-noc.evt | head -3000 > .foo
Let's see whether we lost any information, compared to Takahashi's
version:
cat inter+tak-ed-noc.evt \
| vtt -b1 -l4 -c3 -tH -o0 -s0 -f1 \
| sed -e 's/[-]\(.\)/.\1/g' \
> .bar
diff tak-2.iso .bar \
| prettify-diff-output \
> .foo
GETTING THE WEIRDOS OUT OF THE WAY
Besides the "aesthetic" capitals, Takeshi used significant
characters from the full EVA character set, including plumes ['"],
ligated capital letters such as "I" and "O", and 8-bit characters
above decimal 127 for "weirdoes".
All my statistical scripts were written for basic EVA, and it seems
silly to modify and re-debug them all just for the sake of a few
characters that occur only a few times in the document. Moreover
the weirdos would only contribute distracting noise to the
statistics, and hog the tables with useless entries.
We could keep the full EVA codes in the file, and use VTT or some AWK
script to convert the file to basic EVA before each processing
where they would be a nuisance. However I anticipate that most of
my uses of the file will fall in this category. So I think it is
better to map all weirdos to "*" (or to a similar basic-EVA letter),
and retain the correct full-EVA code as a stylized comment; and
only convert these groups to full EVA when needed.
Specifically, non-basic EVA characters will be denoted by a
construct of the form "C{&XXX}" where "C" is a lower-case basic-EVA
letter or "*" (to be used in "normal" processing), and "XXX" is
either the 3-digit decimal code of the weirdo, or the extended EVA
notation for that caracter. Here are some examples, with the
precise meaning and how they will be handled in statistics
and indexing:
*{&252} = weirdo with decimal code 252 ("V" underbar); treat as "*".
k{&K} = a "k" with stem crossed by a ligature; treat as "k".
o{&o'} = "o" with plume; treat as "o".
s{&S} = an "s" with a ligature, or half a "sh"; treat as "s".
q{&q"} = a "q" with a plume above the connector, not touching it;
treat as "q".
I have temporarily used some non-eva codes after the "&", for
weirdos that I could not decide how to encode in EVA. For example on
I used "r{&r}" for an "r" glued to the preceding
character (an "e"); and on and I used "*{&^}" for a
character that looks like the upper half of an "y". These non-EVA
codes are defined in #-comments at the top of each unit file,
and will hopefully be replaced by official EVA in the near future.
So let's do the conversion on the file (after having already removed
the "aesthetic" capitalizations:
cat inter+tak-ed-noc.evt \
| basify-weirdos \
> inter+tak-ed-basic.evt
diff inter+tak-ed-noc.evt inter+tak-ed-basic.evt | head -3000 > .foo
cat inter+tak-ed-basic.evt \
| validate-new-evt-format \
-v checkTerminators=1 \
-v checkLineLengths=1 \
>& inter+tak-ed-basic.bugs
Splitting back into separate files:
mkdir ../../L16+H-b-eva
ln -s ../../L16+H-b-eva
cat inter+tak-ed-basic.evt \
| split-evt-into-units \
-v outdir=L16+H-b-eva \
> .new.units
Checking the result:
cat .new.units \
| sed -e 's@L16[+]H-b-eva/@@g' \
> .foo
diff .all.units .foo
cat `cat .new.units` \
> .bar
diff inter+tak-ed-basic.evt .bar \
| prettify-diff-output
All seems OK, we can replace the interlinear directory by
the new one:
mv ../../L16+H-eva ../../L16+H-eva-junk
mv ../../L16+H-b-eva ../../L16+H-eva
rm -i L16+H-b-eva
The auxiliary files (UNITS, scripts, X-.fnums, etc.)
were moved by hand from L16+H-eva-junk to the new L16+H-eva
Checking whether we have lost any lines
( cd L16+H-eva && cat ${units} ) \
| gawk '/^<.*;[A-GI-Z]/{print $1;}' \
> .new.locs
( cd L16-eva && cat `cat UNITS | gawk -v FS=':' '/./{print $2;}'` ) \
| gawk '/^<.*;[A-GI-Z]/{print $1;}' \
> .old.locs
diff .old.locs .new.locs \
| prettify-diff-output \
> .diff
ADDING LATE CHANGES
Takeshi says that the separate HTML pages are more reliable
than the full file, and he also added last-minute corrections
to the separate pages only. So I compared
diff tak-2.iso kat-2.iso \
| prettify-diff-output \
> .diff2
Processed all differences by hand and added them to the interlinear.
RECREATING TAKESHI'S VERSION
Let's see how closely we can recreate Takeshi's version
from the interlinear.
Let's get again a list of all unit files, in reading order:
cat L16+H-eva/UNITS \
| gawk -v FS=':' '/./{print $2;}' \
> .all.units
set units = ( `cat .all.units` )
Safety check:
( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo
cat .all.units | sort > .bar
diff .foo .bar
Concatenating all units:
( cd L16+H-eva && cat ${units} ) \
| unbasify-weirdos \
> inter+tak.evt
cat inter+tak.evt \
| egrep -v '^#($|[ ])' \
| gawk \
' \
/^<[^;<>][;]/{ \
match($0,/<[^;<>]*[;]/); \
s=substr($0,2,RLENGTH-2); \
if(s!=ps){print "#"; ps=s;} \
} \
//{print;} \
' \
> inter+tak-noc.evt
dicio-wc inter+tak.evt
lines words bytes file
------- ------- --------- ------------
39428 129301 1681652 inter+tak.evt
cat inter+tak.evt \
| validate-new-evt-format \
-v checkTerminators=1 \
-v checkLineLengths=1 \
>& inter+tak.bugs
Re-extracting Takahashi's version with Rene's VTT:
ln -s /proj/dicio/staff/stolfi/programs/c/vtt-rene/vtt
cat inter+tak.evt \
| sed -e '/^## *<[^;.<>]*>/s/## *//' \
| gawk \
' /^##.*<[^.]*>.*/{print substr($0,index($0,"<"));} \
//{print;} \
' \
| vtt -b1 -c2 -f1 -l4 -o0 -s0 -tH \
| sed \
-e 's/[-]\(.\)/.\1/g' \
-e '/^#$/d' \
-e '/^# /d' \
> inter+tak.iso
39685 lines read in
12143 lines de-selected
0 hash comment lines suppressed
265 empty lines suppressed
27277 lines written to output
Checking for processing errors:
foreach f ( kat-2 inter+tak )
echo $f
cat $f.iso \
| basify-weirdos \
| unbasify-weirdos \
| sed \
-e '/^##/s/## *<\(.*\)>.*$/<'"$f"':\1>@/;t' \
-e '/^#/d' \
-e 's/[%\!]//g' \
-e 's/[-=]/./g' \
-e 's/[.][.][.]*/./g' \
-e 's/^[. ]*//g' \
-e 's/[. ]*$//g' \
-e 's/\([=.]\)/@/g' \
| tr '@' '\012' \
| egrep -v '^ *$' \
> $f.xxx
end
diff {kat-2,inter+tak}.xxx \
| prettify-diff-output \
> .foo
Converting to HTML, quick and dirty:
cat inter+tak.evt \
| vtt -a1 -b1 -c2 -f0 -l4 -o0 -s0 -tH \
| sed \
-e '/^<[^;]*>/d' \
-e 's/[-]\(.\)/.\1/g' \
-e '/^#$/d' \
-e '/^# /d' \
> tak-rebuilt.evt
39685 lines read in
12143 lines de-selected
0 hash comment lines suppressed
8 empty lines suppressed
27534 lines written to output
cat tak-rebuilt.evt \
| egrep -v '## *' \
| numbered-eva-to-html \
'VMS transcription by Takeshi Takahashi'\
> tak-rebuilt.html
[1998-12-28 stolfi]
Deleted L16-eva; the official directory is L16+H-eva from now on.
[2000-05-08 stolfi]
CHECHING FOR ANY CHANGES TO TAKAHASHI'S VERSION
mkdir Takahashi/pages-2
Fetching again the full file:
wget \
'http://www3.justnet.ne.jp/~ttakahashike/voynich/pages/PagesH.txt' \
--non-verbose \
--output-document 'Takahashi/pages-2/full-2.evt'
Fetching again the page files for comparison:
wget \
'http://www3.justnet.ne.jp/~ttakahashike/voynich/pages/index.htm' \
--non-verbose \
--output-document 'Takahashi/pages-2/index.htm'
Manually extracted the links, saved them to files
"Takahashi/pages-2/all.fnums" (f-numbers only) and
"Takahashi/pages-2/all.urls" (complete URLs).
Fetching the pages:
wget \
--non-verbose \
--input-file=Takahashi/pages-2/all.urls \
--directory-prefix=Takahashi/pages-2/ \
--output-file=Takahashi/pages-2/wget.log
Checking for completeness:
( cd Takahashi/pages-2/ && grep -L -i -e '