Hacking at the Voynich manuscript - Side notes
044 Adding line numbers to Takahashi's transcription

Last edited on 2006-11-27 22:50:11 by stolfi

The goal of this note is to merge Takeshi Takahashi's full
transcription of the VMS into the interlinear file.

FETCHING TAKESHI'S FILES

  Takeshi Takahashi announced to the Voynich list a new full
  transcription of the VMS in HTML format, without line numbers. I
  fetched his single-file version with HTTP on 1998-11-25
  
  Some time later (1998-12-28) I fetched again the version formatted
  as separate pages, which Takeshi says are more up-to-date than the
  full file.
  
    ln -s ../../../docs/Takahashi
    
    wget \
      --non-verbose \
      --input-file=Takahashi/pages/all.urls \
      --directory-prefix=Takahashi/pages/ \
      --output-file=Takahashi/pages/wget.log
      
  Had to manually edit f66r.htm because it ws formatted as a <TABLE>.
        
INITIAL CONVERSION FROM HTML TO EVT FORMAT

  Removing irrelevant HTML formatting:
  
    cat Takahashi/full.html \
      | cleanup-takeshi-html-1 \
      > tak-1.iso
      
    ( cd Takahashi/pages/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \
      | cleanup-takeshi-html-1 \
      | sed \
          -e 's/([^()]*)//g' \
          -e '/^ *$/d' \
      > kat-1.iso
      
  Replaced newline and " " by ".", <BR> by "-" and newline, 
  <P> by "=" and newline; then cleaned up the spaces.
  
    foreach f ( tak kat )
      echo "$f"
      cat ${f}-1.iso \
        | sed \
            -e '/^[##]/s/\(f[0-9a-z]*\)/<\1> {'"${f}"'}@/' \
            -e '/^[#]/b' \
            -e 's/<BR>/-@/' \
            -e 's/<P>/=@/' \
            -e 's@<[/]*B>@@g' \
            -e 'y/ /./' \
        | tr '\012@' '.\012' \
        | sed \
            -e 's/[.][.]*_/__/g' \
            -e 's/_[.][.]*/__/g' \
            -e 's/^[-._=]*//g' \
            -e 's/[-._]*[=][-._=]*/=/g' \
            -e 's/[.]*[-_][-._]*/-/g' \
            -e 's/[.][.][.]*/./g' \
        > ${f}-2.iso
    end
    
INSERTING PRELIMINARY LOCATION CODES IN TAKESHI'S FILE

  The next step is to insert preliminary locators into Takeshi's file.
  This processing was done on the single-file version (tak-2.iso);
  later the differences between tak-2.iso and kat-2.iso were checked
  and fixed manually.
  
  Using transcriber "H" = Takahashi.
  
  The preliminary locators have empty unit tag and 2-digit line
  numbers that increase sequentially per page (i.e. <f1r..01;H>,
  <f1r..02;H>, etc.).
  
    cat tak-2.iso \
      | insert-loc-codes \
      > tak-3.evt
  
BUILDING THE TAKESHI-TO-STOLFI LOCATOR DICTIONARY

  Then we have to produce a dictionary that maps the preliminary
  locators to those used in the interlinear. For that purpose we take
  a "best pick" version of the interlinear, chopping the text to 20
  bytes, and equalizing all pages to 170 lines:
  
    ln -s ../../L16-eva
    
    cat L16-eva/UNITS \
      | gawk -v FS=':' '/./{print $2;}' \
      > .all.units
      
    set units = ( `cat .all.units` )
    
    ( cd L16-eva && cat $units ) \
      | best-pick \
      > sto-3.evt
      
    cat sto-3.evt \
      | sync-clip-evt -v pageSize=170 \
      > sto.clp
      
  Then we do the same to Takahashi's file (chop the text to 20 bytes
  and equalize the page sizes.)  We also fix some page numbering
  bugs:

    cat tak-3.evt \
      | sed \
          -e 's/<f86r2/<f85r2/g' \
          -e '/^## <f101v1>/d' \
          -e '/<f86v5>/i\' \
          -e '## <f86r4> {}\' \
          -e '<f86r4..01;H>      {not transcribed}' \
      | sync-clip-evt -v pageSize=170 \
      > tak.clp

  Let's check whether we have the same set of pages, in the 
  same order, with same number of lines:

    dicio-wc sto.clp tak.clp

    grep '##' sto.clp > sto.pages
    grep '##' tak.clp > tak.pages
    diff sto.pages tak.pages

  OK, now we paste these two files side-by-side:
  
    /n/gnu/bin/paste -d' ' tak.clp sto.clp > tak-sto.clp
    
  We then edit tak-sto.clp manually, shifting and 
  permuting the right half of each page  (locators included) until  
  the two truncated texts on each line are two versions of the same
  VMS line.  Unmatched half-lines are discarded (they will require
  fixing the line breaks in the file anyway). 
  
  Then we delete the truncated text columns, leaving only the 
  locators (preliminary and interlinear). The resulting file
  is saved as tak-sto-locs.tbl.
  
MAPPING PRELIMINARY LOCATORS TO STOLFI'S

  Next, we create a preliminary file tak-4.evt with Takeshi's text and
  (mostly) Stolfi's locators:
  
    cat tak-3.evt \
      | map-locations \
          -v table=tak-sto-locs.tbl \
      > tak-4.evt
      
  The script map-locations will leave untouched any locations
  that are not listed in the dictionary.  (These are identified by
  their empty unit tag, i.e. locators of the form "<f1r..03;H>").
  We edit those locators by hand.  We make a copy of the file
  
    cp -p tak-4.evt tak-5.evt
 
  and edit tak-5.evt with emacs.
  
  [ 1998-12-02 ]

  Eventually I finished editing all location numbers in tak-5.evt.
  
MERGING TAKESHI'S VERSION INTO THE INTERLINEAR

  Collecting the units making up the interlinear file (see note 024):
  
    ln -s ../../L16-eva

    /bin/rm -f .all.units
    cat L16-eva/UNITS \
      | gawk -v FS=':' '/^[^#]/{printf "L16-eva/%s\n", $2;}' \
      > .all.units
  
  Merging tak-5.evt into the current interlinear:
  
    cat `cat .all.units` \
      > inter.evt
      
    merge-version-into-interlin \
        -v sourceFile=tak-5.evt \
        -v trashFile=tak-5-unmatched.evt \
        -v transCodes='H' \
      < inter.evt \
      > inter+tak.evt

  Splitting back into separate files:
  
    mkdir ../../L16+H-eva
    
    ln -s ../../L16+H-eva
    
    cat inter+tak.evt \
      | split-evt-into-units \
          -v outdir=L16+H-eva \
      > .new.units
      
EDITING AND SYNCHRONIZING THE MERGED FILE  
  
  At this point we have created a new interlinear, partitioned one
  unit per file. We must still edit the units L16+H-eva/f* by hand,
  adding "-{plant}" breaks and synchronizing "!"s.
  
  This step was done on the individual text unit files in L16+H-eva/*
  
REMOVING NEEDLESS CAPITALIZATION

  In his transcription, Takeshi used capitalized EVA for proper display
  with Gabriel's fonts (i.e. Sh/Ch/cTh etc. instead of sh/ch/cth/etc.
  To simplify comparison and statistical analysis, it is better
  to convert those codes to lower case EVA.  After all, they can be
  re-capitalized with Rene's VTT.
  
  So let's get a list of all unit files, in reading order:
  
    cat L16+H-eva/UNITS \
      | gawk -v FS=':' '/./{print $2;}' \
      > .all.units

    set units = ( `cat .all.units` )
      
  Safety check:

    ( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo
    cat .all.units | sort > .bar
    diff .foo .bar
    
  Concatenating all units:

    ( cd L16+H-eva && cat ${units} ) \
      > inter+tak-ed.evt
  
    cat inter+tak-ed.evt \
      | remove-needless-capitalization \
      > inter+tak-ed-noc.evt
      
    diff inter+tak-ed.evt inter+tak-ed-noc.evt  | head -3000 > .foo
    
  Let's see whether we lost any information, compared to Takahashi's 
  version:
    
    cat inter+tak-ed-noc.evt \
      | vtt -b1 -l4 -c3 -tH -o0 -s0 -f1 \
      | sed -e 's/[-]\(.\)/.\1/g' \
      > .bar
      
    diff tak-2.iso .bar \
      | prettify-diff-output \
      > .foo
      
GETTING THE WEIRDOS OUT OF THE WAY

  Besides the "aesthetic" capitals, Takeshi used significant
  characters from the full EVA character set, including plumes ['"],
  ligated capital letters such as "I" and "O", and 8-bit characters
  above decimal 127 for "weirdoes". 
  
  All my statistical scripts were written for basic EVA, and it seems
  silly to modify and re-debug them all just for the sake of a few
  characters that occur only a few times in the document.  Moreover
  the weirdos would only contribute distracting noise to the
  statistics, and hog the tables with useless entries.
  
  We could keep the full EVA codes in the file, and use VTT or some AWK
  script to convert the file to basic EVA before each processing
  where they would be a nuisance.  However I anticipate that most of
  my uses of the file will fall in this category.  So I think it is
  better to map all weirdos to "*" (or to a similar basic-EVA letter),
  and retain the correct full-EVA code as a stylized comment; and
  only convert these groups to full EVA when needed.
  
  Specifically, non-basic EVA characters will be denoted by a
  construct of the form "C{&XXX}" where "C" is a lower-case basic-EVA
  letter or "*" (to be used in "normal" processing), and "XXX" is
  either the 3-digit decimal code of the weirdo, or the extended EVA
  notation for that caracter. Here are some examples, with the
  precise meaning and how they will be handled in statistics
  and indexing:
  
    *{&252}  =  weirdo with decimal code 252 ("V" underbar); treat as "*".
                
    k{&K}    =  a "k" with stem crossed by a ligature; treat as "k".
    
    o{&o'}   =  "o" with plume; treat as "o".
    
    s{&S}    =  an "s" with a ligature, or half a "sh"; treat as "s".
  
    q{&q"}   =  a "q" with a plume above the connector, not touching it;
                treat as "q".
                
  I have temporarily used some non-eva codes after the "&", for
  weirdos that I could not decide how to encode in EVA. For example on
  <f68v1.X.2;V> I used "r{&r}" for an "r" glued to the preceding
  character (an "e"); and on and <f57v.R4.1;V> I used "*{&^}" for a
  character that looks like the upper half of an "y".  These non-EVA
  codes are defined in #-comments at the top of each unit file,
  and will hopefully be replaced by official EVA in the near future.
                
  So let's do the conversion on the file (after having already removed
  the "aesthetic" capitalizations:

    cat inter+tak-ed-noc.evt \
      | basify-weirdos \
      > inter+tak-ed-basic.evt

    diff inter+tak-ed-noc.evt inter+tak-ed-basic.evt  | head -3000 > .foo
    
    cat inter+tak-ed-basic.evt \
      | validate-new-evt-format \
          -v checkTerminators=1 \
          -v checkLineLengths=1 \
      >& inter+tak-ed-basic.bugs

  Splitting back into separate files:
  
    mkdir ../../L16+H-b-eva
    
    ln -s ../../L16+H-b-eva
    
    cat inter+tak-ed-basic.evt \
      | split-evt-into-units \
          -v outdir=L16+H-b-eva \
      > .new.units

  Checking the result:
  
    cat .new.units \
      | sed -e 's@L16[+]H-b-eva/@@g' \
      > .foo
      
    diff .all.units .foo
    
    cat `cat .new.units` \
      > .bar
      
    diff inter+tak-ed-basic.evt .bar \
      | prettify-diff-output 

  All seems OK, we can replace the interlinear directory by 
  the new one:
  
    mv ../../L16+H-eva ../../L16+H-eva-junk
    mv ../../L16+H-b-eva ../../L16+H-eva
    rm -i L16+H-b-eva
    
  The auxiliary files (UNITS, scripts, X-<section>.fnums, etc.)
  were moved by hand from L16+H-eva-junk to the new L16+H-eva
  
  Checking whether we have lost any lines
  
    ( cd L16+H-eva && cat ${units} ) \
      | gawk '/^<.*;[A-GI-Z]/{print $1;}' \
      > .new.locs
      
    ( cd L16-eva && cat `cat UNITS | gawk -v FS=':' '/./{print $2;}'` ) \
      | gawk '/^<.*;[A-GI-Z]/{print $1;}' \
      > .old.locs
      
    diff .old.locs .new.locs \
      | prettify-diff-output \
      > .diff

ADDING LATE CHANGES

  Takeshi says that the separate HTML pages are more reliable
  than the full file, and he also added last-minute corrections
  to the separate pages only.  So I compared 

    diff tak-2.iso kat-2.iso \
      | prettify-diff-output \
      > .diff2
      
  Processed all differences by hand and added them to the interlinear.

RECREATING TAKESHI'S VERSION

  Let's see how closely we can recreate Takeshi's version
  from the interlinear.

  Let's get again a list of all unit files, in reading order:
  
    cat L16+H-eva/UNITS \
      | gawk -v FS=':' '/./{print $2;}' \
      > .all.units

    set units = ( `cat .all.units` )
      
  Safety check:

    ( cd L16+H-eva && ls f[0-9]* | egrep -v '[~]$' ) | sort > .foo
    cat .all.units | sort > .bar
    diff .foo .bar
    
  Concatenating all units:

    ( cd L16+H-eva && cat ${units} ) \
      | unbasify-weirdos \
      > inter+tak.evt
      
    cat inter+tak.evt \
      | egrep -v '^#($|[ ])' \
      | gawk \
          ' \
            /^<[^;<>][;]/{ \
              match($0,/<[^;<>]*[;]/); \
              s=substr($0,2,RLENGTH-2); \
              if(s!=ps){print "#"; ps=s;} \
            } \
            //{print;} \
          ' \
      > inter+tak-noc.evt
      
    dicio-wc inter+tak.evt

      lines   words     bytes file        
    ------- ------- --------- ------------
      39428  129301   1681652 inter+tak.evt
  
    cat inter+tak.evt \
      | validate-new-evt-format \
          -v checkTerminators=1 \
          -v checkLineLengths=1 \
      >& inter+tak.bugs

  Re-extracting Takahashi's version with Rene's VTT:
  
    ln -s /proj/dicio/staff/stolfi/programs/c/vtt-rene/vtt
    
    cat inter+tak.evt \
      | sed -e '/^## *<[^;.<>]*>/s/## *//' \
      | gawk \
          ' /^##.*<[^.]*>.*/{print substr($0,index($0,"<"));} \
            //{print;} \
          ' \
      | vtt -b1 -c2 -f1 -l4 -o0 -s0 -tH \
      | sed \
          -e 's/[-]\(.\)/.\1/g' \
          -e '/^#$/d' \
          -e '/^# /d' \
      > inter+tak.iso
      
      39685 lines read in
      12143 lines de-selected
          0 hash comment lines suppressed
        265 empty lines suppressed
      27277 lines written to output

  Checking for processing errors:

    foreach f ( kat-2 inter+tak )
      echo $f
      cat $f.iso \
        | basify-weirdos \
        | unbasify-weirdos \
        | sed \
            -e '/^##/s/## *<\(.*\)>.*$/<'"$f"':\1>@/;t' \
            -e '/^#/d' \
            -e 's/[%\!]//g' \
            -e 's/[-=]/./g' \
            -e 's/[.][.][.]*/./g' \
            -e 's/^[. ]*//g' \
            -e 's/[. ]*$//g' \
            -e 's/\([=.]\)/@/g' \
        | tr '@' '\012' \
        | egrep -v '^ *$' \
        > $f.xxx
    end
    
    diff {kat-2,inter+tak}.xxx \
      | prettify-diff-output \
      > .foo
      
  Converting to HTML, quick and dirty:
  
    cat inter+tak.evt \
      | vtt -a1 -b1 -c2 -f0 -l4 -o0 -s0 -tH \
      | sed \
          -e '/^<[^;]*>/d' \
          -e 's/[-]\(.\)/.\1/g' \
          -e '/^#$/d' \
          -e '/^# /d' \
      > tak-rebuilt.evt
      
      39685 lines read in
      12143 lines de-selected
          0 hash comment lines suppressed
          8 empty lines suppressed
      27534 lines written to output

    cat tak-rebuilt.evt \
      | egrep -v '## *<f0' \
      | egrep -v '## *<.*[.].*>' \
      | numbered-eva-to-html \
          'VMS transcription by Takeshi Takahashi'\
      > tak-rebuilt.html

[1998-12-28 stolfi]
  
  Deleted L16-eva; the official directory is L16+H-eva from now on.
  
[2000-05-08 stolfi]

CHECHING FOR ANY CHANGES TO TAKAHASHI'S VERSION
  
    mkdir Takahashi/pages-2
      
  Fetching again the full file:

    wget \
      'http://www3.justnet.ne.jp/~ttakahashike/voynich/pages/PagesH.txt' \
      --non-verbose \
      --output-document 'Takahashi/pages-2/full-2.evt'      
      
  Fetching again the page files for comparison:
  
    wget \
      'http://www3.justnet.ne.jp/~ttakahashike/voynich/pages/index.htm' \
      --non-verbose \
      --output-document 'Takahashi/pages-2/index.htm'
      
  Manually extracted the links, saved them to files
  "Takahashi/pages-2/all.fnums" (f-numbers only) and
  "Takahashi/pages-2/all.urls" (complete URLs).
  
  Fetching the pages:
  
    wget \
      --non-verbose \
      --input-file=Takahashi/pages-2/all.urls \
      --directory-prefix=Takahashi/pages-2/ \
      --output-file=Takahashi/pages-2/wget.log
      
  Checking for completeness:
  
    ( cd Takahashi/pages-2/ && grep -L -i -e '</html>' *.htm )
      
  Removing spurious breaks:
       
    ( cd Takahashi/pages-2/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \
      | tr -d '\015' \
      | unsplit-fontified-lines \
      > kat-new-1.html
      
  Checking for unexpected format:

    cat kat-new-1.html \
      | remove-junk-html \
      | grep -v '<font face="EVA Hand 1">' \
      | egrep -v '^[#][#]' \
      | sort | uniq \
      > .trash
      
  Now do it for real:

    ( cd Takahashi/pages-2/ && cat `cat all.fnums | sed -e 's/$/.htm/'` ) \
      | cleanup-takeshi-html-2 \
      > kat-new-1.iso
     
  Insert locator codes:

    cat kat-new-1.iso \
      | insert-loc-codes \
      | egrep '^(<.*;H>|[#][#])' \
      > kat-new-3.evt

  Map locator codes for maximum compatibility, and 
  reduce character set:
  
    cat kat-new-3.evt \
      | map-locations \
          -v table=tak-sto-locs-new.tbl \
      | egrep '^<' \
      | remove-needless-capitalization \
      | basify-weirdos \
      | sed \
          -e 's/{[^{}&][^{}]*}//g' \
          -e 's/{}//g' \
          -e 's/[-=,]/./g' \
          -e 's/[.] *$/-/g' \
      | sort \
      > kat-new-4.evt
      
  Compare to previous version:
  
    ln -s ../../L16+H-eva
    
    cat L16+H-eva/INDEX \
      | gawk -v FS=':' '/./{print $2;}' \
      > .all.units
      
    set units = ( `cat .all.units` )

    ( cd L16+H-eva && cat $units ) \
      | egrep '^<.*;H>' \
      | tr -d '%\!' \
      | sed \
          -e 's/{[^{}&][^{}]*}//g' \
          -e 's/{}//g' \
          -e 's/[-=,]/./g' \
          -e 's/[.] *$/-/g' \
      | sort \
      > kat-old-4.evt

    diff kat-old-4.evt kat-new-4.evt \
      | prettify-diff-output \
      > .diffs2
      
  Ckecked .diffs2 manually, folded Takeshi's updates 
  into L16+H-eva/*.

  Format check:
  
    ( cd L16+H-eva && cat $units ) \
      | validate-new-evt-format \
          -v checkTerminators=1 \
          -v checkLineLengths=1 \
      >& .bugs