Hacking at the Voynich manuscript - Side notes
074 Revising the "U" (Stolfi) transcription

Last edited on 2025-07-30 04:09:20 by stolfi

INTRO 

  This note was about an attempt at preparing a new version of the 
  EVMT ("European Voynich Multi-Transctiption") file, to be version 
  25e1 (meaning 2.5 release 1). It would take off at the abandoned
  attempt to build version 20e1 described in Notes/072 and 
  Notes/073.
  
  However, Rene Zandbergen has created a much improved version 
  (IVTFF) of the transcription format and software. So this 
  note is now redirected at preparing my own transcriptions 
  (";U" in the interlinear) in a format compatible with that new standard.
  It requires, among other things, changing the format of all locators,
  and mapping the encoding I have been used to his chosen one.
  
  One reason for doing this effort is investigating the theory that 
  one-leg gallows with hooks are distinct from those without hooks.

LINKS
  
    ln -s ../.. work
    ln -s work/Notes
    
    ln -s work/ivtff_frac_word_counts.py
    ln -s work/ivtff_format.py
    ln -s work/process_frac_words.py 
    
OLD EVT INPUT FILES
 
  We start from the file Notes/072/text20e1-03.evt which is the version
  of EVMT 20e1 that was prepares in 2005-02-03 but never released. It
  probably needs extensive checking.
    
    chmod a-w Notes/072/text20e1-03.evt
    ln -s Notes/072/text20e1-03.evt 
   
  We must massage the ".evt" file a bit before automatic conversion.
  For one thing, it has too many groups like "(cht)", and it is not
  clear what they should really be like.  probably not consistent.
  
  The plan is to inspect the groups /[(][ci][ktfpzw]*h+[)]/ and
  /[(]sh+[)]/. If they ARE ligated in the usual way, just remove the
  parens, as in the EVMT 16e6 format, since they will be converted to
  /CI[KTFPZW]*H*h/ and /SH*h/ by the automatic EVA-XEVE conversion
  scripts. If they are NOT ligated, convert them to symbolic weirdos
  "*{&...}" where "..." is the XEVA code for the ligated combination
  
    cp text20e1-03.evt text20e1-30.evt
    
  Inspected and edited text20e1-30.evt by hand, replacing
  weirdos and weird ligatures by codes like &{OPr} or &{310}.
     
  !! NOTE: my weirdo codes differ from Rene's. Mapping will be necessary
  at some point.
 
EXTRACTING MY TRANSCRIPTION

  Removing from "text20e1-30.evt" all transcriptions except ";U":
  
    cat text20e1-30.evt \
      | remove_non_stolfi_transcriptions.gawk \
      > text25e1-50.evt
      
  Saved the version "text20e1-30.evt" just in case:
   
    now=2025-05-31-095000
    mkdir SAVE/${now}
    chmod a-w text20e1-30.evt
    mv -vi text20e1-30.evt SAVE/${now}/
  
  (It seems that a tentative IVTFF version "text25e1-01.xev" was created
  in 20205-05 from text20e1-30.evt but then editing continued in
  "text25e1-51.evt". Saved "text25e1-01.xev" just in case.)

  Proceeded to manual edit of "text25e1-51.evt".
  
  Split off 
  
    "text25e1-weirdos.txt" - weirdo code definitions.  
    
    "text25e1-intro.txt" - the introductory comments.
    
    "../076/starps-U.eva" - the Starred Parags section (SPS, f103r to f11r line 30).
    
  Also extracted from "text20e1-30.evt" the versions by 
  Takeshi (';H') and Rene (';Z') of the SPS:
  
    "../076/starps-H.eva"
    "../076/starps-Z.eva"
   
  The files "../076/starps-U.eva" and "../076/starps-Z.eva" were
  uniformized until the TEXT of the former was a subset of the latter.
  But the #-COMMENTS were not. Then "../076/starps-U.eva" was saved to
  "SAVE/2025-07-15-200047/starps-U.eva" and "../076/starps-Z.eva"
  renamed "../076/starps-U.evt" with ";U" codes replacing the ";Z"
  codes. See "../076/Note-076.txt"
  
    ln -s ../076/starps-U.evt
    
  See also:
    
    "../073/desc25e1-51.txt" - per-page verbal descriptions.

  Continued editing "text25e1-51.evt"
  
  Replaced all weirdo codes notations "&\{{NNN}\}" by "&{NNN};".

  Replaced all EVA font notations in comments, from
  "<{CHAR}>" to "@{CHAR}", and from "<{TEXT}>" to "@'{TEXT}'"
  
  Replaced all inline comments "{...}" by "<!...>".
    
OBTAINING THE LAST IVTFF FILE

  Downloaded the latest version of Rene's "reference" transcription, 
  "RF1b-e.txt", from "https://voynich.nu/data/RF1b-e.txt".
  Renamed it "text25rz-40.txt", made it readonly.
  
  Needed to fix the page of the rosette from "fRos" to "f85v2" otherwise
  all my scripts would break. So made a copy
  
    cp text25rz-40.txt text25rz-41.txt
    chmod u+w text25rz-41.txt
    
  Edited replacing "fRos" by "f85v2".
  
  Also replaced page "f101v" by "f101v2" for consistency.

MAPPING EVMT LOCATORS TO IVTFF LOCATORS

  Rene sent a table "rene-loci-table.xlsx" with the mapping from old
  EVMT 1.6e6 locators to his new IVTFF locators. Extracted it as
  "rene-loci-table-orig.csv" then edited by hand to
  "loci-evmp16e6-ivtff.tbl". Applied the page number fixes above.
  
  To check, extracted all locators from the two files:
  
    ./match_locators_25e1_25ez.sh
 
  There were lots of old locators that did not match, concentrated on
  several pages - such as the "nine rosettes" page (f85v2) and the "ages
  of man" page (f67v2). Rene's table mapping old locators to new ones
  was based on the real 1.6e6 version of the interlinear. However
  "text25e1-51.evt" was based on the aborted update of that interlinear
  (Notes/072). That upgrade implied adding many units not previously
  transcribed, and renaming dozens of units.
  
  Edited the files and re-ran the script above until all old
  locators and all new locators were in 1:1 correspondence.
    
      lines   words     bytes file        
    ------- ------- --------- ------------
       3513    3513     45249 .locators_old
       3513    7026     80634 .locators_old_mapped_to_new
          0       0         0 .old_unmapped
       5388    5388     54207 .locators_new
       5388   10776    123832 .locators_new_mapped_to_old
          0       0         0 .new_unmapped

  Wrote a script that reads {stdin}, maps all old locators (with two
  '.') on data lines to new locators (with a single '.') and writes to
  stdout. Does not touch locators in comments; those will have to be
  fixed by hand, since sometimes they shoudl remain old-style.
  
    cat text25e1-51.evt | map_locators.sh > text25e1-52.evt
    now="`yyyy-mm-dd-hhmmss`"
    mkdir -p SAVE/${now}
    chmod a-w text25e1-51.evt
    mv -vi text25e1-51.evt SAVE/${now}/
    
    ln -s ../073/desc25e1-52.txt
    
CHANGING THE PARAGRAPH MARKERS

  In the old EVMT format, parags were marked only by "=" at the end of the tail line
  and "-" at the end of every other lines.

  As preparation to fix the paragraph markers, 
  
    Changing temporarily the line-final "-" by <|>.
  
    Prefixing every text line after a <|> with <:>.
    
    Replacing every final "=" with "<$>".
    
    Looking for lines that do not end with "<$>" or "<|>" and fixing them:
    
      cat text25e1-52.evt \
        | sed \
           -e 's:^[ ]*\([#]\|[@][@]\|$\).*::g' \
           -e 's:^<f[0-9][0-9]*[rv][0-9]*>.*::g' \
           -e 's:^.*<[|$]> *$::g' \
        | egrep --color=auto -nH --null -e '.' \
        | sed -e 's:[(]standard input[)]:text25e1-52.evt:g' \
        > .bugs
        
    Adding start-of-parag markers:
   
      ./add_parag_markers.gawk text25e1-52.evt \
        > .tmp
      prdiff -Bb text25e1-52.evt .tmp | head -n 200 > .diff

      # now="`yyyy-mm-dd-hhmmss`"
      now="2025-07-15-171634"
      mkdir -p SAVE/${now}
      mv -vi text25e1-52.evt SAVE/${now}/
      chmod a-w SAVE/${now}/text25e1-52.evt
      mv -vi add_parag_markers.gawk SAVE/${now}/

      mv .tmp text25e1-53.evt

SIMPLIFYING THE NAMES

      mv -vi text25e1-53.evt text25e1.evt
      mv -vi desc25e1-53.evt desc25e1.evt
      mv -vi star25e1-53.evt star25e1.evt

REPLACING IMPLICIT LIGATURES BY EXPLICIT ONES

  We want to replace implicit ligatures like @'ycthhey' by explicit ones
  like @'y{CTHh}ey'.
  
  First let's save the current state of things:
  
    now="`yyyy-mm-dd-hhmmss`"; echo "now = ${now}"
    # now=2025-07-17-140500
    mkdir -p SAVE/${now}
    cp -av \
        text25e1.evt star25e1.evt \
        text25e1-weirdos.txt  text25e1-intro.txt \
      SAVE/${now}
    cp -av ../073/desc25e1.txt SAVE/${now}
    chmod a-w SAVE/${now}/*.{evt,txt}

  Piped all four files through "convert_ligatures.sed"

>>> STOPPED HERE 2025-05-27

TO DO

  ??? REPLACE <|> <:> BY [=»«]

LISTING ALL LOCATORS WITH ONE-LEG GALLOWS

  Making a list of all line locators in the full EVT that contain [fp]
  gallows (that may need to be converted to [zw]):
  
    ./list_one_gallows_loci.sh text20e1-30.evt text25e1-51.evt 0
    
CONVERTING MORE
      
  First stab at converting the old EVMT to Rene's IVTFF format
  
    ./convert_evmt_20e1_to_evmt_25e1.sh

  After many ad-hoc tweaks in the input "text20e1-50.evt", we got the
  above script to process without errors.

    * Each /glyph/ occurence in the text is defined as maximal set of strokes that are
      (or presumably were intended to be) connected by contact
      or ligatures.  In the XEVT format, each glyph is encoded as a pair of parens '()'
      enclosing a string of one or more XEVA /simple glyph/ codes, like "(v)" or "(Sh)" or
      "(AKPIHO)"

    * The  XEVA simple glyph codes include the basic lowercase EVA letters [adefik-ty]
      and the two combinations @{Ch}, @{Sh}, and the platform gallows @{CKh}, @{CTh}, @{CFh},
      and @{CPh}.  They also include new lowercase codes @b, @g, @u, @j (@e, @a, or @i with
      plumes or tails), @v (the caret), @x (the picnic table), and @z and @w (versions
      of @f and @p with an @e-hook at the end of the horizontal arm); thus completing all
      lowercase letters.  They also include
      @c and capital letters [ACHIOQRSY] denoting the same simple glyphs as the lower case
      versions with a ligature line added at top right; @E which is an @e that can connect to
      the bottom of the next glyph; and [KTFPZW] which are the gallows [ktfpzw] with a
      stroke forming the floor of the platform.

    * The XEVA simple codes also include weirdo codes like &NNN; where NNN is a 3-digit number.
      The previous EVMT notation like "s{&123}" or "*{&o'}" is replaced by codes "&123" in XEVA
      that function as simple letters.

  Thus the line "<f72v3.S1.8;V> ol*{&ol}.ofaiin=" from EVMT 16e6 wluld become
  "<f72v3.S1.8;V> (o)(ll)(&312).(o)(f)(a)(i)(i)(n)=" in the new EVMT file.

LISTING WEIRDO USES AND SEFINITIONS

  The weirdos and non-basic glyphs are encoded as "&{...}".

  Listing weirdo uses in both files and definitions in the text file:

    ./list_weirdo_uses_and_defs.sh

TO DO