Hacking at the Voynich manuscript - Side notes
076 Getting a clean transcription of the Recipes section

Last edited on 2025-07-17 17:17:20 by stolfi

INTRODUCTION

  This note is about creating a transcription of the Recipes or Starred
  Parags (SPS) section of the VMS, with re-checked text and carefully
  marked paragraph breaks.

  For this note, the SPS is defined as all the prose text between page
  f103r line 01 and page f116r line 30. That is all of quire 20, minus
  the last 19 lines of page f116r (which do not seem to consist of
  "recipes" like all previous lines) and page f116v (which has only some
  extraneous writing and a few words of Voynichese).
  
  Note that the central bifolio of the quire (f109+f110) was removed
  after the folios were numbered, probably after the book was bound.
  Presumably it had four pages (f109r, f109v, f110r, f110v).
    
SETUP

    ln -s ../work

    ln -s work/error_funcs.gawk
    ln -s work/compute-freqs.sh

    ln -s work/insert_blank_lines.gawk
    ln -s work/read_table.gawk

    ln -s work/error_funcs.py
    ln -s work/ivtff_align.py
    ln -s work/ivtff_format.py
    ln -s work/compare_ivtff_files.py
    
    ln -s ../074/map_locators.sh
    ln -s ../074/loci-evmt16e6-ivtff.tbl
    
    ln -s ${HOME}/lib/read_table.gawk

    ln -s ${HOME}/ttf

PREPARING THE TRANSCRIPTION FILE

  Created "starps-H.eva" with the SPS part of the VMS, with only the
  lines from the Takeshi Takahashi transcription (";H>") from page f103r
  to page f116r line 30.
  
  Created a similar file "starps-U.eva" with the SPS part of the VMS
  with only the lines from the Stolfi transcription (";U").
  
  Created a similar file "starps-Z.eva" with Rene Zandbergen's
  transcription (";Z>"). Checked and revised it thoroughly by reference
  to the Beinecke 2014 online scans as of 2025-06.  (Thus it is no longer 
  "Rene's"!)
  
  See "report/report.html" for the glossary of terms used here,
  including "parag head" and "tail", "starlet", "long" and "short line",
  etc.

  The format of these transcription files started somewhat similar to that of
  the EVMT interlinear. In particular:
  
    * The locus indicators were <{PAGE}.{LINE};{TRANS}> where {PAGE} is
    "f103r", "f103v", ... "f116r", and {LINE} is the line number,
    starting from 01. The numbering skipped the "titles".
  
  It was then converted to the new IVTFF format, mostly.  IN particular:
  
    * Assumed parags were marked with a "<%>" before the text of the head line
    and with "<$>" after the text of the tail.
  
    * After the <%> of a parag head, <!S{nn}> was inserted to mean that
      star "S{nn}" is the starlet assigned to this parag head; or
      "<!NoS>" to mean that this parag has no assigned starlet.
    
    * The text of each line is prefixed with one of [=»]. The '=' means
      that the line starts at or before the left rail, and '»' means
      that it starts after it. This comes after the "<%><!S{nn}>" or
      "<%><!NoS>" marks, if any.
    
    * Each line is suffixed with one of [=«]. The "=" means that the
      line ends at or past the right rail, and "«" means that it ends
      before that rail. This comes before the "<$>" mark, if any.
    
  For this conversion, the following line numbers had to be changed:
  
    f103v: 28a,29-36,36a,37-44  to 29-46.
    f108v: 24a,25-51            to 25-52.
    f112v: 44a,45-47            to 45-48.
    f115r: 36a,37-44            to 37-45.
  
  Also, the existing transcriptions of the SPS have four "titles",
  short lines with anomalous justification:

    <f105r.T1.9a>    =sairy.ore.daiindy.ytam=
    <f105r.T2.36>    =otoiis.chedaiin.otair.otaly=
    <f108v.T1.52>    =olchar.olchedy.lshy.otedy=
    <f114r.T1.34>    =ytain.olkaiin.ykar.chdar.alkam=
    
  The "title" <f105r.T1.9a> was assumed to be part of the following line 
  <f105r.P1.10> that the Scribe skipped at first and then added above 
  that line.  In the conversion, it was appended to line <f105r.10>.
  
  The other three "titles" were kept as such.  They must be 
  excluded by special tests when analyzing paragraphs.
  
UPDATING THE LOCATORS

  A major step in converting the various files to the IVTFF format was
  replacing the old-style EVMT 1.6e6 locators
  <{PAGE}.{UNIT}.{OSEQ};{TRANS}> by the new-style IVTFF locators
  <{PAGE}.{NSEQ};{TRANS}>
  
    now=`yyyy-mm-dd-hhmmss`
    mkdir -p SAVE/${now}
    for ifile in starps-{U,H,Z} ; do 
      chmod a-w ${ifile}.eva
      mv -vi ${ifile}.eva SAVE/${now}/   
    done
    
  The "-Z" version had already been upgraded, so:
 
    cp -av SAVE/${now}/starps-Z.eva ./
    
  As for the other two:
  
    for ifile in starps-{U,H} ; do 
      cat SAVE/${now}/${ifile}.eva | map_locators.sh > ${ifile}.eva
    done

STAR PROPERTY TABLE

  Created a star property table "star-pros.txt". See comments in the
  file for the format.
  
  Initial stab at inserting star numbers "<!S{nn}>" and "<!NoS> in parag
  head lines:
  
    ./replace_star_ids.sh < starps-Z.eva > .temp-Z.eva

  Edited the file "starps-Z.eva" checking and reassigning all parag breaks.

    make -f parag-stats-Z.make
    
  See output in "out/
  
COMPARING VERSIONS

  Wrote a python3 program "compare_ivtff_files.py" to compare two files,
  line by line, using an optimal alignment algorithm:
  
    make -f compare_ivtff_files.make

    tag0="Z"
    for tag1 in U H ; do 
      file0="starps-Z.eva"
      file1="starps-${tag1}.eva"
      ofile=".cmp-${tag0}${tag1}.edf"
      ./compare_ivtff_files.py ${file0} ${file1} > ${ofile}
    done
    
  First run:

    read 2414 lines from file 0 = starps-Z.eva
    read 1313 lines from file 1 = starps-U.eva
    there were 587 loci from file0 missing in file1

    read 2414 lines from file 0 = starps-Z.eva
    read 1655 lines from file 1 = starps-H.eva
    there were 1 loci from file0 missing in file1
    
  Edited the files starps-Z.eva and starps-U.eva until the 
  last one became a subset of the former (apart from comments).

  Final run:

    # read  2421 lines ( 1064 data,  23 pages) from file 0 = starps-Z.eva
    # read   921 lines (  476 data,  23 pages) from file 1 = starps-U.eva
    #   588 loci from file0 missing in file1
    #   476 perfectly matching line pairs
    #     0 imperfectly matching line pairs
    
  Saving the current files:
  
    now="`yyyy-mm-dd-hhmmss`"; echo "now = ${now}"
    mkdir -p SAVE/${now}
    mv -vi starps-U.eva SAVE/${now}  
    cp -av starps-Z.eva SAVE/${now}/starps-Z-actually-U.eva
    chmod a-w SAVE/${now}/*.eva
    
      now = 2025-07-15-200047
      renamed 'starps-U.eva' -> 'SAVE/2025-07-15-200047/starps-U.eva'
      'starps-Z.eva' -> 'SAVE/2025-07-15-200047/starps-Z-actually-U.eva'

  Renaming "starps-Z.eva" as "starps-U.evt" to reflect the true culprit
  and uniformize with note 074. Renaming "starps-H.eva" to
  "starps-H.evt" for the same reason:
  
    mv -vi starps-Z.eva starps-U.evt 
    mv -vi starps-H.eva starps-H.evt
    chmod u+w starps-{U,H}.evt 

  Replacing ";Z" by ";U" with emacs.
  
  Moving starps-U.evt to note 074 since further editing will take place there:
  
    mv -vi starps-U.evt ../074/star25e1.evt
    ln -s ../074/star25e1.evt starps-U.evt

TO DO
  
  ??? Move the '#'-comments to the page description file whenever possible.

COUNTING LINES AND PARAGS

    join starps-H.evt, starps-U.evt
    
           lines  parags stars
    page    H  Z   H  Z  Z
    f103r  54 54  19 21 19
    f103v  46 46  14 16 14
    f104r  45 45  13 13 13
    f104v  44 44  13 13 13
    f105r  35 35  10 16 10
    f105v  38 38  10 20 10
    f106r  47 47  15 17 16
    f106v  47 47  15 16 14
    f107r  51 51  15 16 15
    f107v  49 49  16 16 15
    f108r  50 50  16 17 16
    f108v  52 52  17 16 16
    f111r  54 54  17 19 17
    f111v  51 51  18 20 19
    f112r  45 45  12 15 12
    f112v  47 47  14 17 13
    f113r  51 51  17 19 16
    f113v  49 49  15 15 15
    f114r  44 44  13 14 13
    f114v  41 41  12 13 12
    f115r  45 45  13 14 13
    f115v  45 45  13 14 13
    f116r  30 30  10 11 10

PARAG BREAKING RULES

  ??? Revised the text (STILL DOING) and all parag breaks.

  See "report/report.html" for nomenclature ("parag", "head", "tail",
  "starlet", "puff", "right rail", "left rail", "short line", "long
  line", "linegap", etc.) and for the parag breaking rules.

  The title <f114r.T1.34> is a right-justified line after a parag that
  ends with a full line. It had been assumed to be the tail of the
  previous parag that the Scribe skipped and then inserted in that
  non-standard position. However, the first line of the next parag
  <f114r.P1.35> bends down to avoid that title. Thus, if that conjecture
  is true, the Scribe must have realized the omission after writing the
  firat 4 lines of <f114r.P1.35>. I have now re-interpreted
  <f114r.T1.34> as a title.

  It is possible that other section headers were not recognized as such
  and were joined with adjacent parags.

  After commenting out the subsection titles on both files, I counted
  again the number of words and parags, and basic statistics (min, max,
  average, and standard deviation) of the number of words per paragraph
  (nwp):
  
    ./count_recipes_and_words.sh starps.evt

    !!! OLD: !!!
    statistic   !  bencao !  starps
    ------------+---------+--------
    parags      |     354 |     330
    words       |   10874 |   10457
    min nwp     |       7 |      11
    max nwp     |      76 |      72
    avg nwp     |    30.8 |    31.7
    dev nwp     |     8.5 |    11.2
    !!! !!!

CREATING THE PARAG SPLITTING REPORT

  Creating the file "report/report.html" with a description of how the 
  parags were chosen.
  
  First let's create the raw page images:

    ln -s ../../../FromBeinecke
    mkdir -p report/images/raw
    for bf in `cat beinecke_SPS_images.txt` ; do
      fnum="${bf/*-/}"
      convert FromBeinecke/${bf}.jpg -resize 'x800' report/images/raw/${fnum}.png
    done
    eom report/images/raw/*.png

IMAGES

  ??? Make the images showing parag splitting.
  ??? Remove comments that are in the descriptio file.
  ??? Write script to automatically assihm parag breaks.