Hacking at the Voynich manuscript - Side notes
060 Complete nd strict label concordance

Last edited on 2004-07-15 17:32:39 by stolfi

INTRODUCTION

  This note attempts to create a concordance for 
  all figure labels, which is strict (i.e. without any
  letter-mapping) and complete (i.e. looks for all 
  labels in the whole text).
  
SETTING UP THE ENVIRONMENT

  ln -s ../../L16+H-eva
  ln -s ../037/vms-17.roc.gz
  
  ln -s /home/staff/stolfi/voynich/work 
  ln -s /home/staff/stolfi/projects/langbank
  
  ln -s work/basify-weirdos
  
OBTAINING THE LABEL UNITS LIST

  Extracting the list of units that contain labels, from the 
  current (internal) version of the interlinear:

    cat L16+H-eva/INDEX \
      | grep ':labels:' \
      > label-units.idx

  Now we prepare a file in the following format:
  
    UNUM PNUM STAG FNUM UTAG LINE VTAG LABEL UCMT
    1    2    3    4    5    6    7    8     9
    
  where 
  
    UNUM     is the sequential number of the text unit, e.g. "0352",
             as in the INDEX file.
    
    PNUM     is the reading-order number of the logical page, e.g.
             "p123", as in the INDEX file.
    
    STAG     is the major section tag, e.g. "hea" or "pha".
    
    FNUM     is the standard name of the text unit, e.g. "f69v".
    
    UTAG     is a tag identifying the text unit in the page, e.g. "L".
    
    VTAG     is a letter identifying text version (transcriber), e.g. "F".
    
    LINE     is the line number, e.g. "20" or "0a".

    LABEL    is the label, with "." between words, free from fillers,
             line/parag delimiters, and inline comments, with weirdos
             replaced by "*". Each line of a multiline label is treated
             as a separate label. Variant spaces "," and "-" are mapped to "."
             
    UCMT     The comment field that applies to this entire text unit, taken 
             from the INDEX file.
    
  The fields FNUM, UTAG, VTAG, and LINE use the same convention as in the 
  interlinear file.
  
    /bin/rm -f all.lbs
    foreach f ( `cat label-units.idx | tr ' ' '_'` )
      set fld = ( `echo "$f" | tr ':' ' '` )
      set file = "L16+H-eva/${fld[2]}"
      echo "${file}"
      cat ${file} \
        | basify-weirdos \
        | extract-lines-as-labels \
            -v unum="${fld[1]}" \
            -v unit="${fld[2]}" \
            -v pnum="${fld[7]}" \
            -v ucmt="${fld[8]}" \
        | map-field \
            -v table=L16+H-eva/fnum-to-sectag.tbl \
            -v inField=3 -v outField=3 \
            -v default='???' \
        >> all.lbs
    end

  Make a table that maps all labels to "+":
    
    cat all.lbs \
      | gawk '//{ print $8, "+"; }' \
      | egrep -v -e '[*]' \
      | sort \
      | uniq \
      > mark-labels.tbl

  Extracts label occurrences from previously built concordance,
  even though it is alightly out-of-date. This must be redone when
  the new interlinear is released.
  
    set maxlen = 17
    
    zcat vms-${maxlen}.roc.gz \
      | sort -b +5 -6 \
      | map-field \
          -v table=mark-labels.tbl \
          -v inField=6 -v outField=8 \
          -v default="-" \
      | gawk '($8 == "+"){ $8 = ""; print; }' \
      > labels.ocs
      
  The format of this file is 
   
    LOC VTAGS START LENGTH LCTX STRING RCTX
    1   2     3     4      5    6      7   
    
  We now add some info creating the following

    FNUM UTAG LINE VTAGS START LENGTH LCTX STRING RCTX DEF UNUM UCMT PNUM STAG
    1    2    3    4     5     6      7    8      9    10  11   12   13   14
    
  where VTAGS is a lump of version (transcriber) codes, START and LENGTH
  are the position of the occurrence in the line, LCTX and RCTX are the
  left and right contexts (from the majority version) DEF is "+"
  if this occurrence is as a label, "-" otherwise.
  
  For that we need a table that maps the names of label-containing units to "+":
  
    cat label-units.idx \
      | gawk -v FS=':' '// { print $2, "+"; }' \
      | sort \
      > mark-label-units.tbl
      
  We need also a table that maps units to their comments (with blanks mapped to "_"):
    
    cat L16+H-eva/INDEX \
      | tr ' ' '_' \
      | gawk -v FS=':' '//{print $2, $8; }' \
      > unit-to-ucmt.tbl

  Now we expand the label occurrence file:

    cat labels.ocs \
      | gawk '//{ $1 = gensub(/[.]([A-Za-z0-9]*)$/, " \\1", "g", $1); print; }' \
      | map-field \
          -v table=mark-label-units.tbl \
          -v inField=1 -v outField=9 -v default='-' \
      | map-field \
          -v table=L16+H-eva/unit-to-useq.tbl \
          -v inField=1 -v outField=10 -v default='????' \
      | map-field \
          -v table=unit-to-ucmt.tbl \
          -v inField=1 -v outField=11 -v default='_' \
      | gawk '//{ $1 = gensub(/[.]([A-Za-z0-9]*)$/, " \\1", "g", $1); print; }' \
      | map-field \
          -v table=L16+H-eva/fnum-to-pnum.tbl \
          -v inField=1 -v outField=13 -v default='p???' \
      | map-field \
          -v table=L16+H-eva/fnum-to-sectag.tbl \
          -v inField=1 -v outField=14 -v default='???' \
      > labels.xoc
          
   Sort by string, unit number, and line, and format the concordance:
   
     cat labels.xoc \
        | sort -b +7 -8 +10 -11 +2 -3n +4 -5n +5 -6n +3 -4 \
        | format-label-concordance \
        > labels-conc.html