Hacking at the Voynich manuscript - Side notes
062 Occurrences of Grove words

Last edited on 2002-03-05 03:43:38 by stolfi

INTRODUCTION

  John Grove observed that many text lines begin with what he called a
  "detachable gallows": a gallows that, if removed, leaves a valid
  word that occurs elsewhere in the text. Let's call the initial word
  of such lines a "Grove word"
  
  It has also been observed that the first word on each herbal page is
  usually unique, at least in the herbal section. It has also been
  observed that the first word of a paragraph often begins with a "p"
  or "f" gallows. It has also been observed that the rare words with
  two gallows are usually Grove words.

  This note is motivated by the hunch that all those special words 
  may be names of important things --- plants, remedies, illnesses,
  whatever.  We here analyze whether such words occur elsewhere
  in the text, exactly or with simple changes (e.g. removal of the 
  detachable gallows).
  
SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../../L16+H-eva
    ln -s ../../basify-weirdos
    ln -s ~/PUB/bin/map-field

COLLECTING THE TEXT

  cat L16+H-eva/INDEX \
    | gawk -v FS=':' \
        ' (($6 == "parags") || ($6 == "starred-parags")){ \
            printf "%s\n", $2; \
          } ' \
    > .files
    
  set files = ( `cat .files` )
  ( cd L16+H-eva && cat $files ) \
    | basify-weirdos \
    | reformat-evt-file \
    > main.evr

EXTRACTING THE LOCATED TOKEN LIST
  
  Create a file with one record per token, in the format 

    SEC USEQ FNUM UNIT LINE TRAN FPOS RPOS PFRST PLAST WORD
    1   2    3    4    5    6    7    8    9     10    11
  
  where WORD is the word in question; SEC is a book section code; USEQ
  is the nominal position index of the UNIT in the text; FNUM, UNIT,
  LINE and TRAN are fields of the line locator (the page f-number; the
  text unit; the line number; and the transcriber code); FPOS is the
  sequential number of the word in the line; RPOS is the same,
  counting backwards from the and of line; PFRST is a boolean (0 or 1)
  identifying the first token of a paragraph; and PLAST is analogous
  for the last token.
  
  The SEC field is a three-letter code for the secion ("bio", "pha", etc.)
  except that "hea" and "heb" are collapsed into "her".

    cat main.evr \
      | extract-tokens \
      | map-field \
          -v inField=1 -v outField=1 \
          -v table=L16+H-eva/fnum-to-sectag.tbl \
      | gawk ' /./{ gsub(/he[ab]/, "her", $1); $3 = ($2 "." $3); print; } ' \
      | map-field \
          -v inField=3 -v outField=2 \
          -v table=L16+H-eva/unit-to-useq.tbl \
      | gawk ' /./{ gsub(/^f.*[.]/, "", $4); print; } ' \
      | sort -b +0 -1 +1 -2n +4 -5g +5 -6 +6 -7n \
      > main.lts
    dicio-wc main.lts

      lines   words     bytes file        
    ------- ------- --------- ------------
     114329 1257619   4037393 main.lts

COLLECTING THE SPECIAL TOKENS

  We define a word to be "special" if it is a line-initial word
  of paragraph or starred-paragraph text (in any transcription),
  has at most one "*", has at least four letters, and either
  
    is the first word of the paragraph and starts with any gallows, or 
    
    contains a "p" or "f" gallows, or
    
    contains two or more gallows.
    
  Note that this definition does not include any Grove word that begins
  with a "t" or "k" gallows, is not paragraph-initial, and contains
  no other gallows.  Such a word may be an ordinary word that just happened to 
  resemble a Grove word and happened to fall in line-initial position
  by accident.
  
    cat main.lts \
      | select-special-tokens \
      | sort -b +10 -11 +0 -1 +1 -2n +4 -5g +6 -7n +7 -8n +5 -6 \
      > main.spw
    dicio-wc main.spw

        lines   words     bytes file        
      ------- ------- --------- ------------
         2256   24816     83569 main.spw

COMPUTE REPORT HEADINGS

  Extract the "headings" for the output report. The headings are
  pairwise disjoint sets of words, each as small as possible, 
  that satisfy the following rules:
  
    1. every word that occurs as a special token belongs to some
       heading.
  
    2. two words that occur at the start of the same line 
       belong to the same heading.  
       
    cat main.spw \
      | gawk ' /./{ print ($3 "." $4 "." $5), $11; } ' \
      | sort -b +0 -1n +1 -2 | uniq \
      | collect-headings \
      | sort -b +1 -2n \
      > main.hds
    dicio-wc main.hds

        lines   words     bytes file        
      ------- ------- --------- ------------
          772    1544     12164 main.hds

BUILDING THE REPORT

  Now read again the token list, selecting all occurrences, strong and weak,
  of every special word `w'.  (A weak occurence is an occurrence
  of a word `w1' obtained from `w' by deleting an initial gallows.) Output
  each occurrence with two extra fields, HEAD ($12) which is the heading of
  `w', and TAG ($13) which is 2 for a strong occurrence on column 1,
  1 for a strong occurrence elsewhere, and 0 for a weak occurence.
    
    cat main.lts \
      | assign-headings -v table=main.hds \
      | sort -b +11 -12 +12 -13r +0 -1 +1 -2n +4 -5g +5 -6 \
      | collapse-versions-in-soc \
      > main.soc
    dicio-wc main.soc
  
  Format result:
  
    cat main.soc \
      | sort -b +11 -12 +12 -13r +0 -1 +1 -2n +4 -5g +5 -6 \
      | format-soc \
          -v title='Occurrences of special words' \
          -v showWords=1 \
          -v showWeak=1 \
      > main.html