Hacking at the Voynich manuscript - Side notes
112 Analyzing occurrence and context of one-leg gallows

Last edited on 2025-05-02 17:15:08 by stolfi

INTRODUCTION

  In this note we locate all the one-legged gallows (EVA "f", "p", 
  and platformed versions) and analyze their locations in the text
  (sections, line positions in paragraph) and collect n-gram 
  statistics, aiming to decide which other glyphs or glyph combinations they 
  may stand for.
  
SETTING UP THE ENVIRONMENT

  Links:
    
    ln -s ../tr-stats/dat  # Main folder with data files.
    ln -s ../tr-stats/tex  # Main folder for exported tables etc.
    ln -s ../../work 

    # ln -s work/basify-weirdos
    # ln -s work/capitalize-ligatures
    # ln -s work/compute-cum-cum-freqs
    # ln -s work/compute-cum-freqs
    # ln -s work/compute-freqs
    # ln -s work/combine-counts
    # ln -s work/remove-freqs
    # ln -s work/totalize-fields
    # ln -s work/select-units
    # ln -s work/words-from-evt
    # ln -s work/format-counts-packed
    # ln -s work/factor-field-general
    # ln -s work/update-paper-include
    # ln -s work/factor_text_eva_to_basic.gawk
    
GLYPH AND GLYPH PAIR FREQUENCIES

  Extracting data on one-leg gallows:
  
    ./extract_one_leg_gallows.sh
 
  Extracting and analyzing one-leg gallows:
 
    ./analyze-one-leg-gallows.sh

TABULATING GLYPH COUNTS PER SECTION 

    set secs = ( `cat dat/voyn/maj/sections.tags` )
    set secscm = `echo ${secs} | tr ' ' ','`
    echo ${secs}; echo ${secscm}

    set tfile = "voyn/maj/glyph-counts-by-section.txt"
    /bin/rm -f dat/${tfile}
    foreach sec ( ${secs} tot.1 )
      foreach book ( prs lab )
        set dir = "voyn/${book}/${sec}"
        set ifile = "${dir}/raw.wfr"
        set ofile = ".glyphs-${book}"
        echo "dat/${ifile} -> ${ofile}"
        cat dat/${ifile} \
          | capitalize-ligatures -v field=3 \
          | factor-field-general \
              -f factor_text_eva_to_basic.gawk \
              -v inField=3 -v outField=4 \
          | gawk \
              ' BEGIN{ s = 0; } \
                //{ ct = $1; w = $4; \
                    gsub(/}{/, "} {", w); \
                    m = split(w, els); \
                    s += ct * m; \
                  } \
                END{ print s; } \
              ' \
          > ${ofile}
      end
      set nprs = "`cat .glyphs-prs`"
      set nlab = "`cat .glyphs-lab`"
      @ ntot = $nprs + $nlab
      printf "%s %7d %7d %7d\n" "${sec}" "$nprs" "$nlab" "$ntot" >> dat/${tfile}
    end
    cat dat/${tfile}
  
       sec      prs     lab     tot
      -----  ------  ------  ------
      hea.1   27925       6   27931
      hea.2    3783       0    3783
      heb.1   12755       0   12755
      heb.2    2471       0    2471
      cos.1     385      69     454
      cos.2    6714    1252    7966
      cos.3    3969     628    4597
      bio.1   30694     721   31415
      zod.1    4669    1893    6562
      pha.1    4044     537    4581
      pha.2    6354     835    7189
      str.1    3438       0    3438
      str.2   52179       0   52179
      unk.1     833       0     833
      unk.2     623       0     623
      unk.3     195       0     195
      unk.4    1404      67    1471
      unk.5    1621       0    1621
      unk.6    2261       0    2261
      unk.7    1707       0    1707
      unk.8       0       8       8
      tot.1  168020    6016  174036
      -----  ------  ------  ------
      tot.n  168020    6016  174036

>>> STOPPED HERE <<<

SORTING THE BASIC GLYPHS BASED ON DIGRAPH PROBABILITIES

  Let's try to find an optimum sequence for the glyphs --- one that
  brings glyphs with similar context statistics close together.  
  
  Let G be the set of glyphs, and d(u,v) be some penalty for placing
  glyphs u and v next to each other. We want to find a permutation
  u[0..n-1] of G that minimizes
  
    W(u) = sum{ d(u[i-1],u[i]) : i in [1..n-1] }
  
  First, let's compute the pairwise glyph distances d(u,v):
  
    set bglyphs = "q,y,l,r,s,n,m,i,a,o,d,e,Ch,Sh,k,t,f,p,CKh,CTh,CFh,CPh"
    
    foreach tw ( t w )
      set ifile = "voyn-vms-glyph-pair-${tw}.gpf"
      set ofile = "voyn-vms-glyph-distances-${tw}.dst"
      echo "${ifile} -> ${ofile}"
      cat ${ifile} \
        | gawk '/./{ ct=$1; w=$5; gsub(/[:]/, " ", w); print ct,w; }' \
        | compute_elem_distances.gawk -f parse_elem_list.gawk \
            -v elemList="${bglyphs}" \
            -v exponent=1.0 \
        > ${ofile}
    end
 
  #  d(u,v) to a fractional power, so that
  # keeping similar elements together is more important that rearranging
  # dissimilar ones.