Hacking at the Voynich manuscript - Side notes
103 Statistics of basic glyphs, strokes, and pairs thereof

Last edited on 2002-01-17 20:38:43 by stolfi

INTRODUCTION

  In this note we decompose the Voynichese words into their basic glyphs,
  and compute glyph and digraph frequencies.
  
SETTING UP THE ENVIRONMENT

  Links:
  
    ln -s ../../basify-weirdos
    ln -s ../../capitalize-ligatures
    ln -s ../../compute-cum-cum-freqs
    ln -s ../../compute-cum-freqs
    ln -s ../../compute-freqs
    ln -s ../../combine-counts
    ln -s ../../remove-freqs
    ln -s ../../totalize-fields
    ln -s ../../select-units
    ln -s ../../words-from-evt
    ln -s ../../format-counts-packed
    ln -s ../../factor-field-general
    ln -s ../../update-paper-include
    
    ln -s ../101/sample
    
    ln -s ../100/subsections.tags

  Paper directories:

    set tbldir = "/home/staff/stolfi/papers/voynich-stats/techrep/tables/auto"
    set figdir = "/home/staff/stolfi/papers/voynich-stats/techrep/figures/auto"

GLYPH AND GLYPH PAIR FREQUENCIES

  Tabulating "basic" and "rare" glyph frequencies in the text and lexicon

    make -f glyph-freqs.make all

  Computing the "basic" symbol pair counts and next/prev 
  frequencies in the text and lexicon:
  
    make -f glyph-pair-freqs.make all
    
  Tabulating repeated glyphs in the text and lexicon:

    make -f glyph-rep-freqs.make all

  Tabulating inter-glyph stroke pairs (last stroke of a glyph against
  the first stroke of the next glyph).

    make -f stroke-pair-freqs.make all

TABULATING GLYPH COUNTS PER SECTION 

    set secs = ( `cat subsections.tags` )
    set secscm = `echo ${secs} | tr ' ' ','`
    echo ${secs}; echo ${secscm}

    set tfile = ".glyphs-by-section"
    /bin/rm -f ${tfile}
    foreach sec ( ${secs} tot.t )
      foreach lang ( voyp voyl )
        set ifile = "sample/${lang}/vms/${sec}/raw.wfr"
        set ofile = ".glyphs-${lang}"
        echo "${ifile} -> ${ofile}"
        cat ${ifile} \
          | capitalize-ligatures -v field=3 \
          | factor-field-general \
              -f factor-text-basic.gawk \
              -v inField=3 -v outField=4 \
          | gawk \
              ' BEGIN{ s = 0; } \
                //{ ct = $1; w = $4; \
                    gsub(/}{/, "} {", w); \
                    m = split(w, els); \
                    s += ct * m; \
                  } \
                END{ print s; } \
              ' \
          > ${ofile}
      end
      set ntext = "`cat .glyphs-voyp`"
      set nlabs = "`cat .glyphs-voyl`"
      @ ntots = $ntext + $nlabs
      printf "%s %7d %7d %7d\n" "${sec}" "$ntext" "$nlabs" "$ntots" >> ${tfile}
    end
    cat ${tfile}
    
  
       sec     text    labs    both
      -----  ------  ------  ------
      hea.1   27925       6   27931
      hea.2    3783       0    3783
      heb.1   12755       0   12755
      heb.2    2471       0    2471
      cos.1     385      69     454
      cos.2    6714    1252    7966
      cos.3    3969     628    4597
      bio.1   30694     721   31415
      zod.1    4669    1893    6562
      pha.1    4044     537    4581
      pha.2    6354     835    7189
      str.1    3438       0    3438
      str.2   52179       0   52179
      unk.1     833       0     833
      unk.2     623       0     623
      unk.3     195       0     195
      unk.4    1404      67    1471
      unk.5    1621       0    1621
      unk.6    2261       0    2261
      unk.7    1707       0    1707
      unk.8       0       8       8
      -----  ------  ------  ------
      tot.n  168020    6016  174036

>>> STOPPED HERE <<<

SORTING THE BASIC GLYPHS BASED ON DIGRAPH PROBABILITIES

  Let's try to find an optimum sequence for the glyphs --- one that
  brings glyphs with similar context statistics close together.  
  
  Let G be the set of glyphs, and d(u,v) be some penalty for placing
  glyphs u and v next to each other. We want to find a permutation
  u[0..n-1] of G that minimizes
  
    W(u) = sum{ d(u[i-1],u[i]) : i in [1..n-1] }
  
  First, let's compute the pairwise glyph distances d(u,v):
  
    set bglyphs = "q,y,l,r,s,n,m,i,a,o,d,e,Ch,Sh,k,t,f,p,CKh,CTh,CFh,CPh"
    
    foreach tw ( t w )
      set ifile = "voyn-vms-glyph-pair-${tw}.gpf"
      set ofile = "voyn-vms-glyph-distances-${tw}.dst"
      echo "${ifile} -> ${ofile}"
      cat ${ifile} \
        | gawk '/./{ ct=$1; w=$5; gsub(/[:]/, " ", w); print ct,w; }' \
        | compute-elem-distances -f parse-elem-list.gawk \
            -v elemList="${bglyphs}" \
            -v exponent=1.0 \
        > ${ofile}
    end
 
  #  d(u,v) to a fractional power, so that
  # keeping similar elements together is more important that rearranging
  # dissimilar ones.



