Hacking at the Voynich manuscript - Side notes
055 Occurrences of the OL/OR words

Last edited on 1999-11-25 17:11:21 by stolfi

INTODUCTION

  The topic of this note is the study of occurrences and 
  contexts of the short words "ol", "or" and look-alikes.

DATA COLLECTION  
  
  The starting data was the machine-readable
  concordance (See Notes/037), which has one entry for each word
  occurrence and many short phrases) in the format
  
    LOC TRANS START LENGTH LCTX STRING RCTX PATT STAG PNUM HNUM
    1   2     3     4      5    6      7    8    9    10   11

  We extracted all occurrences of "al",
  "ol", "ar", "or", "as", "os" in majority version ("A"):
  
    ln -s ../../Notes/037/vms-17-ok.hoc.gz
    
    zcat vms-17-ok.hoc.gz \
      | gawk '((substr($2,1,1)=="A")&&($6 ~ /^[ao][lrs]$/)){print;}' \
      | sort -b +5 -6 +6 -7 +4 -5 +9 -10n +0 -1 +2 -3n \
      > orol-sep.hoc
  
    dicio-wc orol-sep.hoc
    
      lines   words     bytes file        
    ------- ------- --------- ------------
       1589   17479    122346 orol-sep.hoc
    
      
  Extracted also all words that begin or end with those target words:
  
    zcat vms-17-ok.hoc.gz \
      | gawk '((substr($2,1,1)=="A")&&($6 ~ /^[ao][lrs][*a-z]+$/)){print;}' \
      | sort -b +5 -6 +6 -7 +4 -5 +9 -10n +0 -1 +2 -3n \
      > orol-wds-r.hoc

    zcat vms-17-ok.hoc.gz \
      | gawk '((substr($2,1,1)=="A")&&($6 ~ /^[*a-z]+[ao][lrs]$/)){print;}' \
      | sort -b +5 -6 +6 -7 +4 -5 +9 -10n +0 -1 +2 -3n \
      > orol-wds-l.hoc
  
    dicio-wc orol-wds-{l,r}.hoc

      lines   words     bytes file        
    ------- ------- --------- ------------
       9234  101574    746851 orol-wds-l.hoc
       1684   18524    137013 orol-wds-r.hoc
          
  Extracted the forward and backward 1-word context from the separated words
  file.  To allow fair comparison with the joined-words set, we deleted
  any leading and trailing [qaoy] strings from the context.
  
    cat orol-sep.hoc \
      | gawk \
          ' /./{ \
              str = $6; \
              lnw = split($5, lw, /[-=/.,]/); \
              lc = lw[lnw-1]; \
              gsub(/^[qaoy]*/, "", lc); gsub(/[qaoy]*$/, "", lc); \
              if (lc == "") { lc = "_"; } \
              printf "%s %s\n", str, lc; \
            } \
          ' \
      | sort | uniq -c | expand \
      | sort -b +1 -2 +0 -1nr +2 -3 \
      > orol-sep.lfr
          
    cat orol-sep.hoc \
      | gawk \
          ' /./{ \
              str = $6; \
              rnw = split($7, rw, /[-=/.,]/); \
              rc = rw[2]; \
              gsub(/^[qaoy]*/, "", rc); gsub(/[qaoy]*$/, "", rc); \
              if (rc == "") { rc = "_"; } \
              printf "%s %s\n", str, rc; \
            } \
          ' \
      | sort | uniq -c | expand \
      | sort -b +1 -2 +0 -1nr +2 -3 \
      > orol-sep.rfr

    cat orol-wds-l.hoc \
      | gawk \
          ' /./{ \
              str = substr($6,length($6)-1,2); \
              lc = substr($6,1,length($6)-2); \
              gsub(/^[qaoy]*/, "", lc); gsub(/[qaoy]*$/, "", lc); \
              if (lc == "") { lc = "_"; } \
              printf "%s %s\n", str, lc; \
            } \
          ' \
      | sort | uniq -c | expand \
      | sort -b +1 -2 +0 -1nr +2 -3 \
      > orol-wds.lfr
          
    cat orol-wds-r.hoc \
      | gawk \
          ' /./{ \
              str = substr($6,1,2); \
              rc = substr($6,3,length($6)-2); \
              gsub(/^[qaoy]*/, "", rc); gsub(/[qaoy]*$/, "", rc); \
              if (rc == "") { rc = "_"; } \
              printf "%s %s\n", str, rc; \
            } \
          ' \
      | sort | uniq -c | expand \
      | sort -b +1 -2 +0 -1nr +2 -3 \
      > orol-wds.rfr

STATISTICS

  Next we tabulated the counts, frequencies, and anomalies. 
  
  The anomaly for a string S and a context C is the ratio of the
  observed frequency of C next to S, by the expected value of that
  frequency (product of frequencies of C and of S). The frequencies
  are computed by the "stabilized" estimator freq[case] = (count[case]
  + 1)/(totalCount + nCases).
  
  We omitted the strings "as" and "os" because they are too rare
  and generated mostly noise.
  
    foreach rw ( freestanding.sep attached.wds )
      set r = "${rw:r}"
      set w = "${rw:e}"
      foreach tx ( left.l right.r )
        set t = "${tx:r}"
        set x = "${tx:e}"
        foreach sy ( counts.c freqs.f types.t anomalies.a )
          set s = "${sy:r}"
          set y = "${sy:e}"
          echo "${s} of ${t} contexts - ${r}"
          cat orol-${w}.${x}fr \
            | gawk '($2 ~ /s$/){next;} /./{print;}' \
            | format-context-${s} -v ncmin=12 \
            > orol-${w}.${x}t${y}
        end
      end
    end

  Computing the same tables with equivalence mappings:
  
    foreach rw ( freestanding.sep attached.wds )
      set r = "${rw:r}"
      set w = "${rw:e}"
      foreach tx ( left.l right.r )
        set t = "${tx:r}"
        set x = "${tx:e}"
        foreach sy ( counts.c freqs.f types.t anomalies.a )
          set s = "${sy:r}"
          set y = "${sy:e}"
          echo "${s} of ${t} contexts - ${r}"
          /bin/rm -f orol-${w}-eqv.${x}t${y}
          cat orol-${w}.${x}fr \
            | map-field \
                -v forgiving=1 \
                -v inField=3 \
                -v outField=3 \
                -v table=eqv-${x}.tbl \
            | gawk '($2 ~ /s$/){next;} /./{print $1, $2, $3;}' \
            | format-context-${s} -v ncmin=8 \
            > orol-${w}-eqv.${x}t${y}
        end
      end
    end
    
  Computing the distance matrix between the four target words:
  
    foreach rw ( freestanding.sep attached.wds )
      set r = "${rw:r}"
      set w = "${rw:e}"
      foreach tx ( left.l right.r )
        set t = "${tx:r}"
        set x = "${tx:e}"
        echo "distance matrix based on ${t} contexts - ${r}"
        cat orol-${w}.${x}fr \
          | gawk '($2 ~ /s$/){next;} /./{print $1, $2, $3;}' \
          | compute-distance-matrix \
          > orol-${w}.${x}dm
      end
    end    

  Comparing the context frequencies of joined vs. separated words.
  
    foreach tx ( left.l right.r )
      set t = "${tx:r}"
      set x = "${tx:e}"
      foreach rw ( freestanding.sep attached.wds )
        set r = "${rw:r}"
        set w = "${rw:e}"
        cat orol-${w}.${x}tf \
          | egrep '[0-9] *$' \
          | sort +0 -1 \
          > .${w}
      end
      /n/gnu/bin/join \
          -1 1 -2 1 \
          -a 1 -a 2 \
          -e '.' \
          -o'0,1.2,1.3,1.4,1.5,1.6,2.2,2.3,2.4,2.5,2.6' \
          .sep .wds \
        | gawk \
            ' /./{ \
                printf "%-20s %4s %3s %3s %3s %3s  %4s %3s %3s %3s %3s  %4d\n", \
                  $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$2+$7; \
              } \
            ' \
        | sort -b +11 -12nr \
        > orol-cmp.${x}tf
    end


ANALYSIS

  The most common words before each string are:
  
    al: ar(29) aiin(9) or(9) s(8) dair(7) al(6) daiin(6) dar(6) sar(6)
    ol: qokain(11) aiin(10) or(10) qokaiin(10) ar(9) daiin(9) dar(9) ol(9)
    ar: okar(12) otar(10) al(9) ar(9) dar(9) ol(8) or(8) s(8)
    or: ar(9) daiin(9) s(7) chol(5) dal(5) or(5) aiin(4) dar(4) ol(4)
    
    as: chtar(1) otalef(1)
    os: aiin(2) cheor(2) 

  The most common words after each string are:
  
    al: ar(9) al(6) chedy(6) s(6) aiin(5) dar(4) ol(4)
    ol: chedy(23) shedy(23) daiin(17) aiin(13) cheey(11) sheey(10) 
    ar: al(29) aiin(23) ar(9) ol(9) or(9) =(7) am(7) y(6)
    or: aiin(56) ol(10) al(9) ar(9) shedy(9) air(7) cheey(7) chey(7) chol(7)
    
    as: ainam(1) kai*n(1)
    os: al(3) aiin(2) 
  
  Anomalous frequencies (10 = normal) before the target strings:
  
    al: os~(27) dair(21) sar(21) ched(20) otaiin(20) ar(19) cheor(14) aiin(13) s(13)
    ol: shedy(22) chey(18) qokal(18) sain(16) qokedy(15) dain(14) okain~(14)      
    ar: tar(24) otain(19) al(16) okar~(16) chear(15) dain(14)
    or: dal(22) chear(21) qokal(21) saiin(20) daiin(17) ched(16) otal(16) qokedy(16)
  
    al: shedy(4) qokal(4) qokain(4) chear(4) shey(5) sain(5) okain~(5) ol(6) okar~(6)
    ol: s(2) ched(3) ar(5) dal(6) dair(6) al(6) otain(7) okar~(7)
    ar: qokal(4) daiin(5) aiin(5) sar(6) otal(6) ar(6)
    or: tar(5) otaiin(5) al(5) okar~(6)

  Anomalous frequencies after the target strings:
  
    al: s(33) ar(22) shey(18) chedy(15)      shedy(5) aiin(5) cheey(7)
    ol: daiin(19) chedy(17) shedy(17) s(15)  al(2) aiin(4) or(7) ar(7)
    ar: al(25) or(20) ol(12)                 shedy(3) chedy(3) s(5) 
    or: aiin(18) chey(12)                    s(2) shey(5) daiin(5) chedy(5)


  Among the 273 occurrences of "al", only 11 (4%) are preceded 
  by a word that ends with "y":
    
    airody aldy chedacphy chey okchey oty qokedy qoteedy shefeeedy
    shkchody ykeedy
  
  These words occur once each, and do not occur before the other
  four target words.
  
  Among the 370 occurences of "ar", only 32 (8.6%) are preceded by words 
  that end with "y":
  
    otedy(2) shey(2) *aly chckhy chedy cheeoy cheody chey ckheeody daly
    dshdy kedy o*eey ody okcheey ota*ky oteey otoy oty qokedy qokeey
    qokey shedy sheedy tchdy teodeey typchey ykeshy ytedy yteeey
  
DISUSSION

  The most common pairs, such as "ol chedy" and variants, also correspond to 
  common words (namely "olchedy").  Thus it is possible that the words
  "al/ar/ol/or" are not isolated words but part of the following word.