Hacking at the Voynich manuscript - Side notes
064 Looking for word quadruples with skew frequencies

Last edited on 2012-05-03 20:41:08 by stolfilocal

INTRODUCTION

  Schemas for generating pseudo-Voynichese, like Gordon Rugg's grille
  and my Voynichese probabilistic word grammar, generally produce word
  frequency distributions that are rather regular, because each choice
  made by the model generally depend on a small subset of th preceding choices.
  
  In particular, in my Voynichese grammar and in low-order Markov
  chains, the chioce of suffix is generally independent of the choice
  of prefix. Thus, if A,B are two possible prefixes that can be
  attached to a midfix M, and X,Y are two possible suffixes, then the
  four words AMX, AMY, BMX, BMY will be generated with probabilities
  of the form 
  
      p(AMX) = p(A)·p(M).p(X)          |
      p(AMY) = p(A)·p(M).p(Y)          |
      p(BMX) = p(B)·p(M).p(X)          | (1)
      p(BMY) = p(B)·p(M).p(Y)          |
      
   Here p(M) is the total probability of
  the four words; p(A),p(B) are the relative probabilities of A and B,
  normalized so that p(A)+p(B) = p(X)+p(Y) = 1; and ditto for p(X) and
  p(Y). Note that those four numbers are not independent, because they satisfy
  the equation
  
    p(AMX)/p(AMY) = p(BMX)/p(BMY)         (2)  
  
  Thus, one way of discrediting the mechanical gibberish theory is 
  to find quadruples of words with long M and very short A,B,X,Y,
  for which equation (2) is grossly violated.  
  
  If we let z(E) = lg p(E) (lg = logarithm to base 2), then the
  equations become z(AMX) = z(A) + z(M) + z(X), etc, and
      
    z(AMX) - z(AMY) = z(BMX) - z(BMY)     (3)
    
  This is a linear equation, so it is natural to measure the 
  "skewness" of the four probabilities by the square of the 
  mismatch, i.e.
  
    S(M,A,B,X,Y) = (z(AMX) - z(AMY) - z(BMX) + z(BMY))^2    (4)
    
  Note that p(M) cancels out in formulas (2)--(4).  I.e. the 
  skewness S does not depend on the frequency of the four words
  as a group, but only on their relative frequencies within the 
  group.
  
  Some things that we must take care of: 
  
    * Some letter pairs or groups may be meaningless calligraphic 
      variants, e.g. {k,t}, {ch,ee}.  
      
    * There may be many transcription errors between pairs of similar
      letters, e.g. {r,s}, {o,a}.
      
    * If the count n(AMX) of AMX is too small, the logarithm z(AMX) is tricky
      to estimate. In particular, if the count is zero, we cannot let
      p(AMX) = 0 because then z(AMX) = -oo.
      
  To alleviate the last problem, we will use the estimate
  p(word) = (n(word)+b)/(N+b), where N is the total number of tokens
  and b is a positive parameter, say b = 1.
  I.e. z(word) = lg(n(word)+b) - lg(N+b).  Then 

    S(M,A,B,X,Y) = ( k(AMX) - k(AMY) - k(BMX) + k(BMY) )^2  (5)

  where k(word) = lg(n(word)+b). For b = 1 we have
  
    n(W)  0   1   2   3   4   5   7
    k(W)  0.0 1.0 1.6 2.0 2.3 2.6 3.0  
  
  The following table shows the value of S(M,A,B,X,Y) as a function of 
  the counts of the four words AMX, AMY, BMX, BMY.  The columns correspond
  to bias b=1,2,4,8.

    test-skewness < test-skewness-in.txt

    ( 7, 0, 0, 7) =    36.00   18.83    8.52    3.29
    ( 7, 0, 0, 0) =     9.00    4.71    2.13    0.82
    ( 8, 4, 2, 5) =     3.42    2.38    1.37    0.63
    ( 2, 0, 0, 2) =    10.05    4.00    1.37    0.41
    ( 2, 0, 0, 1) =     6.68    2.51    0.82    0.24
    ( 8, 4, 8, 1) =     1.75    1.00    0.46    0.17
    ( 1, 0, 0, 1) =     4.00    1.37    0.41    0.12
    ( 2, 0, 0, 0) =     2.51    1.00    0.34    0.10
    ( 2, 1, 0, 1) =     2.51    1.00    0.34    0.10
    ( 8, 4, 8, 2) =     0.54    0.34    0.17    0.07
    ( 1, 0, 0, 0) =     1.00    0.34    0.10    0.03
    ( 1, 1, 1, 0) =     1.00    0.34    0.10    0.03
    ( 2, 1, 0, 0) =     0.34    0.17    0.07    0.02
    ( 2, 1, 1, 1) =     0.34    0.17    0.07    0.02
    ( 3, 3, 3, 4) =     0.10    0.07    0.04    0.02
    ( 2, 1, 1, 0) =     0.17    0.03    0.00    0.00
    ( 8, 4, 5, 2) =     0.02    0.00    0.00    0.00

LINK SETUP

  We need the good word frequencies per section:
  
    ln -s ../tr-stats/dat
    ln -s ../.. work
    
    ln -s work/factor-field-general

    ln -s work/factor-text-trivial.gawk
    ln -s work/factor-text-viqr-to-phon.gawk
    ln -s work/factor-text-eva-to-basic.gawk
    
COMPUTING SKEWNESS OF QUADRUPLES

  Define the sample texts to analyze, the ortographical elems
  (glyphs, letters, etc.), and the number of elems to take in 
  prefix, midfix, suffix:

  For tests: 
  
    set sampelems = ( \
      voyn/maj/hea.1/bgly/2.2.4 \
    )
    
  For real: 
  
    set sampelems = ( \
      voyn/maj/{hea.1,heb.1,bio.1}/bgly/2.2.4 \
      voyp/{grs,grm}/tot.1/bgly/2.2.4 \
      engl/wow/tot.1/lets/1.5.3 \
      viet/ptt/tot.1/phon/1.1.1 \
    )
  Create the output directories, and check for necessary files:

    foreach sun ( ${sampelems} )
      set su = "${sun:h}"
      set sample = "${su:h}"
      set elem = "${su:t}"
      set opus = "${sample:h}"
      mkdir -p res/${sample}
      echo " "
      ls -lL dat/${opus}/factor-text-to-${elem}.gawk
      ls -lL dat/${sample}/gud.wfr
    end
      
  Generate all prefix/midfix/suffix combinations from the good words:

    set bias = 8
    
    foreach sun ( ${sampelems} )
      set su = "${sun:h}"
      set sample = "${su:h}"
      set elem = "${su:t}"
      set opus = "${sample:h}"
      set nums = ( `echo ${sun:t} | tr '.' ' '` )
      cat dat/${sample}/gud.wfr \
        | factor-field-general \
            -f dat/${opus}/factor-text-to-${elem}.gawk \
            -v inField=3 -v outField=4 \
        | gawk '//{ print $1, $4; }' \
        | generate-all-mid-pre-suf-combs \
            -v bias=${bias} \
            -v maxpre=${nums[1]} -v minmid=${nums[2]} -v maxsuf=${nums[3]} \
        | sort -b +1 -2 +0 -1nr \
        > res/${sample}/gud.mps
    end

Enumerate all quadruples   
    
    foreach sun ( ${sampelems} )
      set su = "${sun:h}"
      set sample = "${su:h}"
      set elem = "${su:t}"
      set opus = "${sample:h}"
      cat res/${sample}/gud.mps \
        | evaluate-all-word-quads \
            -v bias=${bias} \
            -v minskew=0.10 -v minhits=7 \
        | sort -b +0 -1gr \
        > res/${sample}/gud.skq
    end