Hacking at the Voynich manuscript - Side notes
027 Applying Jim Reeds's iterative digraph compression

Last edited on 1998-07-15 03:00:25 by stolfi

Jim Reeds suggested on 1998-07-14:

  > Here is a primitive cut at an "anti dain daiin" filter:
  > Read a text and tabulate all digraphs.  Create a new symbol
  > and replace all instances of the most frequent digraph with that
  > new symbol.  The new text will be somewhat shorter, will have
  > a character set with one extra symbol, and will have somewhat
  > higher entropy.
  > 
  > The resulting text can be run through the filter again.
  > And again.  And again...  

Let's try it. I can reuse the "extract-signif-chars" and "gather-tuples"
scripts that I wrote for the entropy-colored pages (Note 026).

  cat voyn-bio/full.txt \
    | extract-signif-chars \
        -v normal="-'" \
        -v errors='_' \
    > bio-00.sig

I wrote another simple gawk script "replace-signif-digraph" that
does the replacement (skipping over decoration and breaks),
and a shell script "reeds-compress" that does the outer loop
and manages the file names.

  reeds-compress bio 00 10 A B C D E F G H I J
  reeds-compress bio 10 20 K L M N O P Q R S T 

    # A = dy (   1908)
    # B = he (   1846)
    # C = qo (   1533)
    # D = ol (   1087)
    # E = Ck (   1025)
    # F = cB (    955)
    # G = ai (    860)
    # H = eA (    852)
    # I = sB (    808)
    # J = al (    524)
    # K = Gn (    429)
    # L = in (    425)
    # M = FA (    415)
    # N = ey (    394)
    # O = ar (    378)
    # P = GL (    374)
    # Q = ch (    364)
    # R = eH (    352)
    # S = IA (    334)
    # T = ok (    307)

Trying again, with word breaks treated as letters (but not line breaks
- handling the decoration would be too messy.

First, we create a new version of the ".sig" file where 
  
  * word breaks (class 1) entries whose external rep is " "
    are turned into significant chars (class 3) with rep = "-"
    
  * word breaks with other external reps are turned into 
    significant chars with rep "-", followed by a deco
    (class 0) with the same rep (minus a leading blank, if any).
    
  * a small set of "-"s is inserted before and after each parag break.

  cat bio-00.sig \
    | sed \
        -e 's/^1[ ]$/3-/' \
        -e 's/^1[ ]/3-0/' \
        -e 's/^1/3-0/' \
        -e '2,$s/^\(2.*\)/3-3-3-\1/' \
    | tr '\013' '\012' \
    > bio-sp-00.sig 

  cat bio-sp-00.sig \
    | sed -e 's/^.//' \
    | tr -d '\012' \
    | tr '\015' '\012' \
    > bio-sp-00.txt
    
Now let's compress again:

  reeds-compress bio-sp 00 10 A B C D E F G H I J
  reeds-compress bio-sp 10 20 K L M N O P Q R S T 

    # A = y. (   3358)
    # B = dA (   1886)
    # C = he (   1846)
    # D = qo (   1533)
    # E = l. (   1270)
    # F = Dk (   1025)
    # G = cC (    955)
    # H = ai (    860)
    # I = n. (    860)
    # J = eB (    851)
    # K = sC (    808)
    # L = r. (    661)
    # M = oE (    652)
    # N = ol (    435)
    # O = HI (    426)
    # P = iI (    423)
    # Q = aE (    416)
    # R = GB (    412)
    # S = eA (    391)
    # T = HP (    373)

Let's try the same with Herbal-A and Herbal-B

  cat ../026/voyn-hea/full.txt > hea-00.txt
  cat ../026/voyn-heb/full.txt > heb-00.txt
  
  foreach sec ( hea heb )
    cat ${sec}-00.txt \
      | extract-signif-chars \
          -v normal="-'" \
          -v errors='_' \
      > ${sec}-00.sig
    reeds-compress ${sec} 00 10 A B C D E F G H I J 
    reeds-compress ${sec} 10 20 K L M N O P Q R S T
  end
  
  cat hea-*.dic

    # A = ch (   2982)
    # B = Ao (   1433)
    # C = ai (   1265)
    # D = in (   1076)
    # E = sh (    981)
    # F = CD (    973)
    # G = ol (    886)
    # H = qo (    697)
    # I = or (    604)
    # J = Ae (    558)
    # K = dF (    544)
    # L = Ay (    519)
    # M = ct (    513)
    # N = Mh (    506)
    # O = dy (    504)
    # P = Bl (    429)
    # Q = ar (    385)
    # R = ot (    383)
    # S = ey (    376)
    # T = Br (    372)

  cat hea-*.dic

    # A = ch (   2982)
    # B = Ao (   1433)
    # C = ai (   1265)
    # D = in (   1076)
    # E = sh (    981)
    # F = CD (    973)
    # G = ol (    886)
    # H = qo (    697)
    # I = or (    604)
    # J = Ae (    558)
    # K = dF (    544)
    # L = Ay (    519)
    # M = ct (    513)
    # N = Mh (    506)
    # O = dy (    504)
    # P = Bl (    429)
    # Q = ar (    385)
    # R = ot (    383)
    # S = ey (    376)
    # T = Br (    372)

Now with English:

  cat ../026/engl-wow/full.txt > wow-00.txt
  
  foreach sec ( wow )
    cat ${sec}-00.txt \
      | extract-signif-chars \
          -v normal="-'" \
          -v errors='_' \
      > ${sec}-00.sig
    reeds-compress ${sec} 00 10 A B C D E F G H I J 
    reeds-compress ${sec} 10 20 K L M N O P Q R S T
  end