Hacking at the Voynich manuscript - Side notes
053 Reordering the pages by word and element frequencies

Last edited on 1999-07-22 17:21:57 by stolfi

INTRODUCTION

  The VMS has apparently ben bound by someone who did not know the
  correct order of the pages. For one thing, the biological bifolio
  f78+f81 has a double-page figure that cannot be seen in its entirey
  as bound. Also the mixing of languages in the herbal section is
  highly suspicious.
  
  In this note we consider the problem of finding the correct order
  of the pages by looking at the distribution of words or other
  linguistic elements.  (Rene Zandbergen and Gabriel Landini did
  the same by looking at the distribution of digraphs; see the
  Teddington meeting report.) 

THE BASIC IDEA

  For each word w, let F[p][w] be its estimated frequency in the
  language of page p. The vector F[p] is then a representation of page
  p in a space of large dimension.
  
  Ideally the frequency of usage of a word in a text depends on the
  topic at hand.  In a single prose-like text that spans multiple pages
  (as seems to be the case with the biological section), we expect
  th same topic to extend across page boundaries; therefore,
  the frequencies of use of a given word on two adjacent pages should
  be more similar than the frequencies on two non-adjacent pages.
  
  Therefore, we should look for orderings of the VMS pages that make
  the word distributions change gradually from each page to the next.
  For a single word, we want a permutation p[0..P-1] of the pages that
  minimizes the sum
  
    D2(w,p) = Sum { (F[p[k]][w] - F[p[k-1]][w])^2 : k = 1..P-1 } 
     
  For a single word, the solution is to sort the pages by frequency.
  In general, we want to minimize the sum
  of D2(w,p) over all words w.  That is we want to minimize
   
    Sum { Sum { (F[p[k]][w] - F[p[k-1]][w])^2 : k = 1..P-1 } : w in W }
       
  which is the same as 
  
    Sum { Sum { (F[p[k]][w] - F[p[k-1]][w])^2 : w in W } : k = 1..P-1 }
    
  which is 
  
    Sum { dist(F[p[k]] - F[p[k-1]])^2 : k = 1..P-1 }
     
  In other words we want a path connecting the points F[p] that
  minimizes the *square* of the Euclidean edge lengths, d2(p,q) =
  dist(p,q)^2. This is a special case of the general traveling
  salesman problem. Note that d2(a,b) does not satisfy the usual
  triangle inequality since d2(a,b) can be twice d2(a,c)+d2(b,c). But
  the squared distance should be nicer to work with than the plain
  distance.
  
  Note that in order to prepare the data we must compute a matrix of
  size P^2, each entry of it being a a sum of |W| terms. Since P>200,
  it seems unfeasible at the moment to compute the whole matrix. We
  should be content with computing partial matrices for small
  sections, e.g. bio or str.  In fact we only have hope of finding
  the best solution (by exhaustive search) for sections with 5
  bifolios or less.
  
OBTAINING THE DATA 
  
  The word counts per page (majority version) were obtained in 
  Note 019 "Analyzing word frequencies per section". 
  
    mkdir -p RAW EQV
    
    ( cd RAW && ln -s ../../019/RAW/wfreqs )
    ( cd EQV && ln -s ../../019/EQV/wfreqs )
    
    ln -s ../019/fnum-to-subsec.tbl

  As a warm-up exercise, let's compute the distance matrix for the 
  *folios* of the biological section.  Let's first extract a list of the
  f-numbers of pages in the bio section:
  
  For realistic test:
  
    set subsec = "bio.1"; 
    set ddir = RAW/wfreqs
    
  For toy test:

    set subsec = "tst.1"; 
    set ddir = "TST/wfreqs";

  In any case:

    mkdir subsecs
    
    set sdir = "subsecs/${subsec}"; echo ${sdir}
    
    mkdir ${sdir}
    
    cat fnum-to-subsec.tbl \
      | gawk -v subsec="$subsec" '($2 == subsec){ print $1; }' \
      > ${sdir}/all.pages
      
    dicio-wc ${sdir}/all.pages
    
    set pages = ( `cat ${sdir}/all.pages` ); echo $pages
      
  Now let's assign letter codes A-Z to the bifolios of the 
  bio section, to simplify formatting of matrices and permutations:
  Likewise let's assign letters A-Za-z to the *folios* (not pages)
  of the bio section.  Namely, the lowest-numbered half of each bifolio
  gets the same (uppercase) letter as the bifolio, and the other half 
  gets the corresponding lowercase letter.
  
    foreach utype ( bifolio folio )  
      cat fnum-to-${utype}.tbl \
        | fgrep -f ${sdir}/all.pages \
        | assign-${utype}-letters \
        > ${sdir}/fnum-to-${utype}.tbl
    end
  
    dicio-wc ${sdir}/fnum-to-*.tbl
  
      lines   words     bytes file        
    ------- ------- --------- ------------
          8      16        64 subsecs/tst.1/fnum-to-bifolio.tbl
          8      16        64 subsecs/tst.1/fnum-to-folio.tbl

      lines   words     bytes file        
    ------- ------- --------- ------------
         20      40       140 subsecs/bio.1/fnum-to-folio.tbl
         20      40       140 subsecs/bio.1/fnum-to-bifolio.tbl

  Now let's combine the counts for folios and bifolios:      
   
    mkdir ${sdir}/{folio,bifolio}-cts
    mkdir ${tdir}/{folio,bifolio}-cts
    
  Create lists of folios and bifolios in their current order:

    foreach utype ( bifolio folio )  
      cat ${sdir}/fnum-to-${utype}.tbl \
        | gawk '(\!($2 in p)){print $2; p[$2]=1;}' \
        > ${sdir}/all.${utype}s
    end
    
    set folios = ( `cat ${sdir}/all.folios` ); echo $folios 
    set bifolios = ( `cat ${sdir}/all.bifolios` ); echo $bifolios 

KEY WORD SELECTION

  Select the subset of "key words" to consider in the distance
  matrix computation.

  Keyword set 0: take all words that occur in the section.
  
    set wset = 0
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '/./{ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
      
  (The distances between units were too small with this keyword set.)
  
  Keyword set 1: the words that occur at least twice in the section:
  
    set wset = 1
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '($1 >= 2){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
  
  Keyword set 2: words that occur at least 4 times.
  
    set wset = 2
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '($1 >= 4){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
  
  Keyword set 3: words that occur at least 5 and at most 20 times.
  
    set wset = 3
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '(($1 >= 5)&&($1 <= 20)){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
      
  Keyword set 4: words that occur at at least 21 times.
  
    set wset = 4
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '($1 >= 21){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
      
  Keyword set 5: words that occur 4 to 6 times.

    set wset = 5
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '(($1 >= 4)&&($1 <= 6)){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
      
  Keyword set 6: words that occur more than 40 times.

    set wset = 6
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '($1 >= 41){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
      
  Keyword set 7: words that occur more than 100 times.

    set wset = 7
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '($1 >= 101){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic
      
  Keyword set 8: words that occur between 41 and 100 times.

    set wset = 8
    cat ${ddir}/subsecs/${subsec}.frq \
      | egrep -v '[?*]' \
      | gawk '(($1 >= 41)&&($1 <= 100)){ print $3; }' \
      > ${sdir}/key-words-${wset}.dic

    dicio-wc ${sdir}/key-words-*.dic
  
      lines   words     bytes file        
    ------- ------- --------- ------------
       1247    3741     26754 subsecs/bio.1/key-words-0.dic
        478     478      2827 subsecs/bio.1/key-words-1.dic
        261     261      1465 subsecs/bio.1/key-words-2.dic
        149     149       854 subsecs/bio.1/key-words-3.dic
         72      72       383 subsecs/bio.1/key-words-4.dic
         98      98       577 subsecs/bio.1/key-words-5.dic
         36      36       180 subsecs/bio.1/key-words-6.dic
         10      10        52 subsecs/bio.1/key-words-7.dic
         26      26       128 subsecs/bio.1/key-words-8.dic

DISTANCE MATRICES

  Collect the keyword frequencies per folio and bifolio:

    echo "frequencies computed for keyword set ${wset}"
    set nwords = "`cat ${sdir}/key-words-${wset}.dic | wc -l`"; echo $nwords
    foreach utype ( bifolio folio )
      echo "${wset}"  > ${sdir}/${utype}-cts/.current-wset
      set units = ( `cat ${sdir}/all.${utype}s` )
      foreach unit ( ${units} )
        echo "$unit"
        set files = ( \
          ` cat ${sdir}/fnum-to-${utype}.tbl | gawk -v unit="${unit}" '($2 == unit){ printf "%s.frq\n", $1;}' ` \
        )
        echo "${utype} ${unit} = ${files}"
        ( cd ${ddir}/pages && cat ${files} ) \
          | gawk '/./{print $1, $3}' \
          | combine-counts \
          | fgrep -w -f ${sdir}/key-words-${wset}.dic \
          | estimate-freqs -v nItems="${nwords}" \
          > ${sdir}/${utype}-cts/${unit}.frq
      end
    end

  Now we compute the distance matrices:
  
    foreach utype ( bifolio folio )
      set wset = "`cat ${sdir}/${utype}-cts/.current-wset`"
      echo "wset = ${wset}"
      set nwords = "`cat ${sdir}/key-words-${wset}.dic | wc -l`"; echo $nwords
      set units = ( `cat ${sdir}/all.${utype}s` )
      compute-dist-matrix \
          -v nItems="${nwords}" \
          -v countDir="${sdir}/${utype}-cts/" \
          -v units="${units}" \
        > ${sdir}/dist-${utype}-${wset}.matrix
      echo " "
      echo "With keyword set ${wset}:"
      echo " "
      cat ${sdir}/dist-${utype}-${wset}.matrix
    end
    

  With keyword set 0:
  
              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.001940 0.002088 0.001521 0.002563
    B  0.001940 0.000000 0.001486 0.002373 0.003180
    C  0.002088 0.001486 0.000000 0.002334 0.003441
    D  0.001521 0.002373 0.002334 0.000000 0.003062
    E  0.002563 0.003180 0.003441 0.003062 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.001841 0.002170 0.001716 0.001767 0.002128 0.001779 0.001786 0.001960 0.001816
    B  0.001841 0.000000 0.001914 0.002059 0.002257 0.002641 0.002032 0.001656 0.001552 0.001901
    C  0.002170 0.001914 0.000000 0.001969 0.002895 0.003462 0.002114 0.001707 0.001642 0.002096
    D  0.001716 0.002059 0.001969 0.000000 0.002237 0.002950 0.001542 0.002054 0.002011 0.001633
    E  0.001767 0.002257 0.002895 0.002237 0.000000 0.001574 0.001939 0.002001 0.002437 0.002309
    e  0.002128 0.002641 0.003462 0.002950 0.001574 0.000000 0.002565 0.002253 0.002682 0.002841
    d  0.001779 0.002032 0.002114 0.001542 0.001939 0.002565 0.000000 0.001947 0.002006 0.001757
    c  0.001786 0.001656 0.001707 0.002054 0.002001 0.002253 0.001947 0.000000 0.001424 0.002053
    b  0.001960 0.001552 0.001642 0.002011 0.002437 0.002682 0.002006 0.001424 0.000000 0.002009
    a  0.001816 0.001901 0.002096 0.001633 0.002309 0.002841 0.001757 0.002053 0.002009 0.000000

  With keyword set 1:

              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.003681 0.004026 0.002760 0.004920
    B  0.003681 0.000000 0.002776 0.004960 0.006266
    C  0.004026 0.002776 0.000000 0.004910 0.006868
    D  0.002760 0.004960 0.004910 0.000000 0.006076
    E  0.004920 0.006266 0.006868 0.006076 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.003989 0.005066 0.003678 0.003827 0.004672 0.003828 0.003828 0.004370 0.003986
    B  0.003989 0.000000 0.004341 0.004758 0.005240 0.006074 0.004681 0.003477 0.003176 0.004258
    C  0.005066 0.004341 0.000000 0.004704 0.007331 0.008557 0.005230 0.003846 0.003649 0.004982
    D  0.003678 0.004758 0.004704 0.000000 0.005312 0.007026 0.003356 0.004978 0.004888 0.003534
    E  0.003827 0.005240 0.007331 0.005312 0.000000 0.003159 0.004361 0.004538 0.005902 0.005482
    e  0.004672 0.006074 0.008557 0.007026 0.003159 0.000000 0.005827 0.004903 0.006167 0.006715
    d  0.003828 0.004681 0.005230 0.003356 0.004361 0.005827 0.000000 0.004722 0.004977 0.003903
    c  0.003828 0.003477 0.003846 0.004978 0.004538 0.004903 0.004722 0.000000 0.002962 0.004804
    b  0.004370 0.003176 0.003649 0.004888 0.005902 0.006167 0.004977 0.002962 0.000000 0.004694
    a  0.003986 0.004258 0.004982 0.003534 0.005482 0.006715 0.003903 0.004804 0.004694 0.000000

  With keyword set 2:
  
              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.004968 0.005525 0.003686 0.006683
    B  0.004968 0.000000 0.003745 0.007070 0.008611
    C  0.005525 0.003745 0.000000 0.007072 0.009575
    D  0.003686 0.007070 0.007072 0.000000 0.008451
    E  0.006683 0.008611 0.009575 0.008451 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.005774 0.007721 0.005387 0.005573 0.006838 0.005611 0.005545 0.006485 0.005889
    B  0.005774 0.000000 0.006536 0.007282 0.007884 0.009061 0.007128 0.004979 0.004509 0.006366
    C  0.007721 0.006536 0.000000 0.007349 0.011615 0.013381 0.008306 0.005798 0.005459 0.007720
    D  0.005387 0.007282 0.007349 0.000000 0.008221 0.010820 0.005019 0.007952 0.007775 0.005231
    E  0.005573 0.007884 0.011615 0.008221 0.000000 0.004400 0.006529 0.006809 0.009171 0.008419
    e  0.006838 0.009061 0.013381 0.010820 0.004400 0.000000 0.008749 0.007114 0.009276 0.010249
    d  0.005611 0.007128 0.008306 0.005019 0.006529 0.008749 0.000000 0.007503 0.007948 0.005864
    c  0.005545 0.004979 0.005798 0.007952 0.006809 0.007114 0.007503 0.000000 0.004299 0.007418
    b  0.006485 0.004509 0.005459 0.007775 0.009171 0.009276 0.007948 0.004299 0.000000 0.007223
    a  0.005889 0.006366 0.007720 0.005231 0.008419 0.010249 0.005864 0.007418 0.007223 0.000000
  
  With keyword set 3:
  
              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.008596 0.009378 0.008075 0.008009
    B  0.008596 0.000000 0.008759 0.009005 0.008243
    C  0.009378 0.008759 0.000000 0.009327 0.009156
    D  0.008075 0.009005 0.009327 0.000000 0.009229
    E  0.008009 0.008243 0.009156 0.009229 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.010003 0.011975 0.010437 0.010274 0.011059 0.010005 0.010812 0.011028 0.011051
    B  0.010003 0.000000 0.011561 0.010772 0.010249 0.011223 0.010639 0.010630 0.011077 0.011661
    C  0.011975 0.011561 0.000000 0.012168 0.012396 0.012247 0.011954 0.011850 0.011979 0.011900
    D  0.010437 0.010772 0.012168 0.000000 0.011007 0.012675 0.011028 0.011394 0.012156 0.011482
    E  0.010274 0.010249 0.012396 0.011007 0.000000 0.011205 0.010552 0.010752 0.011707 0.011044
    e  0.011059 0.011223 0.012247 0.012675 0.011205 0.000000 0.011803 0.011403 0.011294 0.011352
    d  0.010005 0.010639 0.011954 0.011028 0.010552 0.011803 0.000000 0.010654 0.011822 0.011166
    c  0.010812 0.010630 0.011850 0.011394 0.010752 0.011403 0.010654 0.000000 0.011303 0.011848
    b  0.011028 0.011077 0.011979 0.012156 0.011707 0.011294 0.011822 0.011303 0.000000 0.012095
    a  0.011051 0.011661 0.011900 0.011482 0.011044 0.011352 0.011166 0.011848 0.012095 0.000000

  With keyword set 4:

              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.008980 0.010107 0.006642 0.013395
    B  0.008980 0.000000 0.006549 0.014435 0.017926
    C  0.010107 0.006549 0.000000 0.014241 0.019875
    D  0.006642 0.014435 0.014241 0.000000 0.017491
    E  0.013395 0.017926 0.019875 0.017491 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.011570 0.016294 0.010815 0.011492 0.014477 0.011866 0.010841 0.013405 0.011531
    B  0.011570 0.000000 0.013552 0.016007 0.017792 0.020305 0.016454 0.009723 0.008449 0.012631
    C  0.016294 0.013552 0.000000 0.015768 0.027317 0.031561 0.019072 0.011600 0.010617 0.016251
    D  0.010815 0.016007 0.015768 0.000000 0.018689 0.024852 0.010022 0.017925 0.017325 0.010038
    E  0.011492 0.017792 0.027317 0.018689 0.000000 0.007961 0.014376 0.014815 0.021252 0.018602
    e  0.014477 0.020305 0.031561 0.024852 0.007961 0.000000 0.019778 0.015033 0.021080 0.023070
    d  0.011866 0.016454 0.019072 0.010022 0.014376 0.019778 0.000000 0.017850 0.018800 0.011940
    c  0.010841 0.009723 0.011600 0.017925 0.014815 0.015033 0.017850 0.000000 0.007854 0.015405
    b  0.013405 0.008449 0.010617 0.017325 0.021252 0.021080 0.018800 0.007854 0.000000 0.014961
    a  0.011531 0.012631 0.016251 0.010038 0.018602 0.023070 0.011940 0.015405 0.014961 0.000000

  With keyword set 5:

              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.015171 0.016168 0.015575 0.014795
    B  0.015171 0.000000 0.016379 0.017969 0.016674
    C  0.016168 0.016379 0.000000 0.018949 0.016653
    D  0.015575 0.017969 0.018949 0.000000 0.018493
    E  0.014795 0.016674 0.016653 0.018493 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.017364 0.019513 0.019282 0.018359 0.018517 0.018728 0.018984 0.019835 0.019621
    B  0.017364 0.000000 0.018675 0.019837 0.018863 0.018509 0.018413 0.018638 0.018085 0.017973
    C  0.019513 0.018675 0.000000 0.020098 0.018858 0.019433 0.019346 0.019391 0.019297 0.018965
    D  0.019282 0.019837 0.020098 0.000000 0.020348 0.019997 0.018991 0.020518 0.019628 0.018151
    E  0.018359 0.018863 0.018858 0.020348 0.000000 0.018856 0.019700 0.020379 0.019835 0.018750
    e  0.018517 0.018509 0.019433 0.019997 0.018856 0.000000 0.018748 0.018377 0.018793 0.018165
    d  0.018728 0.018413 0.019346 0.018991 0.019700 0.018748 0.000000 0.019889 0.019097 0.018054
    c  0.018984 0.018638 0.019391 0.020518 0.020379 0.018377 0.019889 0.000000 0.019047 0.019023
    b  0.019835 0.018085 0.019297 0.019628 0.019835 0.018793 0.019097 0.019047 0.000000 0.018554
    a  0.019621 0.017973 0.018965 0.018151 0.018750 0.018165 0.018054 0.019023 0.018554 0.000000


  With keyword set 6:

              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.012940 0.013690 0.010134 0.020768
    B  0.012940 0.000000 0.008796 0.022844 0.029096
    C  0.013690 0.008796 0.000000 0.021595 0.031861
    D  0.010134 0.022844 0.021595 0.000000 0.028200
    E  0.020768 0.029096 0.031861 0.028200 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.016976 0.022803 0.015724 0.016672 0.022572 0.018364 0.014523 0.018987 0.016886
    B  0.016976 0.000000 0.019258 0.024475 0.027819 0.033093 0.027238 0.012924 0.010811 0.018447
    C  0.022803 0.019258 0.000000 0.022286 0.042816 0.051345 0.029645 0.015924 0.014177 0.023027
    D  0.015724 0.024475 0.022286 0.000000 0.029168 0.040905 0.014113 0.027564 0.026453 0.014939
    E  0.016672 0.027819 0.042816 0.029168 0.000000 0.010641 0.022294 0.022562 0.033872 0.028841
    e  0.022572 0.033093 0.051345 0.040905 0.010641 0.000000 0.032962 0.024785 0.034788 0.037844
    d  0.018364 0.027238 0.029645 0.014113 0.022294 0.032962 0.000000 0.029105 0.030201 0.018716
    c  0.014523 0.012924 0.015924 0.027564 0.022562 0.024785 0.029105 0.000000 0.010611 0.023144
    b  0.018987 0.010811 0.014177 0.026453 0.033872 0.034788 0.030201 0.010611 0.000000 0.021925
    a  0.016886 0.018447 0.023027 0.014939 0.028841 0.037844 0.018716 0.023144 0.021925 0.000000


  With keyword set 7:

              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.024697 0.021902 0.022253 0.047342
    B  0.024697 0.000000 0.016077 0.067946 0.074634
    C  0.021902 0.016077 0.000000 0.054178 0.080114
    D  0.022253 0.067946 0.054178 0.000000 0.072549
    E  0.047342 0.074634 0.080114 0.072549 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.033450 0.043694 0.037565 0.030930 0.047851 0.036868 0.024454 0.032704 0.031666
    B  0.033450 0.000000 0.044591 0.075141 0.074467 0.077829 0.078277 0.023577 0.017538 0.039497
    C  0.043694 0.044591 0.000000 0.054296 0.110691 0.132447 0.070696 0.039988 0.026165 0.040654
    D  0.037565 0.075141 0.054296 0.000000 0.072766 0.107786 0.026442 0.073902 0.067858 0.026341
    E  0.030930 0.074467 0.110691 0.072766 0.000000 0.016514 0.049695 0.054983 0.083623 0.068331
    e  0.047851 0.077829 0.132447 0.107786 0.016514 0.000000 0.080771 0.057655 0.085966 0.086974
    d  0.036868 0.078277 0.070696 0.026442 0.049695 0.080771 0.000000 0.073186 0.078391 0.040357
    c  0.024454 0.023577 0.039988 0.073902 0.054983 0.057655 0.073186 0.000000 0.019658 0.047499
    b  0.032704 0.017538 0.026165 0.067858 0.083623 0.085966 0.078391 0.019658 0.000000 0.038134
    a  0.031666 0.039497 0.040654 0.026341 0.068331 0.086974 0.040357 0.047499 0.038134 0.000000

    According to this matrix, the two halves of each bifolio are more similar to each other
    than to other folios. The only exception is bifolio A, whose two leaves A-a
    are about as far apart as A-b, A-B, A-D, A-d, and farther apart than A-c or A-E.

    We should compute the matrix for pages...

  With keyword set 8:

              A        B        C        D        E
       -------- -------- -------- -------- --------
    A  0.000000 0.025477 0.032770 0.014678 0.026348
    B  0.025477 0.000000 0.017877 0.017311 0.031356
    C  0.032770 0.017877 0.000000 0.028090 0.035054
    D  0.014678 0.017311 0.028090 0.000000 0.031945
    E  0.026348 0.031356 0.035054 0.031945 0.000000

              A        B        C        D        E        e        d        c        b        a
       -------- -------- -------- -------- -------- -------- -------- -------- -------- --------
    A  0.000000 0.033273 0.046463 0.024895 0.031992 0.028727 0.029565 0.032668 0.040787 0.035185
    B  0.033273 0.000000 0.031515 0.022852 0.032934 0.042998 0.022895 0.027224 0.025879 0.033263
    C  0.046463 0.031515 0.000000 0.034602 0.055337 0.054171 0.040839 0.022761 0.028957 0.050304
    D  0.024895 0.022852 0.034602 0.000000 0.041506 0.042909 0.026610 0.036616 0.036847 0.031611
    E  0.031992 0.032934 0.055337 0.041506 0.000000 0.023721 0.035939 0.031024 0.046460 0.043361
    e  0.028727 0.042998 0.054171 0.042909 0.023721 0.000000 0.040227 0.027364 0.038896 0.051080
    d  0.029565 0.022895 0.040839 0.026610 0.035939 0.040227 0.000000 0.034880 0.034214 0.027948
    c  0.032668 0.027224 0.022761 0.036616 0.031024 0.027364 0.034880 0.000000 0.020169 0.044718
    b  0.040787 0.025879 0.028957 0.036847 0.046460 0.038896 0.034214 0.020169 0.000000 0.048712
    a  0.035185 0.033263 0.050304 0.031611 0.043361 0.051080 0.027948 0.044718 0.048712 0.000000

BEST ORDERING OF BIFOLIOS

  Now we look for the mest ordering of bifolios.

    echo "wset = ${wset}"
    enum-perms \
        -v elems="`echo ${bifolios}|tr -d ' '`" \
      | find-cheapest-path \
         -v matrixFile=${sdir}/dist-bifolio-${wset}.matrix
         
  Result with the bifolio distance matrix 0 (all words):

      best score = 0.007729
      BCDAE
      EADCB

  With bifolio distance matrix 1 (words that occur at least twice):
  
      best score = 0.015293
      CBADE
      EDABC

  With bifolio distance matrix 2 (words that occur at least 4 times):

      natural score = 0.024236
      natural perm = ABCDE

      best score = 0.020850
      best perms =
      CBADE
      EDABC
      
  With bifolio matrix 3 (words with count in [5..20]):

      natural score = 0.035911
      natural perm = ABCDE

      best score = 0.033086
      best perms =
      CBEAD
      DAEBC

BEST PHYSICALLY VALID ORDERING OF FOLIOS

  let's look for optimal physically valid orderings of the folios.
  Note that the "elems" list must be an UNNESTED permutation.
  I will retain only the solutions that have D before d.

    echo "wset = ${wset}"
    nice enum-folio-perms \
        -v elems="AaBbCcDdEe" \
      | nice find-cheapest-path \
         -v matrixFile=${sdir}/dist-folio-${wset}.matrix

  Result with the folio distance matrix 0 (all words)

      natural score = 0.017480
      natural perm = ABCDEedcba

      best score = 0.014879
      best perms =
      CcbBaDdAEe

    Note that the bifolios are all unnested except for A wrapped around D.
    Also D is central as it must.
    
    The permutation BbcCDdaAEe, all unnested, has comparable score
    (0.015108).

  With folio distance matrix 1 (words that occur at least twice):

      natural score = 0.039710
      natural perm = ABCDEedcba

      best score = 0.031946
      best perms =
      CcbBaDdAEe
      ()()(())()
      
    Note that it is the same as the best solution for matrix 0.
  
  With folio distance matrix 2 (words that occur at least 4 times):

      natural score = 0.060054
      natural perm = ABCDEedcba

      best score = 0.046806
      best perms =
      CcbBaDdAEe
      ()()(())()

    Again the same solution, with A nested around D.
    The unnested solution BbcCDdaAEe has score 0.048700 in this 
    matrix.

  With folio matrix 3 (words with count in [5..20]):

      natural score = 0.101799
      natural perm = ABCDEedcba

      best score = 0.098164
      best perms =
      beaDdAEBcC
      (((())))()
   
    This solution has everything nested except D (good!) and C.
    But the score is only a trifle lower than the current solution.
    
  With folio matrix 4 (the 72 words with count > 20):

      natural score = 0.127983
      natural perm = ABCDEedcba

      best score = 0.091913
      best perms =
      CcbBaDdAEe
      ()()(())()

    Again the same solution as with matrices 0-2, with A nested around D.

  With folio matrix 5 (words with count in [4..6]):

      natural score = 0.171579
      natural perm = ABCDEedcba

      best score = 0.165447
      best perms =
      ceaDdbBAEC
      (((()())))
      
    At least bifolio D is OK.

  With folio matrix 6 (the 36 words with count > 40):

      natural score = 0.192932
      natural perm = ABCDEedcba

      best score = 0.130247
      best perms =
      eEAcCbBaDd
      ()(()())()

    This one is different... but D is still OK.
  
  With folio matrix 7 (the 10 words with count > 100):

      natural score = 0.433366
      natural perm = ABCDEedcba

      best score = 0.232615
      best perms =
      eEAcBbCaDd
      ()((()))()
      
    This one is only a bit different from the previous one: C and B got nested.
    Note that the cos difference is substantial, but still the points are
    far from being a unidimensional cluster...

  With folio matrix 8 (the 26 words that occur between 41 and 100 times).

      natural score = 0.308605
      natural perm = ABCDEedcba

      best score = 0.231597
      best perms =
      EeADCcbBda
      ()((()()))

    This one is no good because it has C and B nested in D.