Hacking at the Voynich manuscript - Side notes
023 Plotting QOKOKOKO element frequencies per page

Last edited on 1999-01-31 06:42:31 by stolfi

1998-06-20 stolfi
=================

  [ First version done on 1998-05-04, now redone with fresher data. ]
  [ Removed element counting part to Notes/022 on 1999-01-31. ]

  In Note 021, I tried to classify pages according to the
  frequencies of certain keywords.  John Grove pointed out that the
  transcription which I used (Friedman's) has inconsistencies which may
  masquerade as language differences, e.g. "dain" in place of "daiin" or
  vice-versa.  Also, it seems that spacing (word division) is quite
  inconsistent.
  
  Some of these problems were removed on 1999-01-30 by redoing the
  analysis with the majority edition instead of a best pick.
  Still, to remove the problem even further, I thought of using, instead
  of words, the "elements" of the QOKOKOKO paradigm. See Notes 017 and 018.
  
I. EXTRACTING AND COUNTING ELEMENTS

  I will use the majority edition of the interlinear, release 1.6e6:

    ln -s ../045/pages-m text-pages
    ln -s ../045/sections-m text-sections

  I will borrow the element counts computes in Notes/022:
  
    mkdir -p RAW EQV

    ( cd RAW && ln ../../022/RAW/efreqs )
    ( cd EQV && ln ../../022/EQV/efreqs )

II. PAGE SCATTER-PLOTS

  See Notes/021 for explanation of these plots.

  Let's now compute the frequencies of these keywords in each page and section:

    foreach dic ( vald )
      foreach etag ( RAW EQV )
        foreach utype ( pages sections )
          set frdir = "${etag}/efreqs/${utype}"
          set ptdir = "${etag}/plots/${dic}/${utype}"
          echo "${frdir}" "${ptdir}"
          /bin/rm -rf ${ptdir}
          mkdir -p ${ptdir}
          cp -p ${frdir}/all.names ${ptdir}
          foreach fnum ( tot `cat ${frdir}/all.names` )
            printf "%30s/%-7s " "${ptdir}" "${fnum}:"
            cat ${frdir}/${fnum}.frq \
              | gawk '/./{print $1, $3;}' \
              | est-dic-probs -v dic=${etag}/plots/${dic}/keys.dic \
              > ${ptdir}/${fnum}.pos
          end
        end
      end
    end

  Let's plot them:

    set sys = "tot-hea"
    foreach dic ( vald )
      foreach etag ( RAW EQV )
        set ptdir = "${etag}/plots/${dic}/pages"
        set scdir = "${etag}/plots/${dic}/sections"
        set fgdir = "${etag}/plots/${dic}/${sys}"
        /bin/rm -rf ${fgdir}
        mkdir -p ${fgdir}
        cp -p ${ptdir}/all.names ${fgdir}/all.names
        make-3d-scatter-plots \
          ${ptdir} \
          ${fgdir} \
          ${scdir}/{tot,tot,hea,heb,bio}.pos
      end      
    end

  Again, trying to separate Herbal-A from Pharma:

    set sys = "hea-pha"
    foreach dic ( vald )
      foreach etag ( RAW EQV )
        set ptdir = "${etag}/plots/${dic}/pages"
        set scdir = "${etag}/plots/${dic}/sections"
        set fgdir = "${etag}/plots/${dic}/${sys}"
        /bin/rm -rf ${fgdir}
        mkdir -p ${fgdir}
        cp -p ${ptdir}/all.names ${fgdir}/all.names
        make-3d-scatter-plots \
          ${ptdir} \
          ${fgdir} \
          ${scdir}/{tot,hea,pha,heb,bio}.pos
      end      
    end

  The scatter plots made with colapsed letters still show the main
  sections as separate clusters, but touching each other.