Last edited on 1998-07-04 11:34:50 by stolfi

Scatter-plots of VMs pages

Plots based on OKOKOKO element frequencies

Projection on "Herbal" axes

Projection on "Pharma" axes

Click on any image to see a full-size version

For these plots, I deleted all word breaks (but not line breaks) in the sample text, and then factored the resulting strings into the character groups, or "elements", of my OKOKOKO paradigm:

    a   y   o

    ch      sh      ee      
    che     she     eee     

    t       k       p       f
    te      ke      pe      fe 
    cth     ckh     cph     cfh
    cthe    ckhe    cphe    cfhe
    ct      ck      cp      cf
    ith     ikh     iph     ifh 

    d   id   iid   iiid
    de

    n   in   iin   iiin

    r   ir   iir   iiir

    l   il   iil   iiil

    m   im   iim   iiim 

    s   is   iis   iiis 

    j   ij   iij   iiij  

    g   ig   iig   iiig 

    q 

    x   b   u

The elements in boldface occur more than 15 times in the whole text; those in italics do not occurr at all. There were about 300 failures of the OKOKOKO paradigm (mostly extra "e"s and "i"s), or about 3 every 1000 elements.

The initial coordinates of each page are the relative frequencies of these elements in that page. To produce the plots, I picked three mutually orthogonal unit-length vectors X, Y, and Z, in 50-dimensional element frequency space, and projected the page points onto them.

For the "Herbal" projection, the X, Y, and Z axes are the result of Gram-Schmidt orthogonalization applied to the vectors HEA-TOT, HEB-TOT BIO-TOT, where TOT is the vector of element frequencies for the whole sample, and HEA, HEB and BIO are the frequencies for the Herbal-A, Herbal-B, and biological sections, respectively. Here are the X, Y, and Z coordinates of each page in this projection.

The "Pharma" projection is an attempt to discover a separation of Herbal-A pages from Pharma page, by using a different projection to three-space. The X, Y, and Z axes were obtained by orthogonalization of PHA-HEA, HEB-HEA BIO-HEA, where PHA is the vector of element frequencies for the "Pharma" section. Here are the X, Y, and Z coordinates of each page in this projection.

You may want to check another version of these plots where similar characters have been identified.