Last edited on 1998-07-04 11:34:50 by stolfi
For these plots, I deleted all word breaks (but not line breaks) in the sample text, and then factored the resulting strings into the character groups, or "elements", of my OKOKOKO paradigm:
a y o ch sh ee che she eee t k p f te ke pe fe cth ckh cph cfh cthe ckhe cphe cfhe ct ck cp cf ith ikh iph ifh d id iid iiid de n in iin iiin r ir iir iiir l il iil iiil m im iim iiim s is iis iiis j ij iij iiij g ig iig iiig q x b uThe elements in boldface occur more than 15 times in the whole text; those in italics do not occurr at all. There were about 300 failures of the OKOKOKO paradigm (mostly extra "e"s and "i"s), or about 3 every 1000 elements.
The initial coordinates of each page are the relative frequencies of these elements in that page. To produce the plots, I picked three mutually orthogonal unit-length vectors X, Y, and Z, in 50-dimensional element frequency space, and projected the page points onto them.
For the "Herbal" projection, the X, Y, and Z axes are the result of Gram-Schmidt orthogonalization applied to the vectors HEA-TOT, HEB-TOT BIO-TOT, where TOT is the vector of element frequencies for the whole sample, and HEA, HEB and BIO are the frequencies for the Herbal-A, Herbal-B, and biological sections, respectively. Here are the X, Y, and Z coordinates of each page in this projection.
The "Pharma" projection is an attempt to discover a separation of Herbal-A pages from Pharma page, by using a different projection to three-space. The X, Y, and Z axes were obtained by orthogonalization of PHA-HEA, HEB-HEA BIO-HEA, where PHA is the vector of element frequencies for the "Pharma" section. Here are the X, Y, and Z coordinates of each page in this projection.
You may want to check another version of these plots where similar characters have been identified.