Last edited on 1998-07-04 11:36:19 by stolfi
For these plots, the coordinates of each page are computed from the relative frequencies of the following 50 words:
aiin al ar chckhy chdy chedy cheey cheol chey chol chor chy daiin dain dal dar dy lchedy okaiin okain okal okar okedy okeedy okeey ol or otaiin otal otar otedy oteedy oteey oty qokaiin qokain qokal qokar qokedy qokeedy qokeey qokey qoky qol s saiin shedy sheey shey shol
These are the 50 most frequent words from Rene's word frequency list. (Another set of plots was prepared using instead the pages whose frequencies showing the most variation from section to section; but the differences were hardly noticeable.)
The initial coordinates of each page are the relative frequencies of these words in that page. To produce the plots, I picked three mutually orthogonal unit-length vectors X, Y, and Z, in 50-dimensional word frequency space, and projected the page points onto them.
For the "Herbal" projection, the X, Y, and Z axes are the result of Gram-Schmidt orthogonalization applied to the vectors HEA-TOT, HEB-TOT BIO-TOT, where TOT is the vector of word frequencies for the whole sample, and HEA, HEB and BIO are the frequencies for the Herbal-A, Herbal-B, and biological sections, respectively. Here are the X, Y, and Z coordinates of each page in this projection.
The "Pharma" projection is an attempt to discover a separation of Herbal-A pages from Pharma page, by using a different projection to three-space. The X, Y, and Z axes were obtained by orthogonalization of PHA-HEA, HEB-HEA BIO-HEA, where PHA is the vector of word frequencies for the "Pharma" section. Here are the X, Y, and Z coordinates of each page in this projection.
You may want to check another version of these plots where similar characters have been identified.