Last edited on 1998-07-04 11:36:19 by stolfi

Scatter-plots of VMs pages

Plots based on word frequencies

Projection on "Herbal" axes

         

Projection on "Pharma" axes

         

Click on any image to see a full-size version

For these plots, the coordinates of each page are computed from the relative frequencies of the following 50 words:

  aiin al ar chckhy chdy chedy cheey cheol chey chol chor
  chy daiin dain dal dar dy lchedy okaiin okain okal okar
  okedy okeedy okeey ol or otaiin otal otar otedy oteedy
  oteey oty qokaiin qokain qokal qokar qokedy qokeedy qokeey
  qokey qoky qol s saiin shedy sheey shey shol

These are the 50 most frequent words from Rene's word frequency list. (Another set of plots was prepared using instead the pages whose frequencies showing the most variation from section to section; but the differences were hardly noticeable.)

The initial coordinates of each page are the relative frequencies of these words in that page. To produce the plots, I picked three mutually orthogonal unit-length vectors X, Y, and Z, in 50-dimensional word frequency space, and projected the page points onto them.

For the "Herbal" projection, the X, Y, and Z axes are the result of Gram-Schmidt orthogonalization applied to the vectors HEA-TOT, HEB-TOT BIO-TOT, where TOT is the vector of word frequencies for the whole sample, and HEA, HEB and BIO are the frequencies for the Herbal-A, Herbal-B, and biological sections, respectively. Here are the X, Y, and Z coordinates of each page in this projection.

The "Pharma" projection is an attempt to discover a separation of Herbal-A pages from Pharma page, by using a different projection to three-space. The X, Y, and Z axes were obtained by orthogonalization of PHA-HEA, HEB-HEA BIO-HEA, where PHA is the vector of word frequencies for the "Pharma" section. Here are the X, Y, and Z coordinates of each page in this projection.

You may want to check another version of these plots where similar characters have been identified.