Let's compute the distribution of letters at start, middle, and 
  end of the pseudo-"words" (delimited by spaces).  
  I will ignore line-start and line-end as they may
  be special.

    cat .voyn.fsg \
      | tr -d '/=' \
      | sed \
          -e 's/^  *//g' \
          -e 's/  *$//g' \
      | enum-ngraphs -v n=2 \
      > .voyn.dig

    cat .voyn.fsg \
      | tr -d '/=' \
      | sed \
          -e 's/^ *//g' \
          -e 's/ *$//g' \
      | enum-ngraphs -v n=3 \
      > .voyn.trg

    cat .voyn.dig \
      | egrep '^ .$' \
      | sed -e 's/^.\(.\)$/\1/g' \
      | sort | uniq -c | expand \
      > .voyn-ws.frq

    cat .voyn.dig \
      | egrep '^. $' \
      | sed -e 's/^\(.\).$/\1/g' \
      | sort | uniq -c | expand \
      > .voyn-we.frq

    cat .voyn.trg \
      | egrep '^[^ ][^ ][^ ]$' \
      | sed -e 's/^.\(.\).$/\1/' \
      | sort | uniq -c | expand \
      > .voyn-wm.frq

    join \
        -a 1 -a 2  -e 0 -j1 2 -j2 2 -o '0,1.1,2.1' \
        .voyn-wm.frq \
        .voyn-we.frq \
      > .tmp

    join \
        -a 1 -a 2  -e 0 -j1 2 -j2 1 -o '0,1.1,2.2,2.3' \
        .voyn-ws.frq \
        .tmp \
      | gawk ' {printf "%s   %5d %5d %5d\n", $1, $2, $3, $4}' \
      > .voyn-wsme.frq

      let   ini   mid   fin
      --- ----- ----- -----
      4    1456    21     5

      O    1317  2558    27
      S     670   383     4
      T     746   689     1
      8     412  2154    61
      D     102  2071    14
      H     101   816     4

      P      28   112     4
      F       6    27     1
      A     126  1826     0
      C      23  4235     4
      I       1    71     0
      Z       0   343     1

      G      79   120  3126
      K       0     2     5
      L       2     3     3
      M       0    10   395
      N       0    16   458
      6       1     1     3

      *      11    15    10

      E     340   909   947

      2     140    28    57
      R     130   176   561

  Note that "E", "2" and "R" are the only letters that 
  occur in significant numbers at all three positions.
  
  Note also that "2" and "R" are easily confused with each other, so
  the numbers are consistent with "2" being exclusively word-initial,
  "R" being exclusively word-final, and there being substantial
  misredings in both directions (10% of the "R"s misread as "2"s,
  40% of the "2"s misread as "R"s).

  Here is an attempt to recreate the blanks in the VMs according to
  simple rules.  First, prepare a file where every two characters are
  separated by " " or "-". Then replace all blanks by "-", and replace
  some "-" by " " before "[42]" and after "[GKLMN6R]"
  
    cat .voyn.fsg \
      | tr -d '/=' \
      | sed -e 's/^  *//g' -e 's/  *$//g' \
      | sed \
          -e 's/\(.\)/\1:/g' \
          -e 's/: :/ /g' \
          -e 's/:$//g' \
      > .voyn-sp-org.fsg
      
    cat .voyn.fsg \
      | tr -d '/= ' \
      | sed \
          -e 's/\(.\)/\1:/g' \
          -e 's/:$//g' \
          -e 's/:\([42]\)/ \1/g' \
          -e 's/\([GKLMN6R]\):/\1 /g' \
      > .voyn-sp-syn.fsg

    compare-spaces \
      .voyn-sp-syn.fsg \
      .voyn-sp-org.fsg \
      | tr -d ':' \
      > .voyn-sp.cmp
      
                  :
        ----- -----
      |  4707   676
    : |   984 22311


    cat .voyn-sp.cmp \
      | tr -dc '+\- ' \
      | sed -e 's/\(.\)/\1@/g' \
      | tr '@ ' '\012_' \
      | egrep '.' \
      | sort | uniq -c | expand \
      > .voyn-sp-o-s.frq
      
        676 +
        984 -
       4707 _

    cat .voyn-sp.cmp \
      | tr ' ' '_' \
      | enum-ngraphs -v n=3 \
      | egrep '.[-_+].' \
      | sort | uniq -c | expand \
      | sort -b +1.0 -1.2 +0 -1nr \
      > .foo
  
   It seems that many of the errors made by these space-prediction
   rules are due to confusion between "2" and "R" by the transcriber.
   Let's try to "correct" these mistakes by changing in the original 
   
     word-initial "R" to "2"
     non-word-initial "2" to "R"
     
   Let's do these changes 
     
    cat .voyn.fsg \
      | tr -d '/=' \
      | sed \
          -e 's/^  *//g' \
          -e 's/  *$//g' \
          -e 's/ R/ 2/g' \
          -e 's/\([^ ]\)2/\1R/g' \
      | tr -d ' ' \
      | sed \
          -e 's/\(.\)/\1:/g' \
          -e 's/:$//g' \
          -e 's/:\([42]\)/ \1/g' \
          -e 's/\([GKLMN6R]\):/\1 /g' \
      > .voyn-sp-fix.fsg

    compare-spaces \
      .voyn-sp-fix.fsg \
      .voyn-sp-org.fsg \
      | tr -d ':' \
      > .voyn-sp-fix.cmp

            R     2           :
        ----- ----- ----- -----
    R |   792    87     0     0
    2 |   130   278     0     0
      |     0     0  4759   507
    : |     0     0   932 22480


    cat .voyn-sp-fix.cmp \
      | tr -dc '+\- ' \
      | sed -e 's/\(.\)/\1@/g' \
      | tr '@ ' '\012_' \
      | egrep '.' \
      | sort | uniq -c | expand \
      > .voyn-sp-o-s-fix.frq

        507 +
        932 -
       4759 _

    cat .voyn-sp-fix.cmp \
      | tr ' ' '_' \
      | enum-ngraphs -v n=3 \
      | egrep '.[-_+].' \
      | sort | uniq -c | expand \
      | sort -b +1.0 -1.2 +0 -1nr \
      > .foo

  Let's compute what would be the initial/medial/final statistics
  with these R/2 changes but with the original spaces:
  
    cat .voyn-sp-fix.cmp \
      | tr -d '+' \
      | tr '\-' ' ' \
      > .voyn-sp-fixr2.fsg
      
    cat .voyn-sp-fixr2.fsg \
      | tr -d '/=' \
      | sed \
          -e 's/^  *//g' \
          -e 's/  *$//g' \
      | egrep ' .* ' \
      | sed \
          -e 's/^[^ ][^ ]* //g' \
          -e 's/ [^ ][^ ]*$//g' \
      | tr ' ' '\012' \
      | egrep '.' \
      > .voyn-sp-fixr2-nonend.wds

    cat .voyn-sp-fixr2-nonend.wds \
      | sed -e 's/^\(.\).*$/\1/g' \
      | sort | uniq -c | expand \
      > .voyn-sp-fixr2-ws.frq

    cat .voyn-sp-fixr2-nonend.wds \
      | sed -e 's/^.*\(.\)$/\1/g' \
      | sort | uniq -c | expand \
      > .voyn-sp-fixr2-we.frq

    cat .voyn-sp-fixr2-nonend.wds \
      | egrep '...' \
      | sed \
          -e 's/^.\(.*\).$/\1/' \
          -e 's/\(.\)/\1@/g' \
      | tr '@' '\012' \
      | egrep '.' \
      | sort | uniq -c | expand \
      > .voyn-sp-fixr2-wm.frq

    join \
        -a 1 -a 2  -e 0 -j1 2 -j2 2 -o '0,1.1,2.1' \
        .voyn-sp-fixr2-wm.frq \
        .voyn-sp-fixr2-we.frq \
      > .tmp

    join \
        -a 1 -a 2  -e 0 -j1 2 -j2 1 -o '0,1.1,2.2,2.3' \
        .voyn-sp-fixr2-ws.frq \
        .tmp \
      | gawk ' {printf "%s   %5d %5d %5d\n", $1, $2, $3, $4}' \
      > .voyn-sp-fixr2-wsme.frq

      let   ini   mid   fin
      --- ----- ----- -----
      2     189     8     2
      R      15    79   524

============= PARTIALLY CLEANED-UP REPORT FOLLOWS =============

Overview

  This is a summary of my attempts to discover the "true" character,
  syllabe, and word boundaries in the VMs text.

Basic assumptions

  I will assume a priori that the text is mostly prose in some natural
  language, using a peculiar invented alphabet; possibly with
  abbreviations and calligraphic embellishments, but without any
  "hard" encription.  In other words, the I assume that a person who
  spoke the language could learn to read the VMs in "real time", with
  no more effort than it takes to learn any other phonetic alphabet.

  One thing that must be kept in mind is that the text is contaminated
  by transcription errors.  It is hard to estimate their frequency,
  because their distribution is strongly non-uniform. The Friedman and
  Currier transcriptions often differ in the grouping of strokes: what
  is "CI" for one is "A" for the other.  also there are differences in
  the counting of "I" strokes, so that "M"s turn into "N"s and
  vice-versa.  These kinds of errors probably affect a few percent of
  certain letters.

  To further complicate matters, there is strong evidence that the
  Voynich manuscript we have is a copy made by two or more persons
  who did not understand the original.  If this hypothesis is true,
  then the manuscript itself is probably full of copying errors.  It
  is quite possible that the copists had to make their own guesses
  as to alphabet, word spacing, etc.  

Source material

  To minimize distracting artifacts from scribal and topic variation,
  all analysis will be done on the "language B" part of the Biological
  section.  Specifically, I will use Currier's transcription,
  converted to FSG notation, as extracted from Landini's EVMT 1.6
  interlinear edition.

Frequency anomalies around line breaks

  The key observation is that the character and n-gram
  statistics adjacent to line boundaries are very different
  from those within the line.  These differences were already
  observed by Currier, who interpreted them as evidence that the
  line was a functional unit.  

  The difference is real, but the way Currier put it is rather
  misleading.  A better way to say it is "the context of line breaks
  is statisticaly different from that of word spaces". 

  Below I argue that character counts near line breaks seem anomalous
  because the latter were made mostly at "true" word boundaries; and
  line breaks seem different from spaces because the latter are *not*
  true word boundaries.

  The hypothesis that the VMs blanks are not word spaces was explored
  by Robert Forth in his note #??.  He specifically assumed that the
  blanks marked phonetic units, such as stressed syllabes.  There are
  other plausible explanations, however, so it seems more prudent to
  avoid making unnecessary assumptions about the causes of the
  difference.

Unfavorable line breaking contexts

  One specific peculiarity of line breaks is that there are
  combinations of FSG characters that occur very often in the text,
  but rarely or never occur straddling a line break. Here are some
  extreme examples:

     tot occurs   at newline  pattern
    -----------  -----------  -----------
     1899 0.064      1 0.001  C:8
     1347 0.046      0 0.000  O:E
     1436 0.049      0 0.000  O:D
     1629 0.055      0 0.000  4:O
     2056 0.070      0 0.000  8:G

  The first entry of this table says that the FSG character pair
  "C8" occurs 1899 times in the sample text (6.4% of all character),
  ignoring all spaces and line breaks; but there is only one line
  that ends with "C" and is followed by a line that begins with "8"
  (about 0.1% of all line breaks).  

  Note that the average line contains 40 characters; so, if lines
  were broken completely at random, we would expect "C8" to occur
  about 1899/40 = 47 times across a line break.

  Similar anomalies can be seen in the tetragram frequencies:

     tot occurs   at newline  pattern
    -----------  -----------  -----------
      275 0.009      0 0.000  DC:C8
      162 0.006      0 0.000  OE:SC

  The first line says that the group "G4OD" occurs 894 times
  in the text (3.1% of all tetragrams), if we delete all 
  line and word spaces; but not once do we find a line
  that ends with "G4" followed by a line that begins with 
  "OD".  If lines were broken at random, we would expect 
  about 275/40 = 7 line breaks between a "DC" and a "C8".

Favorable line breaking contexts

  Conversely, certain character combinations are far more common
  around line breaks than within lines, as these examples show:

     tot occurs   at newline  pattern
    -----------  -----------  -----------
       18 0.001     17 0.022  K:2
       15 0.001     11 0.014  R:P
       62 0.002     43 0.057  G:P

  The first line says that the digraph "K2" occurs 18 times in the 
  sample, and 17 of those occurrences are split across a line break.
  If line breaks were random, the expected number of occurrences
  at lien breaks would be 0.5 or so.

     tot occurs   at newline  pattern
    -----------  -----------  -----------
       74 0.003     20 0.026  OE:4O
       30 0.001     16 0.021  8G:2O
       15 0.001     11 0.015  8G:HT
       16 0.001      6 0.008  8G:PT
       12 0.000     11 0.015  8G:PO
        9 0.000      5 0.007  8G:HO
       11 0.000      5 0.007  8G:GH
       10 0.000      6 0.008  RG:4O
       29 0.001     14 0.018  AR:4O
       54 0.002     18 0.024  AE:4O
       23 0.001      7 0.009  AM:4O
       12 0.000      9 0.012  AE:2O
        7 0.000      6 0.008  EG:2A

  The favorable and unfavorable line-breaking contexts are strong
  evidence that the lines were not broken at random, but only in
  certain "allowed" contexts.  Presumably, the favorable contexts
  correspond to boundaries between natural linguistic
  segments---characters, syllabes, or words.

Other factors

  Of course, there are several other possible causes for those
  statistical anomalies.  For intance, if lines are broken at word
  boundaries by the trivial "greedy" algorithm (break before the
  first word that would not fit in the line), then line breaks will
  be more likely before long words than short ones.  Thus, line
  break statistics will be biased towards contexts of the form 
  (end of short word):(start of long word).

  On the other hand, a human scribe will probably avoid (consciously
  or uncounsciously) breaking a line between an article or
  preposition and the following noun.  In that case, the line break
  statistics will be biased towards contexts of the form (end of
  long word):(start of short word).

The "words" of the Voynich manuscript

  Text in the VMs is clearly broken by spaces into 
  groups of characters that superficially look like 
  words of the language. Here is a sample (in FSG notation):

    FTC8GDARG ODCCG 4ODAR SGDTC8G 4ODAR SC8G
    8AN SCG EG 2SCOE 4OETC8G TC8GDAR TCDCC8G RAR
    4ODAN TAK ODTCG 4CG DAN SCCDG EHAN OEDAR OR
    8ODZG EDAKO GDCCG ESCG DAE 8G SCG OD SCG 4ODCC8G
    SCGDAR SCG DZCG R AN OE OESCC8G 4ORCCG 4ODG
    PTC8G 4ODS8G GHAN TC8G 4ODAR TG EOE TC8G 4ODG
    2ARTCG 4OHAR8G 8SCDZG 4ODAN TDZG ESC8G ODCC8G
    4ODT8G TCHCG EO 4ODC8G 4ODAN TCCDCG 4ODAP OETC8G 2AE
    8SOR 4OHAR T8G SCG 4ODAN C*DZG8G OHCG HC8G ETC8G
    4ODC*8G 4ODAIR OEG 4ODCC8G 8G 4ODAE ODAR SC8G 8OR TCDAK
    2SCDZG 4ODAE OEG SCG R OE TCCG SCG 8G ESC8G 4ODG
    PTC8G DCC8G 4ODC8G 4ODC8G 4ODC8G 4ODC8G 4ODAN OESC8G

Are the "words" really words?

  Unfortunately, there is evidence that the groups of letters
  delimited by spaces are not words in the ordinary linguistic sense.

  For one thing, the average length of those groups is rather short,
  particularly if we consider that some letters of the "true"
  Voynich alphabet must be combinations of two or more FSG
  codes. (For instance, it is very likely that the combinations
  "SC", "SCC", "TC", and "TCC" are single letters.)

  Moreover, the sequence of "words" is highly repetitive: we often
  see the same word repeated two or more times over a few lines, as
  exemplified by "4ODAN" in the above sample.  Sometimes the same
  word is repeated several times in consecutive positions, as
  "4ODC8G" in the last line.

  Finally, as noted by Robert Firth, the "words" have a rather rigid
  structure; or, said another way, the spaces are strongly
  correlated with the adjacent letters.  If we consider the letter
  frequencies at the beginning, middle, and end of words, we get:

    FSG   ini   med   fin
    --- ----- ----- -----
     4   1328    16     2
     O   1107  1817    24
     S    639   242     3
     T    678   442     1
     8    325  1685    50
     A     89  1350     0
     C     22  3442     4
     D     91  1694    11
     H     91   674     3
     F      5    24     1
     P     26    96     4
     I      1    47     0
     Z      0   306     1

     G     65    66  2756
     M      0     1   330
     N      0     3   390
     6      1     0     3
     K      0     1     5
     L      1     2     3

     2    112    17    47
     R     92    70   479
     E    245   560   801

  Except for "E", "2", and "R", we can hardly find 
  a letter than occurs in significant numbers at all three 
  positions.  In fact, it looks like the blanks have been
  inserted into a continuous text according to very simple
  rules:

    after  every "6", "G", "K", "L", "M", "N", "R", or
    before every "2" or "4".

  These trivial rules correctly reproduce 83% of the original spaces
  in the sample text, and insert only 12% additional ones.  Said
  another way, if we try to predict whether there is a blank between
  two consective non-blank letters, these rules will give the right
  answer 94% of the time.  

  (Incidentally, these scores could be improved to 84%, 9%, and 95%,
  respectively, if we assumed that all 92 "R"s preceded by space are
  actually "2"s, and all 64 "2"s not preceded by space are actually
  "R"s.  Moreover, it must be kept in mind that many spaces were
  omitted or inserted during the transcription.)

  Here are the first few lines of the text, showing the effect of
  these rules: " " is a correctly predicted space, "+" is a
  predicted but non-existing space, "-" an existing but
  non-predicted space.

    FTC8G+DAR+G ODCCG 4ODAR SG+DTC8G 4ODAR SC8G
    8AN SCG EG 2SCOE 4OETC8G TC8G+DAR TCDCC8G R+AR
    4ODAN TAK ODTCG 4CG DAN SCCDG EHAN OEDAR OR
    8ODZG EDAK+O-G+DCCG ESCG DAE-8G SCG OD-SCG 4ODCC8G
    SCG+DAR SCG DZCG R AN OE-OESCC8G 4OR+CCG 4ODG
    PTCG+DCCAR OEDG 8AR ODCG 4ODAN THZG 4ODCC8G 4ODG
    PTC8G 4ODS8G G+HAN TC8G 4ODAR TG EOE-TC8G 4ODG
    2AR+TCG 4OHAR+8G 8SCDZG 4ODAN TDZG ESC8G ODCC8G
    4ODT8G TCHCG EO 4ODC8G 4ODAN TCCDCG 4ODAP-OETC8G 2AE
    8SOR 4OHAR T8G SCG 4ODAN C*DZG+8G OHCG HC8G ETC8G
    4ODC*8G 4ODAIR OEG 4ODCC8G 8G 4ODAE-ODAR SC8G 8OR TCDAK
    2SCDZG 4ODAE-OEG SCG R OE-TCCG SCG 8G ESC8G 4ODG
    PTC8G DCC8G 4ODC8G 4ODC8G 4ODC8G 4ODC8G 4ODAN OESC8G

  If the space-delimited groups are words of the language,
  it follows that either the phonetics of the 
  language is uncommonly constrained, or that the writing
  system uses different symbols for the same phonetic
  element depending on its position in the word.  

  Another possible explanation is that the letter groups are 
  syllabes, or other linguistic elements smaller than a 
  word.  This alternative would explain not only the
  predictability of spaces, but also the shortness of the
  words and the repetitiveness of the text. 

Word spaces are unlike line breaks

  As futher evidence that the spaces do not define words, we note
  that there are combinations of FSG characters that often occur
  adjacent to or across space characters, but not across line
  breaks; and vice-versa. For example

    at NL or SP   at NL only  pattern
    -----------  -----------  -----------
      196 0.030      2 0.003  E:T
       63 0.010      0 0.000  R:A
      111 0.017      0 0.000  N:T
      110 0.017      1 0.001  M:T

  The first line here says that there are 196 occurrences of "ET"
  straddling a blank (a space and/or a line break), amounting to 3%
  of all blanks; but only 2 of those occurrences straddle a line
  break, which is 0.3% of all line breaks.

  Note that there are 6420 blanks in the sample, of which 765 are
  line breaks; i.e. 8.4 gaps per line on the average. So, if lines
  were broken randomly at the same places where blanks are inserted,
  we would expect 196/8.4 = 23 occurrences of "ET" straddling a line
  break.

  There are many possible explanations for these anomalies.  For
  instance, some characters may be written differently at
  line-start or line-end than in the middle of a line (this
  is probbaly the case of some of the "P" and "F" gallows).  Some
  characters may be abbreviations that are only used at line-end,
  for lack of space (FSG "6" and "K" may belong to this class.)
  Certain characters and character groups may be used only at
  sentence-start or sentence-end, which tend to coincide with line
  breaks.  

The "word/syllabe" theory

  A particularly simple and likely explanation for the space
  vs. newline anomaies is that the latter occur mostly at "true"
  word boundaries, whereas the spaces define smaller units, such as
  syllabes.

  However, this "word/syllabe" explanation is not entirely
  satisfactory.  Assuming that the VMs author choose to break the
  text into syllabes rather than words, why would he bother to break
  lines only between words, rather than between syllabes?

  Moreover, if we accept that explanation, we must also assume that
  a fair fraction of the spaces would be unbreakable. In that case,
  it seems that the right margin should look more ragged than it is.
  Specifically, if "|" repesents the ideal margin, we 
  we should see line breaks of the form

     | xxx xxx xxxxxx xxx xxxx xxxxxx xxxxxxx  |
     | xxxxx xxxx xxxxxx xxxx xxxxxx xxx       |
     | yyy xxx xxxxxx xxx xxxx xxxxxx xxxxxxx  |

  where a short word "yyy" that would have fit on one line got
  pushed to the next line, because the space following was not a
  word boundary.  In fact, such "premature breaks" seem nonexistent,
  at least in the few pages I have seen.

Positive line breaking anomalies

  A stronger argument against the word/syllabe theory is the 
  existence of patterns that are *more* likely around 
  line breaks than could be explained by it. Consider for
  example

    at NL or SP   at NL only  pattern
    -----------  -----------  -----------
       53 0.008     28 0.037  E:2
       20 0.003     12 0.016  E:G
       70 0.011     36 0.047  G:G

  The first line says that there are 53 occurrences of "E" and "2"
  separated by a space or newline; and 28 of them are actually line
  breaks.  If line breaks were just a random sample of the blanks,
  we would expect the second count to be 53/8.4 = 6.3.  But even if
  the line breaks were word boundaries, and the spaces were syllabe
  boundaries, to get 28 line breaks between "E" and "2" we would
  have to assume an average of four syllabes per word, or 
  about two words per line.

  Thus, we seem forced to assume that that the true word
  boundaries are not necessarily a subset of the spaces. 

The "bogus spaces" theory

  If the spaces are neither word boundaries, nor a superset thereof,
  what could they be?  

  Robert Firth proposed in one of his notes that the spaces mark
  phonetic units that may span true word boundaries; specifically,
  units defined by stress or quantity patterns, like the "poetic
  syllabes", or the "feet" of metric poetry.

  A variant of this theory is that the spaces have some purely
  phonetic function: they may indicate glottal stops, vowel
  lengthening, stress marks, etc.

  As we observed, the spaces can be accurately predicted from the
  adjacent letters. In fact, their occurrences seem too regular even
  for syllabic boundaries.  Thus, we must consider the hypothesis
  that the spaces are meaningless; specifically, that they were
  inserted mechanically, without regard for word boundaries,
  according to simple local rules like the ones we described above.

  Why would the writer choose to do so?  One obvious possibility 
  is to make the "cypher" harder to break.  

  Another possibility is that the VMs may be a copy made by someone
  who did not undertand the text, from an original where the word
  spacing was absent or rather subtle.  Such a copist could easily
  have have mistaken the visually wider gap after certain
  characters (such as FSG "M") for word breaks.  Or he may have
  identified (consciously or unconsciously) certain characters
  such as "G" and "L" with similar-looking medieval Latin
  abbreviations, and hence assumed that they marked the end of
  words.

  Because of these observations, it seems that we should not take
  the blank characters of the VMs as word separators in the standard
  sense.  If we want to do semantics-oriented analysis (word
  location maps, label hunting, word correlations) we should either
  discard the spaces, and work with n-grams in arbitrary position; 
  or try to discover the "true" word boundaries by looking at those
  patterns that are common line breaks.

Recovering the true word structure from line breaks

  Unfortunately, theis second path is not very promising.  In most
  natural languages, the words and morphemes are usually defined by
  a lexicon---essentially, a large table of valid words, numbering
  in the thousands. We cannot expect to find a reasonably small set
  of letter patterns that can recognize the boundaries of words.

  What we *can* get out the line break statistics, with reasonable
  confidence, is the structure of the syllabes of the language---assuming,
  of course, that the line breaks are mostly true word boundaries,
  and these are a subset of the syllabe boundaries.

  Here is the basic idea. Let's define a `split pattern' as an
  expression like .G:[24]O where "." is a wildcard and ":" is the place
  where we are considering inserting a word break (it matches an
  empty string).

  Let's assume that the line breaks in the VMs are mostly a random
  sample of the "true" word breaks. That is, each "true" word break
  in the text has basically the same chance of becoming a line
  break.  Let's assume futhermore that line breaks occur only at
  word breaks, i.e. a word is almost never split across lines.  (The
  paragraph structure skews this distribution somewhat, but since
  the average paragraph seem to have five lines or more, this
  particular bias cannot change the line-breaking statistics by more
  than 20%

STOPPED HERE

  Finaly, let's assume that each VMs line contains mw "true" words, on
  the average.  In that case, if the split pattern describes a
  mandatory word break (i.e., if the pattern only matches the
  compressed VMs in such a way that the ":" matches a "true" word
  boundary) then, we expect the ratio NL/NT to be about 1/mw.  

  Suppose we take the VMs and remove all spaces, newlines, and
  paragraph marks.  Let NT be the total number of occurrences of the
  split pattern in that compressed text; and let NL be the number of
  occurrences where the ":" falls where a line break existed in the
  VMs.

  On the other hand, if the split pattern can occur also inside words,
  then the ratio NL/NT will be less than 1/mw.  If the pattern can
  occur at any position in the word, indifferently, then NL/NT should
  be close to 1/mc, where mc is the average number of characters per
  line (~40 in the Biological section). 

  Finally, if the pattern never occurs at a true word boundary, then 
  its frequency at line boundaries should be close to zero.
  The only occurrences should be "accidents" - words split
  across lines, scription/transcription errors, non-text 
  uses, etc.  If the probability of a line break to occur
  inside a word is less than 0.05, then such a "forbidden"
  pattern should not occur at line boundaries with 

  Thus, we can then classify a split pattern into one of four classes:

    * mandatory word breaks: those for which NL > 1 and NL/NT > 1/mw

    * mandatory word NON-break: if NT > mc and NL/NT < 

    * possible word breaks: those with NL > 1, NT > 2mw, NL/NT <  ???

  * uncertain: those with NT < 2mc and NL < 2, or with NT < 2mw 

Global digram frequencies, ignoring line breaks and word spaces: