Hacking at the Voynich manuscript - Side notes
077 Further comparisons of Recipes to the Shennong Bencao 

Last edited on 2026-03-09 19:39:28 by stolfi

INTRODUCTION

  This note makes further analyses comparing the Starred Paragraphs
  Section (SPS, aka Recipes Section) of the VMS to the Chinese medical
  classic Sennon Bencaojing (SBJ).
    
SETUP

    ln -s ${HOME}/ttf
    ln -s ../work

    ln -s work/langbank

    ln -s ${HOME}/bin/gawkf.sh
    ln -s work/faf.sh

    ln -s work/chinese_funcs.py
    ln -s work/error_funcs.gawk
    ln -s work/error_funcs.py
    ln -s work/make_funcs.py
    ln -s work/process_funcs.py
    ln -s work/vms_linear_gray_image_funcs.py
    ln -s ../076/common_funcs_076.gawk

    ln -s work/compute_freqs_from_counts.py
    ln -s work/make_histogram.gawk
    ln -s work/turn_histogram_into_polygonal_line.gawk

LOOKING FOR LINES WITH DAIIN NEAR BEGINNING

??? REVISE EVERYTHING ???
    
CLEANED AND REGULARIZED TRANSCRIPTIONS

  The first step is to produce transcriptions of the SBJ ("res/bencao-*.ivt")
  and of the SPS ("res/starps-*.ivt") that are (1) cleaned of complicated 
  artifacts like weirdo codes, inline comments, and other markup,and (2) 
  in compatible formats that can be processed by the same scripts.
  
  Transcription line format
  
  These files all have the same format: each line has fields "<{LOC}> {TEXT}"
  where {LOC} is a locus identifier in the original book, and {TEXT} is a
  chunk of transcribed text.  
  
  For the SBJ the field {LOC} has the form
  "{SSEC}.{SSUB}.{SLIN}" where {SSEC} is "b1" to "b3", {SSUB} is one digit 1..6,
  and {SLIN} is a 3-digit line number starting from "001" that is unique within
  each {SSEC}. (The {SLIN} was supposed to be sequential but some lines of the original
  file had to be split and thus got out-of-sequence {SLIN}s.) An example is "b1.2.023".
  
  For the SPS the field {LOC} is the standard VMS locus ID, namely
  "f{PAGE}.{LSEQ}" where {PAGE} is a 3-digit number 103..116 followed by
  "r" or "v", and {LSEQ} is a line number in the page, from 1. An
  example is "f105v.32". 
  
  Four pages of the SPS, "f109r" to "f110v", are missing in the physical
  manuscript. One line "f105r.10" is omitted because in my transcription
  of the SPS its {TEXT} was merged into the next line. Three other lines
  are omitted because they are believed to be titles. The last page is
  "f116r" (not "f116v") and only lines 1 to 30 are assumed to be part of the 
  SPS.
  
  Transcription file names
  
  The basic naming scheme for all transcription files is "res/{ivt_name}.ivt"
  where {ivt_name} is "{book}-{bsub}-{utype}-{ltype}", and 
  
    {book} is the source book, "bencao" (SBJ) or "starps" (SPS).
    
    {bsub} is a subset of the book.  For "bencao" the only {bsub}
      of interest is "fu", the full book.  For "starps"
      there are two subsets: "fu", the full book, and "gd", a subset
      that consists of the lines which seem to be correctly grouped
      into parags -- excluding those blocks of lines that 
      seem to be two or more parags run together.
      
    {utype} specifies the nature of the text and what constitutes a unit
      for statistical purposes.  The {utype} can be "ch" or "ps" 
      for the SBJ, and "ec", "wp", or "wc" for the SPS.  See 
      {cleanup_starps_raw_text} {cleanup_raw_bencao_text} and {}
      
    {ltype} specifies the amount of text that is in each line of the file.
      For "bencao" files is must be "par", meaning each line of the file is one
      parag ("recipe", entry) of the SJB. For "starps" files it may be "par",
      meaning that each line of the file is a parag of the SPS, or "pag" meaning
      that each line of the file is a whole page.  The latter only makes sense
      for the "fu" subset.
      
UNITS OF COUNTING

  The {utype} field in file names specifies the nature of the raw and cleaned 
  ttranscription {TEXT} and the unit used to measure parag sizes and 
  word positions.  See {} in {size_position_funcs.py}
  for a detailed explanation.
    
 
TWO-FILE COMPARATIVE HISTOGRAMS

  The procedures below produce various ".upp" files for different source
  files (SBJ and SPS), different source formats (pinyin with split or
  joined compounds, unicode Chinese characters, etc.) and different filtering
  policies (all parags or only some, etc.).  Each line of a ".upp" file
  contains the location {loc[ip]} of a parag {ip} and its size {W[ip]}
  defined as the number of /units/ (which may be syllables, characters, words, etc.)
  
  With each ".upp" file we make a histogram, where each bin {kb} has a
  range {[wlo[kb] _ whi[kb]}, and a count {P[kb]} of parags {ip} which
  have {wlo[kb] < W[ip] ~ whi[kb]}.
  
  To meaningfully compare two ".upp" files "res/{file0}.upp" and
  "res/{file1}.upp", we draw their histograms on the same plot, scaling
  the number of words {W[ip]} of each parag in dataset {file0} by a
  factor {W1_mag = W0_avg/W1_avg}, where {W0_avg} and {W1_avg} are the
  averages of {W[ip]} in the two datasets. Then we scale the parag
  counts {P[kb]} in each bin of histogram 1 by {P1_mag = P0_num/P1_num},
  where {P0_num} and {P1_num} are the number of parags in each dataset.
  These adjustments make the average word count and the total parag
  count of {file1} to be the same as those of {file0}.

VOYNICH STARRED PARAGS FILE
  
  For the SPS of the Voynich manuscript, I used the transcription that I
  created in note 076 as the source. It has been moved to note 074 and
  included in the joint file "../074/join25e1.ivt".
  
  Actually I used the extracted file "../074/st_files/str-parags.ivt"
  which is similar but cleaner -- with no '{}', weirdo codes, inline
  comments, page headers, titles, etc.
  
  In particular, that file should not ccontain these lines, assumed to
  be section titles:

    <f104v.45;U>    »okeeey.qokeeey.okeey.okeey«
    <f105r.37;U>    »otaiir.{Ch}edaiin.otair.otaly«
    <f108v.53;U>    »ol{Ch}ar.ol,{Ch}edy.l{Sh}y.otedy«

  It is possible that other section headers were not recognized as such
  and were joined with adjacent parags.

  We still need to explicitly exclude the line <f105r.10>, that in the
  "U" transcription is just "?", since its contents considered
  "overflow" due to Scribal error and was moved to the end of the next
  line.
  
  From that ".ivt" file I created a new ".ivp" file where all the lines
  of each parag are joined together with a "-" separator. As part of
  this conversion I deleted all alignment markers [«=»], parag markers
  "<%>" and "<$>", and inline comments "<!...>", including "<!S{NN}>",
  "<!NoS>", "<!WLP>", "<!WLN>". In this file there are no '{}' around
  ligatures, and all EVA letters are lowercase.
  
  See "convert_starps_lin_ivt_to_par_ivt.py" and the rules for "res/starps-par.ivt"in
  "make_rules_starps.py".
  
     1061 total file lines
     1061 non-comment, non-blank lines
       23 distinct pages seen on input
       23 pages with parags
      327 parags
     1061 parag data lines
        0 data lines not in parags
      327 data lines written
     
COUNTING ENTRIES AND WORDS IN THE STARRED PARAGS SECTION
    
  When counting words or chars per paragraph in the SPS, we used the file
  "res/starps-par.ivt" created above. It should contain iny parag text, one parag
  per line.
  
  We assumed that "-" was a word break, the same as ".", and that
  each "," had a certain probability {pr_comma} of being a word break
  instead of being nothing. We then counted words in each parag as one
  plus the expected number of word breaks, rounded to integer.
  
  There are two pages in that section where several parags seem to
  be run together and the marginal stars seem to be placed at random.
  There are also parags and parag breaks that are dubious for other reasons.
  We tried two analyses, "starps-fu" with all parags, and "starps-gd"
  with those problematic parags excluded.
  
  See the script "count_units_per_parag.py" and the make rules
  for "res/starps-fu-{utype}.upp" and "res/starps-gd-{utype}.upp"
  where {utype} is "ec", "wc", or "wp".
    
  Good parags only version:
  
    ???

        240 total file lines
        240 non-comment, non-blank lines
        240 parags
       8098 total est. words
         33.74 average words per parag
        944 total commas
          3.93 average commas per parag
         22 pages
  
  Full (all parags) version:
    ???
    
        327 total file lines
        327 non-comment, non-blank lines
        327 parags
      10579 total est. words
         32.35 average words per parag
       1196 total commas
          3.66 average commas per parag
         23 pages
         
  After scaling for the different parag count and total word count the
  histograms of each {utype} ("wc" or "wp") for both subsets are almost
  identical. They are both strongly unimodal, with a split at ~22 words
  per parag.
  
  The shortest and longest parags are the same in both version.  
  
  The unique shortest parag is <f112v.35> with only 6 words,
  "pched.shedain.qokaiin.okar.chedy.checkhy". The second smallest one
  one is <f111v.7> with 11 words, also unique.
  
  The longest parag is <f105r.17> with 75 words,
  "lteedy.okeeddl.sheokedy.qokedy.shol.kol.aiirol.qopchedy.daiin.okedy...",
  but there is also <f105v.8> with 72 and <f105v.32> with 70. Only the
  last one is in the "good" subset. The fourth largest is <f115r.13>
  with 64.
 
BENCAO FILE

  For the SBJ, I created a file "in/bencao-3.chu" by merging and
  reordering two versions of the Shennong Bencao Jing obtained from the
  net, and reformatted the result to look sort of like an EVT file. The
  text is in Chinese characters (mixed traditional and simplified),
  in the Unicode UTF-8 encoding. 
  
  See the comments in the files for details of the cleanup. It has
  #-comments before sections and subsections. It uses Chinese
  punctuation, apparently just '：' (Unicode FULLWIDTH COLON, between
  recipe name and body, replacing '　' IDEOGRAPHIC SPACE from the
  internet files), '，' (FULLWIDTH COMMA, replacing IDEOGRAPHC COMMA '、'
  from one of the internet files), and '。' (IDEOGRAPHIC FULL STOP).
  
  The locus ID codes are <{SEC}.{SUB}.{LSEQ}> where the section code
  {SEC} is "b1" ("High Grade"), "b2" (Medium Grade"), or "b3" ("Low
  
  Grade"), {SUB} is a digit 1-6 (Minerals, Herbs, Trees, Insects and
  Animals, Fruits and Vegetables, Rice and Grains), and {LSEQ} is a 
  three-digit recipe number, sequential from "001" bu section
  (NOT by subsection).
  
  From that file I created a pinyin version "in/bencao-3.pyj" by feeding
  it in chunks through Google Translate and copy-pasting its pinyin
  readings. It has joined compounds (hence the "j" in ".pyj").
  The encoding is Unicode UTF-8 for accented letters like 'ā' and 'ǚ'.
  All letters are lowercase. uses '.' and ',', NOTE: the ':' after each
  recipe name was lost.
  
  Then from that file I created "in/bencao-3.pys" by piping it
  through "split_pinyin_compounds.py".  

      441 total lines
      359 total data lines
    11220 total words/compounds in
    13267 total syllables out
    11231 total punctuation and spaces

  Excluding the section titles, the shortest remaining entry seemed to be normal:

    <b3.4.093> 鼯鼠：主墮胎，令易產。
    <b3.4.093> wú shǔ zhǔ duò tāi, lìng yì chǎn.
    Flying squirrel: causes abortion, makes childbirth easier.

  (If that can be called "normal"...)

COUNTING ENTRIES AND WORDS IN THE SHENNONG BENCAO JIN

  We have two versions of the SBJ: "original" with Chinese characters in 
  UNicode UTF-8, and the same converted to pinyin by Google Translate.
   
  The program that does the counting of words per recipe in the
  "orignal" SBJ (Unicode Chinese chars) excludes all blank lines and
  #-comments (including the subsection titles) and the introduction
  (part "s0"). It also exclude the Chinese ideographic punctuation marks
  [。，：]. It counts each Chinese character (syllable) as one word.
  
  See the Makefile for "res/bencao-fu-chu.upp"

      435 total lines
      364 recipes
     3230 ignored chars of type 'punct'
      361 ignored chars of type 'blank'
    13144 total syllables
       36.11 avg syllables/recipe

  The script that does the counting for the pinyin version of the SBJ is
  "count_bencao_pinyin_units_per_parag.py" It excludes all blank lines
  and #-comments (including the subsection titles) and the introduction
  (part "s0"). It also excludes punctuation marks [.,;()*].
  
  The program has an option "-join" to count each multi-syllable compound as
  a single word, or "-split" to try to split every such compound into
  syllables and count syllables.
  
  See the Makefile for "res/bencao-fu-pys.upp", "res/bencao-fu-pyj.upp", 
  
  Splitting compounds:
    
        450 total lines
        364 recipes
      16408 total tokens
      13127 total words
         36.06 avg words/recipe
         
  Without splitting compounds:
    
          450 total lines
          364 recipes
        14128 total tokens
        10882 total words
           29.90 avg words/recipe

  The counts from "bencao-fu-pys" and the Unicode Chinese version
  "bencao-fu-chu" agree perfectly.
  
  The counts for "bencao-fu-pys" and "bencao-fu-pyj" have slightly different histograms
  even after scaling for the different average word count (scaling factor = 1.2063).
  The latter has gaps at 26, 32, 38, 44, 50 words because of rounding.
  Maybe we should use wider bins for these comparisons.
  
  The histograms have notable minima at ~24-25 words per parag and at ~35
  words per parag

  The unique shortest SBJ parag is <s2.3.044> (the "Flying Squirrel"
  one, "wú shǔ : zhǔ duò tāi, lìng yì chǎn"). It has 8 syllables (and 7
  words in "-pj"). The second shortest, also unique, is <s2.3.068> with
  12 syllables (9 words in "-pj").
  
  The unique longest parag is <s2.1.009> (the "Red-Crowned Rooster")
  with 92 syllables, "dān xióng jī wèi gān wēi wēn. zhǔzhì nǚzǐ bēng..."
  ( 76 words in "-pj"). The second and third longest, unique but close,
  are <s2.1.003> with 73 syllables and <s1.8.252> with 72 (60 and 62
  words -- reversed -- in "-pj"). Then there are two other parags with
  70 syllables, then a gap to 64 syllables.

COMPARATIVE SBJ-SPS HISTOGRAMS
  
  We created two-hstogram plots comparing various combinations of
  SBJ and SPS ".upp" files.
  
  The plots of "bencao-fu-ch" and "starps-fu-wc" (scaled) were surprisingly
  similar. The scale factors for the latter were {W1_mag} = 1.1462,
  {P1_mag} = 1.0826) Both had a big hump from {W=26} to {W=48}, with a
  small but significant drop around {W=35}. Both had similar shortest
  parags (with word counts {W0=8} and scaled {W1=7}), and similar
  second-shortest ones ({W0=12} and scaled {W1=13}) Both had somewhat
  similar longest parags, with {W0=92} and scaled {W1=85}.
  
  We also did the same comparison with "starps-gd-wc" instead of
  "starps-fu-wc", but the plots were almost identical, even though the
  scaling factors were different ({W1_mag} = 1.1484, {P1_mag} = 1.1569)
  
  Notable differences were an excess of SPS parags with scaled counts
  {W1} below 25, notably 15..19, 21, 22, and 24. Unscaled, those seem to
  be 14..19, and 21, respectively, with 18, 19, and 21 being the worst.
  Also the second, thirdm and fourth largest parags in the SPS, with
  unscaled {W} 72, 70, and 64, do not seem to have a match in the SBJ.
  
  We isolated from "res/starps-fu-wc.upp" the parags with counts 18, 19, and 21,
  and cheched their pages:
  
    list_pages_with_parags_of_given_sizes.sh "18,19,21" \
      < res/starps-fu-wc.upp 
      > .badpages-18
      
  Those offending parags were distributed all over the SPS, but there
  were 4 each on f106r and f106v. Also 2 each on f105r and f105v, and 2
  each on f108r and f108v. They may have been pages where writing was
  more cramped and so more words were run together, resulting in fewer
  words per parag. Must revise the commas in those two pages.
  
  Adding the smallest batch we get
  
    list_pages_with_parags_of_given_sizes.sh "14,15,16,17,18,19,21" \
      < res/starps-fu-wc.upp \
      > .badpages-14
    
  The worst pages are still f106r and f106v with five anomalous parags each.
  Then f103r with 4, then f107v, f108r, f112v, f113r with 3 each.

  let's list the pages with the three anomalous big parags>
  
    list_pages_with_parags_of_given_sizes.sh "64,70,72" \
      < res/starps-fu-wc.upp \
      > .badpages-64
  
  Two of these are on f105v, one on f115r.
  
    
STATISTICS

  ??? REVISE
  The basic statistics for the number of words per parag (upp) of the two files are:

    statistic   !  bencao !  starps
    ------------+---------+--------
    parags      |     364 |     327
    words       |   13144 |   10699
    min upp     |       8 |       6
    max upp     |      92 |      77
    avg upp     |    37.1 |    32.7
    dev upp     |     9.7 |    12.0

LOOKING FOR SIMILAR SEQUENCES OF PARAGS

  We create a grayscale image where pixel in row i and column j is proportional 
  to the simlarity of word counts of parag i from SPS and recipe j from the SBJ.
  
    ./make_similarity_image.sh ???NAME1 UNIT! NAME2 UNIT2
   

>>> TO REDO >>

LOOKING FOR REPEATED WORD PATTERNS

  Challenged by the internet, will try to look for repeated patterns in the two files.
  
    ./count_repeated_tuples.sh 5 bencao-fu-pys.ivt
    ./count_repeated_tuples.sh 3 starps.ivt
   

======================================================================
  Here are some advances in the comparison between the Starred Parags (SPS) section and
  the Shennong Bencao Jing (SBJ).  Recall that the files are:

  [quote="Jorge_Stolfi" pid='67750' dateline='1750041874']
  starps.eva The Starred Paragraphs section (SPS) from Takeshi's transcription in the 1.6e6 interlinear file, from page f102r to line 30 of f116r. With one parag per line, in the EVA encoding, with all alignment fillers and comments removed, all weirdos and missing chars mapped to '*', one "=" at start and end of each line (= parag).
  bencao-fu-pys.ivt The SBJ from the webpage posted by @oshfdk, minus the introduction 《上卷》 and section headers (see below), converted to pinyin by Google Translate, mapped to lowercase.
  Both files are in UTF-8 encoding.  Again, if you just click on those links you will see gibberish, because the server at my Univ expects plain text files to be in ISO-Latin-1 and thus messes up the formatted HTML that it sends to your browser.  You will have to download the files and look at them with any text editor or viewer that understands UTF-8.
  [/quote]


  Here is the histogram of the word counts upp:
  
  At first sight the histograms are different, but there are some intriguing similarities.  Note that both files have 23 entries with 27 words (the most common entry length in both files), six entries with 23 words, 8 entries with 37 words, 2 entries with 47 words, one entry with 53 words, one entry with 59 words, and one entry with 62 words.  In both files, there are anomalously few entries with 23, 37, and 43 words. 

  Considering the missing bifolio in the SPS quire, we have 6 surprising near coincidences: number of entries, and the mode, min, max, average, and deviation of the number of words per paragraph. (The total number of words is not an extra coincidence since it is the average npw times the number of entries.)  

  Compared to the SBJ, the SPS has a somewhat broader npw histogram, as implied by the standard deviation.  It has more entries with 10-20 words and 35-70 words, and fewer with 21-34 words.  In particular, the SBJ has a second mode: 23 parags of 34 words, whereas the SPS has only 11.  
  
  These discrepancies could be the result of the some word spaces being incorrectly inserted or omitted in the SPS as it was digitized; somewhat at random, with almost the same probability.  
  
  Alternatively, some parag breaks in the SPS may be wrong, causing, for example, two consecutive parags that should have 22 and 32 words to become parags of 16 and 38 words; and two parags that should have 7 and 76 words to become parags with 13 and 70 words.
  
  Both kinds of errors would have little effect on the average npw, but would increase its standard deviation, as observed.
  
  There is also the bonus coincidence of both files having originally subsection titles with ~4 words each, althout the number of such titles is vastly different.  More on that later.
  
  Now for the bad news. As @oshfdk observed, there are hundreds of multiword sequences
  that occur many times in the SBJ.  In particular, there is a 10-word phrase
  that occurs six times, on six consecutive lines:
  [code]
    久食輕身不老，延年神仙。一名
    <s1.4.045> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
    <s1.4.046> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
    <s1.4.047> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
    <s1.4.048> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
    <s1.4.049> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
    <s1.4.050> iǔ.shí.qīng.shēn.bùlǎo.yán.nián.shénxiān.yī.míng
    Eating it for a long time will make you light and immortal. It is also called
  [code]
  
  On the other hand, in the SPS the longest phrases that occur more than once
  have only 3 words; and the most common occurs only three times:
  
  [code]
    <f103r.P1.52> chedy.qokeey.qokeey
    <f108v.P1.44> chedy.qokeey.qokeey
    <f112v.P1.15> chedy.qokeey.qokeey
  [/code]
  I will discuss the implications of this difference in another post.

======================================================================

>> REDO THIS >>

>> REDO THIS >>
  
THE DATA

  Raw data files, with comments prefixed with "#", recipe numbers
  in the form S-NNN prefixed with "##", and each kanji surrounded 
  by ASCII spaces:
  
    ln -s ~/IMPORT/texts/chinese/ShennongBencao/text.big5 bencao-raw.big5
    ln -s ~/IMPORT/texts/chinese/ShennongBencao/text.jis  bencao-raw.jis
  
  Data files without punctuation:

    cat bencao-raw.big5 \
      | gawk \
          ' /^#/ {print; next;} \
            // { \
              gsub(/[ ]+[{}][ ]+/, " ", $0); \
              gsub(/[ ]+[\241][\264]/, "", $0); \
              gsub(/[ ]+[\241][D]/, "", $0); \
              print; \
            } ' \
      > bencao.big5
      
    cat bencao-raw.jis \
      | gawk \
          ' /^#/ {print; next;} \
            // { \
              gsub(/[ ]+[{}][ ]+/, " ", $0); \
              gsub(/[ ]+[\201][\234]/, "", $0); \
              gsub(/[ ]+[\201][D]/, "", $0); \
              print; \
            } ' \
      > bencao.jis
      
    dicio-wc bencao{-raw,}.{big5,jis} vstars.eva

      lines   words     bytes file        
    ------- ------- --------- ------------
       2008   17532     57611 bencao-raw.big5
       1510   19003     70177 bencao-raw.jis
       2008   13705     46183 bencao.big5
       1510   15229     57427 bencao.jis
       1742   13642     86751 vstars.eva *

       1734   13354     85052 vstars.eva **
       
    * = as of sometime before 2004-05-30.
    ** = as of 2004-05-30.  Unknown why it changed since the original run.
       
  Extracted the Voynichese "stars" section from the Majority version,
  reformatted to be comparable to the Bencao (line numbers as
  NNNV-U-LL, recipe numbers as "## S-NNN", all words surrounded by
  ASCII space).  Fixed many errors by hand, against KHE's images
  (also in the interlinear file).
  
BASIC STATISTICS

  Checking whether each VMS page has been split into the correct number
  of recipes:
  
    cat vstars.eva \
      | count-recipes-per-page \
      > vstars.rpp
    diff true.rpp vstars.rpp
  
      total 328 recipes

  Note that total has changed. This is because, during the 05/2004 round of edits 
  some long recipes were split at paragraph breaks, even though there
  were no stars there.  This is not too unreasonable, because the stars seem 
  to have been placed without much care, as if the scribe did not understand 
  that they were associated with the paragraphs.
      
  Basic statistics - total tokens, words, recipes:
  
    foreach f ( bencao.big5 vstars.eva )
      printf "\n%-10s" "${f:r}:"
      cat $f \
        | print-tk-wd-counts \
        > ${f:r}.twct
      cat ${f:r}.twct \
        | sort -b -k3nr -k1n \
        | egrep -v '^000 ' \
        > ${f:r}.twsr
    end

    bencao:   total 357 recipes, 12826 tokens (   0 bad), 35.93 tokens/recipe, 1113 good words
    vstars:   total 328 recipes, 10491 tokens (  38 bad), 31.98 tokens/recipe, 2996 good words

   Note that these counts have changed since 02/2002. They used to be
    
    vstars: total 323 recipes, 10542 tokens,  ( 595 bad), 32.64 tokens/recipe, 2767 good words

   During the 05/2004 round of edits, many tokens became joined with
   their neighbors, because the spaces were entered as faithfully as
   possible. However, if we believe the word structure paradigm, then
   many of those joined words should have been kept separate. Also
   note that over 550 "bad" tokens were fixed by those edits.
      
RECIPE LENGTH HISTOGRAMS

  Plotting the recipe length histograms:
  
    foreach tw ( tk.3 wd.4 )
      foreach f ( bencao vstars )
        printf "\n%s (%s): " "${f}" "${tw:r}"
        cat ${f}.twct \
          | gawk -v fld="${tw:e}" '/./{ print $(fld); }' \
          | compute-tk-wd-histogram -v quantum=5 \
          > ${f}.${tw:r}h
      end
      foreach fmt ( png )
        plot-twhi -format ${fmt} \
          bencao.${tw:r}h Bencao 1 \
          vstars.${tw:r}h Voynich 2 \
        > recipe-${tw:r}-hist.${fmt}
      end
    end

RECIPE LENGTH PLOTS

  Plotting the recipe lengths as function of position in text:
  
    foreach fmt ( png )
      foreach f ( bencao.Bencao vstars.Voynich )
        plot-recipe-attr \
            -format ${fmt} \
            ${f:r}.twct "${f:e} (tk)" 3 1  1.0 \
          > ${f:r}-tk-counts.${fmt}
      end
    end

  Dito, smoothed:

    foreach width ( 09 )
      foreach fmt ( png )
        foreach f ( bencao.Bencao vstars.Voynich )
          foreach type ( avg dif )
            cat ${f:r}.twct \
              | gawk '/./{ print $1, $2, $3; }' \
              | filter-recipe-data -v ${type}=1 -v width=${width} \
              > ${f:r}-${type}${width}.tct
          end
          plot-recipe-attr \
              -format ${fmt} \
              ${f:r}-avg${width}.tct "${f:e} avg${width}" 3 1  1.0 \
              ${f:r}-dif${width}.tct "${f:e} dif${width}" 3 2 60.0 \
            > ${f:r}-tk-counts-dif${width}.${fmt}
        end
      end
    end

COINCIDENCE IMAGES

  Computing the coincidence image:
  
    foreach width ( 09 )
      foreach et ( 0.5/0.05/avg 0.01/0.01/dif )
        set err = "${et:h}"
        set type = "${et:t}"
        compute-coincidence-image \
            -v absErr=${err:h} -v relErr=${err:t} \
            -v xFile=bencao-${type}${width}.tct -v xField=3 \
            -v yFile=vstars-${type}${width}.tct -v yField=3 \
          | pgmnorm | pnmdepth 255 \
          > recipe-tk-counts-${type}${width}.pgm
        display recipe-tk-counts-${type}${width}.pgm
      end
    end
    
REPEATED WORDS

  Checking for repeats
  
    foreach f ( bencao.big5 vstars.eva )
      printf "\n%s: " "${f:r}"
      cat ${f} \
        | list-repeats \
        > ${f:r}.reps
      cat ${f:r}.reps | wc -l 
      cat ${f:r}.reps \
        | gawk '/./{ print $2; }' \
        | sort | uniq -c | expand \
        | map-field \
            -v table=big5-to-html.tbl \
            -v inField=2 -v outField=3 \
            -v forgiving=1 \
        | map-field \
            -v table=html-to-py.tbl \
            -v inField=3 -v outField=4 \
            -v forgiving=1 \
        | gawk '//{ print $1, ($3 "=" $4); }' \
        | sort -b -k1nr -k2 \
        > ${f:r}.rtop
      head -3 ${f:r}.rtop
    end
    
      bencao:      41
      8 &#27927;=(xi3,xian3)
      6 &#34880;=(xue4,xie3)
      5 &#23506;=(han2)

      vstars:      81
      10 qokeedy=qokeedy
      10 qokeey=qokeey
      7 ar=ar

  Build word-paragraph occurrence map.
  
    foreach f ( bencao.big5 vstars.eva )
      cat ${f} \
        | sed \
            -e 's/^[#][#] */@/' \
            -e 's/[#].*$//' \
            -e 's/^[0-9][-A-Za-z0-9]*[ ]/ /' \
            -e '/^[ ]*$/d' \
        | tr ' ' '\012' \
        | gawk \
            ' BEGIN{ \
                split("", map); \
                split("", wd); nwd=0; split("", wdct); \
                split ("", pg); npg = 0; p = "???"; \
              } \
              /^[@]/ { \
                p = $1; gsub(/[@]/, "", p); \
                pg[npg] = p; npg++; next; \
              } \
              /./ { \
                w = $1; \
                if (! (w in wdct)) \
                  { wd[nwd] = w; nwd++; wdct[w] = 0; } \
                wdct[w]++; map[p,w]++; \
              } \
              END { \
                for (w in wdct) \
                  { printf "%-20s %5d ", w, wdct[w]; \
                    for (i = 0; i < npg; i++) \
                      { p = pg[i]; \
                        if ((p,w) in map) \
                          { printf "%d", map[p,w]; } \
                        else \
                          { printf "."; } \
                      } \
                    printf "\n"; \
                  } \
              } \
            ' \
        | sort -b -k2nr -k1 \
        > ${f:r}.wpm
    end
 
LULZ

  According to Google Translate, the subsection titles are:

    1.1.001 玉石部上品  yùshí ù shàngpǐn         Top grade jade                                 
    1.2.019 玉石部中品  yùshí bù zhōng pǐn       Jade department middle grade                 
    1.3.033 玉石部下品  yùshí bùxià pǐn          jade subordinate product                     
    1.4.044 草部上品    cǎo bù shàngpǐn          Top grade grass                               
    1.5.102 草部中品    cǎo bù zhōng pǐn         Kusanabe middle grade                         
    1.6.162 草部下品    cǎo bùxià pǐn            The lowest grade of grass                     
    1.7.219 木部上品    mù bù shàngpǐn           Top grade wood                               
    1.8.234 木部中品    mù bù zhōng pǐn          Kibe middle grade                             
    1.9.253 木部下品    mù bùxià pǐn             Kibe inferior grade                           
    2.1.001 蟲獸部上品  chóng shòu bù shàngpǐn   Top quality insects and beasts               
    2.2.017 蟲獸部中品  chóng shòu bù zhōng pǐn  Insect and animal department medium quality   
    2.3.042 蟲獸部下品  chóng shòu bùxià pǐn     Insect Beast Subordinates                     
    2.4.069 果菜部上品  guǒcài bù shàngpǐn       Top quality fruits and vegetables department 
    2.5.080 果菜部中品  guǒcài bù zhōng pǐn      Medium range of fruits and vegetables         
    2.6.087 果菜部下品  guǒcài bùxià pǐn         Fruit and vegetable products                 
    2.7.091 米穀部上品  mǐgǔ bù shàngpǐn         Top grade rice cereals                       
    2.8.094 米穀部中品  mǐgǔ bù zhōng pǐn        Mid-grade rice                               
    2.9.098 米穀部下品  mǐgǔ bùxià pǐn           The inferior product of Rice Valley