# Last edited on 2004-02-07 10:25:49 by stolfi
# A collection of sample texts in several languages, 
# formatted for statistical analysis.

CURRENT DIRECTORY LAYOUT

  Directories are named LLLL/DDD where LLLL is the language (engl, latn, span,
  ital) and DDD is the document.
  
  Within each directory there MAY be files with the following
  names and meanings:
    
    remove-html-junk-xx = filter that extracts usable text from 
      original HTML files.
  
    main.htm = the original source for edit.txt, as obtained from the
      net, with minimal clean-up, in the original encoding.

    main.raw = intermediate file with some mechanical code conversion
      and editing.
      
    main.org = final source file, with all manual edits
      and transformations, with standardized @-directives 
      as expected by make-evt-from-org
      
    main.evt = the main text in EVT format, generated mechanically
      from main.org.
    
    main.wds = list of words from main.evt.
    
    main.cts = frequency counts for main.wds.
    
    raw-to-org = converts main.raw into a first draft of main.org.
    
    preprocess-org = optional mechanical transformations of main.org 
      before feeding it to org-to-evt.
      
    raw-to-wds = converts main.raw into a word list, for 
      consistency checking.
      
    cook-words-for-diff = preprocess words extracted from main.evt
      for consistency chack against words produced by raw-to-wds.
    
    Makefile = generates main.evt and the other files from main.org.
    
  If the Makefile is not present, the main.evt file was generated 
  by hand-editing.

  The master texts are stored in EVT-like format, extended with
  language-specific charset declarations (see wds-from-evt).
  Capitalization should be retained when it is significant in the
  original text. E.g., yes for English, Vietnamese, and modern Latin;
  no for Greek, Arabic, Tibetan, and some ancient Latin texts.
       
  The line locators should have the form <CCC.PP.NNN> where CCC is a
  chapter/section identifier, PP is a text unit ID and NNN is a line
  number (increasing within each CCC.PP combination). Each text unit
  must have homogeneous nature (normal text, title, label, quotation, etc.)
  
  Comments start with "#" on column 1. There should be a para-comment
  "## <CCC>" at the beginning of each new chapter.

TO DO

  This structure is inconsistent and too baroque.
  It is being converted to s aimpler scheme in 
  /home/staff/stolfi/projects/langbank/

  Old sub-directories are in 
  ~/voynich/work/Texts-Old/