# Last edited on 2004-02-17 01:35:05 by stolfi
# A collection of sample texts in several languages, 
# formatted for statistical analysis.

CURRENT DIRECTORY LAYOUT

  Directories are named {LLLL}/{DDD} where {LLLL} is the language (engl,
  latn, span, ital) and DDD is the document.  
  
  "Language" is defined here from the broad computational/statistical
  viewpoint. Minor differences in puctuation, null charaters, and the
  like (see "file encoding" below) do not make a different language.
  Also acceptable are moderate differences in transliteration and
  spelling, such as different Romanization systems (e.g. Japanese
  "mitubisi" vs. "mitsubishi"), or classical vs. modern Latin.
  
  On the other hand, Chinese ideograms and Chinese pinyin are
  considered to be two different languages. Ditto for plain English
  and English in Vigenère or codebook cipher.
  
  Documents that differ by the systematic omission of certain symbols,
  such as Arabic/Hebrew with and without vowel marks, should be
  classed as different languages. However, if the remaining letters
  are encoded in the same way, we make an exception and use the same
  diretory {LLLL}, for editing convenience.
  
DIRECTORY CONTENTS

  Within each directory, there should be files with the following
  names and meanings:

    * "main.src" (mandatory): the main source file, with all the
      manual cleanup, reformatting, error fixes, etc..  See the 
      detailed description below.

    * "main.raw" (optional): original files from external source, for
      documentation purposes only, with *purely mechanical* cleanup such
      as:

        * delete irrelevant HTML headers, markup and line breaks.

        * replace significant HTML tags (e.g. start of chapter or verse)
        by ad-hoc "@set{{VAR}}{[=|+]{VALUE}}" directives.

        * mark mechanically recognizable special text (e.g. English
        chapter titles, verse numbers, editorial notes) with
        textual type @-directives (see below).

      This file must be in the original encoding, with no hand-edits
      and no hand comments. The commands used to produce this file and
      other notes should be in Notebook.txt.

    * "Notebook.txt": technical description of the document's 
      processing.  Note that information that matters to users of 
      the document should be placed as comments in the "main.src"
      file itself.
  
FORMAT OF THE MAIN SOURCE FILE 

  The file "main.src" must have the following features:

    * identification of any significant structural divisions of
    the text (volumes, books, parts, chapters, sections and
    subsections, verses) with @-directives. 
    See tools/src-to-wds for details.

    * segregation of linguistically homogeneous textual types
    (prose, poetry, foreign-language inserts, verse and chapter
    numbers and titles, lists and tables, figure labels and
    captions etc.) into separate textual units, marked with
    textual type @-directives.

    * markup of significant paragraph and text-line breaks by
    distinctive characters. Note that a paragraph or a line may
    span several textual units, e.g. foreign-language phrases or
    tables.

    * regularization of spelling (such as removal of
    non-significant capitalization, capitalization of proper
    names, adding or deleting hyphens, fixes to punctuation,
    numbers and symbols, etc.)

    * replacement of linguistically significant font markup by
    punctuation marks or symbols, e.g. "/.../" for
    italic.

    * preamble comments with document description, external
    sources and credits.

    * comments describing all edits and transformations
    made to the original document, after obtained from 
    the external source.

    * a short table-of-contents summarizing the textual
    divisions and their nature.
    
    * directives "@include{FILENAME}" to include auxiliary
    (encoding tables, parts of a multi-file document, etc.)

  This file may use a language-specific lossless encoding into
  printable ISO-8859-1 characters that is suitable for
  hand-editing. The main features of the encoding should be
  specified by the directives "@null{...}", "@alpha{...}"
  "@symbol{...}", "@punct{...}", "@bank{...}", "@parag{...}", and
  "@break{...}". These character classes must be disjoint and must
  not include the characters "@#{}*".

  In this file, words should be separated by blank spaces and/or
  newlines (implicitly included in the "@blank{}" directive). Note
  therefore that there is no relatonship between file-lines (ASCII
  NL) and text-lines (marked by encoding-specific characters).
  Except for ASCII SP and NL, non-printable characters --
  including ASCII HT, VT, CR, or ISO-8859-1 non-breaking space and
  soft hyphen -- are strictly forbidden.