Last edited on 2004-02-02 04:36:47 by stolfi
Voice of America Chinese sample in GB2312

INTRODUCTION

  The Voice of America corpus is a set of short texts 
  in Mandarin Chinese, with inerleaved ideographic and Pinyin versions.
  Accoding to the headers, the encoding used is gb2312.
  Here we extract the GB text, leaving the pinyin as {}-comments.

FETCHING

    set imp = "/home/staff/stolfi/IMPORT/texts/chinese/VoiceOfAmerica"

  Files were fetched from "http://www.ocrat.com/ocrat/voa/"  and renamed in 
  a consistent format:
   
     960430/txtpin.html   -> 1996-04-30-a.html
     971212na/txtpin.html -> 1997-12-12-n.html
  
  etc. The suffix "na" means "no audio". See the script ${imp}/fetch-it
  
HTML CLEANUP

  Concatenate HTML files:
  
    rm -f main.htm
    foreach f ( `cd ${imp} && ls ????-??-??-?.html ` ) 
      echo "${f}"
      echo "# === ${f} =====================================" >> main.htm
      cat ${imp}/${f}  >> main.htm
    end

  Remove HTML tags, keep text and titles.

    cat main.htm \
      | remove-html-junk-v1 \
      | translate-punctuation \
      | fix-item-headers \
      > main.raw

  Output format:
     
     # === HTMLFILE =====================================
     @item{NUMBER}{DATE}{TITLE}
     @chinword{GBCODES}{PINYIN}
     @=

    rm -f main.htm
    foreach f ( `cd ${imp} && ls ????-??-??-?.html ` ) 
      echo "${f}"
      echo "# === ${f} =====================================" >> main.htm
      cat ${imp}/${f} \
        | remove-html-junk-v1 \
        >> main.htm
    end

EXTRACTING THE PINYIN PRONUCIATIONS
  
    cat main.raw \
      | extract-pinyin-readings \
      | sort -b +0 -1nr \
      > .py.cts

CHECKING THE PINYIN AGAINST THE EXISTING TABLE

    set tbl = "/home/staff/stolfi/IMPORT/texts/chinese/tables"
    
    cat .py.cts \
      | map-field \
          -v inField=2 -v outField=3 \
          -v table=${tbl}/chr-to-hex.tbl \
          -v default="????" \
      | map-field \
          -v inField=3 -v outField=4 \
          -v table=${tbl}/gb-to-uc.tbl  \
          -v default="????" \
      | map-field \
          -v inField=4 -v outField=5 \
          -v table=${tbl}/uc-to-py.tbl  \
          -v default="(none0)" \
      | sort -b +3 -4 \
      > .py-uc.cts

CHECKING THE NESTING OF QUOTES, PARENS, ETC:

    cat main.raw \
      | check-brackets