Last edited on 2004-02-02 04:36:47 by stolfi Voice of America Chinese sample in GB2312 INTRODUCTION The Voice of America corpus is a set of short texts in Mandarin Chinese, with inerleaved ideographic and Pinyin versions. Accoding to the headers, the encoding used is gb2312. Here we extract the GB text, leaving the pinyin as {}-comments. FETCHING set imp = "/home/staff/stolfi/IMPORT/texts/chinese/VoiceOfAmerica" Files were fetched from "http://www.ocrat.com/ocrat/voa/" and renamed in a consistent format: 960430/txtpin.html -> 1996-04-30-a.html 971212na/txtpin.html -> 1997-12-12-n.html etc. The suffix "na" means "no audio". See the script ${imp}/fetch-it HTML CLEANUP Concatenate HTML files: rm -f main.htm foreach f ( `cd ${imp} && ls ????-??-??-?.html ` ) echo "${f}" echo "# === ${f} =====================================" >> main.htm cat ${imp}/${f} >> main.htm end Remove HTML tags, keep text and titles. cat main.htm \ | remove-html-junk-v1 \ | translate-punctuation \ | fix-item-headers \ > main.raw Output format: # === HTMLFILE ===================================== @item{NUMBER}{DATE}{TITLE} @chinword{GBCODES}{PINYIN} @= rm -f main.htm foreach f ( `cd ${imp} && ls ????-??-??-?.html ` ) echo "${f}" echo "# === ${f} =====================================" >> main.htm cat ${imp}/${f} \ | remove-html-junk-v1 \ >> main.htm end EXTRACTING THE PINYIN PRONUCIATIONS cat main.raw \ | extract-pinyin-readings \ | sort -b +0 -1nr \ > .py.cts CHECKING THE PINYIN AGAINST THE EXISTING TABLE set tbl = "/home/staff/stolfi/IMPORT/texts/chinese/tables" cat .py.cts \ | map-field \ -v inField=2 -v outField=3 \ -v table=${tbl}/chr-to-hex.tbl \ -v default="????" \ | map-field \ -v inField=3 -v outField=4 \ -v table=${tbl}/gb-to-uc.tbl \ -v default="????" \ | map-field \ -v inField=4 -v outField=5 \ -v table=${tbl}/uc-to-py.tbl \ -v default="(none0)" \ | sort -b +3 -4 \ > .py-uc.cts CHECKING THE NESTING OF QUOTES, PARENS, ETC: cat main.raw \ | check-brackets