Last edited on 2004-01-30 22:34:13 by stolfi Cleaning up the Quran in Unicode FETCHING Fetched the Unicode Quran from the Sacred Texts site http://www.sacred-texts.com/isl/uq/001.htm to 114.htm HTML CLEANUP Concatenated all the Surat, removed headers, spurious formatting, etc. set impu = "/home/staff/stolfi/IMPORT/texts/arabic/Quran-Unicode" rm main.htm foreach f ( `cd ${impu} && ls ???.htm` ) echo $f echo "# === $f ========================================" >> subm.htm cat ${impu}/${f} \ | tr -d '\015' \ | extract-html-tags \ | remove-html-u1 \ | remove-html-u2 \ | egrep -v '^ *$' \ | gawk '/^ *[@]/{if ($0 == o) { next; } o = $0; } //{ print; }' \ >> main.htm end set imps = "/home/staff/stolfi/IMPORT/texts/arabic/Quran-19-Submission" rm subm.htm foreach f ( `cd ${imps} && ls sura???.html` ) echo $f echo "# === $f ========================================" >> subm.htm cat ${imps}/${f} \ | tr -d '\015' \ | extract-html-tags \ | remove-html-s1 \ >> subm.htm end CONVERSION TO UNICODE Converted Arabic HTML decimal entity codes to "«xx»": * Arabic chars (HTML decimal 1548..1790 were mapped to "«{xx}»" where {xx} is the lower byte of the Unicode point U06{xx} in hexadecimal. * One occurrence of the entity "ﻷ" (Unicode uFEF7 = ARABIC LIGATURE LAM WITH ALEF WITH HAMZA ABOVE ISOLATED FORM) in verse 20.71 was mapped to "«44»«23»{&65271}". * Spaces between arabic chars were converted to "«__»". fix-codes-1 DEBUGGING Corrected some apparent typos and removed additional formatting chars. See main.org for details. Compared this file to another Unicode file (by the www.submission.org group) and noticed a large number of differences, typos and bugs, such as doubled kasras, omitted hamzas, etc. A spot check against an handwritten edition http://www.islamicity.com/mosque/ArabicScript/sindex.htm showed that both files were probably full of errors.