Last edited on 2004-01-30 02:27:44 by stolfi Cleaning up the Quran in UTF-8 (R. Khalifa's version) FETCHING Fetched an UTF-8 Quran from R. Khalifa's site http://www.submission.org/quran/reader/arabic/ Files sura001.html to sura114.htm HTML CLEANUP Concatenated all the Surat, removed headers, spurious formatting, etc. set imps = "/home/staff/stolfi/IMPORT/texts/arabic/Quran-19-Submission" rm subm.htm foreach f ( `cd ${imps} && ls sura???.html` ) echo $f echo "# === $f ========================================" >> subm.htm cat ${imps}/${f} \ | tr -d '\015' \ | extract-html-tags \ | remove-html-s1 \ >> main.htm end Copied "main.htm" to "main.raw". Edited manually, removing "@begin-span..." "@end-span" and adding "@chapter", "@verse", "@=" After first iteration, fixed some invalid chars in the text. Recorded the changes as "@fix" lines in "main.raw". Used "raw-to-org" to convert UTF-8 codes to hexbytes, then to JSAR, and thus generate the initial "main.org" file.