Last edited on 2004-01-30 02:27:44 by stolfi
Cleaning up the Quran in UTF-8 (R. Khalifa's version)

FETCHING

  Fetched an UTF-8 Quran from R. Khalifa's site 
  http://www.submission.org/quran/reader/arabic/
  Files sura001.html to sura114.htm

HTML CLEANUP

  Concatenated all the Surat, removed headers, spurious formatting, etc.
  
    set imps = "/home/staff/stolfi/IMPORT/texts/arabic/Quran-19-Submission"
    rm subm.htm
    foreach f ( `cd ${imps} && ls sura???.html` )
      echo $f
      echo "# === $f ========================================" >> subm.htm
      cat ${imps}/${f} \
        | tr -d '\015' \
        | extract-html-tags \
        | remove-html-s1 \
        >> main.htm
    end

  Copied "main.htm" to "main.raw".
  
  Edited manually, removing "@begin-span..." "@end-span"
  and adding "@chapter", "@verse", "@="
  
  After first iteration, fixed some invalid chars in the text.
  Recorded the changes as "@fix" lines in "main.raw".
  
  Used "raw-to-org" to convert UTF-8 codes to hexbytes, then to JSAR,
  and thus generate the initial "main.org" file.