Last edited on 2004-01-30 22:34:13 by stolfi
Cleaning up the Quran in Unicode

FETCHING

  Fetched the Unicode Quran from the Sacred Texts site 
  http://www.sacred-texts.com/isl/uq/001.htm to 114.htm

HTML CLEANUP

  Concatenated all the Surat, removed headers, spurious formatting, etc.
  
    set impu = "/home/staff/stolfi/IMPORT/texts/arabic/Quran-Unicode"
    rm main.htm
    foreach f ( `cd ${impu} && ls ???.htm` )
      echo $f
      echo "# === $f ========================================" >> subm.htm
      cat ${impu}/${f} \
        | tr -d '\015' \
        | extract-html-tags \
        | remove-html-u1 \
        | remove-html-u2 \
        | egrep -v '^ *$' \
        | gawk '/^ *[@]/{if ($0 == o) { next; } o = $0; } //{ print; }' \
        >> main.htm
    end
  
    set imps = "/home/staff/stolfi/IMPORT/texts/arabic/Quran-19-Submission"
    rm subm.htm
    foreach f ( `cd ${imps} && ls sura???.html` )
      echo $f
      echo "# === $f ========================================" >> subm.htm
      cat ${imps}/${f} \
        | tr -d '\015' \
        | extract-html-tags \
        | remove-html-s1 \
        >> subm.htm
    end

CONVERSION TO UNICODE

  Converted Arabic HTML decimal entity codes to "«xx»":

    * Arabic chars (HTML decimal 1548..1790 were mapped to  
    "«{xx}»" where {xx} is the lower byte of the Unicode point
    U06{xx} in hexadecimal.

    * One occurrence of the entity "ﻷ" (Unicode uFEF7 = ARABIC
    LIGATURE LAM WITH ALEF WITH HAMZA ABOVE ISOLATED FORM) in verse
    20.71 was mapped to "«44»«23»{&65271}".

    * Spaces between arabic chars were converted to "«__»".

    fix-codes-1

DEBUGGING 

  Corrected some apparent typos and removed additional formatting chars.
  See main.org for details.

  Compared this file to another Unicode file (by the www.submission.org
  group) and noticed a large number of differences, typos and bugs,
  such as doubled kasras, omitted hamzas, etc. A spot check against 
  an handwritten edition 
  
    http://www.islamicity.com/mosque/ArabicScript/sindex.htm
  
  showed that both files were probably full of errors.