Hi Again... First of all, some important warnings: (1) Your Unicode Quran file (QUV) is missing most of sura 13 (verses 13.11--13.43). Since that file (003.htm) ends with "", it is not a downloading problem --- the verses must have been lost in the conversion to HTML, or earlier. Those verses are present in R. Khalifa's version from www.submission.org (QUS) which is otherwise very similar to yours. (2) Both your Unicode file QUV and Khalifa's version QUS are missing verses 9.128 and 9.129. Those two verses are present in the phonetic version also available at your site (QPH), in a consonant-only file I got somehere else (QCS), as well in a graphic edition of the Quran (QIM). Apparently those two verses are praises to Muhammad; they were added posthumously by his followers when the Quran was first compiled, and have been included in official Qurans since then. However, according R. Khalifa (see reference below), those two verses break his "19-miracle" counts, and thus should be deleted. (I suspect that his views on this and other points conflict with mainstream Islamic doctrine.) Since your Unicode file appears to have been derived from his, that would explain the gap. (3) In your phonetic transliteration file (QPH), the last three words of verse 46.4 are actually a separate verse, 47.5. (4) In the same file (QPH), there are two verses numbered "85.18". Actually, the last five verses should be numbered 85.19--85.22, instead of 85.18--85.21. You say: > I've looked into the Arabic script, and it is indeed pretty > hairy. So hairy, as a matter of fact, that I couldn't with > confidence attempt to map from the phonetic transcription that > I also have at the site > ( http://www.sacred-texts.com/isl/quran/index.htm ) > into Unicode. Yes, I found that file (QPH) about a year ago, and I have been using it in my tests. Thanks for making it available. I don't recall why I chose QPH over the Unicode version (QUV); either I was not aware of QUV, or the task of mapping from Unicode to something minimally pronounceable seemed too hard at the time. Apparently, the QPH file was mechanically generated from some Unicode-like file with all the vowel signs. Unfortunately the mapping was not one-to-one; some information was lost, and it is not possible to recover the original file from the transcription. Nevertheless I was able to extract from QPH a rough approximation of the original text, and used that file for my project. I am comparing certain statistics of various languages (actually, their writing systems). Specifically, I am measuring such things as the number of different words with a given length, the number of times that the nth most common word occurs in a sample of given size, etc. I was about to post some results of these experiments, but realized that they were not convincing because my Arabic sample, above, was not a one-to-one mapping from the native script. (For instance, the encoding may have erased the difference between two distinct words, or the same word may have been rendered in two different ways.) I am telling you all this only to explain why I am interested in your Unicode file QUV, and why I am so worried about its reliability, consistency, etc.. For your record, I have attached below a summary of my analysis of the Quran files I have found so far. I wrote it for my own use (as documentation of my sample files), but perhaps you will find it useful too. For my research, I'll probably be using your Unicode file (QUV), with the holes patched. ARABIC QURAN FILES At the moment, I have got five eletronic versions of the Quran in Arabic: [QUV]: Arabic text WITH vowel marks. Each letter or diacritic is represented by its Unicode value in the range u0600 - u06FF, written as a decimal HTML entity "&#nnnn;". This is the Unicode file from your site, completed with verses 9.128--9.129 (by hand) and 13.11--13.43 (from QUS). [QUS]: Arabic text WITH vowel marks. Each letter or mark is represented by its Unicode value in the u0600 - u06FF, encoded in the UTF-8 format. From the site devoted to R. Khalifa and the "19-miracle" "http://www.submission.org/quran/reader/arabic/sura001.html" [QPH]: Semi-phonetic transcription into Roman Alphabet. Each Arabic letter is represented by one or two Roman letters with HTML bold/underline/italic markings. From your site (the one mentioned above). [QCS]: Arabic text WITHOUT vowel marks. Each letter is represented by a single byte in the ISO-8859-6 encoding. From somewhere in the net (Sorry, I didn't note the source, and the file itself has no indication of authorship. However it seems to be very pouplar; the file name is "quran.zip" or "quran.txt".) [QIM]: Arabic text with vowel marks and other auxiliary reading signs. Each page is represented as an electronically typeset GIF image. From the IslamiCity site "http://www.islamicity.com/mosque/ArabicScript/sindex.htm". (Used only for spot checks.) The UTF-8 format, used in the QUS file, maps the 2-byte Unicode values into 1 to 3 bytes. The Arabic segment u0600-u06FF of Unicode, in particular, gets mapped to two bytes where the first one is either xD8, xD9, xDA, or xDB ("Ø","Ù","Ú","Û") and the second is in the range x80-xBF. Ascii characters denote themselves; i.e. bytes x00 to x7F represent Unicode u0000 to u007F. ISO-8859-6, used in the QCS file, is essentially a proper subset of the Unicode Arabic segment u0600 - u067F, without the high-order "06" byte and shifted by x80. (The only exceptions are a couple of puctuation marks, which are not used in the Quran.) In this system, too, ASCII characters stand for themselves. COMPARISON OF THE FILES The following table summarizes the differences that I found between your Unicode file (QUV) and the other files, after mapping them all the same encoding (see below). The verse counts do not include the repeated "bismillah" invocation ("verse 0") but include its original occurence in verse 1.1. The word counts include all repetitions of the "bismillah". Chapter titles and verse numbers are not counted. Each sequence of `mystic initials', when present, is counted as one word. Words that are not couted are excluded from the comparisons. A "difference" means one word added, deleted, or replaced. Before each comparison, the encoding conventions of the two files were reduced to the `least common denominator' by (1) expanding the doubling mark "»" (shaddah) or leaving it alone, and (2) deleting certain marks altogether. The choices for each column are: diff0: keep "»" ignore () diff1: expand "»" ignore (°) = sukun diff2: expand "»", ignore (âäîïûü¡!~°) = short vowels, sub/sup-hamza, sup-madda, sukun diff3: expand "»" ignore (âäîïûü¡!~°') = short vowels, sub/sup-hamza, sup-madda, sukun, hamza diff4: ignore (âäîïûü¡!~°»') = short vowels, sub/sup-hamza, sup-madda, sukun, shaddah, hamza [QUV] [QUS] [QPH] [QCS] -------------- ------------------------ ------------- ------------- Sura #verses #words | #words diff0 diff1 diff2 | #words diff3 | #words diff4 ---- ------- ------ | ------ ----- ----- ----- | ------ ----- | ------ ------ 1 7 29 | 29 0 0 0 | 29 6 | 28 6 2 286 6117 | 6118 993 992 0 | 6148 3240 | 5826 1125 3 200 3487 | 3486 574 574 1 | 3507 1753 | 3335 629 4 176 3759 | 3751 136 132 120 | 3766 2114 | 3613 547 5 120 2839 | 2809 347 345 344 | 2841 1569 | 2691 326 6 165 3065 | 3058 283 283 280 | 3060 1570 | 2901 361 7 206 3322 | 3326 600 598 8 | 3351 1746 | 3193 590 8 75 1238 | 1238 5 5 4 | 1247 638 | 1197 159 9 129 2497 | 2470 28 28 28 | 2506 1266 | 2389 455 10 109 1839 | 1837 263 263 6 | 1843 866 | 1742 345 11 123 1921 | 1921 400 399 0 | 1950 1059 | 1827 335 12 111 1783 | 1783 411 411 0 | 1799 884 | 1724 274 13 43 857 | 857 30 30 0 | 859 425 | 820 154 14 52 837 | 837 148 147 0 | 835 432 | 805 130 15 99 659 | 659 139 136 0 | 662 346 | 634 123 16 128 1848 | 1848 4 4 4 | 1849 879 | 1744 367 17 111 1560 | 1560 4 4 1 | 1563 815 | 1508 238 18 110 1583 | 1583 5 5 5 | 1587 826 | 1526 229 19 98 965 | 965 0 0 0 | 975 483 | 937 162 20 135 1339 | 1339 1 1 1 | 1356 701 | 1275 231 21 112 1173 | 1173 0 0 0 | 1179 545 | 1117 256 22 78 1278 | 1278 0 0 0 | 1283 666 | 1230 203 23 118 1054 | 1054 0 0 0 | 1056 494 | 1015 185 24 64 1320 | 1320 0 0 0 | 1323 664 | 1272 234 25 77 897 | 897 0 0 0 | 900 503 | 863 133 26 227 1322 | 1322 0 0 0 | 1327 632 | 1264 209 27 93 1155 | 1155 0 0 0 | 1163 562 | 1118 166 28 88 1433 | 1434 2 0 0 | 1445 683 | 1370 255 29 69 981 | 980 149 148 2 | 986 502 | 948 169 30 60 822 | 822 117 116 0 | 823 339 | 795 128 31 34 552 | 552 96 96 0 | 555 275 | 526 103 32 30 377 | 377 64 62 0 | 378 168 | 364 59 33 73 1291 | 1291 0 0 0 | 1308 759 | 1230 252 34 54 887 | 887 0 0 0 | 888 425 | 836 159 35 45 779 | 779 0 0 0 | 784 419 | 741 137 36 83 729 | 729 0 0 0 | 737 367 | 689 145 37 182 865 | 865 0 0 0 | 870 433 | 844 139 38 88 737 | 737 0 0 0 | 739 407 | 711 115 39 75 1175 | 1175 0 0 0 | 1183 538 | 1133 193 40 85 1225 | 1225 0 0 0 | 1231 564 | 1180 195 41 54 798 | 798 0 0 0 | 799 416 | 760 136 42 53 864 | 864 0 0 0 | 864 411 | 816 161 43 89 833 | 833 1 1 0 | 841 408 | 814 127 44 59 350 | 350 0 0 0 | 350 169 | 336 58 45 37 492 | 492 0 0 0 | 492 238 | 461 117 46 35 647 | 647 0 0 0 | 650 320 | 621 114 47 38 543 | 543 0 0 0 | 546 275 | 531 80 48 29 564 | 564 0 0 0 | 564 292 | 551 66 49 18 351 | 351 0 0 0 | 357 191 | 335 68 50 45 377 | 377 0 0 0 | 377 187 | 361 53 51 60 364 | 364 0 0 0 | 364 181 | 349 59 52 49 316 | 316 0 0 0 | 316 145 | 302 57 53 62 364 | 364 0 0 0 | 364 201 | 341 70 54 55 346 | 346 0 0 0 | 346 188 | 344 22 55 78 355 | 355 0 0 0 | 356 265 | 347 43 56 96 384 | 384 0 0 0 | 384 200 | 365 89 57 29 578 | 578 0 0 0 | 579 292 | 557 101 58 22 476 | 476 0 0 0 | 479 244 | 462 80 59 24 449 | 449 0 0 0 | 451 229 | 424 95 60 13 352 | 352 0 0 0 | 356 185 | 330 76 61 14 225 | 225 0 0 0 | 230 116 | 218 40 62 11 179 | 179 0 0 0 | 181 79 | 174 31 63 11 184 | 184 0 0 0 | 185 85 | 175 31 64 18 245 | 245 0 0 0 | 246 136 | 237 38 65 12 291 | 291 0 0 0 | 293 143 | 282 45 66 12 253 | 253 0 0 0 | 258 136 | 246 55 67 30 337 | 337 0 0 0 | 338 139 | 331 34 68 52 304 | 304 0 0 0 | 305 138 | 294 50 69 52 262 | 262 0 0 0 | 264 155 | 246 54 70 44 221 | 221 0 0 0 | 221 105 | 218 28 71 28 230 | 230 0 0 0 | 231 122 | 219 38 72 28 289 | 289 0 0 0 | 289 158 | 274 44 73 20 203 | 203 0 0 0 | 204 106 | 197 26 74 56 259 | 259 0 0 0 | 260 125 | 245 43 75 40 168 | 168 0 0 0 | 168 88 | 163 23 76 31 247 | 247 0 0 0 | 247 161 | 241 32 77 50 185 | 185 0 0 0 | 185 109 | 177 33 78 40 177 | 177 0 0 0 | 178 104 | 169 36 79 46 183 | 183 0 0 0 | 183 117 | 182 32 80 42 137 | 137 0 0 0 | 137 88 | 133 14 81 29 108 | 108 0 0 0 | 108 56 | 103 15 82 19 84 | 84 0 0 0 | 85 41 | 74 26 83 36 173 | 173 0 0 0 | 173 87 | 164 27 84 25 111 | 111 0 0 0 | 112 61 | 106 18 85 22 113 | 113 0 0 0 | 113 50 | 111 11 86 17 65 | 65 0 0 0 | 65 34 | 60 12 87 19 76 | 76 0 0 0 | 76 41 | 72 10 88 26 96 | 96 0 0 0 | 96 62 | 93 11 89 30 141 | 141 0 0 0 | 143 76 | 135 22 90 20 86 | 86 0 0 0 | 86 46 | 82 15 91 15 58 | 58 0 0 0 | 58 39 | 54 21 92 21 75 | 75 0 0 0 | 75 51 | 71 10 93 11 44 | 44 0 0 0 | 44 25 | 42 5 94 8 31 | 31 0 0 0 | 31 8 | 31 0 95 8 38 | 38 0 0 0 | 38 17 | 38 5 96 19 76 | 76 0 0 0 | 76 40 | 74 11 97 5 34 | 34 0 0 0 | 34 16 | 32 7 98 8 98 | 98 0 0 0 | 98 49 | 95 17 99 8 40 | 40 0 0 0 | 40 26 | 39 5 100 11 44 | 44 0 0 0 | 44 23 | 42 8 101 11 39 | 39 0 0 0 | 40 22 | 35 10 102 8 32 | 32 0 0 0 | 32 11 | 32 2 103 3 18 | 18 0 0 0 | 18 9 | 18 2 104 9 37 | 37 0 0 0 | 37 20 | 35 5 105 5 27 | 27 0 0 0 | 27 10 | 27 1 106 4 21 | 21 0 0 0 | 21 12 | 21 2 107 7 29 | 29 0 0 0 | 29 9 | 28 4 108 3 14 | 14 0 0 0 | 14 4 | 14 1 109 6 30 | 30 0 0 0 | 31 14 | 22 20 110 3 23 | 23 0 0 0 | 23 11 | 23 0 111 5 27 | 27 0 0 0 | 27 16 | 25 4 112 4 19 | 19 0 0 0 | 19 6 | 18 2 113 5 27 | 27 0 0 0 | 27 6 | 26 3 114 6 24 | 24 0 0 0 | 24 9 | 24 0 TOTAL 6236 77937 | 77867 4800 4780 804 | 78294 40017 | 74656 12932 BOGUS1 | 0 0 7 | 7 | 7 BOGUS2 | 0 4 10 | ? | ? PRELIMINARY CONCLUSIONS (1) Your Unicode file (QUV) and R. Khalifa's (QUS) obviously have a common origin. There are hundreds of differences in suras 2--7, 9--15, 29--32; a few scattered differences 8, 16,17,18,20,28,43; and eactly zero in all other suras. This pattern suggests that his file is later than yours, and that his alef-spelling revision (see below) was limited to a few sections -- presumably those with an alef initial letter. Therefore, QUS probably constains more spelling inconsitencies than QUV. (2) On the other hand, your file apparently has some typos that are missing in QUS, e.g. the "bîîsmî" (with a spurious extra "î") on 2.0, and the gap in sura 13 mentioned above. Both QUV and QUS lack verses 9.128 and 9.120. (3) A spot check of a few verses against the graphic version (QIM) suggests that both QUV and QUS are slightly incorrect. A common difference between the two files is that QUV writes "aû" (alef with u-vowel mark) while QUS uses "a!" (alef with hamza-above). In some of those cases QIM shows "a!û" (alef with hamza-above and u-vowel mark). Ditto for "aî"/"a¡"/"a¡î". I believe that these differences in spelling are phonetically irrelevant, but... (4) There are many discrepancies between QUV/QUS and the other two versions. This was expected for QPH, the file reconstructed from the phonetic trancription, since the latter uses "a" to denote both alef and certain occurrences of the short vowel mark "â". It is worrisome to see similar differences between QUV/QUS and the consonant-only file QCS, even after removing the vowel marks. Apparently some vowel marks "â" in the former are written as alefs "a" in the latter. Also, QCS seems to have many spurious word breaks. KHALIFA'S NOTES ON QURAN REVISION http://www.submission.org/miracle/alif.html The Updated Count of the Quranic initial ALM and ALR in the Quran. Omar E. Sherif and Mohamed A. Elhennawy http://www.submission.org/miracle/alif2.html The Updated Count of the Quranic initial ALMR and ALMS in the Quran. Omar E. Sherif and Mohamed A. Elhennawy Two False Verses Removed from the Quran Appendix 24 of the Authorized English Translation of the Quran http://www.submission.org/tampering.html Dr. Rashad Khalifa APPENDIX - MY AD-HOC ENCODING Since I can't read the Arabic script, in my project I use an ad-hoc mapping of Unicode into single bytes. My encoding tries to preserve the Unicode character counts, while being vaguely phonetic when viewed with an ISO-Latin-1 font: alef -> "a", meem -> "m", fatha (short "a") -> "â", etc.. Here is what the files look like, after conversion to this encoding: Reverse-mapped phonetic text (QPH) 1.1. bîsmî allâhî alrrâµmânî alrrâµymî 1.2. alµâmdû lîllâhî râbbî al¿âlâmynâ 1.3. alrrâµmânî alrrâµymî 1.4. mâlîkî yâwmî alddynî 1.5. aîyyâkâ nâ¿bûdû wâ-aîyyâkâ nâstâ¿ynû 1.6. aîhdînâ alßßîrâ±â almûstâqymâ 1.7. ßîrâ±â allâ£ynâ an¿âmtâ ¿âlâyhîm ¤âyrî almâ¤ðwbî ¿âlâyhîm wâlâ alððâllynâ Text with vowel marks (QUV and QUS) 1.1. bîs°mî all»âhî alr»âµ°mânî alr»âµîymî 1.2. al°µâm°dû lîl»âhî râb»î al°¿âlâmîynâ 1.3. alr»âµ°mânî alr»âµîymî 1.4. mâlîkî yâw°mî ald»îynî 1.5. a¡îy»âakâ nâ¿°bûdû wâa¡îy»âakâ nâs°tâ¿îynû 1.6. ah°dînâa alß»îrâ±â al°mûs°tâqîymâ 1.7. ßîrâ±â al»â£îynâ a!ân°¿âm°tâ ¿âlây°hîm° ¤ây°rî al°m⤰ðûwbî ¿âlây°hîm° wâlâa alð»âal»îynâ Consonant-only text (QCS) 1.1. bsm allh alrµmn alrµym 1.2. alµmd llh rb al¿almyn 1.3. alrµmn alrµym 1.4. malk ywm aldyn 1.5. ayak n¿bd wayak nst¿yn 1.6. ahdna alßra± almstqym 1.7. ßra± al£yn an¿mt ¿lyhm ¤yr alm¤ðwb ¿lyhm wlaalðalyn Here, "»" is the Arabic diacritic that means "the previous consonant is doubled" (shaddah), and "°" is the vovel sign that means "the preceding consonant has no vowel" (sukun). The three vowel diacritics (fatta, damma, kasra) are written "âîû", the doubled variants as "äïü". The `consonantal' long vowels alef, yeh, waw are written "a", "y", "w". The following "accented" variants are represented by two-letter codes: u0622 ARABIC_LETTER_ALEF_WITH_MADDA_ABOVE -> "a~" u0623 ARABIC_LETTER_ALEF_WITH_HAMZA_ABOVE -> "a!" u0624 ARABIC_LETTER_WAW_WITH_HAMZA_ABOVE -> "w!" u0625 ARABIC_LETTER_ALEF_WITH_HAMZA_BELOW -> "a¡" u0626 ARABIC_LETTER_YEH_WITH_HAMZA_ABOVE -> "y!" Other funny-looking characters are consonants, eg. "¿" = 'ain, "¤" = ghain. The apostrophe "'" is a free-standing hamza, "¨" is the mostly-silent final teh (teh-marbuta). See the full table at the end. APPENDIX - SAMPLES OF THE DIFFERENCES FOUND diff0: bîîs°mî | bîs°mî aûn°zîlâ | a!n°zîlâ aûn°zîlâ | a!n°zîlâ aîn»â | a¡n»â 'âan£âr°tâhûm° | 'âa!n£âr°tâhûm° am° | a!m° ab°ßârîhîm° | a!b°ßârîhîm° aîl»âa | a¡l»âa anfûsâhûm° | a!nfûsâhûm° alîymü | a!lîymü wâaî£âa | wâa¡£âa aîn»âmâa | a¡n»âmâa alâa | a!lâa aîn»âhûm° | a¡n»âhûm° wâaî£âa | wâa¡£âa anûw!°mînû | a!nûw!°mînû alâa | a!lâa aîn»âhûm° | a¡n»âhûm° wâaî£âa | wâa¡£âa wâaî£âa | wâa¡£âa ... al°ar°ðî | al°a!r°ðî tâa°kûlû | tâa!kûlû an°¿âmûhûm° | a!n°¿âmûhûm° wâanfûsûhûm° | wâa!nfûsûhûm° afâlâa | a!fâlâa an° | a¡n° aymânûhûm° | a¡ymânûhûm° fâa¿°rîð° | fâa!¿°rîð° an»âhûm° | a¡n»âhûm° wâmâlâa¡îyhî | wâmâlâa!âyhî diff1: bîîsmî | bîsmî aûnzîlâ | a!nzîlâ aûnzîlâ | a!nzîlâ aînnâ | a¡nnâ 'âan£ârtâhûm | 'âa!n£ârtâhûm am | a!m abßârîhîm | a!bßârîhîm aîllâa | a¡llâa anfûsâhûm | a!nfûsâhûm alîymü | a!lîymü wâaî£âa | wâa¡£âa aînnâmâa | a¡nnâmâa alâa | a!lâa aînnâhûm | a¡nnâhûm wâaî£âa | wâa¡£âa anûw!mînû | a!nûw!mînû alâa | a!lâa aînnâhûm | a¡nnâhûm wâaî£âa | wâa¡£âa wâaî£âa | wâa¡£âa ... alarðî | ala!rðî tâakûlû | tâa!kûlû an¿âmûhûm | a!n¿âmûhûm wâanfûsûhûm | wâa!nfûsûhûm afâlâa | a!fâlâa an | a¡n aymânûhûm | a¡ymânûhûm fâa¿rîð | fâa!¿rîð annâhûm | a¡nnâhûm wâmâlâa¡îyhî | wâmâlâa!âyhî diff2: ann | an hnyya | hny'a mryya | mry'a wall£n | wall£an xyya | xy'a ala©r | ala'©r ya | yayyha ayyha < la'tynahm | la'tynhm wltat < ±ayf¨ < arak | ark haantm | hantm a£an | 'a£an alßßlµat | alßßlµt alssmawat | alssmwt ytamå | ytmå alllaty | allty alwldan | alwldn llytamå | llytmå ... wayta'y | waytay alnnbyyyn | alnnbyyn 'ayat | 'ayt lxay | lxaå' yhdyny | yhdyn walbqyat | walbqyt mkknny | mkkny wlaßllbnnkm | wßllbnnkm a | arjlhm rjlhm < diff3: alllh | allh ayyak | ayyk wayyak | w-ayyk ahdna | ahdn wla | wl alððallyn | alððllyn alllh | allh a/l/m | alf-lm-mym la | l hdå | hdn ywmnwn | ymnwn alßßlw¨ | alßßlt wmmma | wmmm wall£yn | wll£yn ywmnwn | ymnwn bma | bm wma | wm wbala©r¨ | wbal-a©rt awlyk | al-ak ¿lå | ¿l ... a£a | a£ alllh | allh alnnas | alnns alnnas | alnns alnnas | alnns alwswas | alwsws al©nnas | al©nns alnnas | alnns aljnn¨ | aljnnt walnnas | wlnns diff4: al¿lmyn | al¿almyn mlk | malk alßr± | alßra± ßr± | ßra± wla | wlaalðalyn alðalyn < a/l/m | alm alktb | alktab la | laryb ryb < alßlw¨ | alßla¨ rzqnhm | rzqnahm wma | wmaanzl anzl < la | laywmnwn ywmnwn < abßrhm | abßarhm ¤xw¨ | ¤xaw¨ wma | wmahm hm < ... a¿bd < ma | maa¤nå a¤nå < wma | wmaksb ksb < kfwa | kfwaaµd aµd < ma | ma©lq ©lq < alnfþt | alnfaþat Last edited on 2004-01-31 07:48:19 by stolfi