Wednesday 28 December 2011

The Quran's Statistics


Shadow Caster. 17-04-2010
This is version 2 of The Quran's Statistics. The old article has been removed because it is inaccurate.
Last updated 29-12-2011, to version 2.0.2
Updated 22-04-2010, to version 2.0.1 - Thanks to brother Ali Adams for helping identify a missed diacritic.
Updated 29-12-2011, to version 2.0.2 - Thanks to brother Ayub Hamid for noticing a numerical discrepancy that led to the identification of four more diacritics, a missed letter and a word counting bug.


Introduction


The Quran is the holy book of the Muslims and they consider its text to be the unparalleled, unadulterated and the perfect word of God. The Quran was revealed over a period spanning 23 years about 1400 years ago. It was revealed in the Arabic language and scribes wrote it on parchment mainly which were collected as volumes and then compiled into one book. Originally there were no harakat (diacritics) or nuqat (dots) on the letters because there was no set of definitive, standardized rules to apply these to the Arabic language. After some time and some minor evolution, the Arabic language incorporated diacritics and dots which were consequently applied to the text of the Quran.

Today, when we open up a copy of the Quran we see it is ornately riddled with all manner of symbols in addition to the 28 letters of the Arabic alphabet. Arabic on computers used to be a big problem but with the improvements to the Unicode standard in the naughties, it is possible to display all the required Arabic letters and their diacritics, though editing and typography still has its issues.

Arabic is best represented using the Unicode UTF-8 standard. Because each letter and diacritic in Arabic is represented as a UTF-8 character and there are some combinations that are represented by one UTF-8 character, these diacritics between the text make it difficult to analyze the letters of the text individually.

There could be a large number of variations of diacritics on a single word so you cannot do a word search or even search for a phrase like you can in English. One approach (admittedly it's not a great solution) to solving this problem is to expunge diacritics from the text. Arabic text is very difficult to computationally process unless all the diacritics are removed and some characters are modified.

The aim of this article is to find out some statistics about the Quran using quick computational methods. Doing so manually would take too long and is subject to human error.

Code and Text


In the analysis below, the scripts ran on diacritic-free renderings of the minimal "Uthmani" copy of the Quran (v1.0.2), obtained from the Tanzil project. The XML copy of the Uthmani Quran, under the Creative Commons BY-ND 3.0 Unported was downloaded and parsed using a PHP script and a diacritic-free XML file was made and is used for the scripts. This shares the same license. The scripts in the downloadable archive were written in PHP 5 (5.3) running on WAMP on Windows XP, to be run from a web browser. If you wish to run them then upload them to a PHP 5 enabled server or download WAMP or a similar package
and run it on your own system by visiting the pages via your web browser. The diacritic removal code was originally written by the author in Perl, but it has been modified and ported to PHP for this purpose. The PHP code files are released under the Apache 2.0 License, except the new diacritic-free quran XML file, which is released under the Creative Commons BY-ND 3.0 Unported, which it inherited from the XML file of the Tanzil project. If you see a bug or logical flaw in the PHP please inform me. If you download the development folder you can access all the code, data files and results.

General Details


This is all common knowledge:
  • There are 114 surahs (chapters) in the Quran.
  • There are 30 ajza' (volumes/parts) to the Quran.
  • There are 6236 verses in the Quran (7 verses in the first chapter with Bismillah included but with the initial Bismillah not included for other surahs, otherwise it is: 112 + 6236 = 6348).
  • The "Bismillah" opening phrase is mentioned at the beginning of 113 Surahs and once in the text of Surat al-Naml, so 114 times in total in the whole Quran.
  • The most common print of the Arabic Quran contains approximately 604 pages.
  • The longest chapter, Surat al-Baqarah, contains 286 verses.
  • The shortest chapter, Surat al-Kawthar, contains 3 verses.
  • The longest ayah (verse) is in Surat al-Baqarah, verse 2.282.
  • The shortest ayat (verses) are two letters long and are present in numerous surahs like Taha (20.1)
  • The shortest ayah (verse) with an actual word is in Surat al-Rahman, verse 55.1.
The things that I independently discovered are listed in the summary at the bottom of the page.

Creating the Diacritic-free XML Quran File


The minimal "Uthmani" soft copy of the Quran, version 1.0.2, was obtained from the Tanzil project. A PHP script was written to read the text from this file, remove the diacritics, change the various forms of aleph characters down to their simplest representation (ا), and output it back into another XML file.

The original file from Tanzil is called quran-uthmani-min.xml, the PHP script is create_clear_uthmani_xml.php and should be run in the browser to create the diacritic-free XML file quran_uthmani_clear.xml.

The Number of Verses in the Quran


As mind-boggling as this sounds - most Muslims have no idea how many verses there are in the Quran and some actually debate the subject with no knowledge. Some people even claim the Quran has 6666 verses in it, which is a malicious lie to discredit Islam in the eyes of the Christians who see this number as the mark of the Antichrist, as "The Number of the Beast"! In reality, the Quran contains 6236 verses.

The PHP script number_of_verses.php clarifies that the quran has 6236 verses and it also shows how many verses there are in each surah (chapter). Run the PHP script in your browser to view the results. It will also write the results to the file number_of_verses.csv and number_of_verses.html in the same directory.

The Number of Arabic Letters in the Quran


The Arabic language has 28 base letters and a few extra letters/representations. A script, letter_frequency_uthmani.php, was written to analyze the characters in the quran. The script discovers the total number of letters in the quran, the number of letters in each surah (chapter), the frequency of each letter in the quran and the frequency of each letter in each surah.The Arabic letters we're counting are:

(ى ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي ء)

For the count of characters, any instance of taa marboot (ة) is converted to haa (ه) and haa is counted. And there are two compound characters (ئ and ؤ) that are made up of more than one letter. Internally, UTF-8 stores these as a single character, however, for this analysis they are separated into their constituents - two characters.

Run the letter_frequency_uthmani.php in your browser and it will display the results. In addition to that, it will produce these CSV files with the results - number_of_chars_per_surah.csv, quran_letter_frequency.csv and letter_frequency_uthmani.csv. A HTML file of the result is also made, letter_frequency_uthmani.html. If you want to modify the script to count characters according to your own rules, then you are most welcome.

The Number of Words in the Quran


As far as this computational analysis is concerned, a word is a group of consecutive letters seperated by a space from the next word. It doesn't try to find the meaning or root of the words in order to identify if they are essentially the same word. The script considers words with 'huruf' (grammatical letters) joined to them to be different words, so wa-allah is considered differently to Allah. An analogy in English are the words betwixt and twixt which would be treated as separate words even though their root and meaning are the same.

Also, in the Arabic language, a word may have one form - the same base character sequence - but it would be considered two different words because the diacritics ascertain the different meanings - this script does not differentiate between them. The only way to accurately calculate the number of words and how many times they occur in the quran is by laborious human analysis.

If you run the quran_words.php script it will display the number of words in the quran, the number of unique words in the quran, the average length of unique words in the quran, the number of words in each surah (chapter) and the frequency in which each word occurs in the quran. The script creates two CSV results files, words_in_surahs.csv and word_frequencies.csv, and a HTML file, quran_words.html, of the results.

Results


The script number_of_verses.php tells us that:
  • There are 6236 verses in the Quran, and
  • it lists the number of verses in each surah (chapter).
  • The largest surah (chapter), (2) Al-Baqarah (The Cow), contains 286 verses, and
  • the shortest surah, (108) Al-Kawthar (The Abundance), contains 3 verses.
The script letter_frequency_uthmani.php tells us that:
  • The total number of letters in the Quran is 327293, and
  • it lists the number of letters in each surah (chapter), and
  • it lists the frequency of each alphabetic letter in the whole quran, and
  • finally it lists the frequency each alphabetic letter appears in each surah.
  • The largest surah in the Quran, (2) Al-Baqarah (The Cow), contains 25986 letters, and
  • the smallest surah, (108) Al-Kawthar (The Abundance), contains 43 letters.
  • The three most common letters in the Quran are alpeh, laam, and noon ( ا ل ن ) and they occur 52655, 38102 and 27268 times respectively.
The script quran_words.php tells us that:
  • The number of words in the quran is 77430, and
  • the number of unique words in the quran is 14716, and
  • the average length of those unique words is 5.30 letters.
  • It shows a table with the number of words in each surah, and
  • it also shows a table with the frequency words occur in the quran.
  • The longest surah, (2) Al-Baqarah (The Cow), contains 6116 words, and
  • the shortest surah, (108) Al-Kawthar (The Abundance), contains 10 words.
  • The three most common words in the quran are min/men (from/who), Allah (God) and inna (is, if, ...etc) ( من, الله, ان ) and they occur 2763, 2153 and 1604 times respectively.

Summary


You will find all the files mentioned above in this zipped archive (download).

I do not claim that everything is perfect but I have made a solid attempt to be accurate and have double checked most things so if by luck you notice anything that I missed then please contact me so I may correct it.

If you are looking for a specific thing, you can use your browser's find feature (Click Edit > Find from the menu) to look at the results in the scripts to find the values you are looking for. Mozilla Firefox is recommended.

It has not escaped our notice that the data we present here can be used to discredit the beliefs of heretical cults who base their faith on certain false ideas of the numeric basis of the Quran. We foresee the data being shared with all honesty and being used to ascertain facts, denounce spurious claims and to vanquish myths.

If you discover something interesting then please contact us and we might publish it on this site.

No comments:

Post a Comment