By Nirmaldasan
(nirmaldasan@hotmail.com)
Of the two factors in the New Dale-Chall Readability Index of 1995, the number of complete sentences (S) in a sample of 100 words is arguably the simplest of all variables. It accounts for the syntactic difficulty of any text. I used it as the sole variable in the ‘Simplicity Score of Business Writing’ (October 2014).
A readability formula also needs a factor to measure semantic complexity. This may be ASW (average syllables per word) or ALW (average letters per word) or percentage of unfamiliar or polysyllabic words. Without being specific (for the time being), I would like to say that semantic complexity is measured by the percentage of difficult words (D).
To formulate a Generalised Readability Index (GRI), we also need a readability constant (r). If by some means we know the expected percentage of difficult words (EPD), then r = 50 – EPD. If EPD is 0, then r = 50; and if EPD is 50, then r = 0.
Then, GRI = (D + r) / S
This formula measures the grade level of texts on a scale of 1 to 17+ years of schooling.
I have not offered any proof. But, as they say, the proof of the pudding is in the eating. A generalized formula is useless unless we know what exactly are D and r. We now need to become specific. I’ll demonstrate the utility of GRI in deriving new readability formulas.
Example 1: Let the number of uncommon words U be a measure of semantic complexity. The expected percentage of difficult (uncommon) words EPD = 50 since the 100 commonest words account for 50% of any text. For clarification, do take a look at my article ‘The Lemma Readability Index’ (January 2016). So, r = 50 – EPD = 50 – 50 = 0. Therefore, in a sample of 100 words, the Index = (U + 0) / S = U / S. But this is simply the average number of uncommon words in a sentence. Thus the formula is just another form of the Lemma Readability Index.
Example 2: The average syllable has three characters (one vowel letter and two consonants). A disyllabic word may have six characters; and a polysyllabic word, more than six characters. So any word with more than six characters may be called a long word. Let the percentage of long words L(>6) be a measure of semantic complexity. The EPD may be calculated from the distribution of word lengths presented in Peter Norvig’s article titled ‘English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU’, which is available at http://norvig.com/mayzner.html
Interestingly, words of six or less characters account for 75%. So EPD = 100 – 75 = 25. Now, r = 50 – EPD = 50 – 25 = 25. Therefore, in a sample of 100 words, the Index = (L(>6) + 25) / S.
Example 3: The average word has five characters (two vowel letters and three consonants). Any word with more than five characters may be called a long word. Let the percentage of long words L(>5) be a measure of semantic complexity. Consulting Peter Norvig’s distribution of word lengths, we notice that words of five or less characters account for 67%. So EPD = 100 – 67 = 33. Now, r = 50 – EPD = 50 – 33 = 17. Therefore, in a sample of 100 words, the Index = (L(>5) + 17) / S.
Application: Let us test the three new indices on the following text:
“The first batch of students of the Certificate in Online Journalism programme announced that they are online at the viva voce examination on Saturday (29 November 2014). They have created for themselves a website, a blog and a twitter account.”
Words = 40 and Sentences = 2
Uncommon words = 17 (batch, students, Certificate, Online, Journalism, programme, announced, online, viva, voce, examination, created, themselves, website, blog, twitter, account)
Words of more than six characters = 13 (students, Certificate, Journalism, programme, announced, examination, Saturday, November, created, themselves, website, twitter, account
Words of more than five characters = 15 (students, Certificate, Online, Journalism, programme, announced, online, examination, Saturday, November, created, themselves, website, twitter, account)
From the above data, we get:
U% = (17/40) * 100 = 42.5
S% = (2/40) * 100 = 5
U-Index = (U / S) = 42.5 / 5 = 8.5 years of schooling.
L(>6)% = (13/40) * 100 = 32.5
L(>6)-Index = (L(>6) + 25) / S = (32.5 + 25) / 5 = 11.5 years of schooling.
L(>5)% = (15/40) * 100 = 37.5
L(>5)-Index = (L(>5) + 17) / S = (37.5 + 17) / 5 = 10.9 years of schooling.
Note: It is better to take a sample of 100 words and thus avoid the calculation of percentages.