The Dale-Chall readability formula uses a list of 3000 familiar words. This formula has a very high correlation with text difficulty. However, readability formulae that do not use a list such as Robert Gunning’s Fog Index are more popular as they are easy to apply. But there is no reason to discard the list as it tests each word of a text. Let us look at a shorter list of 100 commonest words, which typically covers 50% of the over two billion words in the Oxford English Corpus. This list in rank order is found in an article titled ‘The OEC Facts About The Language’: http://www.oxforddictionaries.com/words/the-oec-facts-about-the-language
The list uses the idea of lemmas, ‘a lemma being the base form of a word’. An alphabetical arrangement of the words would help us use the list for measuring readability.
Commonest Lemma List
a about after all also an and any as at (10 lemmas)
back be because but by (5 lemmas)
can come could (3 lemmas)
day do (2 lemmas)
even (1 lemma)
first for from (3 lemmas)
get give go good (4 lemmas)
have he her him his how (6 lemmas)
I if in into it its (6 lemmas)
just (1 lemma)
know (1 lemma)
like look (2 lemmas)
make me most my (4 lemmas)
new no not now (4 lemmas)
of on one only or other our out over (9 lemmas)
people (1 lemma)
say see she so some (5 lemmas)
take than that the their them then there these they think this time to two (15 lemmas)
up us use (3 lemmas)
want way we well what when which who will with work would (12 lemmas)
year you your (3 lemmas)
The Lemma Readability Index (LRI) measures texts on a scale of 1 to 17 years of schooling. The LRI is the number of words per sentence not in the Commonest Lemma List. Take a sample of n sentences from a text. Count the Words Not in List (WNL). Then, LRI = WNL/n.
- Do not count proper names (names of people, places, days, months, organisations … )
- Do not count numerals, symbols, abbreviations, acronyms
- Do not count lemmas that are in the list
- Do not count words that are grammatically associated with the lemmas in the list. Some examples:
- Since be is in the list, do not count being, am, are, is, was, were
- Since take is in the list, do not count taken, taker, takers, takes, taking, took
- Since new is in the list, do not count newer, newest, newly, news, newsy
- Since time is in the list, do not count timed, timely, timer, times, time’s, timing
- Do not count compound words, if each part is in the list. Some examples:
- Since some and how are in the list, do not count somehow
- Since any and way are in the list, do not count anyway
- Since an and other are in the list, do not count another
- Since good and will are in the list, do not count goodwill
- Count compound words as many times as they appear even if one part is not in the list. Some examples:
- Since how is in the list but ever is not, count however
- Since will is in the list but free is not, count freewill
- Count every single word (even repetitions) which is neither in the list nor grammatically associated with the lemmas in the list
These guidelines solve most of the counting problems. But one is likely to come across a number of deceptive words. For instance, a and do are in the list, therefore ado begs not to be counted (fifth guideline). However, ado has to be counted because it is not a compound word. Again, take the words better and more. Though both are not in the list, one is tempted to exclude them from the count because of semantic reasons. Better is related to good, and more is related to some and most. Resist the temptation and count every deceptive word. Remember that if we do not count more, then we cannot also count moreover. Let’s not quibble.
Let us apply the formula on the following paragraph:
“The first batch of students of the Certificate in Online Journalism programme announced that they are online at the viva voce examination on Saturday (29 November 2014). They have created for themselves a website, a blog and a twitter account.”
Let us follow the counting guidelines.
- Proper names are not counted (Saturday, November)
- Numerals are not counted (29, 2014)
- Lemmas in the list are not counted (The, first, of, in, that, they, at, on, have, for, a, and)
- Words grammatically associated with the lemmas are not counted (are)
- Compound words, if each part is in the list, is not counted (— )
- Compound words even if one part is not in the list is counted (Online, online, themselves, website)
- Every single word which is neither in the list nor grammatically associated with the lemmas in the list is counted (batch, students, Certificate, Journalism, programme, announced, viva, voce, examination, created, blog, twitter, account)
In the order of appearance, here is the list of words not in the list: batch, students, Certificate, Online, Journalism, programme, announced, online, viva, voce, examination, created, themselves, website, blog, twitter, account. WNL = 17.
Since the number of sentences in the sample is 2, LRI = WNL/n = 17/2 = 8.5 years of schooling.
Let us compare the LRI with the Fog Index (FI).
Average Words per Sentence (AWS) = 40/2 = 20
Percentage of hard words (P) = (1/40)*100 = 2.5 [Not all polysyllables are hard. In this example, Certificate and Journalism are not counted as hard because they are part of the name of a programme. The only hard word is examination]
FI = 0.4*(AWS+P) = 0.4*(20+2.5) = 0.4*22.5 = 9 years of schooling.
The LRI compares very well with the FI. One needs to test the validity of LRI on at least a 100 samples. Please go ahead and put the LRI to the test. Thank you.
Direct Dale-Chall Grading: https://strainindex.wordpress.com/2008/03/10/direct-dale-chall-grading/
Plain Fog Index: https://strainindex.wordpress.com/2010/05/11/the-plain-fog-index/
Readability Conjectures: https://strainindex.wordpress.com/2008/05/16/readability-conjectures/