The Lemma Readability Index

January 17, 2016

By Nirmaldasan


The Dale-Chall readability formula uses a list of 3000 familiar words. This formula has a very high correlation with text difficulty. However, readability formulae that do not use a list such as Robert Gunning’s Fog Index are more popular as they are easy to apply. But there is no reason to discard the list as it tests each word of a text. Let us look at a shorter list of 100 commonest words, which typically covers 50% of the over two billion words in the Oxford English Corpus. This list in rank order is found in an article titled ‘The OEC Facts About The Language’:

The list uses the idea of lemmas, ‘a lemma being the base form of a word’.  An alphabetical arrangement of the words would help us use the list for measuring readability.

Commonest Lemma List

a  about  after  all  also  an  and  any  as  at  (10 lemmas)

back  be  because  but  by  (5 lemmas)

can  come  could  (3 lemmas)

day  do  (2 lemmas)

even  (1 lemma)

first  for  from  (3 lemmas)

get  give  go  good  (4 lemmas)

have  he  her  him  his  how  (6 lemmas)

I  if  in  into  it  its  (6 lemmas)

just  (1 lemma)

know  (1 lemma)

like  look  (2 lemmas)

make  me  most  my  (4 lemmas)

new  no  not  now  (4 lemmas)

of  on  one  only  or  other  our  out  over  (9 lemmas)

people (1 lemma)

say  see  she  so  some  (5 lemmas)

take  than  that  the  their  them  then  there  these  they  think  this  time  to  two  (15 lemmas)

up  us  use (3 lemmas)

want  way  we  well  what  when which  who  will with  work  would  (12 lemmas)

year  you  your (3 lemmas)

New Formula

The Lemma Readability Index (LRI) measures texts on a scale of 1 to 17 years of schooling. The LRI is the number of words per sentence not in the Commonest Lemma List. Take a sample of n sentences from a text. Count the Words Not in List (WNL). Then, LRI = WNL/n.

Counting Guidelines

  1. Do not count proper names (names of people, places, days, months, organisations … )
  2. Do not count numerals, symbols, abbreviations, acronyms
  3. Do not count lemmas that are in the list
  4. Do not count words that are grammatically associated with the lemmas in the list. Some examples:
  5. Since be is in the list, do not count being, am, are, is, was, were
  6. Since take is in the list, do not count taken, taker, takers, takes, taking, took
  7. Since new is in the list, do not count newer, newest, newly, news, newsy
  8. Since time is in the list, do not count timed, timely, timer, times, time’s, timing
  9. Do not count compound words, if each part is in the list. Some examples:
  10. Since some and how are in the list, do not count somehow
  11. Since any and way are in the list, do not count anyway
  12. Since an and other are in the list, do not count another
  13. Since good and will are in the list, do not count goodwill
  14. Count compound words as many times as they appear even if one part is not in the list. Some examples:
  15. Since how is in the list but ever is not, count however
  16. Since will is in the list but free is not, count freewill
  17. Count every single word (even repetitions) which is neither in the list nor grammatically associated with the lemmas in the list

These guidelines solve most of the counting problems. But one is likely to come across a number of deceptive words. For instance, a and do are in the list, therefore ado begs not to be counted (fifth guideline). However, ado has to be counted because it is not a compound word. Again, take the words better and more. Though both are not in the list, one is tempted to exclude them from the count because of semantic reasons. Better is related to good, and more is related to some and most. Resist the temptation and count every deceptive word.  Remember that if we do not count more, then we cannot also count moreover. Let’s not quibble.


Let us apply the formula on the following paragraph:

“The first batch of students of the Certificate in Online Journalism programme announced that they are online at the viva voce examination on Saturday (29 November 2014). They have created for themselves a website, a blog and a twitter account.”

Let us follow the counting guidelines.

  1. Proper names are not counted (Saturday, November)
  2. Numerals are not counted (29, 2014)
  3. Lemmas in the list are not counted (The, first, of, in, that, they, at, on, have, for, a, and)
  4. Words grammatically associated with the lemmas are not counted (are)
  5. Compound words, if each part is in the list, is not counted (— )
  6. Compound words even if one part is not in the list is counted (Online, online, themselves, website)
  7. Every single word which is neither in the list nor grammatically associated with the lemmas in the list is counted (batch, students, Certificate, Journalism, programme, announced, viva, voce, examination, created, blog, twitter, account)

In the order of appearance, here is the list of words not in the list: batch, students, Certificate, Online, Journalism, programme, announced, online, viva, voce, examination, created, themselves, website, blog, twitter, account. WNL = 17.

Since the number of sentences in the sample is 2, LRI = WNL/n = 17/2 = 8.5 years of schooling.


Let us compare the LRI with the Fog Index (FI).

Average Words per Sentence (AWS) = 40/2 = 20

Percentage of hard words (P) = (1/40)*100 = 2.5 [Not all polysyllables are hard. In this example, Certificate and Journalism are not counted as hard because they are part of the name of a programme. The only hard word is examination]

FI = 0.4*(AWS+P) = 0.4*(20+2.5) = 0.4*22.5 = 9 years of schooling.

The LRI compares very well with the FI. One needs to test the validity of LRI on at least a 100 samples. Please go ahead and put the LRI to the test. Thank you.


