A mathematical approach to selecting Japanese reading material
The motivation
Educators use metrics like Lexile and Flesch-Kincaid scores to help children select suitable reading material for their age and grade level. Many of us benefitted from such a system when we were younger. Is it possible to implement something similar for the languages we’re studying as adults?
Flesch-Kincaid formulas
# larger asl == more difficult
asl = average_sentence_length(text)
# more syllables == more difficult
asw = average_syllables_per_word(text)
# people smarter than me figured out the
# magic constants used below
# Flesch-Kincaid grade level formula
fkgl = (0.39 * asl) + (11.8 * asw) - 15.59
# Flesch-Kincaid reading ease formula
fkre = 206.835 - (1.015 * asl) - (84.6 * asw)
Here are how the reading ease score is interpreted (Grade level is self-explanatory):
Score | School Grade (US) | Notes |
---|---|---|
100.00–90.00 | 5th grade | Very easy to read. Easily understood by an average 11-year-old student. |
90.0–80.0 | 6th grade | Easy to read. Conversational English for consumers. |
80.0–70.0 | 7th grade | Fairly easy to read. |
70.0–60.0 | 8th & 9th grade | Plain English. Easily understood by 13- to 15-year-old students. |
60.0–50.0 | 10th to 12th grade | Fairly difficult to read. |
50.0–30.0 | College | Difficult to read. |
30.0–10.0 | College graduate | Very difficult to read. Best understood by university graduates. |
10.0–0.0 | Professional | Extremely difficult to read. Best understood by university graduates. |
Will the original formula work?
I would like to see how accurate this formula is for grading Japanese reading / listening difficulty. For speech, I think this could still work quite well, but for reading, there might be additional factors to consider:
- Japanese text often consists of Chinese characters, called Kanji in Japanese, which obfuscate the pronunciation of a word for readers who have not yet learned the pronunciation of the characters used to spell the word in question.
- Following on from the previous point, Kanji often have multiple “readings” (pronunciations). The correct reading to use is often dependent on the adjacent characters and context.
- Rather than memorize each distinct possible reading in isolation, it is often more effective to learn the reading of the characters in context. A natural consequence of this is that there will be some contexts / situations that are more common than others, and thus some readings will be easier to recognize.
- It’s probably fair to say that older readers & speakers will use rarer kanji compounds & words more ofthen than younger readers & speakers.
To give you something a bit more concrete to work with, here are some examples:
Kanji | Japanese Reading | Romanization | English Meaning |
---|---|---|---|
月 | つき | tsuki | moon |
1月 | いち・がつ | ichi gatsu | January |
1ヶ月 | いっか・げつ | ikka getsu | One month |
So with those previous points in consideration, it might be reasonable to supplement the original Flesch-Kincaid approach with a modified variant specifically for reading languages with complex writing systems like Japanese:
- shorter sentences are easier (same)
- words with fewer syllables are easier (same)
- sentences with more common words are easier (new)
Modified Flesch-Kincaid Formula for Japanese
asl = average_sentence_length(text)
asw = average_syllables_per_word(text)
# We can try using the reciprocal of the percentile
# as multiplier against a constant penalty
# value to increase the overall score whenever
# the text being scored uses a lot of uncommon words
awfp = average_word_frequency_percentile(text)
# Modified Flesch-Kincaid grade level
mfkgl = (x * asl) + (y * asw (1/awfp)) - a
# Flesch-Kincaid reading ease formula
fkre = 206.835 - (1.015 * asl) - (84.6 * asw (1/awfp))
So now that we have a basic strategy for determining the age / grade level of a particular
Japanese text, we can now start doing a bit of testing to find what constant values for
x
, y
, z
, and a
get us to a reasonably accurate score. If anything, we could
forego the constants and use the uncorrected value to get an idea of what kind of effect
the newly introduced variable is having on the original Flesch-Kincaid score.
A major consideration with this approach is that you need to have a high quality corpus of Japanese that is representantive of the media that the learner will be exposed to in real life. It’s not a trivial problem to solve, so we’ll put in a pin in this strategy and come back to it in a later post.
The tests
Original Flesch-Kincaid Formula
To start, I figured we’d remove the additional variable of Kanji readings from the equation and go with the original Flesch-Kincaid approach so we can set a baseline. Fortunately, when you write sentences without Kanji you end up using one of the Japanese syllabaries, hiragana or katakana. The good news is in the name: we don’t need to figure out the syllables for each word, as each character in a syllabary generally represents a distinct vowel or consonant-vowel pair. Based on this knowledge, we can take a naive approach and count the number of characters to estimate the number of syllables.
The sample text we’ll be analyzing is the skit from NHK Easy Japanese Lesson 48 :
はる: タムさんがきて、もうすぐいちねんですね。しょうらいはなにがしたいですか。
タム: そつぎょうしたら、にほんではたらきたいです。りょこうがいしゃではたらきたいです。
かいと: いいね!
はる: にほんのみりょくをいっぱいつたえてくださいね。
ミーヤー: おうえんしてるよ。
タム: はい。がんばります!
The provided translation is as follows:
Haru: Tam-san, it’s been almost a year since you came here. What do you want to do in the future?
Tam: When I graduate, I want to work in Japan. I want to work at a travel agency.
Kaito: That sounds good!
Haru: Please tell people all about Japan’s attractions.
Mi Ya: I’ll be rooting for you.
Tam: Thank you. I’ll do my best!
The english version of the skit has a FK Grade Level of 1.79 and a FK Reading Ease of 93.54. In other words, your last lesson in NHK Easy Japanese is roughly the equivalent of second grade reading material in English. Let’s see if the Japanese text is scored similarly using the original formulas.
# I'm using ruby, but you should be able to do this in any programming language
# average sentence length (average number of words in sentence)
[8] pry(main)> asl = words.length / sentences.length
=> 5 # oops, forgot to convert to floating point math
[9] pry(main)> asl = words.length / sentences.length.to_f
=> 5.25 # that's better
# average number of syllables per word
[10] pry(main)> asw = words.map { |word| word.length }.sum.to_f / words.length
=> 2.642857142857143
# FK Grade Level
[11] pry(main)> fkgl = (0.39 * asl) + (11.8 * asw) - 15.59
=> 17.64321428571429
# FK Reading Ease
[12] pry(main)> fkre = 206.835 - (1.015 * asl) - (84.6 * asw)
=> -22.079464285714238
So, using the original formulas, it seems that our elementary level Japanese has been graded as “very difficult”, event though it isn’t. Let’s compare things sentence by sentence to see if we can get a rough guess as to why this is:
Sentence | Word Count | Syllable Count |
---|---|---|
Tam-san, it’s been almost a year since you came here. | 11 | 12? |
タムさんがきて、もうすぐいちねんですね | 10 | 18 |
What do you want to do in the future? | 9 | 10 |
しょうらいはなにがしたいですか。 | 7 | 15 |
When I graduate, I want to work in Japan. | 9 | 11 |
そつぎょうしたら、にほんではたらきたいです。 | 6 | 20 |
Just a few sentences in and the problem is evident: if we count each hiragana character as a syllable, the average syllable count for the sentence will be much greater than the english counterpart. I know that technically each hiragana character isn’t actually a syllable, but is instead a mora (but they call it a syllabary…). Moras are smaller units of sound, sometimes a syllable can consist of several moras. Since converting hiragana compounds to syllables does not sound like my idea of fun, I will instead continue counting moras and try adjusting the constant that the average syllables per word is multiplied by, since that’s where our scores are really getting penalized.
[42] pry(main)> fkgl = (0.39 * asl) + (5.8 * asw) - 15.59
=> 1.7860714285714288
# A few iterations later
[56] pry(main)> fkre = 206.835 - (1.015 * asl) - (40.85 * asw)
=> 93.54553571428573
Tada! 🎉
Now, I’m not saying that these are in any way going to be reliable going forward, as I have zero idea how the original constants for the Flesch-Kincaid formulas were determined. I plan to test these new constants against more data to see if it sufficently adjusts for how Japanese language differs from English. The goal isn’t to necessarily get them to match one-to-one, but to get the values back into a reasonable range that allows me to analyze Japanese sentences using the Flesch-Kincaid method.
Conclusion
I think after experimenting with the values, we now have an interesting baseline to start working from, but the only way to really prove this out is to analyze a lot more data. In the next post in this series, we’ll look at creating a data pipeline to do bulk analysis of Japanese media available on the web.