An over-engineered approach to learning Japanese using the internet
What are we talking about today?
In our previous post in this series, we looked at how we could use mathematics to figure out the difficulty level of Japanese text. Now we need to create a practical way to do this quickly, and at scale. Analyze all the things!
Problems we need to tackle at a high level:
Locate a high quality collection of source material
This will be used for testing our FK formulas w/ modified constants from the first post, as well as serve as our corpus for developing an enhanced FK formula.
- Should have consistent formatting and quality
- Should have a diverse collection of examples of native Japanese
- Should be widely accessible to users
Implement a service that prepares source material for processing
No matter the source material, most will have junk that needs to be cleaned out. Formatting characters, HTML tags, Wiki labels, the list goes on. We need something that can do this really quickly, wihout using a lot of resources.
- Remove invalid characters / text
- Conform data to a data structure that is easy to process
Implement a service that processes the prepared source material
Japanese included sentences don’t have spaces between the words. This is also true of some other langauges. There are libraries designed to read Japanese text and identify where the word boundaries are (otherwise, we couldn’t search Japanese text using search engines). We’ll use such a software for processing our Japanese sentences.
- Convert sentences w/ kanji to a collection of moras / kana
- Group moras / kana by word
- Group words by sentence
Generate Flesch-Kincaid scores for each sentence
Once our sentences are processed, we grading them should be fairly straightforward. If they have been processed properly, we should be able to apply a similar algorithm to the one used in the first post.
- Apply Flesch-Kincaid grade level formula with updated constants
- Apply Flesch-Kincaid reading ease formula with updated constants
Store results in an easy to explore database
The primary reason we’re doing all of this is so we can get lots of quality data to explore and test our theories on. We want to be able to easily query the data and gather meaningful insights quickly.
- DB should allow rapid store and retrieval
- DB should facilitate basic, useful data analysis strategies
Experiment with enhanced Flesch-Kincaid formulas
This is the pinnacle of the project. If the prior efforts worked well enough, we should have reasonable scores for all of our content. However, it would be really cool if we were able to somehow create other useful metrics to compare against, specifically something that took how common a word is into consideration.
- Retrieve word frequency percentiles from DB
- Calculate average word frequency percentile for sentence
- Apply Enhanced Flesch-Kincaid grade level formula
- Apply Enhanced Flesch-Kincaid reading ease formula
- Store in DB for exploration
Conclusion
Now that we have a general plan of attack, let’s start hacking! The plan is to keep this page updated as we knock out each step in the process. Tune in next time as we locate a good data source and start prepping the data.