Next-Word Predictor: Overview and Initial Config
(Coursera Data Science Capstone project)
using 4 prior
--
--
--
--
--
using 3 prior
--
--
--
--
--
using 2 prior
--
--
--
--
--
using 1 prior
--
--
--
--
--
This is a first pass at a next-word prediction app based on Ngram analysis of text corpora.
It builds on an initial exploratory analysis in R
undertaken as part of the Coursera / Johns Hopkins Specialization in Data Science.
The calculation approach has developed significantly after that initial exploration: the data analysis
underlying this first pass app uses Hadoop's mapreduce framework, which will facilitate the the use
of much larger raw data sets and more extensive parametric analysis of configuration parameters
(including the production of learning curves and trade-off maps) to improve the performance of the app.
The calculation process, which uses Amazon's Elastic MapReduce implementation, is described in this IPython notebook. As an overview, it carried out the following two MapReduce steps in sequence (for each N (ie. for each level of Ngram)):
- The input text is mapped to a series of N-grams (filtered by a vocabulary list such that tokens (words) not on the list are overwritten with the value "##unkn##") and then these N-grams are aggregated by frequency
- The aggregated N-grams are then filtered and summarized such that each (N-1)-gram is coupled with an array of the top X most frequent following words
The next major step in development will be to set aside a testing dataset and devise appropriate performance metrics, which can then be used to direct further model improvement.