Exploratory Analysis of Language Corpora in Preparation for Development of a Word Prediction Algorithm

Synopsis

This report outlines some exploratory analysis with language corpora along with a description of plans for a word prediction algorithm.

As described in the Coursera / Johns Hopkins Data Science Specialization Capstone Project course materials, the motivation for this project is to:

Demonstrate that I have downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that you amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Preliminaries

Step 0a: Load the R libraries we will use herein:

library(dplyr)
library(ggplot2)
library(gridExtra)

Step 0b: Set knitr options to use throughout

options(width=120)

Resource Summary: English Language Corpora from Blogs, News and Twitter

Various language files have been provided for this project. The three English language ones, which we will be using in this exploration, are for word corpuses from blogs, news feeds and Twitter. The three files are summarized below.

Calculations for file sizes, number of lines per file, and number of words per file:

# file sizes
fSize_blogs <- file.info("data/en_US/en_US.blogs.txt")['size']
fSize_news <- file.info("data/en_US/en_US.news.txt")['size']
fSize_twitter <- file.info("data/en_US/en_US.twitter.txt")['size']

# number of lines
nLines_blogs <- length(readLines("data/en_US/en_US.blogs.txt"))
nLines_news <- length(readLines("data/en_US/en_US.news.txt"))
nLines_twitter <- length(readLines("data/en_US/en_US.twitter.txt"))

# number of words
nWords_blogs <- sapply(gregexpr("\\W+", 
                                readChar("data/en_US/en_US.blogs.txt",fSize_blogs)), length) + 1
nWords_news <- sapply(gregexpr("\\W+", 
                               readChar("data/en_US/en_US.news.txt",fSize_news)), length) + 1
nWords_twitter <- sapply(gregexpr("\\W+", 
                                  readChar("data/en_US/en_US.twitter.txt",fSize_twitter)), length) + 1

File Name	Size (MB)	Number of Lines	Number of Words
en_US.blogs.txt	210.160014	899288	3.830842210^{7}
en_US.news.txt	205.811889	1010242	3.562444910^{7}
en_US.twitter.txt	167.105338	2360148	2.19189810^{6}

Data Sampling

The following function will be used to filter out lines containing profanity. The file ‘profanity.txt’ contains one profane word per line, with a total of 7 lines.

profanityList <- readLines("profanity.txt")
profanityCheck <- function(str) {
    p <- FALSE
    if (is.character(str)) {
        if (nchar(str) > 0) {
            for (x in profanityList) {
                if (grepl(x, tolower(str))) {
                    p <- TRUE
                }
            }
        }
    }
    p
}

For exploration purposes, we will start by creating three separate sub-sample datasets and saving them to separate files so that we will not have to read in the large files each time we run a new exploratory script. Note the use of the profanity filter in this process.

filenames <- c('blogs','news','twitter')
selectRatio <- 0.001
set.seed(42)
for (f in filenames) {
    if (!file.exists(paste("en_US.",f,".sample.txt",sep=""))) {
        conIn <- file(paste("data/en_US/en_US.",f,".txt",sep=""), "r")
        conOut <- file(paste("en_US.",f,".sample.txt",sep=""), "w")
        for(i in 1:nLines_twitter) {
            if (rbinom(1,1,selectRatio) > 0 ) {
                l <- readLines(conIn, 1)
                if (length(l) > 0) {
                    if (!profanityCheck(l)) {
                        writeLines(l,con=conOut)
                    }
                }
            } else {
                readLines(conIn, 1)
            }
        }
        close(conIn)
        close(conOut)
    }
}

We can then read in the sub-samples for the exploration and analysis described hereafter.

# read from files
sampleData_blogs <- readLines("en_US.blogs.sample.txt")
sampleData_news <- readLines("en_US.news.sample.txt")
sampleData_twitter <- readLines("en_US.twitter.sample.txt")

# take a quick look at the dataframes
head(sampleData_blogs)

## [1] "So now a short break in writing while I make coffee. 'Scuse me."                                                                                                                                                                    
## [2] "Herman and Eva Althaus moved their family across country from Illinois to settle in Roseburg, Oregon in about 1909. Herman had a plumbing shop."                                                                                    
## [3] "I have business idea which involves fabric which I am very excited about. I am currently putting together a business plan to see if I can take it further. That is really exciting. Even my other half thinks its a good idea!"     
## [4] "I have a tune for you all as well."                                                                                                                                                                                                 
## [5] "Look at Bruno over here. With his brown skin pairs the light to dark scheme of clothing. From flesh, which is his short sleeves polo, to the breath-taking dark orange pants goes his dark green loafers. Love Bruno in the summer."
## [6] "And she pulls of the short haircut that I wish I could!"

head(sampleData_news)

## [1] "Others are afraid that if they have already had breast cancer, a mammogram could spread new cancer to the other breast, she said. Again, not true."                                                                                                                                 
## [2] "The teacher \"never comes to work, leaves no lesson plan for the sub, and has a file as thick as everything,\" Miller said, her eyes welling. \"Students are falling behind. . . . Colleagues are upset. They come to me and say, 'Why aren't you doing anything?' \""              
## [3] "Damon is a lifetime .286 hitter. He is 277 hits shy of 3,000."                                                                                                                                                                                                                      
## [4] "Roofing contractors all tell me the same thing; municipal roof inspections are anything but thorough. One roofer told me that city inspectors barely get out of their vehicles. I have a high level of respect for municipal inspectors, so I decided to get to the bottom of this."
## [5] "\"We have no further comment at this time.\""                                                                                                                                                                                                                                       
## [6] "Azi, a 3-year-old German shepherd, bit the officer at about 4 p.m. April 19 at the department's headquarters, 20495 S.W. Borchers Drive, Reed said. The officer, who is not the dog's handler, was treated and then released from the hospital."

head(sampleData_twitter)

## [1] "- Sounds good bro"                                                                                                               
## [2] "Word to Blimpy."                                                                                                                 
## [3] "Remember that you are never alone. God is always with you. (a2ib4ue) Let's get it :-)"                                           
## [4] "I'd still let him have it."                                                                                                      
## [5] "Nice to meet you, fellow writers"                                                                                                
## [6] "Well there's multiple.I'm going to be doing a couple of serious pencil drawings of actual people and need your help locating pix"

String Tokenization

For our n-gram prediction model, we will consider whole words (rather than shingles), with numbers considered as words, and we will add the token ‘##s##’ at the beginning of sentences (or sentence fragments) and the token ‘##es##’ at the end of sentences (or sentence fragments). (As discussed below, our prediction model will also include a token ‘##unkn##’ for unknown words, but we will not deal with that quite yet.)

As such, we need to be able to split text into words and also add the ‘##s##’ and ‘##es##’ tokens in the appropriate places. The following function takes a string as input and returns a tokenized version of it, in all lower case letters.

wordList <- function(str) {
    str = gsub("[.!?;]", " ##es## ##s## ", str)
    str = paste("##s##",str,"##es##")
    str = gsub("##s##[ ]+##es##", "", str)
    regex = "[^[:space:],:=<>/\\)\\(]+"
    regmatches(tolower(str), gregexpr(regex, tolower(str)))
}

A example usage of the wordList function:

sampleData_twitter[4]

## [1] "I'd still let him have it."

wordList(sampleData_twitter[4])

## [[1]]
## [1] "##s##"  "i'd"    "still"  "let"    "him"    "have"   "it"     "##es##"

We can then apply this function to each line of each of our data samples:

wordsSeqs_blogs <- sapply(sampleData_blogs,wordList,USE.NAMES=FALSE)
wordsSeqs_news <- sapply(sampleData_news,wordList,USE.NAMES=FALSE)
wordsSeqs_twitter <- sapply(sampleData_twitter,wordList,USE.NAMES=FALSE)

A quick look at the results:

wordsSeqs_blogs[1:2]

## [[1]]
##  [1] "##s##"   "so"      "now"     "a"       "short"   "break"   "in"      "writing" "while"   "i"       "make"   
## [12] "coffee"  "##es##"  "##s##"   "'scuse"  "me"      "##es##" 
## 
## [[2]]
##  [1] "##s##"    "herman"   "and"      "eva"      "althaus"  "moved"    "their"    "family"   "across"   "country" 
## [11] "from"     "illinois" "to"       "settle"   "in"       "roseburg" "oregon"   "in"       "about"    "1909"    
## [21] "##es##"   "##s##"    "herman"   "had"      "a"        "plumbing" "shop"     "##es##"

wordsSeqs_news[1:2]

## [[1]]
##  [1] "##s##"     "others"    "are"       "afraid"    "that"      "if"        "they"      "have"      "already"  
## [10] "had"       "breast"    "cancer"    "a"         "mammogram" "could"     "spread"    "new"       "cancer"   
## [19] "to"        "the"       "other"     "breast"    "she"       "said"      "##es##"    "##s##"     "again"    
## [28] "not"       "true"      "##es##"   
## 
## [[2]]
##  [1] "##s##"      "the"        "teacher"    "\"never"    "comes"      "to"         "work"       "leaves"    
##  [9] "no"         "lesson"     "plan"       "for"        "the"        "sub"        "and"        "has"       
## [17] "a"          "file"       "as"         "thick"      "as"         "everything" "\""         "miller"    
## [25] "said"       "her"        "eyes"       "welling"    "##es##"     "##s##"      "\"students" "are"       
## [33] "falling"    "behind"     "##es##"     "##s##"      "colleagues" "are"        "upset"      "##es##"    
## [41] "##s##"      "they"       "come"       "to"         "me"         "and"        "say"        "'why"      
## [49] "aren't"     "you"        "doing"      "anything"   "##es##"     "##s##"      "'"          "\""        
## [57] "##es##"

wordsSeqs_twitter[1:2]

## [[1]]
## [1] "##s##"  "-"      "sounds" "good"   "bro"    "##es##"
## 
## [[2]]
## [1] "##s##"  "word"   "to"     "blimpy" "##es##"

Exploratory Analysis

Individual Tokens

One may wish to use different corpora for different applications, but for our purposes of trying to construct a general-use next-word predictor, we will combine data from all three corpora.

Our first step in analysis will be to consider occurances of individual tokens, which we will call the N1 case. For this, let’s combine all of the lines from all of the data samples into one list of token occurances (noting that they are still in order, so we can use this later for our consideration of two-word sequences and three-word sequences).

wordsN1 <- c(unlist(wordsSeqs_blogs), unlist(wordsSeqs_news), unlist(wordsSeqs_twitter))
totalWords = length(wordsN1)
totalWords

## [1] 115299

We can then use the following to calculate the frequency of each word (token) in this list:

wordCountsN1 <- as.data.frame(table(wordsN1))
wordCountsN1$Freq <- wordCountsN1$Freq / totalWords

A quick look at the most frequent tokens:

wordCountsN1_desc <- wordCountsN1 %>% arrange(desc(Freq))
head(wordCountsN1_desc, n=10)

##    wordsN1        Freq
## 1   ##es## 0.077346725
## 2    ##s## 0.077346725
## 3      the 0.038907536
## 4       to 0.021968968
## 5      and 0.019367037
## 6        a 0.019176229
## 7       of 0.016418182
## 8       in 0.013634117
## 9        i 0.012706095
## 10      is 0.009228181

For our application, we will want to consider reducing the number of tokens in our collection by replacing some relatively uncommon ones with the single token ##unkn##. This could significantly decrease the memory requirements for the application with just a relatively small decrease in accuracy, and also allows for the handling of cases where users type words that are not in the corpora. The following calculates the number of tokens required to capture 50% or 90% of the token counts, and the figure show the distributions in greater detail.

wordsReq50 = min(which( cumsum(wordCountsN1_desc$Freq) > 0.5 ))
wordsReq90 = min(which( cumsum(wordCountsN1_desc$Freq) > 0.9 ))
wordsReq50

## [1] 78

wordsReq90

## [1] 5405

wordCountsN1_asc <- wordCountsN1 %>% arrange(Freq)
wordCountsN1_asc$wordsN1 <- factor(wordCountsN1_asc$wordsN1,
                                       levels = wordCountsN1_asc$wordsN1, 
                                       ordered = TRUE)

y1 <- ggplot(wordCountsN1_asc, 
             aes(x=wordsN1,y=Freq)) +
        geom_bar(stat="identity") + coord_flip() + 
        xlab("all tokens, ordered by prevalence") + ylab("Frequency") +
        theme(axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
        theme(text = element_text(size=9))

y2 <- ggplot(tail(wordCountsN1_asc,n=30), 
             aes(x=wordsN1,y=Freq)) +
        geom_bar(stat="identity") + coord_flip() + 
        xlab("30 most commonly used tokens") + ylab("Frequency") +
        theme(text = element_text(size=9))

y3 <- ggplot(tail(wordCountsN1_asc,n=30), 
             aes(x=1:30,y=(sum(Freq)-cumsum(Freq)))) +
        geom_line() + coord_flip() + 
        xlab("30 most commonly used tokens") + ylab("Cummulative Sum of Frequencies") +
        scale_x_discrete(breaks=1:30, labels=tail(wordCountsN1_asc$wordsN1,n=30)) +
        theme(text = element_text(size=9))

y4 <- ggplot(wordCountsN1_asc, 
             aes(x=1:length(wordsN1),y=(1-cumsum(Freq)))) +
        geom_line() + coord_flip() + 
        xlab("all tokens, ordered by prevalence") + ylab("Cummulative Sum of Frequencies") +
        theme(axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
        theme(text = element_text(size=9)) + 
        geom_vline(aes(xintercept=length(wordCountsN1_asc$wordsN1)-wordsReq50,
                       colour='blue')) + 
        geom_vline(aes(xintercept=length(wordCountsN1_asc$wordsN1)-wordsReq90,
                       colour='red')) +
        geom_text(aes(length(wordCountsN1_asc$wordsN1)-wordsReq50, 1,
                      label = paste("50%: ", wordsReq50, " tokens"), 
                      hjust = 1, vjust = -1, size=9, colour='blue')) +
        geom_text(aes(length(wordCountsN1_asc$wordsN1)-wordsReq90, 1,
                      label = paste("90%: ", wordsReq90, " tokens"), 
                      hjust = 1, vjust = -1, size=9, colour='red')) +
        theme(legend.position = 'none')

grid.arrange(y1, y2, y3, y4, ncol=4, main="Figure 1: Frequency of Individual Tokens")

N-grams - Groups of Tokens

The following function takes a word list (e.g. our wordsN1 list) and produces a list of pairs (2-grams), triples (3-grams), or, more generally, n-grams. These n-grams will be used in our prediction app, and could also be used for various other applications.

makeNlist <- function(x,n) {
    l <- list()
    l[[1]] <- x
    for (i in 2:n) {
        l[[i]] <- l[[i-1]]
        l[[i]] <- l[[i]][-1]
        l[[i]][length(l[[i]])+1] <- "##es##"
    }
    df <- data.frame(l)
    colnames(df) <- 1:n
    do.call("paste", c(df[1:n], sep=" "))
}

With such a list, we can repeat the previous N1 analysis for any N-gram list by using the following function:

analyseNgramList <- function(wordsNgram) {
    
    totalWords = length(wordsNgram)
    print(paste("total N-grams = ", totalWords, sep=""))

    wordCountsNgram <- as.data.frame(table(wordsNgram))
    wordCountsNgram$Freq <- wordCountsNgram$Freq / totalWords
    distinctWords = length(wordCountsNgram$Freq)
    print(paste("distinct N-grams = ", distinctWords, sep=""))

    wordCountsNgram_desc <- wordCountsNgram %>% arrange(desc(Freq))

    wordsNgramReq50 = min(which( cumsum(wordCountsNgram_desc$Freq) > 0.5 ))
    wordsNgramReq90 = min(which( cumsum(wordCountsNgram_desc$Freq) > 0.9 ))
    print(paste("num N-grams required for 50% coverage= ", wordsNgramReq50, sep=""))
    print(paste("num N-grams required for 90% coverage= ", wordsNgramReq90, sep=""))
    
    wordCountsNgram
}

graphNgramList <- function(wordsNgram,wordCountsNgram) {

    wordCountsNgram_asc <- wordCountsNgram %>% arrange(Freq)
    wordCountsNgram_asc$wordsNgram <- factor(wordCountsNgram_asc$wordsNgram,
                                             levels = wordCountsNgram_asc$wordsNgram, 
                                             ordered = TRUE)

    y1 <- ggplot(wordCountsNgram_asc, 
                 aes(x=wordsNgram,y=Freq)) +
            geom_bar(stat="identity") + coord_flip() + 
            xlab("all N-grams, ordered by prevalence") + ylab("Frequency") +
            theme(axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
            theme(text = element_text(size=9))

    y2 <- ggplot(tail(wordCountsNgram_asc,n=30), 
                 aes(x=wordsNgram,y=Freq)) +
            geom_bar(stat="identity") + coord_flip() + 
            xlab("30 most commonly used N-grams") + ylab("Frequency") +
            theme(text = element_text(size=9))

    y3 <- ggplot(tail(wordCountsNgram_asc,n=30), 
                 aes(x=1:30,y=(sum(Freq)-cumsum(Freq)))) +
            geom_line() + coord_flip() + 
            xlab("30 most commonly used N-grams") + ylab("Cummulative Sum of Frequencies") +
            scale_x_discrete(breaks=1:30, labels=tail(wordCountsNgram_asc$wordsNgram,n=30)) +
            theme(text = element_text(size=9))

    y4 <- ggplot(wordCountsNgram_asc, 
                 aes(x=1:length(wordsNgram),y=(1-cumsum(Freq)))) +
            geom_line() + coord_flip() + 
            xlab("all N-grams, ordered by prevalence") + ylab("Cummulative Sum of Frequencies") +
            theme(axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
            theme(text = element_text(size=9)) 
    
   list(y1,y2,y3,y4)
}

2-Grams

What are the frequencies of 2-grams in the dataset?

wordsN2 <- makeNlist(wordsN1,2)
wordsN2[1:10]

##  [1] "##s## so"      "so now"        "now a"         "a short"       "short break"   "break in"      "in writing"   
##  [8] "writing while" "while i"       "i make"

wordCountsN2 <- analyseNgramList(wordsN2)

## [1] "total N-grams = 115299"
## [1] "distinct N-grams = 67535"
## [1] "num N-grams required for 50% coverage= 10068"
## [1] "num N-grams required for 90% coverage= 56006"

y <- graphNgramList(wordsN2,wordCountsN2)
grid.arrange(y[[1]], y[[2]], y[[3]], y[[4]], ncol=4, main="Figure 2: Frequency of 2-Gram Tokens")

3-Grams

What are the frequencies of 3-grams in the dataset?

wordsN3 <- makeNlist(wordsN1,3)
wordsN3[1:10]

##  [1] "##s## so now"     "so now a"         "now a short"      "a short break"    "short break in"   "break in writing"
##  [7] "in writing while" "writing while i"  "while i make"     "i make coffee"

wordCountsN3 <- analyseNgramList(wordsN3)

## [1] "total N-grams = 115299"
## [1] "distinct N-grams = 96708"
## [1] "num N-grams required for 50% coverage= 39059"
## [1] "num N-grams required for 90% coverage= 85179"

y <- graphNgramList(wordsN3,wordCountsN3)
grid.arrange(y[[1]], y[[2]], y[[3]], y[[4]], ncol=4, main="Figure 3: Frequency of 3-Gram Tokens")

How do you evaluate how many of the words come from foreign languages?

Report any interesting findings that you amassed so far.

Development Plan for Creating a Prediction Algorithm and Shiny App

Basic n-gram prediction algorithm, assuming that the given (n-1)-gram is in the corpora:

predict <- function(given2gram,wordCounts,n) {
    found <- grepl(paste("^",given2gram," ",sep=""),wordCounts$wordsNgram)
    df <- wordCounts[found,]
    df$Freq = df$Freq / sum(df$Freq)
    df$wordsNgram = gsub(paste("^",given2gram," ",sep=""),"",df$wordsNgram)
    head(df[ order(-df[,2]), ], n=n)
}

w <- predict("now a",wordCountsN3,5)
w

##       wordsNgram Freq
## 59798       card 0.25
## 59799   jamaican 0.25
## 59800    regular 0.25
## 59801      short 0.25

This algorithm will be extended to use “backoff” (recursively considering (n-1)-gram prediction if there are not enough cases in the n-gram prediction). The “wordCount” tables will also be modified to include unknown tokens (##unkn##), in order to limit the size of the tables in memory and to deal with words not in the corpora. Smoothing will also be considered.

The model will thus have various parameters that can be tuned, including:

The maximum size n to consider with the n-grams
For each n, the number of n-grams to keep (or, equivalently, a cut-off frequency)
The number of individual tokens to consider in the language (the remainder being labeled as ##unkn##)

To evaluate the performance of any given configuration, and thus to aid in the model tuning, the text corpora will be divided into training and testing sets, and the predictor algorithm will called on each word in the training set (given its previous n tokens, does it predict the given word?), with the percentage correct used as a performance metric. This performance metric will be considered alongside the memory and computation time required by the model, and the model will be tuned to provide the best possible performance within the memory and computation time constraints posed by the application.