Tuesday, December 25, 2012

Common words in the Gathering Storm

The Wheel of Time is a series of books started by Robert Jordan. Unfortunately he died too early. Like all fans of the series I feel very lucky that Brandon Sanderson was able to continue these books. The first book Sanderson wrote was the Gathering Storm, last one is due January 2013. In this post it is examined how common words can be used differentiate between books written by Sanderson and those written by Jordan.

Data

The training data used are some of the books by Sanderson and Jordan. They form three categories; Robert Jordan Wheel of Time, Robert Jordan other and Brandon Sanderson various.
  • the Eye of the World (Wheel of Time) by Robert Jordan
  • the Fires of Heaven (Wheel of Time) by Robert Jordan
  • Elantris by Brandon Sanderson
  • Warbreaker by Brandon Sanderson
  • Prince of the Blood (other) by Robert Jordan
  • Conan the Defender (other) by Robert Jordan
The test set is three books;
  • Knife of Dreams (Wheel of Time) by Robert Jordan
  • Mistborn by Brandon Sanderson
  • the Gathering Storm (Wheel of Time) by Brandon Sanderson and Robert Jordan
All books were acquired via darknet and read into R as a vector with one element per chapter. Prologue and epilogue count for separate chapters. The relative amount of common words is counted in each chapter. In this case, common words are defined as stopwords from the tm package.  For example;
tm::stopwords("English")[1:5]
[1] "a"      "about"  "above"  "across" "after" 
Two functions were devised to count the relative occurrence of these words per chapter:
numwords <- function(what,where) {
  g1 <- gregexpr(paste('[[:blank:]]+[[:punct:]]*',what,'[[:punct:]]*[[:blank:]]+',sep=''),where,perl=TRUE,ignore.case=TRUE)
  if (g1[[1]][1]==-1) 0L
  else length(g1[[1]])
}
countwords <- function(book) {
  sw <- tm::stopwords("English")
  la <- lapply(book,function(where) {        
        sa <- sapply(sw,function(what) numwords(what,where))
        ntot <- length(gregexpr('[[:blank:]]+',
                       where,perl=TRUE,ignore.case=TRUE)[[1]])
        sa/ntot
      } )
  mla <- t(do.call(cbind,la))
}
# words are counted
wtEotW <- countwords(tEotW)
wElantris <- countwords(Elantris)
wtFoH <- countwords(tFoH)
wWarbreaker <- countwords(Warbreaker)
wPotB <- countwords(PotB)
wConan <- countwords(Conan)
wtGS <- countwords(tGS)
wMistborn <- countwords(Mistborn)
wKoD <- countwords(KoD)

Model

Random forest is used as the number of variables is much bigger than the number of objects.
#combine the counts and make predictions
all <- rbind(wElantris,wWarbreaker,wtEotW,wtFoH,wPotB,wConan)
cats <-  factor(c(
        rep('BS',nrow(wElantris)),
        rep('BS',nrow(wWarbreaker)),
        rep('WoT',nrow(wtEotW)),
        rep('WoT',nrow(wtFoH)),
        rep('RJ',nrow(wPotB)),
        rep('RJ',nrow(wConan))
    ),levels=c('BS','WoT','RJ'))
rf1 <- randomForest(y=cats,x=all,importance=TRUE)
rf1

Call:
 randomForest(x = all, y = cats, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 22

        OOB estimate of  error rate: 3.93%
Confusion matrix:
     BS WoT RJ class.error
BS  124   0  1 0.008000000
WoT   0 110  1 0.009009009
RJ    5   4 35 0.204545455
varImpPlot(rf1)


Words which discriminate between the three categories are such as 'and', 'did't' and 'not'. The next figure shows the typical usage of nine words. Note that the data has been scaled at this point in order to make the display more easy to read.
im <- importance(rf1)
toshow <- rownames(im)[order(-im[,'MeanDecreaseGini'])][1:9]
tall <- as.data.frame(scale(all[,toshow]))
tall$chapters <- rownames(tall)
tall$cats <- cats
rownames(tall) <- 1:nrow(tall)
propshow <- reshape(tall,direction='long',
    timevar='Word',
    v.names='ScaledScore',
    times=toshow,
    varying=list(toshow))
bwplot(  cats ~ScaledScore  | Word,data=propshow)
Based on this it seems Sanderson would use contractions such as 'didn't', which Jordan did not. Jordan used 'not', 'and' and 'or' more often. 'However is very much Sanderson.

Predictions

For predictions I took the predicted proportion trees for each category, as this shows a bit of the uncertainty in the categorization, which I find of interest. To display the predictions density plots are used. Each pane in the plot shows the strength of the associations between books and categories. The higher the values, the stronger association. Each row represents a book, each column a category.
ptGS <- predict(rf1,wtGS,type='prob')
pMistborn <- predict(rf1,wMistborn,type='prob')
pKoD <- predict(rf1,wKoD,type='prob')
preds <- as.data.frame(rbind(ptGS,pMistborn,pKoD))
preds$Book <- c(rep('the Gathering Storm',nrow(ptGS)),
    rep('Mistborn',nrow(pMistborn)),rep('Knife of Dreams',nrow(pKoD)))
predshow <- reshape(preds,direction='long',
    timevar='Prediction',v.names='Score',times=c('BS','WoT','RJ'),
    varying=list(w=c('BS','WoT','RJ')))
densityplot(~Score | Prediction + Book,data=predshow)

Interpretation

Knife of dreams is correctly categorized as Wheel of Time, Mistborn is correctly categorized as Sanderson. This shows the predictions are indeed performing well and the item of interest can be examined; the Gathering Storm. It sits solidly in the Sanderson category. Interestingly, it sits a little bit less in Sanderson than Mistborn and sits a bit more in Wheel of Time than Mistborn.  

2 comments:

  1. This was a super cool analysis! Can you post a link to the data you used to do it?

    Thanks!

    ReplyDelete
    Replies
    1. Hi Inkhorn,

      The data (books) are obviously copyrighted. That means I cannot. Sorry about that.

      Delete