Wiekvoet: Common words in the Gathering Storm

The Wheel of Time is a series of books started by Robert Jordan. Unfortunately he died too early. Like all fans of the series I feel very lucky that Brandon Sanderson was able to continue these books. The first book Sanderson wrote was the Gathering Storm, last one is due January 2013. In this post it is examined how common words can be used differentiate between books written by Sanderson and those written by Jordan.

Data

The training data used are some of the books by Sanderson and Jordan. They form three categories; Robert Jordan Wheel of Time, Robert Jordan other and Brandon Sanderson various.

the Eye of the World (Wheel of Time) by Robert Jordan
the Fires of Heaven (Wheel of Time) by Robert Jordan
Elantris by Brandon Sanderson
Warbreaker by Brandon Sanderson
Prince of the Blood (other) by Robert Jordan
Conan the Defender (other) by Robert Jordan

The test set is three books;

Knife of Dreams (Wheel of Time) by Robert Jordan
Mistborn by Brandon Sanderson
the Gathering Storm (Wheel of Time) by Brandon Sanderson and Robert Jordan

All books were acquired via darknet and read into R as a vector with one element per chapter. Prologue and epilogue count for separate chapters. The relative amount of common words is counted in each chapter. In this case, common words are defined as stopwords from the tm package. For example;

tm::stopwords("English")[1:5]

[1] "a" "about" "above" "across" "after"

Two functions were devised to count the relative occurrence of these words per chapter:

numwords <- function(what,where) {

g1 <- gregexpr(paste('[[:blank:]]+[[:punct:]]*',what,'[[:punct:]]*[[:blank:]]+',sep=''),where,perl=TRUE,ignore.case=TRUE)

if (g1[[1]][1]==-1) 0L

else length(g1[[1]])

}

countwords <- function(book) {

sw <- tm::stopwords("English")

la <- lapply(book,function(where) {

sa <- sapply(sw,function(what) numwords(what,where))

ntot <- length(gregexpr('[[:blank:]]+',

where,perl=TRUE,ignore.case=TRUE)[[1]])

sa/ntot

} )

mla <- t(do.call(cbind,la))

}

# words are counted

wtEotW <- countwords(tEotW)

wElantris <- countwords(Elantris)

wtFoH <- countwords(tFoH)

wWarbreaker <- countwords(Warbreaker)

wPotB <- countwords(PotB)

wConan <- countwords(Conan)

wtGS <- countwords(tGS)

wMistborn <- countwords(Mistborn)

wKoD <- countwords(KoD)

Model

Random forest is used as the number of variables is much bigger than the number of objects.

#combine the counts and make predictions

all <- rbind(wElantris,wWarbreaker,wtEotW,wtFoH,wPotB,wConan)

cats <- factor(c(

rep('BS',nrow(wElantris)),

rep('BS',nrow(wWarbreaker)),

rep('WoT',nrow(wtEotW)),

rep('WoT',nrow(wtFoH)),

rep('RJ',nrow(wPotB)),

rep('RJ',nrow(wConan))

),levels=c('BS','WoT','RJ'))

rf1 <- randomForest(y=cats,x=all,importance=TRUE)

rf1

Call:

randomForest(x = all, y = cats, importance = TRUE)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 22

OOB estimate of error rate: 3.93%

Confusion matrix:

BS WoT RJ class.error

BS 124 0 1 0.008000000

WoT 0 110 1 0.009009009

RJ 5 4 35 0.204545455

varImpPlot(rf1)

Words which discriminate between the three categories are such as 'and', 'did't' and 'not'. The next figure shows the typical usage of nine words. Note that the data has been scaled at this point in order to make the display more easy to read.

im <- importance(rf1)

toshow <- rownames(im)[order(-im[,'MeanDecreaseGini'])][1:9]

tall <- as.data.frame(scale(all[,toshow]))

tall$chapters <- rownames(tall)

tall$cats <- cats

rownames(tall) <- 1:nrow(tall)

propshow <- reshape(tall,direction='long',

timevar='Word',

v.names='ScaledScore',

times=toshow,

varying=list(toshow))

bwplot( cats ~ScaledScore | Word,data=propshow)

Based on this it seems Sanderson would use contractions such as 'didn't', which Jordan did not. Jordan used 'not', 'and' and 'or' more often. 'However is very much Sanderson.

Predictions

For predictions I took the predicted proportion trees for each category, as this shows a bit of the uncertainty in the categorization, which I find of interest. To display the predictions density plots are used. Each pane in the plot shows the strength of the associations between books and categories. The higher the values, the stronger association. Each row represents a book, each column a category.

ptGS <- predict(rf1,wtGS,type='prob')

pMistborn <- predict(rf1,wMistborn,type='prob')

pKoD <- predict(rf1,wKoD,type='prob')

preds <- as.data.frame(rbind(ptGS,pMistborn,pKoD))

preds$Book <- c(rep('the Gathering Storm',nrow(ptGS)),

rep('Mistborn',nrow(pMistborn)),rep('Knife of Dreams',nrow(pKoD)))

predshow <- reshape(preds,direction='long',

timevar='Prediction',v.names='Score',times=c('BS','WoT','RJ'),

varying=list(w=c('BS','WoT','RJ')))

densityplot(~Score | Prediction + Book,data=predshow)

Interpretation

Knife of dreams is correctly categorized as Wheel of Time, Mistborn is correctly categorized as Sanderson. This shows the predictions are indeed performing well and the item of interest can be examined; the Gathering Storm. It sits solidly in the Sanderson category. Interestingly, it sits a little bit less in Sanderson than Mistborn and sits a bit more in Wheel of Time than Mistborn.

Wiekvoet

Tuesday, December 25, 2012

Common words in the Gathering Storm

Data

Model

Predictions

Interpretation

2 comments: