Monday, June 16, 2014

Stone flakes II

Continuing from last week, the aim is now to classify the stone flakes based on their various properties. Three methods are used. LDA is an obvious standard. A classification tree is both simple and visually appealing. Random forest as a complex method, where more complex relations can easily be captured. Surprising with these data is that the classification tree is doing better than random forest with respect to predicting the input data.

Data

This is the same as last week. However, I now opted to make the group labels a bit more short.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)
r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',
    'Levallois technique',
    'Middle Paleolithic',
    'Homo Sapiens'))
r12c <- r12[complete.cases(r12),]

LDA

LDA is part of the MASS package. The settings are default. The plot function has been adapted, to retain proper labeling and colors.
ld1 <- lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12c)
ld1
Call:
lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, 
    data = r12c)

Prior probabilities of groups:
  Lower Paleolithic Levallois technique  Middle Paleolithic        Homo Sapiens 
         0.15492958          0.32394366          0.47887324          0.04225352 

Group means:
                         LBI      RTI      WDI      FLA       PSF       FSF
Lower Paleolithic   1.105455 33.82727 2.690000 118.8182 31.881818  8.372727
Levallois technique 1.254348 28.74348 2.768261 121.2609 15.021739 14.900000
Middle Paleolithic  1.195000 25.88235 3.384412 113.9706  9.497059 23.294118
Homo Sapiens        1.453333 22.96333 3.150000 117.0000  7.666667 27.900000
                        ZDF1    PROZD
Lower Paleolithic   15.86364 57.81818
Levallois technique 30.46087 69.17391
Middle Paleolithic  62.31471 84.23529
Homo Sapiens        56.36667 87.00000

Coefficients of linear discriminants:
              LD1           LD2         LD3
LBI    2.63864596 -7.3053369000  4.96921267
RTI    0.03735899  0.0683327644  0.03955854
WDI    0.64616126 -0.3086473881  0.06507385
FLA   -0.04798439 -0.0591106950 -0.05229559
PSF   -0.01917272  0.0515112771  0.10178299
FSF    0.02488002 -0.0001579818  0.07080551
ZDF1   0.02896112  0.0396366883 -0.01625858
PROZD  0.03036974 -0.0079845555  0.04335520

Proportion of trace:
   LD1    LD2    LD3 
0.7026 0.2585 0.0389 
There are eight objects incorrect classified. Out of 71 complete cases, that is 10% wrong.
mydf <- data.frame(ID=r12c$ID,
    Group=r12c$Group,
    predict=predict(ld1)$class)
mydf[mydf$Group!=mydf$pred,]

    ID               Group             predict
9    c  Middle Paleolithic Levallois technique
10  cl   Lower Paleolithic Levallois technique
37  ms Levallois technique   Lower Paleolithic
38   n  Middle Paleolithic Levallois technique
45 roe  Middle Paleolithic Levallois technique
55  sz Levallois technique  Middle Paleolithic
61  va Levallois technique        Homo Sapiens
68 woe   Lower Paleolithic Levallois technique

LDA Plot

This is a small adaptation from the plot.lda function in MASS. At visual examination, it seems that the groups are not completely separated. Especially, I wonder if they are much better separated than last weeks biplot.
cols <- colorRampPalette(c('violet','gold','seagreen'))(4) 
myldaplot <- function (x, panel = panel.lda, ..., cex = 0.7, dimen, abbrev = FALSE, 
    xlab = "LD1", ylab = "LD2",indata,cols) 
{
  panel.lda <- function(x, y, ...) text(x, y, g, 
        cex = cex, ...)
  if (!is.null(Terms <- x$terms)) {
    data <- model.frame(x)
    X <- model.matrix(delete.response(Terms), data)
    g <- indata$ID
    mr <- cols[as.numeric(model.response(data))]
    xint <- match("(Intercept)", colnames(X), nomatch = 0L)
    if (xint > 0L) 
      X <- X[, -xint, drop = FALSE]
  }
  else {
    xname <- x$call$x
    gname <- x$call[[3L]]
    if (!is.null(sub <- x$call$subset)) {
      X <- eval.parent(parse(text = paste(deparse(xname, 
                      backtick = TRUE), "[", deparse(sub, backtick = TRUE), 
                  ",]")))
      g <- eval.parent(parse(text = paste(deparse(gname, 
                      backtick = TRUE), "[", deparse(sub, backtick = TRUE), 
                  "]")))
    }
    else {
      X <- eval.parent(xname)
      g <- eval.parent(gname)
    }
    if (!is.null(nas <- x$call$na.action)) {
      df <- data.frame(g = g, X = X)
      df <- eval(call(nas, df))
      g <- df$g
      X <- df$X
    }
  }
  if (abbrev) 
    levels(g) <- abbreviate(levels(g), abbrev)
  means <- colMeans(x$means)
  X <- scale(X, center = means, scale = FALSE) %*% x$scaling
  if (!missing(dimen) && dimen < ncol(X)) 
    X <- X[, 1L:dimen, drop = FALSE]
  if (ncol(X) > 2L) {
    pairs(X, panel = panel, ...,col=mr)
  }
  else if (ncol(X) == 2L) {
    eqscplot(X[, 1L:2L], xlab = xlab, ylab = ylab, type = "n", 
        ...)
    panel(X[, 1L], X[, 2L], ...)
  }
  else ldahist(X[, 1L], g, xlab = xlab, ...)
  invisible(NULL)
}
myldaplot(ld1,indata=r12c,cols=cols)

Classification tree

Rpart is my favorite classification tree implementation. The only tuning is setting minsplit to 10, the default of 20 seems a bit large for 79 objects and four categories. The printed output is skipped, since we have the plot. The interpretation is pretty simple. First split on ZDF1, to distinguish old from young (less old?). The young can be split on LBI to Middle paleolithic and homo sapiens. The old by PSF to Levoillas and Lower Paleolithic. A final split between middle and lower paleolithic shows that the differences are not clear cut.
library(rpart)
rp1 <- rpart(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12,minsplit=10)
plot(rp1,margin=.1)
text(rp1, use.n = TRUE,cex=.8)
There are ten incorrect predicted objects. Note that it is difficult to compare this with LDA, since objects with missing data (e.g. sk misses FLA, PSF, FSF) are predicted with rpart, while they were removed prior to LDA analysis.

mydf <- data.frame(ID=r12$ID,
    Group=r12$Group,
    predict=predict(rp1,type='class'))
mydf[mydf$Group!=mydf$predict,]
    ID               Group             predict
6  bie Levallois technique  Middle Paleolithic
10   c  Middle Paleolithic Levallois technique
11  cl   Lower Paleolithic Levallois technique
42  ms Levallois technique   Lower Paleolithic
43   n  Middle Paleolithic Levallois technique
59  sk Levallois technique  Middle Paleolithic
62  sz Levallois technique  Middle Paleolithic
65  ta Levallois technique  Middle Paleolithic
76 woe   Lower Paleolithic Levallois technique
77 wol Levallois technique  Middle Paleolithic
Note that cross validation makes for a much worse model assessment; about one third are incorrectly predicted with leave one out.
xmat <- xpred.rpart(rp1,xval=79)
colSums(xmat!=do.call(cbind,lapply(1:5,function(x) as.numeric(r12$Group) )))
0.71428571 0.30304576 0.12371791 0.05832118 0.02182179 
        42         26         23         24         25 

Randomforest

Randomforest seems to predict slightly worse than the more simple methods, with OOB error around 21%. As might be expected, Homo Sapiens, with only 3 rows, is particularly difficult to classify. Similar to the classification tree ZDF1 is an important variable, but FLA and PROZD were not important in the classification tree but are in random forest.

library(randomForest)

rf1 <- randomForest(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
   data=r12c,
   importance=TRUE,
   nodesize=10)
rf1
Call:
 randomForest(formula = Group ~ LBI + RTI + WDI + FLA + PSF +      FSF + ZDF1 + PROZD, data = r12c, importance = TRUE, nodesize = 10)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 23.94%
Confusion matrix:
                    Lower Paleolithic Levallois technique Middle Paleolithic
Lower Paleolithic                   8                   3                  0
Levallois technique                 3                  17                  3
Middle Paleolithic                  1                   4                 29
Homo Sapiens                        0                   0                  3
                    Homo Sapiens class.error
Lower Paleolithic              0   0.2727273
Levallois technique            0   0.2608696
Middle Paleolithic             0   0.1470588
Homo Sapiens                   0   1.0000000
Wrong predicted:
mydf <- data.frame(ID=r12c$ID,
    Group=r12c$Group,
    predict=predict(rf1))
mydf[mydf$Group!=mydf$predict,]
    ID               Group             predict
6  bie Levallois technique  Middle Paleolithic
8   bo Levallois technique  Middle Paleolithic
9   by Levallois technique   Lower Paleolithic
10   c  Middle Paleolithic Levallois technique
11  cl   Lower Paleolithic Levallois technique
14  e2  Middle Paleolithic   Lower Paleolithic
21  g5  Middle Paleolithic Levallois technique
28 gue Levallois technique   Lower Paleolithic
40  ml   Lower Paleolithic Levallois technique
42  ms Levallois technique   Lower Paleolithic
43   n  Middle Paleolithic Levallois technique
50 roe  Middle Paleolithic Levallois technique
55 sa1        Homo Sapiens  Middle Paleolithic
56 sa2        Homo Sapiens  Middle Paleolithic
57 sa3        Homo Sapiens  Middle Paleolithic
62  sz Levallois technique  Middle Paleolithic


importanceplot


varImpPlot(rf1)

No comments:

Post a Comment