Wiekvoet: Stone flakes II

Continuing from last week, the aim is now to classify the stone flakes based on their various properties. Three methods are used. LDA is an obvious standard. A classification tree is both simple and visually appealing. Random forest as a complex method, where more complex relations can easily be captured. Surprising with these data is that the classification tree is doing better than random forest with respect to predicting the input data.

Data

This is the same as last week. However, I now opted to make the group labels a bit more short.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)
r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',
'Levallois technique',
'Middle Paleolithic',
'Homo Sapiens'))
r12c <- r12[complete.cases(r12),]

LDA

LDA is part of the MASS package. The settings are default. The plot function has been adapted, to retain proper labeling and colors.
ld1 <- lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
data=r12c)
ld1
Call:
lda(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
data = r12c)

Prior probabilities of groups:
Lower Paleolithic Levallois technique Middle Paleolithic Homo Sapiens
0.15492958 0.32394366 0.47887324 0.04225352

Group means:
LBI RTI WDI FLA PSF FSF
Lower Paleolithic 1.105455 33.82727 2.690000 118.8182 31.881818 8.372727
Levallois technique 1.254348 28.74348 2.768261 121.2609 15.021739 14.900000
Middle Paleolithic 1.195000 25.88235 3.384412 113.9706 9.497059 23.294118
Homo Sapiens 1.453333 22.96333 3.150000 117.0000 7.666667 27.900000
ZDF1 PROZD
Lower Paleolithic 15.86364 57.81818
Levallois technique 30.46087 69.17391
Middle Paleolithic 62.31471 84.23529
Homo Sapiens 56.36667 87.00000

Coefficients of linear discriminants:
LD1 LD2 LD3
LBI 2.63864596 -7.3053369000 4.96921267
RTI 0.03735899 0.0683327644 0.03955854
WDI 0.64616126 -0.3086473881 0.06507385
FLA -0.04798439 -0.0591106950 -0.05229559
PSF -0.01917272 0.0515112771 0.10178299
FSF 0.02488002 -0.0001579818 0.07080551
ZDF1 0.02896112 0.0396366883 -0.01625858
PROZD 0.03036974 -0.0079845555 0.04335520

Proportion of trace:
LD1 LD2 LD3
0.7026 0.2585 0.0389
There are eight objects incorrect classified. Out of 71 complete cases, that is 10% wrong.
mydf <- data.frame(ID=r12c$ID,
Group=r12c$Group,
predict=predict(ld1)$class)
mydf[mydf$Group!=mydf$pred,]

ID Group predict
9 c Middle Paleolithic Levallois technique
10 cl Lower Paleolithic Levallois technique
37 ms Levallois technique Lower Paleolithic
38 n Middle Paleolithic Levallois technique
45 roe Middle Paleolithic Levallois technique
55 sz Levallois technique Middle Paleolithic
61 va Levallois technique Homo Sapiens
68 woe Lower Paleolithic Levallois technique

LDA Plot

This is a small adaptation from the plot.lda function in MASS. At visual examination, it seems that the groups are not completely separated. Especially, I wonder if they are much better separated than last weeks biplot.

cols <- colorRampPalette(c('violet','gold','seagreen'))(4)

myldaplot <- function (x, panel = panel.lda, ..., cex = 0.7, dimen, abbrev = FALSE,

xlab = "LD1", ylab = "LD2",indata,cols)

{

panel.lda <- function(x, y, ...) text(x, y, g,

cex = cex, ...)

if (!is.null(Terms <- x$terms)) {

data <- model.frame(x)

X <- model.matrix(delete.response(Terms), data)

g <- indata$ID

mr <- cols[as.numeric(model.response(data))]

xint <- match("(Intercept)", colnames(X), nomatch = 0L)

if (xint > 0L)

X <- X[, -xint, drop = FALSE]

}

else {

xname <- x$call$x

gname <- x$call[[3L]]

if (!is.null(sub <- x$call$subset)) {

X <- eval.parent(parse(text = paste(deparse(xname,

backtick = TRUE), "[", deparse(sub, backtick = TRUE),

",]")))

g <- eval.parent(parse(text = paste(deparse(gname,

backtick = TRUE), "[", deparse(sub, backtick = TRUE),

"]")))

}

else {

X <- eval.parent(xname)

g <- eval.parent(gname)

}

if (!is.null(nas <- x$call$na.action)) {

df <- data.frame(g = g, X = X)

df <- eval(call(nas, df))

g <- df$g

X <- df$X

}

if (abbrev)

levels(g) <- abbreviate(levels(g), abbrev)

means <- colMeans(x$means)

X <- scale(X, center = means, scale = FALSE) %*% x$scaling

if (!missing(dimen) && dimen < ncol(X))

X <- X[, 1L:dimen, drop = FALSE]

if (ncol(X) > 2L) {

pairs(X, panel = panel, ...,col=mr)

}

else if (ncol(X) == 2L) {

eqscplot(X[, 1L:2L], xlab = xlab, ylab = ylab, type = "n",

...)

panel(X[, 1L], X[, 2L], ...)

}

else ldahist(X[, 1L], g, xlab = xlab, ...)

invisible(NULL)

}

myldaplot(ld1,indata=r12c,cols=cols)

Classification tree

Rpart is my favorite classification tree implementation. The only tuning is setting minsplit to 10, the default of 20 seems a bit large for 79 objects and four categories. The printed output is skipped, since we have the plot. The interpretation is pretty simple. First split on ZDF1, to distinguish old from young (less old?). The young can be split on LBI to Middle paleolithic and homo sapiens. The old by PSF to Levoillas and Lower Paleolithic. A final split between middle and lower paleolithic shows that the differences are not clear cut.

library(rpart)

rp1 <- rpart(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

data=r12,minsplit=10)

plot(rp1,margin=.1)

text(rp1, use.n = TRUE,cex=.8)

There are ten incorrect predicted objects. Note that it is difficult to compare this with LDA, since objects with missing data (e.g. sk misses FLA, PSF, FSF) are predicted with rpart, while they were removed prior to LDA analysis.

mydf <- data.frame(ID=r12$ID,
Group=r12$Group,
predict=predict(rp1,type='class'))
mydf[mydf$Group!=mydf$predict,]

ID Group predict
6 bie Levallois technique Middle Paleolithic
10 c Middle Paleolithic Levallois technique
11 cl Lower Paleolithic Levallois technique
42 ms Levallois technique Lower Paleolithic
43 n Middle Paleolithic Levallois technique
59 sk Levallois technique Middle Paleolithic
62 sz Levallois technique Middle Paleolithic
65 ta Levallois technique Middle Paleolithic
76 woe Lower Paleolithic Levallois technique
77 wol Levallois technique Middle Paleolithic

Note that cross validation makes for a much worse model assessment; about one third are incorrectly predicted with leave one out.

xmat <- xpred.rpart(rp1,xval=79)

colSums(xmat!=do.call(cbind,lapply(1:5,function(x) as.numeric(r12$Group) )))

0.71428571 0.30304576 0.12371791 0.05832118 0.02182179

42 26 23 24 25

Randomforest

Randomforest seems to predict slightly worse than the more simple methods, with OOB error around 21%. As might be expected, Homo Sapiens, with only 3 rows, is particularly difficult to classify. Similar to the classification tree ZDF1 is an important variable, but FLA and PROZD were not important in the classification tree but are in random forest.

library(randomForest)

rf1 <- randomForest(Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,

data=r12c,

importance=TRUE,

nodesize=10)

rf1

Call:
randomForest(formula = Group ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, data = r12c, importance = TRUE, nodesize = 10)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2

OOB estimate of error rate: 23.94%
Confusion matrix:
Lower Paleolithic Levallois technique Middle Paleolithic
Lower Paleolithic 8 3 0
Levallois technique 3 17 3
Middle Paleolithic 1 4 29
Homo Sapiens 0 0 3
Homo Sapiens class.error
Lower Paleolithic 0 0.2727273
Levallois technique 0 0.2608696
Middle Paleolithic 0 0.1470588
Homo Sapiens 0 1.0000000

Wrong predicted:

mydf <- data.frame(ID=r12c$ID,
Group=r12c$Group,
predict=predict(rf1))
mydf[mydf$Group!=mydf$predict,]

ID Group predict
6 bie Levallois technique Middle Paleolithic
8 bo Levallois technique Middle Paleolithic
9 by Levallois technique Lower Paleolithic
10 c Middle Paleolithic Levallois technique
11 cl Lower Paleolithic Levallois technique
14 e2 Middle Paleolithic Lower Paleolithic
21 g5 Middle Paleolithic Levallois technique
28 gue Levallois technique Lower Paleolithic
40 ml Lower Paleolithic Levallois technique
42 ms Levallois technique Lower Paleolithic
43 n Middle Paleolithic Levallois technique
50 roe Middle Paleolithic Levallois technique
55 sa1 Homo Sapiens Middle Paleolithic
56 sa2 Homo Sapiens Middle Paleolithic
57 sa3 Homo Sapiens Middle Paleolithic
62 sz Levallois technique Middle Paleolithic

importanceplot

varImpPlot(rf1)

Wiekvoet

Monday, June 16, 2014

Stone flakes II

Data

LDA

LDA Plot

Classification tree

Randomforest

importanceplot

No comments:

Post a Comment