Sunday, May 26, 2013

More bubble sort tuning

After last week's post bubble sort tuning I got an email from Berend Hasselman noting that my 'best' function did not protect against cases n<=2 and a speed improvement was possible. That made me realize that I should have been profiling the functions rather than just look for inspiration in the code. So, that's what I wanted to do this week. Unfortunately the examples were a bit too simple to see good power of profiling. So, after wring a too long post, removed a lot again.

Simple Improvements 

The first thing I did was compare Berend's code, bubble_sort6, with mine own, bubble_sort5. The difference is, I thought, in the indexing, the double [ ][ ].  And it was 10% to 20% faster.
library(microbenchmark)

bubble_sort5 = function(vec) {
  wm <- which.max(vec)
  vec <- c(vec[-wm],vec[wm])
  for (iend in ((length(vec)-1):2)) {
    wm <- which.max(vec[1:iend])
    vec <- c(vec[1:iend][-wm],vec[1:iend][wm],vec[(iend+1):length(vec)])
  }
  vec
}

bubble_sort6 = function(vec) {
  wm <- which.max(vec)
  vec <- c(vec[-wm],vec[wm])
  for (iend in ((length(vec)-1):2)) {
    tmp <- vec[1:iend]
    wm <- which.max(tmp)
    vec <- c(tmp[-wm],tmp[wm],vec[(iend+1):length(vec)])
  }
  vec
}
So how about removing more indices? I can copy the vector, push the maximum in the correct space and eliminate from the original vector.
bubble_sort7 = function(vec) {
  final <- vec
  for (iend in (length(vec):1)) {
    wm <- which.max(vec)
    final[iend] <- vec[wm]
    vec <- vec[-wm]
  }
  final
}
Why not eliminate the removal from the vector and place a low number there, so which.max skips those?
bubble_sort8 = function(vec) {
  final <- vec
  smallest <- min(vec)
  for (iend in (length(vec):1)) {
    wm <- which.max(vec)
    final[iend] <- vec[wm]
    vec[wm] <- smallest
  }
  final
}
The speed of number 7 and 8 are highest. Number 8 at vector_size = 1000. (To make the display compacted, the expression is slightly shortened).                                                                             
vector_size = 1000
mbm <- microbenchmark(bubble_sort5(sample(vector_size)),
    bubble_sort6(sample(vector_size)),
    bubble_sort7(sample(vector_size)),
    bubble_sort8(sample(vector_size)),
    sort(sample(vector_size)),
    sample(vector_size))  
su <- summary(mbm)
levels(su$expr) <- gsub('sample(vector_size)','',levels(su$expr),fixed=TRUE)
su

            expr       min         lq     median         uq        max neval
1 bubble_sort5() 58382.570 69967.8920 70629.8475 71088.0120  85883.413   100
2 bubble_sort6() 42090.254 43020.5110 54414.1360 55058.8650  74350.140   100
3 bubble_sort7() 27909.892 28862.5065 29183.9545 39707.0685 241396.777   100
4 bubble_sort8() 24635.301 25061.5775 25356.6345 36344.5100  40485.581   100
5         sort()   321.081   503.2475   522.6735   534.7695    641.430   100
6                   31.521    38.8525    48.3830    50.5815     53.514   100
but number 7 is fastest at vector.size=100. 
            expr      min       lq    median       uq       max neval
1 bubble_sort5() 1792.338 1836.322 1880.6720 1928.322  2313.546   100
2 bubble_sort6() 1392.819 1416.644 1454.7630 1536.499 28460.422   100
3 bubble_sort7() 1364.230 1408.579 1433.5045 1507.177  1894.967   100
4 bubble_sort8() 1724.896 1773.279 1808.8320 1881.038 26898.999   100
5         sort()  177.401  258.405  336.8425  364.332   599.645   100
6                   5.864    9.530   10.9960   13.562    20.526   100

Profiling

The classic way to profile is using rprof(). I checked R-bloggers and found improved R profiling summaries by Noam Ross, from which I concluded that the only improvements were on output. I will use the classic output. This seems to work best, maybe that's because I work from Eclipse, maybe because I still use R 2.15.3, but other summaries did not work so well. It also seems 17% of the time is consumed by which.max and 80% by processes which are not registered by the profiler. Not very informative, but that may be because the example is too simple.
Rprof("file.out",interval=.002)
for(i in 1:100) bubble_sort7(sample(1000))
Rprof(NULL)
summaryRprof("file.out")
$by.self
             self.time self.pct total.time total.pct
bubble_sort7     2.704    81.30      3.322     99.88
which.max        0.594    17.86      0.594     17.86
-                0.020     0.60      0.020      0.60
sample           0.004     0.12      0.004      0.12
$<-              0.002     0.06      0.002      0.06
parse            0.002     0.06      0.002      0.06

$by.total
             total.time total.pct self.time self.pct
<Anonymous>       3.326    100.00     0.000     0.00
bubble_sort7      3.322     99.88     2.704    81.30
eval              3.322     99.88     0.000     0.00
which.max         0.594     17.86     0.594    17.86
-                 0.020      0.60     0.020     0.60
sample            0.004      0.12     0.004     0.12
doTryCatch        0.004      0.12     0.000     0.00
tryCatchList      0.004      0.12     0.000     0.00
tryCatchOne       0.004      0.12     0.000     0.00
$<-               0.002      0.06     0.002     0.06
parse             0.002      0.06     0.002     0.06
srcfilecopy       0.002      0.06     0.000     0.00

$sample.interval
[1] 0.002

$sampling.time
[1] 3.326

Saturday, May 18, 2013

Bubble sort tuning

I was reading Paul Hiemsta's blogpost on Much more efficient bubble sort in R using the Rcpp and inline packages, went back to his first post  Bubble sort implemented in pure R and thought, surely we can do it better in pure R. So I cleaned inner loops, removed a recursion made some variations and finally made a big improvement as I vectorized it. I don't need to try Rcpp to know that is and will be faster, but surely closed the gap a bit.

Original code

This is how it was originally in the blog in pure R. If there are changes it is layout.

larger = function(pair) {
  if(pair[1] > pair[2]) return(TRUE) else return(FALSE)
}
swap_if_larger = function(pair) {
  if(larger(pair)) {
    return(rev(pair)) 
  } else {
    return(pair)
  }
}
swap_pass = function(vec) { 
  for(i in seq(1, length(vec)-1)) {
    vec[i:(i+1)] = swap_if_larger(vec[i:(i+1)])
  }
  return(vec)
}
bubble_sort = function(vec) {
  new_vec = swap_pass(vec)
  if(isTRUE(all.equal(vec, new_vec))) { 
    return(new_vec) 
  } else {
    return(bubble_sort(new_vec))
  }
}

Improve inside of loops

This is essentially the same code, but cleaned the first two functions. 
larger1 = function(pair) pair[1] > pair[2]

swap_if_larger1 = function(pair) {
  if(larger1(pair)) rev(pair) 
  else pair
}

swap_pass1 = function(vec) { 
  for(i in seq(1, length(vec)-1)) {
    vec[i:(i+1)] = swap_if_larger1(vec[i:(i+1)])
  }
  return(vec)
}

bubble_sort1 = function(vec) {
  new_vec = swap_pass1(vec)
  if(isTRUE(all.equal(vec, new_vec))) { 
    return(new_vec) 
  } else {
    return(bubble_sort1(new_vec))
  }
}

Improve outside loops

I cleaned the outside loop, then decided that the two inside functions were not needed any more, because it was reduced to one statement.
swap_pass2 = function(vec) { 
  for(i in 1:(length(vec)-1)) {
    if (vec[i] > vec[i+1] ) vec[c(i,i+1)] <- vec[c(i+1,i)]
  }
  return(vec)
}

bubble_sort2 = function(vec) {
  new_vec = swap_pass2(vec)
  if(identical(vec, new_vec)) new_vec 
  else bubble_sort2(new_vec)
}

No recursion

This was actually not intended a speed update, but R was complaining about too deep recursion. The same swapp_pass2 as before is used. 
bubble_sort3 = function(vec) {
  new_vec = swap_pass2(vec)
  while (!identical(vec, new_vec)) {
    vec <- new_vec
    new_vec = swap_pass2(vec)
  } 
  new_vec 
}

A different setup

The way I understood bubblesort is different from what is programmed so far. So far the sort is completed when no improvements are made, which is checked via a vector comparison. I always understood you don't check, the first bubble goes to the end, at which point the last element is the maximum. The second bubble then stops at end-1 etc. The final bubble is only the first two elements, after which the algorithm is finished. 
swap_pass4 = function(vec,iend) { 
  for(i in 1:iend) {
    if (vec[i] > vec[i+1] ) vec[c(i,i+1)] <- vec[c(i+1,i)]
  }
  return(vec)
}
bubble_sort4 = function(vec) {
  for (iend in (length(vec)-1):1) vec <- swap_pass4(vec,iend)
  vec
}

Tuning Paul's original algorithm

I was a bit disappointed with the improvements from using the true bubble sort. Apparently there is some gain by checking if the process is completed. Which is logical since the bubbles do some sorting of the intermediate elements. So, what about bubbles that go up and go down combined with a check on no improvements?
swap_pass3b = function(vec) { 
  for(i in length(vec):2) {
    if (vec[i] < vec[i-1] ) vec[c(i,i-1)] <- vec[c(i-1,i)]
  }
  return(vec)
}

bubble_sort3b = function(vec) {
  new_vec = swap_pass2(vec)
  while (!identical(vec, new_vec)) {
    new_vec = swap_pass2(vec)
    vec <- swap_pass3b(new_vec)
  } 
  vec 
}

Vectorizing the bubbles

When you look at the bubblesort without checking, it seems it is only assumed that the first step brings the highest element at the last position. This can be achieved by just pulling the highest element and placing it at the end, without the intermediate swaps.
bubble_sort5 = function(vec) {
  wm <- which.max(vec)
  vec <- c(vec[-wm],vec[wm])
  for (iend in ((length(vec)-1):2)) {
    wm <- which.max(vec[1:iend])
    vec <- c(vec[1:iend][-wm],vec[1:iend][wm],vec[(iend+1):length(vec)])
  }
  vec
}

Results

First a demonstration that they work;
test_vec = round(runif(10, 0, 100))
cbind(
    bubble_sort(test_vec),
    bubble_sort1(test_vec),
    bubble_sort2(test_vec),
    bubble_sort3(test_vec),
    bubble_sort4(test_vec),
    bubble_sort3b(test_vec),
    bubble_sort5(test_vec)
)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
 [1,]   28   28   28   28   28   28   28
 [2,]   41   41   41   41   41   41   41
 [3,]   42   42   42   42   42   42   42
 [4,]   46   46   46   46   46   46   46
 [5,]   52   52   52   52   52   52   52
 [6,]   68   68   68   68   68   68   68
 [7,]   89   89   89   89   89   89   89
 [8,]   93   93   93   93   93   93   93
 [9,]   98   98   98   98   98   98   98
[10,]  100  100  100  100  100  100  100

Speed

Now the timing. Cleaning the outer loops saves a third of the time. The bubble as I remembered it and the up-and-down bubble improve a bit. We are now at 13% of the original time. The final version, which uses the vectorization, blows them all out of the water. 
test_vec = runif(1000, 0, 100)
system.time(bubble_sort(test_vec))
#   user  system elapsed 
#  25.48    0.02   25.69 
system.time(bubble_sort1(test_vec))
#   user  system elapsed 
#  17.88    0.00   17.91 
system.time(bubble_sort2(test_vec))
#   user  system elapsed 
#   4.75    0.00    4.75 
system.time(bubble_sort3(test_vec))
#   user  system elapsed 
#   4.75    0.00    4.74 
system.time(bubble_sort4(test_vec))
#   user  system elapsed 
#   3.45    0.00    3.44 
system.time(bubble_sort3b(test_vec))
#   user  system elapsed 
#   3.57    0.00    3.59 
system.time(bubble_sort5(test_vec))
#   user  system elapsed 
#   0.06    0.00    0.06 

How about sort()?

Sort is much faster and scales much better, even more so after subtracting the time for the sample() call.. But I don't need to do the exercise to know that a not optimal algorithm in an interpreted language cannot compete with modern algorithms in a compiled language.
library(microbenchmark)
vector_size = 100                                                                           
print(microbenchmark(bubble_sort5(sample(vector_size)),                                                                          
        sort(sample(vector_size)),
        sample(vector_size))  )
Unit: microseconds
                              expr      min       lq    median       uq
 bubble_sort5(sample(vector_size)) 1730.041 1776.957 1806.2795 1866.758
         sort(sample(vector_size))  104.096  114.359  126.4540  180.334
               sample(vector_size)    5.865    8.797    9.8965   11.729
      max neval
 9978.521   100
  233.849   100
   21.992   100

Sunday, May 12, 2013

Reshaping data

Preparing and reshaping data is the ever continuing task of a data analyst. Luckily we have many tools for it. The default tool in R would be reshape(), although this is so user friendly that a reshape package has been added too. I try to use reshape() (the function) because I feel it is a good tool, though with a somewhat cryptical manual. The latter may be because it is written in terms of longitudal data, whereas my experience is  conversion of data from easy to enter in Excel to suitable for analysis in R.
To exercise myself a bit more I have taken all examples from the SAS transpose procedure and implemented them in R.

Examples 1 to 3

These examples are so simple, the best tool is the t() function.

score <- read.table(textConnection('
Student StudentID  Section  Test1 Test2 Final
Capalleti 0545 1  94 91 87
Dubose    1252 2  51 65 91
Engles    1167 1  95 97 97
Grant     1230 2  63 75 80
Krupski   2527 2  80 76 71
Lundsford 4860 1  92 40 86
McBane    0674 1  75 78 72'),header=TRUE,
  colClasses=rep(c('character','numeric'),each=3))
# id, only numerical data
# this case - t(score[,-1:-3])
# general
t(score[,sapply(score,class)=='numeric'])

      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
Test1   94   51   95   63   80   92   75
Test2   91   65   97   75   76   40   78
Final   87   91   97   80   71   86   72

#example 2: studentid
score2 <- score
rownames(score2) <- paste('sn',score2$StudentID,sep='')
t(score2[,-1:-3])

      sn0545 sn1252 sn1167 sn1230 sn2527 sn4860 sn0674
Test1     94     51     95     63     80     92     75
Test2     91     65     97     75     76     40     78
Final     87     91     97     80     71     86     72
#example 3 student names

rownames(score2) <- score2$Student
t(score2[,-1:-3])

      Capalleti Dubose Engles Grant Krupski Lundsford McBane
Test1        94     51     95    63      80        92     75
Test2        91     65     97    75      76        40     78
Final        87     91     97    80      71        86     72

Example 4 by groups

The first 'real' example. Some columns are used to transpose on. In addition the data is ragged, not everywhere there is data. Two unexpected problems appeared here. Location contains a space, which I solved by separating this part and adding it later. Date is in a 7 digit format, but not my locale and a non default layout. Since I need to sort in order to get the data ordered like SAS does, I converted it to a proper date variable.
The transpose itself is not so difficult. I chose to extract the variables to transpose via grep(), so I could reuse this part in the times variable. To make tracking between call and result easy, I used lower and upper case in 'length'. Length is transposed, so weight is dropped from the data. 
# example 4 by groups

fishdata1 <- readLines(textConnection(
'Cole Pond   2JUN95 31 .25 32 .3  32 .25 33 .3
Cole Pond   3JUL95 33 .32 34 .41 37 .48 32 .28
Cole Pond   4AUG95 29 .23 30 .25 34 .47 32 .3
Eagle Lake  2JUN95 32 .35 32 .25 33 .30
Eagle Lake  3JUL95 30 .20 36 .45
Eagle Lake  4AUG95 33 .30 33 .28 34 .42'))
fishdata <- read.table(text=c('Date Length1 Weight1 Length2 Weight2 Length3 Weight3 Length4 Weight4',
           substring(fishdata1,11)),
       fill=TRUE,
       header=TRUE)
fishdata$Location <- gsub(' $','',substring(fishdata1,1,10))
Sys.setlocale(category = "LC_TIME", locale = "C")
fishdata$Date <- as.Date(fishdata$Date,'%d%b%y')
lengthfishdata <- reshape(fishdata,
    varying=list(grep('Length',names(fishdata),value=TRUE)),
    direction='long',
    idvar=c('Date','Location'),
    drop=grep('Weight',names(fishdata),value=TRUE),
    v.names='LENGTH',
    timevar='NAME',
    times=tolower(grep('Length',names(fishdata),value=TRUE)),
  )
rownames(lengthfishdata)  <- 1:nrow(lengthfishdata)
lengthfishdata[order(lengthfishdata$Location,
        lengthfishdata$Date,
        lengthfishdata$NAME),]

         Date   Location    NAME LENGTH
1  1995-06-02  Cole Pond length1     31
7  1995-06-02  Cole Pond length2     32
13 1995-06-02  Cole Pond length3     32
19 1995-06-02  Cole Pond length4     33
2  1995-07-03  Cole Pond length1     33
8  1995-07-03  Cole Pond length2     34
14 1995-07-03  Cole Pond length3     37
20 1995-07-03  Cole Pond length4     32
3  1995-08-04  Cole Pond length1     29
9  1995-08-04  Cole Pond length2     30
15 1995-08-04  Cole Pond length3     34
21 1995-08-04  Cole Pond length4     32
4  1995-06-02 Eagle Lake length1     32
10 1995-06-02 Eagle Lake length2     32
16 1995-06-02 Eagle Lake length3     33
22 1995-06-02 Eagle Lake length4     NA
5  1995-07-03 Eagle Lake length1     30
11 1995-07-03 Eagle Lake length2     36
17 1995-07-03 Eagle Lake length3     NA
23 1995-07-03 Eagle Lake length4     NA
6  1995-08-04 Eagle Lake length1     33
12 1995-08-04 Eagle Lake length2     33
18 1995-08-04 Eagle Lake length3     34
24 1995-08-04 Eagle Lake length4     NA
In practice my call would be different, I would keep both height and weight, and stick the number in a more logical variable than NAME. I think SAS can only achieve this by two transpose calls and a subsequent merge, though I may be mistaken.
reshape(fishdata,
    varying=list(grep('Length',names(fishdata),value=TRUE),
        grep('Weight',names(fishdata),value=TRUE)),
    direction='long',
    idvar=c('Date','Location'),
    v.names=c('Length','Weight'),
    timevar='Fish number',
    new.row.names=1:24
)

         Date   Location Fish number Length Weight
1  1995-06-02  Cole Pond           1     31   0.25
2  1995-07-03  Cole Pond           1     33   0.32
3  1995-08-04  Cole Pond           1     29   0.23
4  1995-06-02 Eagle Lake           1     32   0.35
5  1995-07-03 Eagle Lake           1     30   0.20
6  1995-08-04 Eagle Lake           1     33   0.30
7  1995-06-02  Cole Pond           2     32   0.30
8  1995-07-03  Cole Pond           2     34   0.41
9  1995-08-04  Cole Pond           2     30   0.25
10 1995-06-02 Eagle Lake           2     32   0.25
11 1995-07-03 Eagle Lake           2     36   0.45
12 1995-08-04 Eagle Lake           2     33   0.28
13 1995-06-02  Cole Pond           3     32   0.25
14 1995-07-03  Cole Pond           3     37   0.48
15 1995-08-04  Cole Pond           3     34   0.47
16 1995-06-02 Eagle Lake           3     33   0.30
17 1995-07-03 Eagle Lake           3     NA     NA
18 1995-08-04 Eagle Lake           3     34   0.42
19 1995-06-02  Cole Pond           4     33   0.30
20 1995-07-03  Cole Pond           4     32   0.28
21 1995-08-04  Cole Pond           4     32   0.30
22 1995-06-02 Eagle Lake           4     NA     NA
23 1995-07-03 Eagle Lake           4     NA     NA
24 1995-08-04 Eagle Lake           4     NA     NA

Example 5

Example 5 is named: Naming Transposed Variables When the ID Variable Has Duplicate Values. I am not sure what the naming part is, it seems that only the closing call price is needed, which boils down to taking only the last observation in a category. The data here comes in a fixed format array, I have used (first time) read.fwf() and some calls to strip the spaces from the factors.
To get the transpose I borrowed the idea of a last function from SAS, which is basically a function which indicates that the current record in a variable is different from the next record. Which is very important in SAS because it is fundamentally row (record) organized, whereas R is column (variable) organized. Proper R would just processing dependent on Time='closing' but what is used is functionally closer to the SAS call.
stocks <- read.fwf(textConnection(
'Horizon Kites jun11 opening 29
Horizon Kites jun11 noon    27
Horizon Kites jun11 closing 27
Horizon Kites jun12 opening 27
Horizon Kites jun12 noon    28
Horizon Kites jun12 closing 30
SkyHi Kites   jun11 opening 43
SkyHi Kites   jun11 noon    43
SkyHi Kites   jun11 closing 44
SkyHi Kites   jun12 opening 44
SkyHi Kites   jun12 noon    45
SkyHi Kites   jun12 closing 45'
),col.names=c('Company','Date','Time','Price'),
widths=c(14,5,8,3))
levels(stocks$Company) <- gsub('(^ +)|( +$)','',levels(stocks$Company))
levels(stocks$Time) <- gsub('(^ +)|( +$)','',levels(stocks$Time))
# only last observation
islast <- function(x)  c(x[-length(x)]!=x[-1],TRUE)
reshape(stocks[islast(stocks$Date),],direction='wide',
    timevar='Date',idvar=c('Company'),drop='Time',times=Time)

        Company Price.jun11 Price.jun12
3 Horizon Kites          27          30
9   SkyHi Kites          44          45

Example 6 

Transposing for statistical analysis. We have 5 subjects, who did 3 programs and 7 strength assessments. Why subject is not in the original data baffles me. In SAS this is not run through proc transpose, but rather a data step. But that doesn't stop me from using reshape(). The subject variable is added. In the next step SAS rebuilds to get one file with both formats, seems silly in R context, I am just transforming back, now using subject. 

weights <- read.table(textConnection(
'Program s1 s2 s3 s4 s5 s6 s7
CONT  85 85 86 85 87 86 87
CONT  80 79 79 78 78 79 78
CONT  78 77 77 77 76 76 77
CONT  84 84 85 84 83 84 85
CONT  80 81 80 80 79 79 80
RI    79 79 79 80 80 78 80
RI    83 83 85 85 86 87 87
RI    81 83 82 82 83 83 82
RI    81 81 81 82 82 83 81
RI    80 81 82 82 82 84 86
WI    84 85 84 83 83 83 84
WI    74 75 75 76 75 76 76
WI    83 84 82 81 83 83 82
WI    86 87 87 87 87 87 86
WI    82 83 84 85 84 85 86
'),header=TRUE)
weights1 <- reshape(weights,direction='long',timevar='time',idvar='row',
    varying=list(paste('s',1:7,sep='')),v.names='Strength')
weights1$subject <- 1+ ((weights1$row-1) %% 5)
weights1 <- weights1[order(weights1$Program,weights1$subject,weights1$time),]
(Weights1 <- weights1[,c(1,5,2,3)])[1:15,]

    Program subject time Strength
1.1    CONT       1    1       85
1.2    CONT       1    2       85
1.3    CONT       1    3       86
1.4    CONT       1    4       85
1.5    CONT       1    5       87
1.6    CONT       1    6       86
1.7    CONT       1    7       87
2.1    CONT       2    1       80
2.2    CONT       2    2       79
2.3    CONT       2    3       79
2.4    CONT       2    4       78
2.5    CONT       2    5       78
2.6    CONT       2    6       79
2.7    CONT       2    7       78
3.1    CONT       3    1       78
reshape(Weights1,direction='wide',timevar='time',
    idvar=c('Program','subject'))

     Program subject Strength.1 Strength.2 Strength.3 Strength.4 Strength.5
1.1     CONT       1         85         85         86         85         87
2.1     CONT       2         80         79         79         78         78
3.1     CONT       3         78         77         77         77         76
4.1     CONT       4         84         84         85         84         83
5.1     CONT       5         80         81         80         80         79
6.1       RI       1         79         79         79         80         80
7.1       RI       2         83         83         85         85         86
8.1       RI       3         81         83         82         82         83
9.1       RI       4         81         81         81         82         82
10.1      RI       5         80         81         82         82         82
11.1      WI       1         84         85         84         83         83
12.1      WI       2         74         75         75         76         75
13.1      WI       3         83         84         82         81         83
14.1      WI       4         86         87         87         87         87
15.1      WI       5         82         83         84         85         84
     Strength.6 Strength.7
1.1          86         87
2.1          79         78
3.1          76         77
4.1          84         85
5.1          79         80
6.1          78         80
7.1          87         87
8.1          83         82
9.1          83         81
10.1         84         86
11.1         83         84
12.1         76         76
13.1         83         82
14.1         87         86
15.1         85         86


Sunday, May 5, 2013

Simulation shows gain of clmm over ANOVA is small

After last post's setting up for a simulation, it is now time to look how the models compare. To my disappointment with my simple simulations of assessors behavior the gain is minimal. Unfortunately, the simulation took much more time than I expected, so I will not expand it.

Background

I have been looking at the ordered logistic model in a number of postings. The reason is that in general I find people use ANOVA for analyzing data on a nine point scale, whereas you would think an ordered logistic model works better. Two posts (first and second) showed the current methods in R, and and JAGS are easy to use and with some tweaking provide suitable output to present. Subsequently I made the simulated data generator. Now it is time to make the final comparison.  

Simulations

The core of the simulator is explained elsewhere so I won't explain here again. I did however notice a small error, so the corrected code is given here. Some new parts are added, wrappers around the data generator and the analysis. And to my big disappointment I could not even build that as desired. The call anova(Res.clmm2,Res.clmm) with subsequent extraction has been replaced by the ugly pchisq(2*(Res.clmm$logLik-Res.clmm2$logLik),2,lower.tail=FALSE ). 2 represents the degrees of freedom for products in my simulations. Somehow that call to ANOVA did not run within a function, after trying too many variations I choose the short cut.

library(ordinal) 
library(ggplot2)

num2scale <- function(x,scale) {
  cut(x,c(-Inf,scale,Inf),1:(length(scale)+1))
}
pop.limits2ind.limits <- function(scale,sd=.5) {
  newscale <- scale+rnorm(length(scale),sd=sd)
  newscale[order(newscale)]
}

obs.score <- function(obs,pop.score,pop.limits,
    sensitivity.sd=.1, precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  # individual sensitivity (multiplicative)
  obs.errorfreeintensity <- center.scale + 
      (pop.score-center.scale)*rlnorm(1,sd=sensitivity.sd)
  #individual (additive) 
  obs.errorfreeintensity <- obs.errorfreeintensity +
      rnorm(1,mean=0,sd=additive.sd)
  # individual observation error 
  obs.intensity <- obs.errorfreeintensity+
      rnorm(length(pop.score),sd=precision.sd)
  
  # individual cut offs between categories  
  obs.limits <- pop.limits2ind.limits(pop.limits)
  obs.score <- num2scale(obs.intensity,obs.limits)
  data.frame(obs=obs,
      score = obs.score,
      product=labels
  )
}

panel.score <- function(n=100,pop.score,pop.limits,
    sensitivity.sd=.1,precision.sd=1,
    additive.sd=2,center.scale=5,
    labels=LETTERS[1:length(pop.score)]) {
  la <- lapply(1:n,function(x) {
        obs.score(x,pop.score,pop.limits,sensitivity.sd=sensitivity.sd,
            precision.sd=precision.sd,additive.sd=additive.sd,labels=labels)
      })
  dc <- do.call(rbind,la)
  dc$obs <- factor(dc$obs)
  dc
}

overallP <- function(scores) {
  Res.aov <- aov( numresponse ~ obs +  product , data=scores)
  paov <- anova(Res.aov)['product','Pr(>F)']
  Res.clmm <- clmm(score  ~ product + (1|obs),data=scores)
  Res.clmm2 <- clmm(score ~ 1 + (1|obs),data=scores)
  c(ANOVA=paov,
    clmm = pchisq(2*(Res.clmm$logLik-Res.clmm2$logLik),2,lower.tail=FALSE ) #.1687
   )
}

onesim <- function(prodmeans,pop.limits,center.scale,additive.sd=2) {
  scores <- panel.score(40,prodmeans,pop.limits,
      center.scale=center.scale,additive.sd=additive.sd)
  scores$numresponse <- as.numeric(levels(scores$score))[scores$score]
  overallP(scores)
}

Simulation I, 5 categories

The first simulation is with 5 categories. This represents the Just About Right (JAR) scale. The final plot shows the difference between ANOVA and clmm is minimal.

pop.limits <- c(1,2.5,4.5,6)

nsim <- 250
sim5cat1 <- lapply(seq(0,.6,.05),function(dif) {
    prodmeans=rep(3.5,3)+c(-dif,0,dif)
    sa <- sapply(1:nsim,function(x) onesim(prodmeans,pop.limits))
    data.frame(dif=dif,nreject=as.numeric(rowSums(sa<0.05)),
        method=rownames(sa))
  }
)

sim5cat1tb <- do.call(rbind,sim5cat1)
ggplot(sim5cat1tb, aes(dif,nreject/nsim ,colour=method)) +
     geom_line() + xlab('Difference between products') + 
     ylab('Proportion significant (at 5% Test)') +
     theme(legend.position = "bottom") + ylim(c(0,1)) +
     guides(colour = guide_legend('Analysis Method'))

Simulation 2, 9 categories

This simulation represents the intensity and liking scales. Again the difference between ANOVA and clmm are minimal.

pop.limits <- c(1,3:8,10)
prodmeans <- c(7,7,7)
scores <- panel.score(40,prodmeans,pop.limits,additive.sd=1)
scores$numresponse <- as.numeric(levels(scores$score))[scores$score]
nsim <- 250
sim9cat <- lapply(seq(0.,.6,.05),function(dif) {
      prodmeans=c(7,7,7)+c(-dif,0,dif)
      sa <- sapply(1:nsim,function(x) onesim(prodmeans,pop.limits,
                additive.sd=1))
      data.frame(dif=dif,nsign=as.numeric(rowSums(sa>0.05)),
          method=rownames(sa))
    }
)
sim9cattb <- do.call(rbind,sim9cat)
ggplot(sim9cattb, aes(dif,(nsim-nsign)/nsim ,colour=method)) +
    geom_line() + xlab('Difference between products') + 
    ylab('Proportion significant (at 5% Test)') +
    theme(legend.position = "bottom") + ylim(c(0,1)) +
    guides(colour = guide_legend('Analysis Method'))

Conclusion

The differences between clmm and ANOVA seem to be minimal. I had not expected large differences, but especially at 5 categories my expectation were to find real differences as the continuous data is more violated. Obviously, a load more simulations would be needed to draw final conclusions. This is beyond the scope of my blog. 
To conclude, in theory clmm is much more suitable than ANOVA for ordinal data. There are no reasons in terms of presentation to prefer ANOVA over clmm. But in practice the difference may be minimal.