Sunday, March 30, 2014

Looking at Measles Data in Project Tycho

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested. I wanted to play around with the data a bit, so I registered.

Measles

Measles are in level 2 data. These are standardized data for immediate use and include a large number of diseases, locations, and years. These data are not complete because standardization is ongoing. The data are retrieved as measles cases and while I know I should convert to cases per 100 000, I have not done so here.
The data come in wide format, so the first step is conversion to long format. The Northern Mariana Islands variable was created as logical, so I removed it. In addition, data from before 1927 seemed very sparse, so those are removed too.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r1 <- subset(r1,,-NORTHERN.MARIANA.ISLANDS)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='State',
    direction='long')
r2$State=factor(r2$State)
r3 <- r2[r2$YEAR>1927,]

Plotting

The first plot is total cases by year. It shows the drop in cases from vaccine (Licensed vaccines to prevent the disease became available in 1963. An improved measles vaccine became available in 1968.)
qplot(x=Year,
        y=x,
        data=with(r3,aggregate(Cases,list(Year=YEAR),function(x) sum(x,na.rm=TRUE))),


        ylab='Sum registered Measles Cases')+
    theme(text=element_text(family='Arial'))


Occurrence within a year by week

Winter and spring seems to be the periods in which most cases occur. The curve seems quite smooth, with a few small fluctuations. The transfer between week 52 and week 1 is a bit steep, which may be because I removed week 53 (only present in part of the years).

qplot(x=Week,
        y=x,
        data=with(r3[r3$WEEK!=53 & r3$YEAR<1963,],
            aggregate(Cases,list(Week=WEEK),
                function(x) sum(x,na.rm=TRUE))),
        ylab='Sum Registered Measles Cases',
        main='Measles 1928-1962')+
    theme(text=element_text(family='Arial'))

A more detailed look

Trying to understand why the week plot was not smooth, I made that plot with year facets. This revealed an interesting number of zeros, which are an artefact of processing method (remember, sum(c(NA,NA),na.rm=TRUE)=0). I do not know if the data distinguishes between 0 and '-'. There are 872 occurrences of 0 which suggests 0 is used. On the other hand, week 6 and 9 in 1980 in Arkansas each have one case, the other weeks from 1 to 22 are '-', which suggests 0 is not used. My feeling for that time part is that registration became lax after measles was under control and getting reliable data from the underlying documentation is a laborious task. 

References


Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

3 comments:

  1. The dip and spike at the end of the year are quite clear in the UK data (see ), generally attributed to reporting delays (cases pile up on doctors' desks during the holiday season and get reported a week later) -- I think this is referred to by Fine and Clarkson (International journal of epidemiology 1982) as well.

    ReplyDelete
  2. Hello - could you tell me from where I can get the data dump 'MEASLES_Cases_1909-1982_20140323140631.csv' . Or could you share this to sekharajith@gmail.com

    ReplyDelete
    Replies
    1. The data is in project Tycho. You have to register there first. The measles data are in level 2 data. The last part of the filename seems to be date time when I downloaded the data, so that would be different by now..

      Delete