Alluvial graph: NZAVS sample retention by wave and sample frame
The fundamental challenge of longitudinal research is participant retention. This graph presents sample retention for the New Zealand Attitudes and Values Study (NZAVS). Figure 1 presents the sample frames within a single graph.
For more information about the NZAVS see: here and here
Show the code
# First we import the data, and create an indicator for measured/lost/withdrawn.library("ggalluvial") # alluvial graphs# data sent by Chris Sibley Aug 2021sampling_data <- haven::read_sav(here::here("data", "GraphData2.sav"))# remove SPSS labelssampling_data <- haven::zap_formats(sampling_data)sampling_data <- haven::zap_label(sampling_data)sampling_data <- haven::zap_widths(sampling_data)sampling_data <- haven::zap_labels(sampling_data)# rename Id variablesampling_data <- sampling_data %>%rename(Id = Questionnaire.Num)## transform data into long form, required for ggalluvial, and rename the wavesinner_join( sampling_data %>% dplyr::select(c( Id,c('T01',"T02",'T03',"T035","T04",'T05',"T06",'T07',"T08","T09","T10","T11","T12" ) )) %>%#~~~~ Left side = date component ~~~~~~~~gather(Wave, YearMeasured, -Id) %>%#~ long form = 1 row per prod per seq ~mutate(Wave =recode_factor( Wave,T01 ="2009",T02 ="2010",T03 ="2011",T035 ="2011.5",T04 ="2012",T05 ="2013",T06 ="2014",T07 ="2015",T08 ="2016",T09 ="2017",T10 ="2018",T11 ="2019",T12 ="2020" ) ) %>%arrange(Id), sampling_data %>% dplyr::select(c(Id, starts_with('SampleOriginYear'))) %>%arrange(Id),) -> sdf# define sample origin yearsdf <- sdf %>%mutate(SampleOriginYear = SampleOriginYear +2008)# Make sample frame into a factorsdf$SampleOriginYear <-as.factor(sdf$SampleOriginYear)# Created levels for "missing", "not yet measured" and "deceased"# then filter "not yet measured"sdf.0<- sdf %>%mutate(Wave =as.numeric(as.character(Wave))) %>%arrange(Wave, Id) %>%group_by(Id) %>%mutate(first = { YearMeasured ==1 } %>% { . *!duplicated(.) }) %>%mutate(value_tmp =if_else(first ==1, Wave, NA_real_),firstwave =mean(value_tmp, na.rm =TRUE) # this is a hack, but works ) %>%mutate(state =ifelse( YearMeasured ==-1,"deceased",ifelse( YearMeasured ==0& Wave < firstwave,"notyetmeasured",ifelse(YearMeasured ==0& Wave > firstwave, "missing","measured") ) )) %>% dplyr::mutate(Wave =as.factor(Wave)) %>%# return Wave to a factor dplyr::select(Wave, Id, state, YearMeasured, SampleOriginYear) %>% dplyr::filter(state !="notyetmeasured") %>%droplevels() %>%arrange(Id, Wave)# This is to create a "recovered" state. Not used but could be useful for modelling recovery:sdf.01<- sdf.0%>%group_by(Id) %>%mutate(lag_state = dplyr::lag(state, n =1, default ="init"))sdf.1<- sdf.01%>%group_by(Id, Wave) %>%mutate(recovered =ifelse(lag_state =="missing"& state =="measured", "recovered", state))# We need to back fill the missing values for our alluvial graph.# This is OK because a participants original sampling frame does not change.ssdf0 <- sdf.1%>%# Fill missing values for Sample.Framearrange(Id) %>%group_by(Id) %>%fill(SampleOriginYear) %>%ungroup()# now we create the "state_frame" variabletf <- ssdf0 %>%mutate(SampleOriginYear =as.character(SampleOriginYear))ssdf <- tf %>%mutate(state_frame =ifelse( state =="missing","missing",ifelse(state =="deceased", "deceased", SampleOriginYear) ))ssdf.0<- ssdf %>%mutate(state_frame =factor(state_frame))# next we order the levels of the factor to create a pretty graphssdf.01<- ssdf.0%>%mutate(state_frame = forcats::fct_relevel( state_frame,c("2009","2010","2011","2011.5","2012","2013","2014","2015","2016","2017","2018","2019","2020","missing","deceased" ) )) %>%rename(Recovered = recovered) # for pretty table
Alluvial graph: NZAVS sample retention by wave, faceted by sample frame