Psych 447: Consolidation of skills

Joseph Bulbulia

Get data

#libraries
if (!require(skimr)) install.packages('skimr')
if (!require(lubridate)) install.packages('lubridate')

if (!requireNamespace("devtools")) {
  install.packages("devtools")
}
if (!require(easystats)) devtools::install_github("easystats/easystats")
if (!require(ggthemes)) install.packages('ggthemes')
if (!require(pmdplyr)) install.packages("pmdplyr")
if (!require(kableExtra)) install.packages("kableExtra")
# this should be part of easystats but in case not:
if (!require(report)) install.packages('report')
if (!require(brms)) install.packages('brms')
if (!require(lme4)) install.packages('lme4')
if (!require(table1)) install.packages('table1')
if (!require(modelsummary)) install.packages("modelsummary")
if (!require(naniar)) install.packages("naniar")
if (!require(ggraph)) install.packages("ggraph")
if (!require(gtsummary)) install.packages("gtsummary")


# load tidyverse
library("tidyverse")

# theme set
theme_set(theme_classic())

# uncomment below and run this code
# easystats::install_easystats_latest()

nz_0 <- readr::read_csv2(url("https://raw.githubusercontent.com/go-bayes/psych-447/main/data/nz/nz.csv"))

# take all characters and make them factors
# also get rid of duplicate rows
# Note the convention of renaming dataframe when creating a new one:
# ` nz <-nz_0 %>%... `

f<-c("None Of The Time","A Little Of The Time", "Some Of The Time",  "Most Of The Time", "All Of The Time")

nz <-nz_0 %>%
  dplyr::mutate_if(is.character, factor) %>%
  select(-c(SWB.Kessler01,SWB.Kessler02,SWB.Kessler03,SWB.Kessler04,SWB.Kessler05, SWB.Kessler06))%>%
  dplyr::mutate(Wave = as.factor(Wave))%>%
  mutate(FeelHopeless = forcats::fct_relevel(FeelHopeless,f))%>%
  mutate(FeelDepressed = forcats::fct_relevel(FeelDepressed,f))%>%
  mutate(FeelRestless = forcats::fct_relevel(FeelRestless,f))%>%
  mutate(EverythingIsEffort = forcats::fct_relevel(EverythingIsEffort,f))%>%
  mutate(FeelWorthless = forcats::fct_relevel(FeelWorthless,f))%>%
  mutate(FeelNervous = forcats::fct_relevel(FeelNervous,f))

# not used
# nz <- haven::zap_formats(nz)
# nz <- haven::zap_label(nz)
# nz <- haven::zap_widths(nz)
# nz <- haven::zap_labels(nz)

Preamble

One of the advantages of R is that allows us to create highly effective workflows. Today, we’ll reinforce and extend the workflow skills that you’ve started to develop in previous weeks. Below we’ll be working with the nz dataset, which is a reduced, truncated, and jittered version of waves 10 and 11 of the New Zealand Attitudes and Values Study. This dataset is for teaching only, if you’d like to learn more about the study to which it belongs, go here or here.

Data carpentry continued

Different methods for selecting columns

Suppose we want to select all variables that start with Believe. We can do this in a number of ways.

First there is explicit selection:

# explicit selection 
nz %>%
  select("Believe.God", "Believe.Spirit")%>%
    glimpse()

Rows: 4,126
Columns: 2
$ Believe.God    <fct> Not Believe God, Not Believe God, Believe God…
$ Believe.Spirit <fct> Not Believe Spirit, Not Believe Spirit, Belie…

We can select all instances of a column that start with a certain name. For this you by using starts_with

nz %>%
  select(starts_with("Believe"))%>%
    glimpse()

Rows: 4,126
Columns: 2
$ Believe.God    <fct> Not Believe God, Not Believe God, Believe God…
$ Believe.Spirit <fct> Not Believe Spirit, Not Believe Spirit, Belie…

By the same token, we can select all instances of a variable that ends with a certain string by using ends_with

nz %>%
  select(ends_with("conditions"))%>%
    glimpse()

Rows: 4,126
Columns: 2
$ NZ.Social.Conditions   <dbl> 7, 7, 2, 6, 5, 5, 2, 0, 9, 7, 3, 2, 2…
$ NZ.Business.Conditions <dbl> 7, 8, 2, 6, 5, 5, 6, 5, 9, 7, 9, 5, 5…

We can cast a broader net and select all instances of a variable within a string by using contains

nz %>%
  select(contains("Believe"))%>%
    glimpse()

Rows: 4,126
Columns: 3
$ Religion.Believe.Cats <dbl> 4, 4, 1, 1, 1, 1, 1, NA, 3, 1, 4, 4, 4…
$ Believe.God           <fct> Not Believe God, Not Believe God, Beli…
$ Believe.Spirit        <fct> Not Believe Spirit, Not Believe Spirit…

As we can see, the net that we cast using contains was too broad. We don’t want the Religion.Believe.Cats.

In R, you can programme your way out of this corner as follows:

nz %>%
  select(contains("Believe") &  -  Religion.Believe.Cats)%>%
   glimpse()

Rows: 4,126
Columns: 2
$ Believe.God    <fct> Not Believe God, Not Believe God, Believe God…
$ Believe.Spirit <fct> Not Believe Spirit, Not Believe Spirit, Belie…

However, that’s inelegant; better to drop contains altogether and revert to another method.

Re-leveling a factor

Death, taxes, and factors are consequence of living. Let’s look at the BigDoms variable in the nz, which is a factor identifying large religious denominations

nz %>%
  dplyr::select(BigDoms)%>%
  table(useNA ="ifany")

.
 Buddhist Christian    Muslim   Not_Rel TheOthers      <NA> 
       36      1204         9      2655       128        94

Note the use of ifany to print the NAs in this table. It’s almost never sensible to ignore missing values!

Suppose we wanted to make “Not Rel” our base category for this factor. We could do so as follows:

## suppose we want "Not_Rel" as the base category, and rearrange the other levels
library(forcats) # this is part of the tidyverse package. 
nz1<-nz %>%
  dplyr::select(BigDoms, KESSLER6sum) %>%
  dplyr::mutate(BigDoms =  
                  forcats::fct_relevel(BigDoms, c("Not_Rel","Christian","Buddhist","Muslim","TheOthers")))

#inspect data
nz1%>%
  group_by(BigDoms)%>%
  count()

# A tibble: 6 x 2
# Groups:   BigDoms [6]
  BigDoms       n
  <fct>     <int>
1 Not_Rel    2655
2 Christian  1204
3 Buddhist     36
4 Muslim        9
5 TheOthers   128
6 <NA>         94

The reordering makes for a more sensible model because the base category is now Not_Rel or not-religious. Hence comparisons are to this category.

m0<- glm( KESSLER6sum ~ BigDoms, data = nz1 )
parameters::model_parameters(m0)  %>%
  print_html(caption = "Model of Distress by Denomination with the base category is `No Religion'")

Model of Distress by Denomination with the base category is `No Religion'

Parameter	Coefficient	SE	95% CI	t(3991)	p
(Intercept)	5.38	0.08	(5.23, 5.54)	69.49	< .001
BigDoms (Christian)	-0.48	0.14	(-0.76, -0.21)	-3.47	< .001
BigDoms (Buddhist)	-0.97	0.67	(-2.28, 0.34)	-1.45	0.147
BigDoms (Muslim)	1.17	1.33	(-1.43, 3.78)	0.88	0.378
BigDoms (TheOthers)	0.18	0.36	(-0.53, 0.88)	0.49	0.621

We can see the results better using a coefficient graph, which visualises the information presented in the table.

plot(parameters::model_parameters(m0) ) + labs(title = "Comparison of Religious groups to secular people", 
subtitle = "Christians are a little more chilled out, \n Other denoms are less chilled out")

The base category is the comparison class. Should we infer that “The Others” denomination causes greater distress? We’ll return to this, and related questions, in the upcoming weeks. For now let’s just leave it at “probably not.”

Creating factors from numerical indicators

It is almost never a good idea to transform continuous data into categorical data. However, occassionally, you will need to do so. For example, we might want to break the KESSLER6 distress indicator into its medically diagnostic components for “mild distress,” “moderate distress,” and “severe distress.” We may achieve this task using the cut function as follows:

nz <-nz %>%
  dplyr::mutate(k6cats = cut(
    KESSLER6sum,
    breaks = c(-Inf, 5, 13, Inf),   # create Kessler 6 diagnostic categories
    labels = c("Low Distress", "Moderate Distress", "Serious Distress"), 
    right = TRUE
  ))
table(nz$k6cats, useNA = "ifany")


     Low Distress Moderate Distress  Serious Distress 
             2510              1400               177 
             <NA> 
               39

Using `ifelse` to create factors

I prefer to maintain control over how I am making the categories. For example, in the previous example, I didn’t remember whether cut includes a value to the left or to the right. I had to look this up. However, I can use ifelse function to explicitly create the relevant categories:

nz %>%
  dplyr::mutate(k6cats1 =  as.factor(ifelse(
    KESSLER6sum <= 5,
    "Low Distress",
    ifelse(KESSLER6sum <= 13,  "Moderate Distress", "Serious Distress")
  ))) %>%
  group_by(k6cats1) %>%
  count()

# A tibble: 4 x 2
# Groups:   k6cats1 [4]
  k6cats1               n
  <fct>             <int>
1 Low Distress       2510
2 Moderate Distress  1400
3 Serious Distress    177
4 <NA>                 39

#check this is the same as the previous method
nz %>%
  group_by(k6cats) %>%
  count()

# A tibble: 4 x 2
# Groups:   k6cats [4]
  k6cats                n
  <fct>             <int>
1 Low Distress       2510
2 Moderate Distress  1400
3 Serious Distress    177
4 <NA>                 39

We can see that this method returns the same values as the cut method above.

Transformations of indicators: scaling, centering, and logs

Throughout this course, we’ll be standardising and centering indicators. Occasionally, we’ll need to perform log transformations. You’ll need to know how to do this.

Suppose we want to standardise the Relid indicator. This will transform the Relid indicator into standard deviation units. In later seminars, we’ll explain why this transformation is useful. For now, this is how you do it:

nz1 <- nz %>%
  select(Relid)%>%
  mutate(religousid_s = scale(Relid, scale = TRUE, center  = TRUE))
head(nz1)

# A tibble: 6 x 2
  Relid religousid_s[,1]
  <dbl>            <dbl>
1     0           -0.632
2     0           -0.632
3     0           -0.632
4     0           -0.632
5     0           -0.632
6     0           -0.632

What happened? The variable name for our standardised variable looks weird: religious_s[ ,1]

This isn’t a worry. We use the variable as we would any other and all is fine.¹

sjPlot::tab_model(lm(religousid_s ~ 1 , data = nz1))

	religousid_s
Predictors	Estimates	CI	p
(Intercept)	0.00	-0.03 – 0.03	1.000
Observations	3963
R² / R² adjusted	0.000 / 0.000

Pro tip 1:

Transform your data as the last step in your pipe workflow.

This is because if you filter cases, you’ll end up with a variable that isn’t measured standard deviations units

nza <- nz %>%
  select(Relid, BigDoms)%>%
  mutate(religousid_s = scale(Relid, scale = TRUE, center  = TRUE)) 
nzb <- nz %>%
  select(Relid, BigDoms)%>%
  mutate(religousid_s = scale(Relid, scale = TRUE, center  = TRUE)) %>%
  filter(BigDoms !="Not_Rel")

# compare
summary(nza$religousid_s)

       V1         
 Min.   :-0.6315  
 1st Qu.:-0.6315  
 Median :-0.6315  
 Mean   : 0.0000  
 3rd Qu.: 0.9223  
 Max.   : 2.0877  
 NA's   :163

# with
summary(nzb$religousid_s)

       V1         
 Min.   :-0.2431  
 1st Qu.: 0.9223  
 Median : 1.3107  
 Mean   : 1.2819  
 3rd Qu.: 2.0877  
 Max.   : 2.0877  
 NA's   :69

When we filter last, the mean value in the dataset is 1.28 – everything has changed!

nz1 <- nz1 %>%
  select(Relid)%>%
  mutate(religousid_s = scale(Relid, scale = TRUE, center  = TRUE))
head(nz1)

# A tibble: 6 x 2
  Relid religousid_s[,1]
  <dbl>            <dbl>
1     0           -0.632
2     0           -0.632
3     0           -0.632
4     0           -0.632
5     0           -0.632
6     0           -0.632

or simply:

nz1 <- nz1 %>%
  select(Relid) %>%
  mutate(religousid_s = scale(Relid))

head(nz1)

To center a variable we set scale = FALSE, center = TRUE

nz1 <- nz %>%
  mutate(religousid_c = scale(Relid, scale = FALSE, center  = TRUE))

# inspect new indicator
nz1%>%
  select(Relid,religousid_c)%>%
    glimpse()

Rows: 4,126
Columns: 2
$ Relid        <dbl> 0, 0, 0, 0, 0, 0, 7, 7, 2, 2, 0, 0, 0, 0, 0, 0,…
$ religousid_c <dbl[,1]> <matrix[23 x 1]>

We use the log transformation for extreme values. We can create a new indicator by combining mutate and log as follows:

nz1 <- nz %>%
  mutate(charitydonate_log = log(CharityDonate + 1))

# inspect new indicator
nz1 %>%
  select(CharityDonate,charitydonate_log)%>%
    glimpse()

Rows: 4,126
Columns: 2
$ CharityDonate     <dbl> 180, 80, 300, 100, 4200, 3500, 400, 350, 5…
$ charitydonate_log <dbl> 5.198497, 4.394449, 5.707110, 4.615121, 8.…

Note that we have to add \[+1\] to the log transformation, as you will recall that the log of zero is undefined. You cannot obtain zero by raising it to the power of another number.

Create and work with dates a date

nz <- nz %>%
  dplyr::mutate(date = make_date(year = 2009, month = 6, day = 30) + TSCORE)  # first data of data collection in this study

We can analyze dates, for example, for how many minutes were data collected?

nz %>%
  select(date)%>%
  summary()

      date           
 Min.   :2018-01-02  
 1st Qu.:2018-08-09  
 Median :2019-10-03  
 Mean   :2019-05-18  
 3rd Qu.:2019-12-07  
 Max.   :2020-10-06

int<-lubridate::interval(ymd("2018-01-02"), ymd("2020-10-06"))

#time in years
time_length(int, "year")

[1] 2.759563

#time in minutes
time_length(int, "minutes")

[1] 1451520

Fun! So much so you have some homework that will work with dates.

Create a timeline

Here we’re going to graph the number of responses each day for the years of data collection.

library(lubridate)
library(ggplot2)


datrep <- nz %>%
  count(day = floor_date(date, "day"))%>%
  dplyr::mutate(Year = factor(ifelse(
    day < "2018-01-01",
    2017,
    ifelse(day < "2019-01-01", 2018,
           ifelse(day < "2020-01-01", 2019, 2020))
  ))) %>%
  arrange(day)

# create the graph
ggplot(datrep, aes(day, n)) +
  geom_col(aes(fill = Year)) +
  scale_x_date(date_labels = "%b/%Y")  +
  xlab("Days") + ylab("Count of Responses") +
  ggtitle("Our Dataset's Daily Counts")  +
  theme_classic()  +
  scale_fill_viridis_d()

Note that we can use the datrep dataframe that we created to explore aspects of data collection. For example we can arrange the dataset by day in descending order of participants sampled:

datrep%>%
  arrange(desc(n))

# A tibble: 607 x 3
   day            n Year 
   <date>     <int> <fct>
 1 2018-06-21   112 2018 
 2 2018-06-22    93 2018 
 3 2018-06-24    80 2018 
 4 2018-06-20    67 2018 
 5 2018-06-23    59 2018 
 6 2018-06-26    58 2018 
 7 2019-12-03    54 2019 
 8 2018-06-25    52 2018 
 9 2019-10-04    47 2019 
10 2019-12-02    46 2019 
# … with 597 more rows

Take not of that code, you might need it for your workbook.

What might we do with dates? Well we might ask, were there any inherently stressful days?

To see this, we can take average stress levels by day, and then see where the high average stress days fall.

tn<-nz %>%
  select(date,KESSLER6sum,Id) %>%
  group_by(date)%>%
  summarise(
   av_distress =  mean(KESSLER6sum, na.rm = TRUE),
   n = n_distinct(Id)
  ) %>%
  arrange(desc(av_distress))%>%
  glimpse()

Rows: 607
Columns: 3
$ date        <date> 2018-12-16, 2020-07-06, 2018-11-13, 2019-01-05,…
$ av_distress <dbl> 21.00000, 16.00000, 15.50000, 15.00000, 15.00000…
$ n           <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 1, …

Graphing the densities reveals the following

tn%>%
  ggplot(., aes(date, av_distress)) + 
  geom_col(aes(fill =(n))) + scale_x_date(date_labels = "%b/%Y")  + theme_classic() + scale_fill_viridis_c()

Clearly the “stressful days” are an artifact of days with low numbers of participant respondents.

Let’s see whether there are any stressful days of the week. We do this by creating a weekday variable using the wday function in the lubridate package. Let’s graph our results using a pipe %>% workflow:

nz %>%
  select(Id, date, KESSLER6sum) %>%
  mutate(weekdays = wday(date, label = TRUE)) %>%
  group_by(weekdays) %>%
  summarise(
    mn_k6 =  mean(KESSLER6sum, na.rm = TRUE),
    sd_k6 =  sd(KESSLER6sum, na.rm = TRUE),
    n_k6w = n()
  ) %>%
  mutate(
    se_k6 = sd_k6 / sqrt(n_k6w),
    lw_ci = mn_k6 - qt(1 - (0.05 / 2), n_k6w - 1) * se_k6,
    up_ci = mn_k6 + qt(1 - (0.05 / 2), n_k6w - 1) * se_k6
  ) %>%
  ggplot(., aes(x = weekdays, y = mn_k6, colour = mn_k6)) +
  geom_errorbar(aes(ymin = lw_ci, ymax = up_ci), width = .1) +
  geom_point(size = 3)  +
  scale_y_continuous(limits = c(0,7)) + 
  theme_classic() + scale_fill_viridis_d()

Despite the variability over the two years of data collection, the bars of the graph overlap: we don’t find differences in distress by days.

“Ok Boomer,” you ask, “what if we were to calculate distress by generational cohorts?”

My reply, I’m not a boomer, I’m a GenX-er. I’m keen to check it out:

nz$hour

NULL

nz %>%
  select(GenCohort, KESSLER6sum) %>%
  group_by(GenCohort) %>%
  summarise(
    mn_k6 =  mean(KESSLER6sum, na.rm = TRUE),
    sd_k6 =  sd(KESSLER6sum, na.rm = TRUE),
    n_k6w = n()
  ) %>%
  mutate(
    se_k6 = sd_k6 / sqrt(n_k6w),
    lw_ci = mn_k6 - qt(1 - (0.05 / 2), n_k6w - 1) * se_k6,
    up_ci = mn_k6 + qt(1 - (0.05 / 2), n_k6w - 1) * se_k6
  ) %>%
  ggplot(., aes(x = GenCohort, y = mn_k6, colour = GenCohort)) +
  geom_errorbar(aes(ymin = lw_ci, ymax = up_ci), width = .1) +
  geom_point(size = 3)  +
  scale_y_continuous(limits = c(0, 7)) +
  theme_classic() +
  geom_hline(yintercept = 5,
             colour = "red",
             linetype = "dashed") +
  scale_y_continuous(limits = c(0, 10)) +
  theme(
    legend.text = element_text(size = 6),
    legend.title = element_text(size = 8),
    axis.text.x = element_blank()
  ) +
  xlab("Birth Generation Cohort") +
  ylab("Kessler 6 Distress") +
  labs(title = "Average Distress by Birth Cohort",
       subtitle = "Red line indicates clinically moderate distress threshold") +
  scale_colour_viridis_d()

Later, we’ll ask why you’re so stressed out.

`Slice`

Dplyr’s slice function can be handy. Say we only want the first four rows

datrep%>%
  arrange(desc(n)) %>%
  slice(1:4)

# A tibble: 4 x 3
  day            n Year 
  <date>     <int> <fct>
1 2018-06-21   112 2018 
2 2018-06-22    93 2018 
3 2018-06-24    80 2018 
4 2018-06-20    67 2018

Say we only want the 1st row, the 3rd row, and the 20th row

datrep%>%
  dplyr::arrange(desc(n)) %>%
  dplyr::slice(c(1,3,20))

# A tibble: 3 x 3
  day            n Year 
  <date>     <int> <fct>
1 2018-06-21   112 2018 
2 2018-06-24    80 2018 
3 2020-03-18    38 2020

Lags and leads using timeseries data

Create a difference variable for change in Kessler 6

library("pmdplyr")
df <-nz %>%
  dplyr::filter(!is.na(KESSLER6sum))%>%
  mutate(wave = as.numeric(Wave))%>%
   mutate(lag_k6 = tlag(KESSLER6sum,
    .i = Id, # id variable
    .t = wave # time series variable, needs to be numeric
  ))%>%
  mutate(diff_k6 = lag_k6 - KESSLER6sum) %>%
  select(Id,Wave,KESSLER6sum,diff_k6,Emp.JobSecure,Employed)%>%
  arrange(desc(diff_k6))

What to do with this new variable. Well, we might explore whether employment security relates to distress change:

df %>%
  filter(Wave == 2019) %>%
  mutate(employed_employsecurity = as.factor(ifelse(Employed ==1, Emp.JobSecure,0)))%>%
  ggplot(data = ., aes(x = diff_k6, fill = employed_employsecurity) )+
   geom_histogram() + 
  xlab("Difference in K6 eleveation (cases above 5)") + 
  ylab("Counts of cases") + 
  labs(subtitle ="No clear relationship between unemployment insecurity and distress change")+
  scale_fill_discrete(name="Employment Security 1-7") + 
  scale_fill_viridis_d() + theme_classic() + 
  theme(legend.position = "bottom")

And remarkably we don’t see much evidence in the cross-sectional analysis.

# create data frame with new variable Zero is for the unemeployed. 
dfnew <- df %>%
  filter(Wave == 2019) %>%
  mutate(employed_employsecurity = as.numeric(ifelse(Employed == 1, Emp.JobSecure, 0)))%>%
filter(!is.na(employed_employsecurity))

head(dfnew)

# A tibble: 6 x 7
     Id Wave  KESSLER6sum diff_k6 Emp.JobSecure Employed
  <dbl> <fct>       <dbl>   <dbl>         <dbl>    <dbl>
1   165 2019            0      20            NA        0
2  1936 2019            3      15            NA        0
3  1037 2019           10      14             4        1
4   722 2019            2      12             7        1
5  1568 2019            8      12            NA        0
6  1606 2019            7      12             7        1
# … with 1 more variable: employed_employsecurity <dbl>

# Graph
ggplot(dfnew, aes(y = diff_k6, employed_employsecurity)) +
  geom_jitter(alpha = .2) +
  geom_smooth(method = lm) +
  xlab("employed_employsecurity") +
  ylab("Kessler 6 distress jumps over 5") +
  ggtitle("Jumps in distress change not related to employement insecurity") +
  scale_fill_viridis_d() + theme_classic()

However, perhaps our indicator is misleading us. We can formally model the relationship between employment security and Kessler6 distress across two years

# create dataframe with the variables we need
dfnew2 <- df %>%
  mutate(employed_employsecurity = as.numeric(ifelse(Employed == 1, Emp.JobSecure, 0))) %>%
  filter(!is.na(employed_employsecurity)) %>%
  dplyr::mutate(employsecurity_s = scale(employed_employsecurity))

# multi-level model 

m00a<-lme4::lmer(KESSLER6sum ~  employsecurity_s * Wave + (1|Id), data = dfnew2)
plot(ggeffects::ggpredict(m00a, terms=c("employsecurity_s", "Wave")),
     add.data = TRUE, jitter = 0.2, dot.alpha =.05) + geom_hline(yintercept = 5,
             colour = "red",
             linetype = "dashed") + 
  labs(title = "There is a relationship between employment security and Kessler6 distress")

This suggests a stable negative relationship between employment security and (low) distress. So is there are causal relationship? Not necessarily. Again, we return to casual inference in the upcoming weeks. For now, we want to alert you to an important lesson:

Pro tip 2

Do not read too much into your descriptive analysis!

This is especially true when creating new variables. Just because you can make a variable doesn’t mean you should use it, or interpret it!

Put differently, our workflow will require much more than descriptive statistics.

Data summary

Summarise all your data

The `skimr` package

The skimmer package can be helpful in detecting problems. A drawback note that it is interpreting all factors as numbers).

For example. ( I won’t run the following code, you will do so for your homework).

library("skimr")
nz %>%
  select(-date) %>% #not useful
  dplyr::group_by(Wave) %>%
  skim()

However, I want to point out that skimr works with individual columns, and it accepts a tidy workflow.

nz %>%
  dplyr::group_by(Wave) %>%
  select(KESSLER6sum,HLTH.SleepHours)  %>%
  skim()

Table 1: Data summary

Name	Piped data
Number of rows	4126
Number of columns	3
_______________________
Column type frequency:
numeric	2
________________________
Group variables	Wave

Variable type: numeric

skim_variable	Wave	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
KESSLER6sum	2018	28	0.99	5.14	4.00	0	2	4	7	24	▇▆▂▁▁
KESSLER6sum	2019	11	0.99	5.35	3.97	0	2	5	7	24	▇▆▂▁▁
HLTH.SleepHours	2018	103	0.95	6.92	1.07	3	6	7	8	11	▁▆▇▅▁
HLTH.SleepHours	2019	78	0.96	6.88	1.09	2	6	7	8	12	▁▃▇▁▁

`Table1` & other canned table packages

In earlier seminars, we encountered the table1 package, which makes really great html tables:

library(table1)

table1::table1(~Age  +
                 GenCohort +
                 Male + 
                 Edu +
                 Pol.Orient + 
                 Relid + 
                 BigDoms   | Wave, data = nz,
               overall = FALSE)

	2018 (N=2063)	2019 (N=2063)
Age
Mean (SD)	50.2 (13.4)	50.9 (13.4)
Median [Min, Max]	52.0 [18.0, 92.0]	53.0 [19.0, 92.0]
GenCohort
Gen Boombers: born >= 1946 & b.< 1961	676 (32.8%)	676 (32.8%)
Gen_Silent: born< 1946	51 (2.5%)	51 (2.5%)
GenX: born >=1961 & b.< 1980	915 (44.4%)	915 (44.4%)
GenY: born >=1980 & b.< 1996	368 (17.8%)	368 (17.8%)
GenZ: born >= 1996	53 (2.6%)	53 (2.6%)
Male
Male	755 (36.6%)	752 (36.5%)
Not_Male	1305 (63.3%)	1305 (63.3%)
Missing	3 (0.1%)	6 (0.3%)
Edu
Mean (SD)	5.54 (2.73)	5.71 (2.66)
Median [Min, Max]	7.00 [0, 10.0]	7.00 [0, 10.0]
Missing	77 (3.7%)	50 (2.4%)
Pol.Orient
Mean (SD)	3.58 (1.40)	3.58 (1.37)
Median [Min, Max]	4.00 [1.00, 7.00]	4.00 [1.00, 7.00]
Missing	119 (5.8%)	69 (3.3%)
Relid
Mean (SD)	1.69 (2.58)	1.56 (2.57)
Median [Min, Max]	0 [0, 7.00]	0 [0, 7.00]
Missing	51 (2.5%)	112 (5.4%)
BigDoms
Buddhist	17 (0.8%)	19 (0.9%)
Christian	624 (30.2%)	580 (28.1%)
Muslim	4 (0.2%)	5 (0.2%)
Not_Rel	1301 (63.1%)	1354 (65.6%)
TheOthers	75 (3.6%)	53 (2.6%)
Missing	42 (2.0%)	52 (2.5%)

Unfortunately, the table1 package only prints html tables.

For publications, I might use the modelsummary package

library("modelsummary")
nnz<-nz %>%
  dplyr::select(Age, 
                Male,
                BigDoms,
                Edu,
                GenCohort,
                Relid,
                Pol.Orient,
                Wave)
modelsummary::datasummary_balance( ~ Wave, data=nnz, dinm=FALSE, output = 'table.tex')

I’ll put the $LaTeX$ output into my document because I generally prefer to write in $LaTeX$

However if you want to print inline, you can simply use:

library("modelsummary")
nnz<-nz %>%
  dplyr::select(Age, 
                Male,
                BigDoms,
                Edu,
                GenCohort,
                Relid,
                Pol.Orient,
                Wave)
modelsummary::datasummary_balance( ~ Wave, data=nnz, dinm=FALSE)

		2018 (N=2063)		2019 (N=2063)
		Mean	Std. Dev.	Mean	Std. Dev.
Age		50.2	13.4	50.9	13.4
Edu		5.5	2.7	5.7	2.7
Relid		1.7	2.6	1.6	2.6
Pol.Orient		3.6	1.4	3.6	1.4
		N	%	N	%
Male	Male	755	36.6	752	36.5
	Not_Male	1305	63.3	1305	63.3
BigDoms	Buddhist	17	0.8	19	0.9
	Christian	624	30.2	580	28.1
	Muslim	4	0.2	5	0.2
	Not_Rel	1301	63.1	1354	65.6
	TheOthers	75	3.6	53	2.6
GenCohort	Gen Boombers: born >= 1946 & b.< 1961	676	32.8	676	32.8
	Gen_Silent: born< 1946	51	2.5	51	2.5
	GenX: born >=1961 & b.< 1980	915	44.4	915	44.4
	GenY: born >=1980 & b.< 1996	368	17.8	368	17.8
	GenZ: born >= 1996	53	2.6	53	2.6

Create a table using pipe functions

Above we saw how to create a clunky table using table(x). However, R has lots of functionality to enable better.

library(kableExtra)
nz %>%
  select(k6cats, Wave) %>%
  filter(!is.na(k6cats))%>%
  group_by( Wave, k6cats) %>%
  summarise(n = n())%>%
  kbl(caption = "Distress by Year") %>%
   kable_classic_2(c("striped", "hover"), full_width = TRUE)%>%
  collapse_rows()

Table 2: Distress by Year
Wave	k6cats	n
2018	Low Distress	1272
2018	Moderate Distress	675
2018	Serious Distress	88
2019	Low Distress	1238
2019	Moderate Distress	725
2019	Serious Distress	89

Note that we can use the pivot_wider function to spread the dataframe to enable a table that is easier to interpret.

Credit where credit is due: I just learned about pivot_wider from Johannes and Thorven. I’m keen to get pivot_longer and pivot_wider into my vocabulary, and to do more things, like this:

nz %>%
  select(k6cats, Wave) %>%
  filter(!is.na(k6cats))%>%
  group_by( Wave, k6cats) %>%
  summarise(n = n())%>%
 pivot_wider(names_from = Wave, values_from = n) %>%
   kbl(caption = "Distress counts by year") %>%
   kable_classic_2(c("striped", "hover"), full_width = TRUE)

Table 3: Distress counts by year
k6cats	2018	2019
Low Distress	1272	1238
Moderate Distress	675	725
Serious Distress	88	89

Nice!

Bar graphs

For categorical data, in place of tables we can use bar graphs

Here’s the table:

table(nz$BigDoms)


 Buddhist Christian    Muslim   Not_Rel TheOthers 
       36      1204         9      2655       128

Here’s the bar graph:

ggplot(nz) + 
  geom_bar(mapping = aes(x = BigDoms))

Note that we can re-order the factor levels to produce a nicer output, using fct_infreq

ggplot(nz) + 
  geom_bar(mapping = aes(x = fct_infreq(BigDoms))  )

Missing data graphs

What do you notice about the patterns of missingness in this graph?

library("naniar")
naniar::vis_miss(nz) #What do you notice?

Here, we find all the problem cases:

gg_miss_upset(nz)

What explains these feature of missingess?

Boxplots

A box plot provides visual information for the following statistics:

Minimum – (0p) min outlier
Maximum – (100p) max outlier
Median – (50th p)
First Quartile (Q1 or 25p)
Third Quartile (Q3 or 75p)
Interquartile range (IQR), whcih is the distance between Q1 and Q3
Optional: the notch displays a confidence interval around the median. This is +/- 1.58 X IQR/sqrt(n). We use notches to compare differences between groups; overlap implies uncertainty about whether the medians differ.

There’s a simple explanation here

We can use base R to investigate differences in distress among big denominations:

# using base R
boxplot(KESSLER6sum ~ BigDoms, data = nz, notch = TRUE, col = c("cadetblue1","orange","red","darkblue","brown"))

Here’s a ggplot boxplot:

ggplot(data = nz, aes(x = KESSLER6sum, y = BigDoms, fill = BigDoms)) + 
  geom_boxplot(notch=TRUE) + scale_fill_viridis_d() + 
  ggtitle("If the notches don't overlap, there's likely a difference") + 
  geom_jitter(alpha = .05)

Here’s a ggplot2 boxplot with points overlaid, and jittered. This allows us to se the differences in sample sizes

ggplot(data = nz, aes(x = KESSLER6sum, y = BigDoms, fill = BigDoms)) + 
  geom_boxplot(notch=TRUE) + scale_fill_viridis_d() + 
  ggtitle("If the notches don't overlap, there's likely a difference") + 
  geom_jitter(alpha = .07)

We could look at differences by wave:

ggplot(data = nz, aes(x = KESSLER6sum, y = BigDoms, fill = BigDoms)) + 
  geom_boxplot(notch=TRUE) + scale_fill_viridis_d() + 
   geom_jitter(alpha = .07) + 
  facet_grid(Wave ~ .) +
  ggtitle("If the notches don't overlap, there's likely a difference")

Correlation graphs

Johannes will describe a method for making a correlation plot. Here is another method.

bzsec<-nz%>%
  select(
    Your.Future.Security,
    Standard.Living,
    NZ.Economic.Situation,
    NZ.Business.Conditions,
    Emp.JobSecure,
    CharityDonate,
    Id
  ) %>%
  mutate_all(., as.numeric) %>% #make numeric 
  mutate(Id = as.factor(Id),
         CharityDonate = log(CharityDonate + 1))# make Id a factor for the 

# make a correlation plot using the "correlation" package from easystates

library(correlation)
p1<-bzsec %>%
  correlation(partial = FALSE, multilevel = TRUE ) %>%
  plot()

Print summary

bzsec %>%
  correlation(partial = FALSE, multilevel = TRUE ) %>%
  summary()

Parameter              | CharityDonate | Emp.JobSecure | NZ.Business.Conditions | NZ.Economic.Situation | Standard.Living
-------------------------------------------------------------------------------------------------------------------------
Your.Future.Security   |         0.05* |       0.24*** |                0.44*** |               0.31*** |         0.34***
Standard.Living        |       0.09*** |       0.18*** |                0.22*** |               0.28*** |                
NZ.Economic.Situation  |          0.03 |       0.11*** |                0.43*** |                       |                
NZ.Business.Conditions |          0.03 |       0.14*** |                        |                       |                
Emp.JobSecure          |         0.05* |               |                        |                       |

Let’s set multilevel to FALSE.

library(correlation)
p2<-bzsec %>%
  select(-Id)%>% # get rid of grouping variable
  correlation(partial = FALSE, multilevel = FALSE ) %>%
  plot()
#print summary
bzsec %>%
  select(-Id)%>% # get rid of grouping variable
  correlation(partial = FALSE, multilevel = FALSE ) %>%
  summary()

# Correlation Matrix (pearson-method)

Parameter              | CharityDonate | Emp.JobSecure | NZ.Business.Conditions | NZ.Economic.Situation | Standard.Living
-------------------------------------------------------------------------------------------------------------------------
Your.Future.Security   |       0.19*** |       0.35*** |                0.52*** |               0.41*** |         0.52***
Standard.Living        |       0.22*** |       0.28*** |                0.31*** |               0.36*** |                
NZ.Economic.Situation  |       0.07*** |       0.15*** |                0.52*** |                       |                
NZ.Business.Conditions |       0.08*** |       0.21*** |                        |                       |                
Emp.JobSecure          |       0.09*** |               |                        |                       |                

p-value adjustment method: Holm (1979)

library(patchwork)
# create a two panel plot
p1 / p2 + 
  plot_annotation(title = "Plot of multilevel (a) and single-level (b) correlation", tag_levels = 'a')

We can see an even greater correlations between the variables. This is because the model does not adjust for the repeated measures, which create dependencies in the data.

The `report` package

The reports package from the easystats group is powerful tool for saving tame. Before extolling its virtues, I’d like to point out two major limitations.

First, the package is in development. Currently, it has lots of bugs.

Second, the package uses terminology that won’t work for all contexts and purposes. For example, it uses the term “significant” to describe p values that are below the traditional p = .05 threshold.

If you learn nothing else from this course, you should learn never to use “significant” to describe a p value. You may, if you like, use “statistically signficant” however it would be better altogether if you simply dropped p-values from data analysis. We’ll show you how. With those provisos in mind, consider some useful functionality from the report package.

# create a demographic dataframe
nz_demagraphics <- nz %>%
  select(Age, GenCohort, Male, Edu, Pol.Orient, Relid, BigDoms, Wave)

# now a nice way to save you time when reporting
paste(
  report::report_participants(
    nz_demagraphics, 
    group = "Wave", 
    age = "Age",
    sex = "Male",
    education = "Edu",
    spell_n = TRUE),
  "were recruited in the study by through enticement by lollipops. Those who did not volunteer were coerced."
  )

[1] "For the 'Wave - 2018' group: Two Thousand, Sixty Three participants (Mean age = 50.2, SD = 13.4, range: [18, 92]; 0.0% females; Mean education = 5.5, SD = 2.7, range: [0, 10]) and for the 'Wave - 2019' group: Two Thousand, Sixty Three participants (Mean age = 50.9, SD = 13.4, range: [19, 92]; 0.0% females; Mean education = 5.7, SD = 2.7, range: [0, 10]) were recruited in the study by through enticement by lollipops. Those who did not volunteer were coerced."

The table function of report isn’t great yet. However it has some nice features. For example you should always report your session information, and doing so in tabluar form clarifies the elements

Try running the following code on your own:

r <- report_table(sessionInfo())
r

Here is another method, which you can try on your own

cite_packages()

Here’s a demographic table (try on your own)

report_table(nz_demagraphics)

Here’s a data summary

library("report")
nz %>%
  group_by(Wave)%>%
  select(
    "Wave", 
    "Age",
    "Male",
    "Edu",
    "Relid",
    "Pol.Orient",
    "KESSLER6sum",
    "FeelHopeless",
    "FeelDepressed",
    "FeelRestless",
    "EverythingIsEffort",
    "FeelWorthless",
    "FeelNervous"
    )%>%
  report() %>% 
  summary()

The data contains 4126 observations, grouped by Wave, of the following variables:

- 2018 (n = 2063):
  - Age: Mean = 50.19, SD = 13.38, range: [18, 92]
  - Male: 2 levels, namely Male (n = 755), Not_Male (n = 1305) and missing (n = 3)
  - Edu: Mean = 5.54, SD = 2.73, range: [0, 10], 3.73% missing
  - Relid: Mean = 1.69, SD = 2.58, range: [0, 7], 2.47% missing
  - Pol.Orient: Mean = 3.58, SD = 1.40, range: [1, 7], 5.77% missing
  - KESSLER6sum: Mean = 5.14, SD = 4.00, range: [0, 24], 1.36% missing
  - FeelHopeless: 5 levels, namely None Of The Time (n = 1012), A Little Of The Time (n = 627), Some Of The Time (n = 325), Most Of The Time (n = 58), All Of The Time (n = 11) and missing (n = 30)
  - FeelDepressed: 5 levels, namely None Of The Time (n = 1481), A Little Of The Time (n = 345), Some Of The Time (n = 156), Most Of The Time (n = 35), All Of The Time (n = 12) and missing (n = 34)
  - FeelRestless: 5 levels, namely None Of The Time (n = 520), A Little Of The Time (n = 751), Some Of The Time (n = 585), Most Of The Time (n = 143), All Of The Time (n = 29) and missing (n = 35)
  - EverythingIsEffort: 5 levels, namely None Of The Time (n = 523), A Little Of The Time (n = 827), Some Of The Time (n = 490), Most Of The Time (n = 159), All Of The Time (n = 32) and missing (n = 32)
  - FeelWorthless: 5 levels, namely None Of The Time (n = 1469), A Little Of The Time (n = 348), Some Of The Time (n = 147), Most Of The Time (n = 48), All Of The Time (n = 19) and missing (n = 32)
  - FeelNervous: 5 levels, namely None Of The Time (n = 537), A Little Of The Time (n = 804), Some Of The Time (n = 509), Most Of The Time (n = 148), All Of The Time (n = 28) and missing (n = 37)

- 2019 (n = 2063):
  - Age: Mean = 50.94, SD = 13.38, range: [19, 92]
  - Male: 2 levels, namely Male (n = 752), Not_Male (n = 1305) and missing (n = 6)
  - Edu: Mean = 5.71, SD = 2.66, range: [0, 10], 2.42% missing
  - Relid: Mean = 1.56, SD = 2.57, range: [0, 7], 5.43% missing
  - Pol.Orient: Mean = 3.58, SD = 1.37, range: [1, 7], 3.34% missing
  - KESSLER6sum: Mean = 5.35, SD = 3.97, range: [0, 24], 0.53% missing
  - FeelHopeless: 5 levels, namely None Of The Time (n = 961), A Little Of The Time (n = 660), Some Of The Time (n = 348), Most Of The Time (n = 67), All Of The Time (n = 8) and missing (n = 19)
  - FeelDepressed: 5 levels, namely None Of The Time (n = 1399), A Little Of The Time (n = 403), Some Of The Time (n = 195), Most Of The Time (n = 47), All Of The Time (n = 7) and missing (n = 12)
  - FeelRestless: 5 levels, namely None Of The Time (n = 501), A Little Of The Time (n = 794), Some Of The Time (n = 577), Most Of The Time (n = 158), All Of The Time (n = 17) and missing (n = 16)
  - EverythingIsEffort: 5 levels, namely None Of The Time (n = 470), A Little Of The Time (n = 798), Some Of The Time (n = 572), Most Of The Time (n = 179), All Of The Time (n = 29) and missing (n = 15)
  - FeelWorthless: 5 levels, namely None Of The Time (n = 1434), A Little Of The Time (n = 359), Some Of The Time (n = 193), Most Of The Time (n = 51), All Of The Time (n = 12) and missing (n = 14)
  - FeelNervous: 5 levels, namely None Of The Time (n = 540), A Little Of The Time (n = 824), Some Of The Time (n = 539), Most Of The Time (n = 122), All Of The Time (n = 25) and missing (n = 13)

Notes:

More about the report package: here

This package is brought to you by easystats

Measures

When reporting your study, it is extremely important to include information about your measure. For example:

We measure psychological distress using the Kessler-6 scale (R. C. Kessler et al. 2002), which exhibits strong diagnostic concordance for moderate and severe psychological distress in large, cross-cultural samples Prochaska et al. (2012). Participants rated during the past 30 days, how often did… (a) “$\dots$ you feel hopeless”; (b) “$\dots$ you feel so depressed that nothing could cheer you up”; (c) $\dots$ you feel restless or fidgety”; (d)“$\dots$ you feel that everything was an effort”; (e) “$\dots$ you feel worthless”; (f) “$\dots$ you feel nervous?” Ordinal response alternatives for the Kessler-6 are: “None of the time”; “A little of the time”; “Some of the time”; “Most of the time”; “All of the time.”

We report sample descriptive statistics for indicators of personal Kessler-6 distress below in Table1.

Table 1

library(gtsummary)
tb1 <-nz %>%
  dplyr::select(
    KESSLER6sum,
    FeelHopeless,
    FeelDepressed,
    FeelRestless,
    EverythingIsEffort,
    FeelWorthless,
    FeelNervous,
    Wave,
  ) %>%
  gtsummary::tbl_summary(
    by = Wave,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    ),
    digits = all_continuous() ~ 2,
    missing_text = "(Missing)"
  )%>%
  bold_labels() 
tb1

Characteristic	2018, N = 2,063¹	2019, N = 2,063¹
KESSLER6sum	5.14 (4.00)	5.35 (3.97)
(Missing)	28	11
FeelHopeless
None Of The Time	1,012 / 2,033 (50%)	961 / 2,044 (47%)
A Little Of The Time	627 / 2,033 (31%)	660 / 2,044 (32%)
Some Of The Time	325 / 2,033 (16%)	348 / 2,044 (17%)
Most Of The Time	58 / 2,033 (2.9%)	67 / 2,044 (3.3%)
All Of The Time	11 / 2,033 (0.5%)	8 / 2,044 (0.4%)
(Missing)	30	19
FeelDepressed
None Of The Time	1,481 / 2,029 (73%)	1,399 / 2,051 (68%)
A Little Of The Time	345 / 2,029 (17%)	403 / 2,051 (20%)
Some Of The Time	156 / 2,029 (7.7%)	195 / 2,051 (9.5%)
Most Of The Time	35 / 2,029 (1.7%)	47 / 2,051 (2.3%)
All Of The Time	12 / 2,029 (0.6%)	7 / 2,051 (0.3%)
(Missing)	34	12
FeelRestless
None Of The Time	520 / 2,028 (26%)	501 / 2,047 (24%)
A Little Of The Time	751 / 2,028 (37%)	794 / 2,047 (39%)
Some Of The Time	585 / 2,028 (29%)	577 / 2,047 (28%)
Most Of The Time	143 / 2,028 (7.1%)	158 / 2,047 (7.7%)
All Of The Time	29 / 2,028 (1.4%)	17 / 2,047 (0.8%)
(Missing)	35	16
EverythingIsEffort
None Of The Time	523 / 2,031 (26%)	470 / 2,048 (23%)
A Little Of The Time	827 / 2,031 (41%)	798 / 2,048 (39%)
Some Of The Time	490 / 2,031 (24%)	572 / 2,048 (28%)
Most Of The Time	159 / 2,031 (7.8%)	179 / 2,048 (8.7%)
All Of The Time	32 / 2,031 (1.6%)	29 / 2,048 (1.4%)
(Missing)	32	15
FeelWorthless
None Of The Time	1,469 / 2,031 (72%)	1,434 / 2,049 (70%)
A Little Of The Time	348 / 2,031 (17%)	359 / 2,049 (18%)
Some Of The Time	147 / 2,031 (7.2%)	193 / 2,049 (9.4%)
Most Of The Time	48 / 2,031 (2.4%)	51 / 2,049 (2.5%)
All Of The Time	19 / 2,031 (0.9%)	12 / 2,049 (0.6%)
(Missing)	32	14
FeelNervous
None Of The Time	537 / 2,026 (27%)	540 / 2,050 (26%)
A Little Of The Time	804 / 2,026 (40%)	824 / 2,050 (40%)
Some Of The Time	509 / 2,026 (25%)	539 / 2,050 (26%)
Most Of The Time	148 / 2,026 (7.3%)	122 / 2,050 (6.0%)
All Of The Time	28 / 2,026 (1.4%)	25 / 2,050 (1.2%)
(Missing)	37	13
¹ Mean (SD); n / N (%)

Note that you can use the gtsummary package to create in-line referencing. For example: Average Kessler-6 distress in 2018 was 5.14 (4.00) and in 2019 was 5.35 (3.97).

Order of your Method section

The following is a brief guide to describing your method. We’ll be returning to report writing in future weeks. For now, I just want to put this on the table for you. The advice is just a guide.

Heading	Include
Participants	Participant or subject characteristics Sampling procedures Sample size and power
Materials	Primary and secondary measures Quality of measurements
Procedure	Data collection methods Research design (e.g., experimental, correlational, or descriptive) Data processing and diagnostics (e.g., outlier removal) Data analysis strategy (e.g., comparison or regression tests)

Below are the sampling procedures from the New Zealand Attitudes and Values Study, from where the nz teaching dataset was drawn.

Appendix 1A Sampling Procedure – NZAVS Time 10 (2018; conducted from 18.06.2018-28.09.2019)

The Time 10 (2018) NZAVS contained responses from 47,951 participants (18,010 retained from one or more previous wave. The sample retained 2,964 participants from the Time 1 (2009) sample (a retention rate of 45.5%). The sample retained 14,049 participants from Time 9 (2017; a retention rate of 82.3% from the previous year). Participants who provided an email address were first emailed and invited to complete an online version if they preferred. Participants who did not complete the online version (or did not provide an email) were then posted a copy of the questionnaire, with a second postal follow-up two months later. We staggered the time of contact, so that participants who had completed the previous wave were contacted approximately one year after they last completed the questionnaire. We offered a prize draw for participation (five draws each for $1000 grocery vouchers, $5000 total prize pool). All participants were posted a Season’s Greetings card from the NZAVS research team and informed that they had been automatically entered into a bonus seasonal grocery voucher prize draw. Participants were also emailed an eight-page newsletter about the study.

To boost sample size and increase sample diversity for subsequent waves, a booster sample was conducted by selecting people from the New Zealand electoral roll. As with previous booster samples, sampling was conducted without replacement (i.e., people included in previous sample frames were identified and removed from the 2018 roll). The sample frame consisted of 325,000 people aged from 18-65 randomly selected from the 2018 New Zealand Electoral Roll, who were currently residing in New Zealand (one can be registered to vote in New Zealand but living overseas). The electoral roll contained ~3,250,000 registered voters. The New Zealand Electoral Roll contains participants’ date of birth (within a one-year window), and we limited our frame to people who 65 or younger, due to our aim of retaining participants longitudinally. We concurrently advertised the survey on Facebook via a $5000 paid promotion of a link to a YouTube video describing the NZAVS and the large booster sample we were conducting. The advertisement targeted men and women aged 18-65+ who lived in New Zealand and ran for 14 days. This paid promotion reached 147,296 people, with 4,721 link clicks (i.e., clicking to watch the video), according to Facebook. The goal of the paid promotion was twofold: (a) to increase name recognition of the NZAVS during the period in which questionnaires were being posted, and (b) to help improve retention by potentially reaching previous participants who happened to see the advertisement. A total of 29,293 participants who were contained in our sample frame completed the questionnaire (response rate = 9.2% when adjusting for the 98.2% accuracy of the 2018 electoral roll). A further 648 participants completed the questionnaire, but were unable to be matched to our sample frame (for example, due to a lack of contact information) or were unsolicited opt-ins. Informal analysis indicates that unsolicited opt-ins were often the partners of existing participants.

Appendix 1B Sampling Procedure – NZAVS Time 11 (2019; conducted from 29.09.2019-17.10.2020)

The Time 11 wave was conducted during COVID-19 pandemic. Procedures thus differed in that there was an increased focus on online deliver using email reminders and extensive Facebook advertising, no Christmas card, and incomplete phoning of non-respondents.

The Time 11 (2019) NZAVS contained responses from 42,684 participants (36,522 retained from one or more previous wave. The sample retained 2,506 participants from the Time 1 (2009) sample (a retention rate of 38.4%). The sample retained 34,782 participants from Time 10 (2018; a retention rate of 72.5% from the previous year). Participants who provided an email address were first emailed and invited to complete an online version if they preferred. Participants who did not complete the online version (or did not provide an email) were then posted a copy of the questionnaire, with a second postal follow-up two months later. We staggered the time of contact, so that participants who had completed the previous wave were contacted approximately one year after they last completed the questionnaire. A second reminder email was sent approximately four months following the first email attempt. We offered a prize draw for participation (five draws each for $1000 grocery vouchers, $5000 total prize pool). Participants were also emailed an eight-page newsletter about the study. As in past years, three attempts were made to phone non-respondents using each available cell and landline number. However, due to the university closure during COVID-19 lockdowns, phoning attempts were made for only 54% of the phoning pool (11,687 from a total of 21,636 non-respondents who provided at least one phone number).

Two additional forms of recruitment were also introduced during Time 11. The first was a large information box in the questionnaire (taking a full page on the paper version), which asked people: ‘Do you have a partner who would also like to join the NZAVS?’ with additional details for how partners might join the study (see questionnaire for the full text). The second was a Facebook advertisement. The advertisement targeted men and women aged 18-65+ who lived in New Zealand and ran from and 4th April 2020 – 4th July 2020 (overlapping with New Zeeland’s first lockdown period and recovery), and again from 18th August 2020 – 4th September (during the second Auckland lockdown). Given the unprecedented nature of the COVID-19 lockdowns, we thought it important to maximise sampling during these periods. The goal of the Facebook advertisement was threefold: (a) to increase name recognition of the NZAVS and remind people to complete the paper/online version already posted/emailed to them, (b) to help improve retention by potentially reaching previous lost participants who happened to see the advertisement, and (c) to recruit new participants (and also the partners of existing participants) while people were at home with some possibly having more free time during lockdown. This last goal was indirect and not explicitly stated it in the advertisement.

The Facebook advertisement read as follows: “Participate in the New Zealand Attitudes and Values Study. Complete the 2020 Questionnaire online” with the body of text: “If you are part of the NZAVS, but have not heard from us in the last year, then please consider completing the 2020 questionnaire online. The study is more important than ever as we aim to understand the impact of COVID-19 on mental health, wellbeing and resilience in our communities. We wish you all the best at this time and hope you keep well and stay safe.” This paid promotion reached 883,969 people, with 37,850 link clicks (i.e., clicking the link for the Qualtrics survey) according to Facebook. A total of 6106 people continued complete the questionnaire and provide full contact details, and were thus included in the dataset (4734 were new participants opting in to the study, and 1372 were previously ‘lost’ participants).

Appendix 2 Johannes’s mini-lecture on the `papaja` package

Lecture

Papaja R markdown template

Appendix 3 Style advice about research methods

APA style advice here

Kessler, R C, G Andrews, L J Colpe, E Hiripi, D K Mroczek, S L T Normand, E E Walters, and A M Zaslavsky. 2002. “Short Screening Scales to Monitor Population Prevalences and Trends in Non-Specific Psychological Distress.” Psychol. Med. 32 (6): 959–76.

Kessler, Ronald C, Jennifer Greif Green, Michael J Gruber, Nancy A Sampson, Evelyn Bromet, Marius Cuitan, Toshi A Furukawa, et al. 2010. “Screening for Serious Mental Illness in the General Population with the K6 Screening Scale: Results from the WHO World Mental Health (WMH) Survey Initiative.” Int. J. Methods Psychiatr. Res. 19 Suppl 1 (June): 4–22.

Prochaska, Judith J, Hai-Yen Sung, Wendy Max, Yanling Shi, and Michael Ong. 2012. “Validity Study of the K6 Scale as a Measure of Moderate Mental Distress Based on Mental Health Treatment Need and Utilization.” Int. J. Methods Psychiatr. Res. 21 (2): 88–97.

Notice, the intercept here is zero. This because we centered the new indicator at zero, and we wrote a model that is estimating the population average for this outcome (an intercept-only model). Don’t worry if you don’t know what an intercept is, we’ll get to regression in a few weeks.↩︎

Consolidation of skills

Get data

Preamble

Data carpentry continued

Different methods for selecting columns

Re-leveling a factor

Creating factors from numerical indicators

Using `ifelse` to create factors

Transformations of indicators: scaling, centering, and logs

Pro tip 1:

Create and work with dates a date

Create a timeline

`Slice`

Lags and leads using timeseries data

Pro tip 2

Data summary

Summarise all your data

The `skimr` package

`Table1` & other canned table packages

Create a table using pipe functions

Bar graphs

Missing data graphs

Boxplots

Correlation graphs

The `report` package

Measures

Order of your Method section

Appendix 1A Sampling Procedure – NZAVS Time 10 (2018; conducted from 18.06.2018-28.09.2019)

Appendix 1B Sampling Procedure – NZAVS Time 11 (2019; conducted from 29.09.2019-17.10.2020)

Appendix 2 Johannes’s mini-lecture on the `papaja` package

Appendix 3 Style advice about research methods

References

Corrections

Reuse

Citation

Consolidation of skills

Get data

Preamble

Data carpentry continued

Different methods for selecting columns

Re-leveling a factor

Creating factors from numerical indicators

Using ifelse to create factors

Transformations of indicators: scaling, centering, and logs

Pro tip 1:

Create and work with dates a date

Create a timeline

Slice

Lags and leads using timeseries data

Pro tip 2

Data summary

Summarise all your data

The skimr package

Table1 & other canned table packages

Create a table using pipe functions

Bar graphs

Missing data graphs

Boxplots

Correlation graphs

The report package

Measures

Order of your Method section

Appendix 1A Sampling Procedure – NZAVS Time 10 (2018; conducted from 18.06.2018-28.09.2019)

Appendix 1B Sampling Procedure – NZAVS Time 11 (2019; conducted from 29.09.2019-17.10.2020)

Appendix 2 Johannes’s mini-lecture on the papaja package

Appendix 3 Style advice about research methods

References

Corrections

Reuse

Citation

Using `ifelse` to create factors

`Slice`

The `skimr` package

`Table1` & other canned table packages

The `report` package

Appendix 2 Johannes’s mini-lecture on the `papaja` package