Missing data, measurement error, and course wrap-up

Johannes Karl https://johannes-karl.com (Victoria University of Wellington)https://www.wgtn.ac.nz , Joseph Bulbulia https://josephbulbulia.netlify.app (Victoria University of Wellington)https://www.wgtn.ac.nz
2021-MAY-25

Some thoughts on measurement

Psychology is dominated by the latent variable perspective. Most constructs we are interested in cannot be directly measured, but rather have to be inferred from a range of indicators (For an example see Figure 1).

Standard latent variable model.

Figure 1: Standard latent variable model.


bfi <- psych::bfi
reverse_bfi <- c("A1", "C4", "C5", "E1", "E2", "O2", "O5")
bfi[reverse_bfi] <- lapply(bfi[reverse_bfi], function(x) {
  car::recode(x, "1 = 6; 2 = 5; 3 = 4; 4 = 3; 5 = 2; 6 = 1")
})

What is reliability

Many students might learn what reliability is, but once they reach research praxis they often have forgotten the details and what remains is the \(\alpha > .70\) rule. To address this issue first, this rule was not suggested by Cronbach. The cut-off was suggested by Nunnally (1994). Let’s see what the original source says:

“A satisfactory level of reliability depends on how a measure is being used. In the early stages of predictive or construct validation research, time and energy can be saved using instruments that have only modest reliability, e.g., .70. If significant correlations are found, corrections for attenuation will estimate how much the correlations will increase when reliabilities of measures are increased. If these corrected values look promising, it will be worth the time and effort to increase the number of items and reduce much beyond .80 in basic research is often wasteful of time and money. Measurement error attenuates correlations very little at that level. Strenuous and unnecessary efforts at standardization in addition to increasing the number of items might be required to obtain a reliability of, say, .90. In contrast to the standards used to compare groups, a reliability of .80 may not be nearly high enough in making decisions about individuals. Group research is often concerned with the size of correlations and with mean differences among experimental treatments, for which a reliability of .80 is adequate. However, a great deal hinges on the exact test scores when decisions are made about individuals. If, for example, children with IQs below 70 are to be placed in special classes, it may make a great deal of difference whether a child has an IQ of 65 or 75 on a particular test. When selection standards are quite rigorous, decisions depend on very small score differences, and so it is difficult to accept any measurement error. We have noted that the standard error of measurement is almost one-third as large as the overall standard deviation of test scores even when the reliability is .90. If important decisions are made with respect to specific test scores, a reliability of .90 is the bare minimum, and a reliability of .95 should be considered the desirable standard. However, never switch to a less valid measure simply because it is more reliable.” (pp.264)

To boil down what Nunnally said; reliability can never replace validity, the reliability your measure should achieve is dependent on the context of application. The commonly applied rule of .70 is fitting for some contexts. Specifically contexts where one wants to save time and effort. Lance, Butts, and Michels (2006) make the important point that this is not a context that applies in most published papers. In most contexts we would want \(\alpha\) to be over .80 and over .90 | .95 if we make judgments about individuals based on this test, such as personality questionnaires used for hiring.

In the discussion on the applicability of \(\alpha\) Nunnally makes statements such as:

“Coefficient a usually provides a good estimate of reliability because sampling of content is usually the major source of measurement error for static constructs and also because it is sensitive to the”sampling" of situational factors as well as item content." (p.252)

But what is the “usual” case which Nunnally assumes. We could assume that it is the case in which the underlying assumption of \(\alpha\) are met. Let’s have a look at these assumptions to better appreciate Nunnally’s usual case. I am paraphrasing in this section from the excellent article by McNeish (2018) and I highly recommend reading it for a more detailed discussion.

  1. The scale is uni-dimensional.
  2. Scale items are on a continuous scale and normally distributed.
  3. The scale adheres to tau equivalence.
  4. The errors of the items do not covary.

Let’s look at the reliability of our Extraversion scale

print(psych::alpha(bfi[paste0("E", 1:5)]))

Reliability analysis   
Call: psych::alpha(x = bfi[paste0("E", 1:5)])

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd
      0.76      0.76    0.73      0.39 3.2 0.007  4.1 1.1
 median_r
     0.38

 lower alpha upper     95% confidence boundaries
0.75 0.76 0.78 

 Reliability if an item is dropped:
   raw_alpha std.alpha G6(smc) average_r S/N alpha se
E1      0.73      0.73    0.67      0.40 2.6   0.0084
E2      0.69      0.69    0.63      0.36 2.3   0.0095
E3      0.73      0.73    0.67      0.40 2.7   0.0082
E4      0.70      0.70    0.65      0.37 2.4   0.0091
E5      0.74      0.74    0.69      0.42 2.9   0.0078
    var.r med.r
E1 0.0044  0.38
E2 0.0028  0.35
E3 0.0071  0.40
E4 0.0033  0.38
E5 0.0043  0.42

 Item statistics 
      n raw.r std.r r.cor r.drop mean  sd
E1 2777  0.72  0.70  0.59   0.52  4.0 1.6
E2 2784  0.78  0.76  0.69   0.61  3.9 1.6
E3 2775  0.68  0.70  0.58   0.50  4.0 1.4
E4 2791  0.75  0.75  0.66   0.58  4.4 1.5
E5 2779  0.64  0.66  0.52   0.45  4.4 1.3

Non missing response frequency for each item
      1    2    3    4    5    6 miss
E1 0.09 0.13 0.16 0.15 0.23 0.24 0.01
E2 0.09 0.14 0.22 0.12 0.24 0.19 0.01
E3 0.05 0.11 0.15 0.30 0.27 0.13 0.01
E4 0.05 0.09 0.10 0.16 0.34 0.26 0.00
E5 0.03 0.08 0.10 0.22 0.34 0.22 0.01

Looks all good, but when we look at the assumptions of \(\alpha\) this looks less good. Some items e.g. E2 show very high loadings and E5 shows a substantially lower loading.

psych::fa(bfi[paste0("E", 1:5)], nfactors = 1)
Factor Analysis using method =  minres
Call: psych::fa(r = bfi[paste0("E", 1:5)], nfactors = 1)
Standardized loadings (pattern matrix) based upon correlation matrix
    MR1   h2   u2 com
E1 0.60 0.36 0.64   1
E2 0.72 0.52 0.48   1
E3 0.59 0.34 0.66   1
E4 0.69 0.47 0.53   1
E5 0.52 0.28 0.72   1

                MR1
SS loadings    1.98
Proportion Var 0.40

Mean item complexity =  1
Test of the hypothesis that 1 factor is sufficient.

The degrees of freedom for the null model are  10  and the objective function was  1.11 with Chi Square of  3112.54
The degrees of freedom for the model are 5  and the objective function was  0.03 

The root mean square of the residuals (RMSR) is  0.03 
The df corrected root mean square of the residuals is  0.05 

The harmonic number of observations is  2767 with the empirical chi square  67.5  with prob <  3.4e-13 
The total number of observations was  2800  with Likelihood Chi Square =  86.38  with prob <  3.9e-17 

Tucker Lewis Index of factoring reliability =  0.948
RMSEA index =  0.076  and the 90 % confidence intervals are  0.063 0.091
BIC =  46.7
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy             
                                                   MR1
Correlation of (regression) scores with factors   0.88
Multiple R square of scores with factors          0.78
Minimum correlation of possible factor scores     0.55

So what are the alternatives to alpha. First I do not think that we need to replace alpha for now Raykov and Marcoulides (2017) make some interesting points on its contiued usage. So instead of replacing it, we could try supplementing it. Alternatives suggested by McNeish (2018) are omega total, and GLB. So how would we compute those reliabilities. We have a number of options from different packages, such as the psych package.


psy_omega <- psych::omega(bfi[paste0("E", 1:5)], nfactors = 3, poly = F)
psy_omega
Omega 
Call: omegah(m = m, nfactors = nfactors, fm = fm, key = key, flip = flip, 
    digits = digits, title = title, sl = sl, labels = labels, 
    plot = plot, n.obs = n.obs, rotate = rotate, Phi = Phi, option = option, 
    covar = covar)
Alpha:                 0.76 
G.6:                   0.73 
Omega Hierarchical:    0.66 
Omega H asymptotic:    0.83 
Omega Total            0.79 

Schmid Leiman Factor loadings greater than  0.2 
      g  F1*  F2*   F3*   h2   u2   p2
E1 0.49 0.36            0.37 0.63 0.65
E2 0.59 0.50            0.60 0.40 0.57
E3 0.66                 0.47 0.53 0.92
E4 0.58 0.38       0.22 0.53 0.47 0.64
E5 0.59                 0.40 0.60 0.87

With eigenvalues of:
   g  F1*  F2*  F3* 
1.70 0.53 0.04 0.11 

general/max  3.22   max/min =   13.29
mean percent general =  0.73    with sd =  0.15 and cv of  0.21 
Explained Common Variance of the general factor =  0.72 

The degrees of freedom are -2  and the fit is  0 
The number of observations was  2800  with Chi Square =  0  with prob <  NA
The root mean square of the residuals is  0 
The df corrected root mean square of the residuals is  NA

Compare this with the adequacy of just a general factor and no group factors
The degrees of freedom for just the general factor are 5  and the fit is  0.11 
The number of observations was  2800  with Chi Square =  312.5  with prob <  2.1e-65
The root mean square of the residuals is  0.09 
The df corrected root mean square of the residuals is  0.13 

RMSEA index =  0.148  and the 10 % confidence intervals are  0.135 0.162
BIC =  272.81 

Measures of factor score adequacy             
                                                 g   F1*
Correlation of scores with factors            0.83  0.61
Multiple R square of scores with factors      0.68  0.38
Minimum correlation of factor score estimates 0.37 -0.25
                                                F2*   F3*
Correlation of scores with factors             0.20  0.41
Multiple R square of scores with factors       0.04  0.17
Minimum correlation of factor score estimates -0.92 -0.66

 Total, General and Subset omega for each subset
                                                 g  F1*
Omega total for total scores and subscales    0.79 0.74
Omega general for total scores and subscales  0.66 0.47
Omega group for total scores and subscales    0.13 0.27
                                               F2*  F3*
Omega total for total scores and subscales    0.45 0.38
Omega general for total scores and subscales  0.43 0.35
Omega group for total scores and subscales    0.02 0.03

I personally prefer the ufs package, which containts the niffty scaleStructure function. Importantly, there is a wide range of packages and a substantial number of papers that advance research in this area.


print(ufs::scaleStructure(bfi[paste0("E", 1:5)]))

Information about this analysis:

                 Dataframe: bfi[paste0("E", 1:5)]
                     Items: all
              Observations: 2713
     Positive correlations: 10 out of 10 (100%)

Estimates assuming interval level:

             Omega (total): 0.8
      Omega (hierarchical): 0.65
   Revelle's omega (total): 0.8
Greatest Lower Bound (GLB): 0.8
             Coefficient H: 0.78
         Coefficient alpha: 0.76

(Estimates assuming ordinal level not computed, as the polychoric correlation matrix has missing values.)

Note: the normal point estimate and confidence interval for omega are based on the procedure suggested by Dunn, Baguley & Brunsden (2013) using the MBESS function ci.reliability, whereas the psych package point estimate was suggested in Revelle & Zinbarg (2008). See the help ('?scaleStructure') for more information.

Some alternatives to the latent variable perspective

In recent years network statistics have found increased use in psychology as conceptual alternatives to the latent variable model (even thought they are mathematical equivalents). In the network approach we assume that communities of items arise due to their interaction and not due to an underlying latent variable. The network approach commonly uses regularized partial correlations between variables to model the network. What this means is that first all variables are correlated while controling for all other variables in the dataset. Subsequently, we use a mathematical approach aimed at iteratively shrinking correlations that are close to 0 to exactely 0, reducing the complexity of the structure.

net_mdl <- bootnet::estimateNetwork(select(bfi, -age, -education, -gender), "EBICglasso")
qgraph::qgraph(net_mdl$graph, layout = "spring")

ega_out <-
  EGAnet::bootEGA(
    select(bfi, -age, -education, -gender),
    plot.typicalStructure = T,
    iter = 1000
  )

Measurement and validity

Do your measures describe facts about how people are? This is a mission critical question. Yet there has been very little attention.

We find this advice from Jessica Flake is helpful: https://www.youtube.com/watch?v=Cq6n7AS_r8w

To summarise:

Include measurement in your research plan & pre-register that plan. Ask yourself:

  1. What is your construct? (What are you trying to measure?)
  2. Why and how did you select measures?
  1. How do you use and operationalise your construct?
  2. How do you quantify your measure?
  3. Did you modify your measure, if so, why? Report…
  4. Justify all measure modifications…

Points developed in this paper: https://psyarxiv.com/c45t9/

See also the work of Eiko Fried: https://eiko-fried.com/

Validity, measurement error, and causal confounding.

Measurement error and recall bias:

Measurement error is the degree to which we mismeasure a variable, which can lead to bias in a number of ways. If the error is dependent on the exposure or the outcome, (e.g. we measure the exposure with less accuracy for a group without a disease than with it), it is called differential measurement error. If the error has nothing to do with the exposure or outcome, it’s called non-differential measurement error. Under most conditions, non-differential error will bias the estimate of effect towards the null. In this way, it’s at least predictable, but for small effects or inadequately powered studies, it can make a true effect disappear. If there is error in both the exposure and outcome, the errors themselves can also be associated, opening a back-door path between the exposure and outcome. - Malcolm Barrett: https://ggdag.malco.io/articles/bias-structures.html

Barrett’s recall of vitimans example: Does taking vitamins in childhood protect against bladder cancer? If people who have bladder cancer are more likely to recall vitam taking, then this can support the inference that vitams cause bladder cancer even if the opposite is true:

Show code
library(ggdag)
coords <- tibble::tribble(
  ~name,            ~x,  ~y,
  "bladder_cancer",    1,   0,
  "vitamins",          0,   0,
  "diagnosed_bc",      1,   1,
  "recalled_vits",     0,   1,
  "bc_error",          1,   2,
  "vits_error",        0,   2,
)


bladder_dag <- dagify(
  diagnosed_bc ~ bc_error + bladder_cancer,
  recalled_vits ~ vitamins + vits_error,
  vits_error ~ bladder_cancer,
  exposure = "recalled_vits",
  outcome = "bladder_cancer",
  latent = c("bc_error","vits_error"),
  labels = c(
    bladder_cancer = "Bladder Cancer",
    vitamins = "Childhood Vitamin \nIntake",
    diagnosed_bc = "Diagnosed \nBladder Cancer",
    recalled_vits = "Memory of \nTaking Vitamins",
    bc_error = "Measurement Error, \nDiagnosis",
    vits_error = "Measurement Error, \nVitamins"
  ),
  coords = coords
)
ggdag(bladder_dag, text = FALSE, use_labels = "label", node_size = 20) + theme_dag_blank()

In this DAG, we can obtain an unbiased estimate when we do not adjust for baseline CES-D scores:

ggdag_adjustment_set(bladder_dag,
                     text = FALSE, use_labels = "label") + theme_dag_blank()

Measurement error in panel studies

The Centers for Epidemiological Studies–Depression (CES-D) scale assess depression, but the scale is known to have measurement error.

Question. Does completing Psych 477 affect change in depression changes after the course is finished. Assume no relationship. Nevertheless, conditioning on the baseline measure of CES-D will create bias in our model that arises from from measurement error.

Here is Barrett’s example:

Show code
# set coordinates
coords <- tibble::tribble(
  ~name,              ~x,  ~y,
  "honors",            1,   3,
  "depression",        2,   3,
  "cesd",              2,   2,
  "baseline_error",    2,   1,
  "depression_change", 3,   3,
  "cesd_change",       3,   2,
  "followup_error",    3,   1
)

cesd_dag <- dagify(
  depression ~ honors,
  cesd ~ depression + baseline_error,
  cesd_change ~ depression_change + followup_error + baseline_error,
  outcome = "depression_change",
  exposure  = "honors",
  latent = c("followup_error","baseline_error"),
  labels = c(
    honors = "447 Completion",
    depression = "Depression",
    cesd = "CES-D",
    cesd_change = "Change CES-D",
    depression_change = "Change in \nDepression",
    baseline_error = "Measurement Error, \nBaseline",
    followup_error = "Measurement Error, \nFollow-up"
  ),
  coords = coords
)# %>%
 # control_for(c("baseline_error","followup_error"))

cesd_dag %>%
  ggdag_dconnected(
    from = "honors",
    to = "cesd_change",
    controlling_for = "cesd",
    text = FALSE,
    use_labels = "label",
    collider_lines = FALSE
  ) + theme_dag_blank()

The DAG reveals we can obtain an unbiased estimate when we do not adjust for baseline CES-D scores:

ggdag_adjustment_set(cesd_dag,
                     text = FALSE, use_labels = "label") + theme_dag_blank()

Of course the DAG might not be correct. Depressed people might be drawn to psych 447.

Missing data is a form of measurement error

Missing data is an extreme form of measurement error.See: (mcelreath2020Reading?)

Multiple-Imputation as a way of handling Missing Data.

Methods for Multiple-Imputation:

Honaker, King, and Blackwell (2011)

Bhaskaran and Smeeth (2014)

Blackwell, Honaker, and King (2017)

McElreath (2020)

https://gking.harvard.edu/category/research-interests/methods/missing-data

Missing data poses a problem. Missingness biases inference. However, missingness need not be intractable. We can use regression to predict missing values when the missing data are random conditional on known features of the data,1 Where MAR assumptions are satisfied we predict the missing observations using regression. Because regression is uncertain, we must generate multiple datasets with observations. This process of generating multiple datasets is called multiple imputation. Before proceeding with an explanation of the mechanics of multiple imputation, we take a moment to recollect two fundamental principles of data science that we have been discussing throughout this course:

  1. The practice of statistical inference is educated guesswork. We never observe population parameters, we only observe samples. We assume that our samples are drawn randomly from an unobserved population. We we move from measured features of our sample to unmeasured features of a population we are projecting with uncertainty. Multiply imputing missing data is no different. multiple imputation is continuous with a data-scientist’s larger mission of guessing wisely[^1]: For a explanation, see: here

  2. Regression is a tool for prediction and regression is also a tool for causal inference. How we go about using regression differs depending on weather we want to predict or whether we want to explain (i.e. casual inference). When we regress to predict, it is generally useful to include all relevant covariates in our dataset that might be related to the outcome and that are not co-linear with other indicators in our dataset.2

Prediction: we have discussed strategies for assessing improvements in model fit for prediction using criteria such as the BIC criterion, the WAIC criterion, and LOO.

Gelman and Hill offer the following advice about inclusion of variables when using regression to predict (Gelman and Hill 2006, 69):

Our general principles for building regression models for prediction are as follows:

  1. Include all input variables that, for substantive reasons, might be expected to be important in predicting the outcome.
  1. It is not always necessary to include these inputs as separate predictors—for example, sometimes several inputs can be averaged or summed to create a “total score” that can be used as a single predictor in the model.
  1. For inputs that have large effects, consider including their interactions as well.
  1. We suggest the following strategy for decisions regarding whether to exclude a variable from a prediction model based on expected sign and statistical significance (typically measured at the 5% level; that is, a coefficient is “statistically significant” if its estimate is more than 2 standard errors from zero):
  1. If a predictor is not statistically significant and has the expected sign, it is generally fine to keep it in. It may not help predictions dramatically but is also probably not hurting them.
  2. If a predictor is not statistically significant and does not have the expected sign (for example, incumbency having a negative effect on vote share), consider removing it from the model (that is, setting its coefficient to zero).
  3. If a predictor is statistically significant and does not have the expected sign, then think hard if it makes sense. (For example, perhaps this is a country such as India in which incumbents are generally unpopular; see Linden, 2006.) Try to gather data on potential lurking variables and include them in the analysis.
  4. If a predictor is statistically significant and has the expected sign, then by all means keep it in the model. These strategies do not completely solve our problems but they help keep us from making mistakes such as discarding important information. They are predicated on having thought hard about these relationships before fitting the model. It’s always easier to justify a coefficient’s sign after the fact than to think hard ahead of time about what we expect. On the other hand, an explanation that is determined after running the model can still be valid. We should be able to adjust our theories in light of new information (Gelman and Hill 2006, 69).

Causal inference: do not include all predictors that might be associated with the outcome. Doing so is an invitation to confounding causal inference. To obtain an unbaised estimate of an exposure’s causal effect on the outcome(s) of interest we must close all backdoor paths from a predictor to an outcome and we must avoid conditioning on any colliders. As Richard McElreath writes (McElreath 2020, 46)

But the approach which dominates in many parts of biology and the social sciences is instead CAUSAL SALAD.36 Causal salad means tossing various “control” variables into a statistical model, observing changes in estimates, and then telling a story about causation. Causal salad seems founded on the notion that only omitted variables can mislead us about causation. But included variables can just as easily confound us. When tossing a causal salad, a model that makes good predictions may still mislead about causation. If we use the model to plan an intervention, it will get everything wrong (McElreath 2020, 46).

To help assist us with the problems of unbiased causal inference we have discussed the utility of building causal DAGs, noting that inference is always conditional on the DAG that we have drawn. Frequently, the data under-determine the DAG. There are many DAGs consistent with our observations. Regression is a powerful tool for reliable causal inference but it is not sufficient for reliable inference.

So which approach to regression should we use for multiple imputation. The answer is that we should use the predictive approach. When we are imputing we are are predicting features of a population. We shoudl therefore follow the Gelman and Hill’s advice to stratifying across all indicators that might improve predictive accuracy.

Visualising missingness

When you are examining your data, make sure to assessing missing responses. A handy tool for assessing missingess is a missingness graph.

Let’s consider subsample of the nz-jitter longitudinal dataset that has responded to multiple waves. Today we will ask whether employment insecurity is causally related to charitable donations. To predict missingness we should probably include other indicators besides those that I have included here, but these indicators will be sufficient for our purposes.


df <- nz12 %>%
  select(
    Id,
    CharityDonate,
    Emp.JobSecure,
    Household.INC,
    Hours.Work,
    Male,
    Employed,
    Pol.Orient,
    Relid,
    Wave,
    yearS,
    KESSLER6sum,
    Partner,
    Age,
    yearS
  )%>%
  dplyr::mutate(Partner = factor(Partner))

# always inspect your data frame
glimpse(df)
Rows: 4,140
Columns: 14
$ Id            <fct> 15, 15, 15, 15, 15, 15, 15, 15, 15, …
$ CharityDonate <dbl> 20, 0, 5, 10, 70, 0, 170, 160, 80, 1…
$ Emp.JobSecure <dbl> 4, 6, 6, NA, 7, 5, NA, 7, NA, 7, NA,…
$ Household.INC <dbl> 80000, 70000, 56000, 65000, 85000, 8…
$ Hours.Work    <dbl> NA, NA, 0, 0, 15, 25, 0, 18, 0, 14, …
$ Male          <fct> Male, Male, Male, Male, Male, Male, …
$ Employed      <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, …
$ Pol.Orient    <dbl> 3, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 4, …
$ Relid         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, …
$ Wave          <fct> 2010, 2011, 2012, 2013, 2014, 2015, …
$ yearS         <dbl> 27, 347, 834, 1200, 1608, 2037, 2336…
$ KESSLER6sum   <int> 4, 4, 4, 4, 3, 4, 5, 4, 3, 5, 1, 2, …
$ Partner       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Age           <dbl> 38.82820, 39.70431, 41.03765, 42.036…

How many individuals?

nz12 %>%
  group_by(Wave) %>%
  summarise(Unique_Id = n_distinct(Id))
# A tibble: 10 x 2
   Wave  Unique_Id
   <fct>     <int>
 1 2010        414
 2 2011        414
 3 2012        414
 4 2013        414
 5 2014        414
 6 2015        414
 7 2016        414
 8 2017        414
 9 2018        414
10 2019        414

We can visualise the data, using the naniar package:

We can see substantial missingness for Emp.JobSecure.

Let’s explore this:

df%>%
  select(Wave, Emp.JobSecure) %>%
  group_by(Wave)%>%
  tally(is.na(Emp.JobSecure))
# A tibble: 10 x 2
   Wave      n
   <fct> <int>
 1 2010     89
 2 2011    106
 3 2012    118
 4 2013    117
 5 2014    129
 6 2015    133
 7 2016    414
 8 2017    167
 9 2018    172
10 2019    183

Lot’s of missingness in Emp.JobSecure and the question was not included in 2016

table1::table1(~ Wave|Emp.JobSecure, data = df, overall = FALSE)
1
(N=90)
2
(N=96)
3
(N=139)
4
(N=268)
5
(N=403)
6
(N=731)
7
(N=785)
Wave
2010 12 (13.3%) 12 (12.5%) 18 (12.9%) 31 (11.6%) 68 (16.9%) 90 (12.3%) 94 (12.0%)
2011 16 (17.8%) 14 (14.6%) 15 (10.8%) 45 (16.8%) 43 (10.7%) 84 (11.5%) 91 (11.6%)
2012 11 (12.2%) 10 (10.4%) 17 (12.2%) 41 (15.3%) 43 (10.7%) 86 (11.8%) 88 (11.2%)
2013 11 (12.2%) 11 (11.5%) 17 (12.2%) 35 (13.1%) 44 (10.9%) 90 (12.3%) 89 (11.3%)
2014 10 (11.1%) 10 (10.4%) 16 (11.5%) 28 (10.4%) 40 (9.9%) 91 (12.4%) 90 (11.5%)
2015 5 (5.6%) 10 (10.4%) 18 (12.9%) 27 (10.1%) 46 (11.4%) 78 (10.7%) 97 (12.4%)
2016 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
2017 7 (7.8%) 9 (9.4%) 12 (8.6%) 21 (7.8%) 46 (11.4%) 71 (9.7%) 81 (10.3%)
2018 8 (8.9%) 12 (12.5%) 13 (9.4%) 23 (8.6%) 35 (8.7%) 73 (10.0%) 78 (9.9%)
2019 10 (11.1%) 8 (8.3%) 13 (9.4%) 17 (6.3%) 38 (9.4%) 68 (9.3%) 77 (9.8%)

There are various methods for multiple imputation. First, let’s look at the Amelia package

library(Amelia)
# set seed
set.seed(1234)

# we need to pass a data frame to Amelia
prep <- as.data.frame(df) # tibble won't run in amelia !!


# this is the key code
prep2 <- Amelia::amelia(
  prep,
  #dataset to impute
  m = 10,
  # number of imputations
  cs = c("Id"),
  # the cross sectional variable
  ts = c("yearS"),
  # Time series, allowing polynomials
  #ords =  none in this dataset, but use this command for ordinal data
  #logs = ,  # big numbers better to use the natural log
  sqrt = c("KESSLER6sum", "CharityDonate"),
  # skewed positive data such as K6
  noms = c("Male",  # nominal vars
           "Employed",
           "Partner"),
  idvars = c("Wave"),
  # not imputing outcomes
  polytime = 3
) #https://stackoverflow.com/questions/56218702/missing-data-warning-r
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
 101 102 103 104 105

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52

-- Imputation 6 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68

-- Imputation 7 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71

-- Imputation 8 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71

-- Imputation 9 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

-- Imputation 10 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52


# Impute again but do not impute the outcome


# this is the key code
prep2.1 <- Amelia::amelia(
  prep,
  #dataset to impute
  m = 10,
  # number of imputations
  cs = c("Id"),
  # the cross sectional variable
  ts = c("yearS"),
  # Time series, allowing polynomials
  #ords =  none in this dataset, but use this command for ordinal data
  #logs = ,  # big numbers better to use the natural log
  sqrt = c("KESSLER6sum"),
  # skewed positive data such as K6
  noms = c("Male",  # nominal vars
           "Employed",
           "Partner"),
  idvars = c("Wave","CharityDonate"), # We do not impute the outcome this time
  # not imputing outcomes
  polytime = 3
) #https://stackoverflow.com/questions/56218702/missing-data-warning-r
-- Imputation 1 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95

-- Imputation 2 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22

-- Imputation 3 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46

-- Imputation 4 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77

-- Imputation 5 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71

-- Imputation 6 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78

-- Imputation 7 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74

-- Imputation 8 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
 81 82 83 84

-- Imputation 9 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78

-- Imputation 10 --

  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53

We can center and scale our variables using the following code head(df)


# prepare data with imputed outcome
prep3 <- Amelia::transform.amelia(
  prep2,
  Id = as.factor(Id),
  # redundant
  Age.10yrs = (Age / 10),
  years_s = scale(yearS, center = TRUE, scale = TRUE),
  years = yearS,
  KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale =TRUE),
  Employed = factor(Employed),
  Relid = scale(Relid, scale = TRUE, center = TRUE),
  Male = as.factor(Male),
  Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE),
  CharityDonate = as.integer(CharityDonate)
)

# center an d scale age
out <- Amelia::transform.amelia(prep3, Age_in_Decades_C = scale(Age.10yrs,scale =FALSE, center=TRUE))


prep3.1 <- Amelia::transform.amelia(
  prep2.1,
  Id = as.factor(Id),
  # redundant
  Age.10yrs = (Age / 10),
  years_s = scale(yearS, center = TRUE, scale = TRUE),
  years = yearS,
  KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale =TRUE),
  Employed = factor(Employed),
  Relid = scale(Relid, scale = TRUE, center = TRUE),
  Male = as.factor(Male),
  Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE),
  CharityDonate = as.integer(CharityDonate)
)

# center an d scale age
out.1 <- Amelia::transform.amelia(prep3.1, Age_in_Decades_C = scale(Age.10yrs,scale =FALSE, center=TRUE))

We can use trace plots to examine how Amelia imputed and with how much uncertainty:

Here are the imputations for Job Security (a random selection)


Amelia::tscsPlot(
  prep2,
  cs = c("15", "19", "20", "39", "549", "861","1037","1078","1253"),
  main = "Imputation of Job security",
  var = "Emp.JobSecure",
  ylim = c(0, 30)
)

Here are the imputations for Charity (another random selection). Note that there is a fair amount of uncertainty here:

Amelia::tscsPlot(
  prep2,
  cs = c("394", "1039", "1082",  "1149", "1238" , "1253", "1229","15", "19", "20","1037","1078"),
  main = "Impuatation of Charity",
  var = "CharityDonate",
  ylim = c(0, 10000)
)

Which variables do we need to include to estimate the causal effect of job security on charity?

Write your dag!

library(ggdag)
dg <-
  dagify(
    charity ~ jobsecure + employed + age + male + relid + years + polconserve,
    jobsecure ~ employed + distress + male + years + age,
    distress ~ male  + employed + years + age + partner,
    relid ~ male + years,
    age ~ years,
    labels = c(
      "charity" = "charity",
      "jobsecure" = "job security",
      "employed" = "employed",
      "polconserve"  = "political conservative",
      "partner" = "partner",
      "age" = "age",
      "male" = "male",
      "relid" = "religious identity",
      "years" = "years"
    ),
    exposure = "jobsecure",
    outcome = "charity"
  ) %>%
  tidy_dagitty(layout = "nicely")

ggdag::ggdag_adjustment_set(dg) + theme_dag_blank()

To obtain an unbiased estimate of jobsecurity on charity we must condition on employed, male, age, and years.

We can write the model using the lme4 package, which is fast. I wrote a little function, recall that we have 8 data sets

# first write out the model equation
library(lme4)
mod_eq <-  'CharityDonate  ~  Emp.JobSecure_S  + Employed + Age_in_Decades_C +  Male  +  years_s +  (1|Id)' 

# run models iterating over imputed data
loop_glmer_model <-
  function(x, y) {
    # x is the mod equation, y is the data
    m <- 10
    mod <- NULL
    for (i in 1:m) {
      mod[[i]] <- lme4::glmer(x, data = y$imputations[[i]], family = "poisson")
    }
    return(mod)
  }

m_list <- loop_glmer_model(mod_eq, out)

# try with out the outcome imputed

m_list1 <- loop_glmer_model(mod_eq, out.1)

Here is a function for obtaining the results:

# table of effects
loop_lmer_model_tab <- function(x) {
  mp <- lapply(x, model_parameters)
  out <- parameters::pool_parameters(mp)
  return(out)
}

# create table
tab_impute <- loop_lmer_model_tab(m_list)
tab_impute
# Fixed Effects

Parameter         | Coefficient |   SE |         95% CI | Statistic |      p
----------------------------------------------------------------------------
(Intercept)       |        6.34 | 0.59 | [ 5.18,  7.50] |     10.72 | < .001
Emp.JobSecure_S   |        0.01 | 0.03 | [-0.05,  0.07] |      0.41 | 0.678 
Employed [1]      |        0.15 | 0.02 | [ 0.10,  0.20] |      5.95 | < .001
Age_in_Decades_C  |       -4.79 | 1.29 | [-7.31, -2.26] |     -3.72 | < .001
Male [Not_Male]   |       -1.48 | 0.80 | [-3.04,  0.08] |     -1.86 | 0.063 
years_s           |        1.38 | 0.38 | [ 0.63,  2.12] |      3.63 | < .001
SD (Intercept)    |        6.79 |      |                |           |       
SD (Observations) |        1.00 |      |                |           |       

# create graph
plot_impute <- plot(tab_impute)
plot_impute

We can plot the predictions:

library(ggeffects)
library(gghighlight) # not used here, useful for interactions 
graph_predictions_imputed <- function(x, y) {
  # x = model objects
  m <- 10
  out <- NULL
  for (i in 1:m) {
    out[[i]] <-
      ggeffects::ggpredict(x[[i]], terms = c("Emp.JobSecure_S"))
  }
  plots <- NULL
  for (i in 1:m) {
    plots[[i]] <-
      plot(out[[i]], facets = T) # + scale_y_continuous(limits=c(6.35,6.85) )
  }
  plots[[10]] +
    gghighlight::gghighlight() +
    ggtitle(y)
}

# graph
graph_predictions_imputed(m_list,"Prediction of jobsecurity on charity (not reliable")

What if we do not impute the outcome? Does that make a difference? Not much:

loop_lmer_model_tab <- function(x) {
  mp <- lapply(x, model_parameters)
  out <- parameters::pool_parameters(mp)
  return(out)
}

# create table
tab_impute1 <- loop_lmer_model_tab(m_list1)
tab_impute1
# Fixed Effects

Parameter         | Coefficient |   SE |         95% CI | Statistic |      p
----------------------------------------------------------------------------
(Intercept)       |        6.04 | 0.47 | [ 5.12,  6.97] |     12.84 | < .001
Emp.JobSecure_S   |        0.02 | 0.04 | [-0.06,  0.10] |      0.48 | 0.629 
Employed [1]      |        0.11 | 0.03 | [ 0.06,  0.17] |      3.86 | < .001
Age_in_Decades_C  |       -4.58 | 0.38 | [-5.34, -3.83] |    -11.92 | < .001
Male [Not_Male]   |       -1.50 | 0.59 | [-2.65, -0.35] |     -2.55 | 0.011 
years_s           |        1.29 | 0.11 | [ 1.08,  1.50] |     12.10 | < .001
SD (Intercept)    |        6.65 |      |                |           |       
SD (Observations) |        1.00 |      |                |           |       

# create graph
plot_impute1 <- plot(tab_impute1)

library(patchwork)

plot_impute / plot_impute1 + plot_annotation(title = "(a) Multiply imputed outcome vs (b) No-multiple imputation for the outcome", tag_levels = "a")

If you want a LaTeX table, you can use this code:

library(huxtable)

huxtable::as_hux( your_model_here ) %>%
  select("Parameter", "Coefficient", "CI_low", "CI_high", "p") %>%
  set_number_format(3) %>%
  set_left_padding(20) %>%
  set_bold(1, everywhere) %>%
  quick_latex()

Compare multiple imputation results with row-wise deleted results

When you run a regression with missing data, R automatically deletes the missing cases.

Let’s look at the results from the row-wise deleted data:

# prepare data as we did for the imputed dataset

df2 <- df %>%
  dplyr::mutate(
    Age.10yrs = (Age / 10),
    Age_in_Decades_C = scale(Age.10yrs, scale = FALSE, center = TRUE),
    years_s = scale(yearS, center = TRUE, scale = TRUE),
    years = yearS,
    KESSLER6sum_S = scale(KESSLER6sum, center = TRUE, scale = TRUE),
    Employed = factor(Employed),
    Relid = scale(Relid, scale = TRUE, center = TRUE),
    Male = as.factor(Male),
    Emp.JobSecure_S = scale(Emp.JobSecure, center = TRUE, scale = FALSE)
  )

# run the row-wise deletion model
m_no_impute <- glmer(mod_eq, data = df2, family = "poisson")

# create table
tab_no <-
  parameters::model_parameters(m_no_impute, effects = c("all"))
tab_no
# Fixed Effects

Parameter        |  Log-Mean |       SE |         95% CI |      z |      p
--------------------------------------------------------------------------
(Intercept)      |      5.26 |     0.08 | [ 5.09,  5.42] |  62.00 | < .001
Emp.JobSecure_S  | -5.65e-03 | 5.87e-04 | [-0.01,  0.00] |  -9.61 | < .001
Employed [1]     |      0.19 | 6.46e-03 | [ 0.18,  0.20] |  29.18 | < .001
Age_in_Decades_C |     -0.89 |     0.04 | [-0.96, -0.81] | -21.86 | < .001
Male [Not_Male]  |     -0.64 |     0.11 | [-0.86, -0.43] |  -5.79 | < .001
years_s          |      0.18 |     0.01 | [ 0.16,  0.21] |  15.41 | < .001

# Random Effects

Parameter          | Coefficient
--------------------------------
SD (Intercept: Id) |        1.00
SD (Residual)      |        1.00

# create graph
plot_no <- plot(tab_no)
library(patchwork)

# compare models graphically
plot_impute / plot_no +
  plot_annotation(title = "Comparison of regressions using (a) multiple-imputed  and (b) row-wise deleted datasets", tag_levels = 'a')

When we compare the graphs, we see that the multiply imputed datasets shrink estimates towards zero.

Multiple imputation is sometimes avoided because people don’t like to “invent” data. However, creating multiply imputed datasets and integrating over their uncertainty during model tends to increase uncertainty in a model. That’s generally a good thing when we want to predict features of the population.

However it would be a mistake to think that multiple imputation is sufficient. Missingness might not have been completely at random conditional on variables in your model. In this case, your multiple imputation cannot help your dataset to avoid bias.

FAQ for Multiple Imputation

Q: Is multiple imputation inventing data? MI is educated guesswork about the population from which your data are drawn. It is no different from ordinary statistical inference.

Q: Do I need to multiple-impute missing values? A: If your data satisfies the assumptions of MAR, you should multiply impute missing values. However, in many cases it makes no practical difference. Whether MI makes a difference is something you can’t know in advance of trying, so to be safe, you should use MI.

Q: How many missing datasets should I impute? A: The documentation to Amelia suggests n = 5 datasets is sufficient. Here we used 10.

Q: How do I handle time series data? A: Here we have used the cs and ts commands in the Amelia package

Q: Can I use multiple imputation with Bayesian estimation packages. A: Yes, see me.

Q: Where can I go for help? A: See the resources provided at the introduction to this section.

Course wrap up: 21 skills you’ve acquired

You’ve come very far in only twelve weeks. Among other things, you have acquired the following skills:

  1. You can use Rstudio, GitHub, and R-markdown to do data-science.
  2. You can get your data-into shape through data-wrangling in R.
  3. You can create publication-ready descriptive graphs and tables for your samples.
  4. You can do the same for your regression models.
  5. You can search for anomalies in your samples and in your regression outputs.
  6. You can compare model fits.
  7. You can select family distributions that are appropriate for your data.
  8. You can model ordinal predictors and ordinal outcomes.
  9. You can model outcomes that have an abundance of zeros 10 You an write non-linear models.
  10. You can simulate data.
  11. You can perform ANOVA, ANOVCA, and MANOVA in a regression setting.
  12. You can perform model checks.
  13. You can write multilevel models and interpret them.
  14. You understand meta-analysis as a form of multi-level modelling.
  15. You understand the difference between regression for prediction and regression for inference.
  16. You understand how to reduce the dimensions of your data.
  17. You understand how to check for confounding by writing a DAG.
  18. You can create a distill website for free on GitHub.
  19. You’ve learned about measurement reliability and measurement validity, and you understand how to reduce highly dimensional data.
  20. You can multiply impute datasets to handle missing data, and you understand why this is a good thing.
  21. Most of all we hope you’ve learned how to continue learning: this is the most important skill for data science.
Bhaskaran, Krishnan, and Liam Smeeth. 2014. “What Is the Difference Between Missing Completely at Random and Missing at Random?” International Journal of Epidemiology 43 (4): 1336–39. https://doi.org/10.1093/ije/dyu080.
Blackwell, Matthew, James Honaker, and Gary King. 2017. “A Unified Approach to Measurement Error and Missing Data: Overview and Applications.” Sociological Methods and Research 46 (3): 303–41. http://journals.sagepub.com/doi/full/10.1177/0049124115585360.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge university press.
Honaker, James, Gary King, and Matthew Blackwell. 2011. “Amelia II: A Program for Missing Data.” Journal of Statistical Software 45 (7): 1–47. http://www.jstatsoft.org/v45/i07/.
Lance, C. E., Marcus M. Butts, and Lawrence C. Michels. 2006. The Sources of Four Commonly Reported Cutoff Criteria: What Did They Really Say? Organizational Research Methods 9 (2): 202–20. https://doi.org/10.1177/1094428105284919.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. CRC press.
McNeish, Daniel. 2018. Thanks coefficient alpha, we’ll take it from here. Psychological Methods 23 (3): 412–33. https://doi.org/10.1037/met0000144.
Nunnally, J. C., and I. H. Bernstein. 1994. The Assessment of Reliability. In Psychometric Theory, 3rd ed., 248–92.
Raykov, Tenko, and George A. Marcoulides. 2017. Thanks Coefficient Alpha, We Still Need You! Educational and Psychological Measurement, August, 001316441772512. https://doi.org/10.1177/0013164417725127.

  1. Statisticians call the assumption of of conditional randomness in missingness: “MAR: Missing at Random.” The language is admittedly confusing – what statisticians mean is “missing conditional on known predictors.”↩︎

  2. Although it generally does no harm to include collinear indicators if our task is prediction↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Karl & Bulbulia (2021, May 25). Psych 447: Missing data, measurement error, and course wrap-up. Retrieved from https://vuw-psych-447.netlify.app/posts/12_1/

BibTeX citation

@misc{karl2021missing,
  author = {Karl, Johannes and Bulbulia, Joseph},
  title = {Psych 447: Missing data, measurement error, and course wrap-up},
  url = {https://vuw-psych-447.netlify.app/posts/12_1/},
  year = {2021}
}