Elements of a linear model

Joseph Bulbulia https://josephbulbulia.netlify.app (Victoria University of Wellington)https://www.wgtn.ac.nz
2021-MAR-23
Show code
knitr::include_graphics("op.png")

Show code
# libraries
library("tidyverse")
library("patchwork")
library("brms")
library("lubridate")
library("splines")
if (!require(equatiomatic)) {
  remotes::install_github("datalorax/equatiomatic")
  }
# set theme
# theme set
theme_set(theme_classic())
Show code
# Import data
# read data
nz_0 <- readr::read_csv2(url("https://raw.githubusercontent.com/go-bayes/psych-447/main/data/nz/nz.csv"))

# to relevel kessler 6 variables
f<-c("None Of The Time","A Little Of The Time","Some Of The Time",  "Most Of The Time", "All Of The Time")

# get data into shape
library("tidyverse")
nz <- nz_0 %>%
  dplyr::mutate_if(is.character, factor) %>%
  select(
    -c(
      SWB.Kessler01,
      SWB.Kessler02,
      SWB.Kessler03,
      SWB.Kessler04,
      SWB.Kessler05,
      SWB.Kessler06
    )
  ) %>%
  dplyr::mutate(Wave = as.factor(Wave)) %>%
  dplyr::mutate(FeelHopeless = forcats::fct_relevel(FeelHopeless, f)) %>%
  dplyr::mutate(FeelDepressed = forcats::fct_relevel(FeelDepressed, f)) %>%
  dplyr::mutate(FeelRestless = forcats::fct_relevel(FeelRestless, f)) %>%
  dplyr::mutate(EverythingIsEffort = forcats::fct_relevel(EverythingIsEffort, f)) %>%
  dplyr::mutate(FeelWorthless = forcats::fct_relevel(FeelWorthless, f)) %>%
  dplyr::mutate(FeelNervous = forcats::fct_relevel(FeelNervous, f)) %>%
  dplyr::mutate(Wave = as.factor(Wave)) %>%
  dplyr::mutate(date = make_date(year = 2009, month = 6, day = 30) + TSCORE) %>%
  dplyr::mutate(height_m = HLTH.Height * 100,
         weight_kg =  HLTH.Weight) # better height vars

Overview

This week we introduce regression.

Learning outcomes

By learning regression, you will be better equipped to do psychological science and to evaluate psychological research.

What is regression?

Broadly speaking, a regression model is method for inferring the expected average features of a population, and the variance of a population, conditional on other features of the population as measured in a sample.

We’ll see that regression encompasses more than this definition, however, this definition makes a start.

To understand regression, then, we need to understand the following jargon words: population, sample, measurement, and inference.

What is a population?

In science, a population is a hypothetical construct. It is the set of all potential members of a set of things. In psychological science that set is typically a collection of individuals. We want to understand “The population of all human beings?” or “The New Zealand adult population”; or “The population of undergraduates who may be recruited for IPRP in New Zealand.”

What is a sample?

A sample is a randomly realised sub-population from the larger abstract population that a scientific community hopes to generalise about.

Think of selecting balls randomly from an urn. If pulled at random, the balls may inform us about the contents of the urn. For example, if we select one white ball and one black ball, we may infer that the balls in the urn are not all white or all black.

What is “measurement?”

A measure is tool or method for obtaining numerical descriptions of a sample. We often call measures “scales.” We can think of a bathroom weight scale as a tool and method for tracking body weight.

A measurement is the numerical description we obtain from sensors such as statistical surveys, census data, twitter feeds, & etc.

In the course, we have encountered numerical scales, ordinal scales, and factors. The topic of measurement in psychological is, to say the least, very broad.

For now, it is important to keep in mind that, similar to bathroom scales, measures can be prone to error.

Also similar to bathroom scales, error prone scales may nevertheless be useful. We need to investigate the utility of error prone scales against the backdrop of specific interests and purposes.

What is a parameter?

In regression, we combine measurements on samples with probability theory to guess about the properties of a population we will never observe. We call these properties “parameters.”

What is statistical inference?

The bulk of statistical inference consists of educated guessing about population parameters.

Probability distributions and statistical guessing

Inference is possible because the parameters of naturally occurring populations are structured by data generating processes that are approximated by probability distributions. A probability distribution is a mathematical function that describes the probability of a random event. Today we will be focusing on height.1

Today we will be talking about the “normal” or “Gaussian distribution.” A very large number of data-generating processes in nature conform the normal distribution.

Let’s consider some examples of randomly generated samples, which we will obtain using R’s rnorm function.

100-person sample of heights

Show code
set.seed(123)
sm<-rnorm(100, mean = 170, sd = 20)
ggplot2::qplot(sm, binwidth = 10)

10-person sample of heights

Show code
set.seed(123)
subsm <-rnorm(10, mean = 170, sd = 20)

ggplot2::qplot(
  subsm, binwidth = 10
  )

10000-person sample of heights

Show code
set.seed(123)
largesm <-rnorm(1e5, mean = 170, sd = 20)
ggplot2::qplot(
  largesm, binwidth = 1
  )

How can I use regression to infer a population parameter?

We can use R to investigate the average height of our imaginary population from which the preceding samples were randomly drawn. We do this in R by writing an “intercept-only” model as follows:

Show code
model <- lm(outcome ~ 1, data = datset)
summary(model)

Using the previous simulations:

N = 10 random draws

#write the model and get a nice table for it
sjPlot::tab_model(
  lm(sm ~ 1)
)
  sm
Predictors Estimates CI p
(Intercept) 171.81 168.19 – 175.43 <0.001
Observations 100
R2 / R2 adjusted 0.000 / 0.000

N = 100 random draws

sjPlot::tab_model(
  lm(subsm ~ 1)
)
  subsm
Predictors Estimates CI p
(Intercept) 171.49 157.85 – 185.14 <0.001
Observations 10
R2 / R2 adjusted 0.000 / 0.000

N = 10,000 random draws

Show code
sjPlot::tab_model(
  lm(largesm ~ 1)
)
  largesm
Predictors Estimates CI p
(Intercept) 170.02 169.90 – 170.14 <0.001
Observations 100000
R2 / R2 adjusted 0.000 / 0.000

What do we notice about the relationship between sample size the estimated population average?

Show code
sjPlot::tab_model(
   lm(sm ~ 1),
   lm(subsm ~ 1),
   lm(largesm ~ 1)
)
  sm subsm largesm
Predictors Estimates CI p Estimates CI p Estimates CI p
(Intercept) 171.81 168.19 – 175.43 <0.001 171.49 157.85 – 185.14 <0.001 170.02 169.90 – 170.14 <0.001
Observations 100 10 100000
R2 / R2 adjusted 0.000 / 0.000 0.000 / 0.000 0.000 / 0.000

Regression with a single co-variate

Does mother height predict daughter height? It seems so. By what how close are is the relationship?

Francis Galton is credited with inventing regression. Galton observed that the height of offspring tends to fall between parental height and the population average, what Galton termed: “regression to the mean.” Galton sought a method for educated guessing about heights, and this led to fitting a line of regression by a method called “least squares” (For a history see: here).

This following dataset is from “The heredity of height,” Karl Pearson and Alice Lee (1903)(Pearson and Lee 1903). I obtained the dataset from (Gelman, Hill, and Vehtari 2020a). Let’s use this dataset to investigate the relationship between a mother’s height and a daughter’s height.

Show code
md_df <- data.frame(read.table(url("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"), header=TRUE))
# Center mother's height for later example
md_df <- md_df %>%
  dplyr::mutate(mother_height_c = as.numeric(scale(mother_height, center = TRUE, scale = FALSE)))
dplyr::glimpse(md_df)
Rows: 5,524
Columns: 3
$ daughter_height <dbl> 52.5, 52.5, 53.5, 53.5, 55.5, 55.5, 55.5, 55…
$ mother_height   <dbl> 59.5, 59.5, 59.5, 59.5, 59.5, 59.5, 59.5, 59…
$ mother_height_c <dbl> -2.9987328, -2.9987328, -2.9987328, -2.99873…

Pearson and Lee collected 5,524 observations from mother/daughter height pairs. Let’s examine the data, first by plotting the relationship.

What what is happening here?

Show code
explore_md <-ggplot2::ggplot(data = md_df, aes(y = daughter_height, x = mother_height)) + 
  geom_jitter(alpha = .2) + 
  labs(title = "The relationship between mothers height and daughter's height") +
       ylab("Daughter's height") +
       xlab("Mother's height") + theme_classic()
explore_md

Is there a linear predictive relationship between these two parameters? In regression we examine the line of best fit.

Show code
m1 <- lm(daughter_height ~ mother_height, data = md_df)
sjPlot::tab_model(m1)
  daughter_height
Predictors Estimates CI p
(Intercept) 29.80 28.25 – 31.35 <0.001
mother_height 0.54 0.52 – 0.57 <0.001
Observations 5524
R2 / R2 adjusted 0.252 / 0.252

We can plot the coefficient; in a model with one predictor isn’t too informative. As we continue in the course, however, we’ll see that plotting coefficients can be easier than deciphering the numbers in tables. Here are two methods for plotting.

Show code
t_m1<-parameters::model_parameters(m1,  
                                   ci = 0.95)
method1 <- plot(t_m1) +
  labs(title = "The relationship between mothers height and daughter's height") + 
  ylab("Daughter's height") 

method2 <-sjPlot::plot_model(m1)

library(patchwork)
method1 / method2 + plot_annotation(title = "Comparision of two coefficeint plots",
                                    subtitle = "a: parameters see; b: sjPlot", 
                                    tag_levels = "a")

How do we interpret the regression model?

Let’s write the equation out in mathematics. How do we read this?2

Show code
library("equatiomatic")
extract_eq(m1,  use_coefs = FALSE)

\[ \operatorname{daughter\_height} = \alpha + \beta_{1}(\operatorname{mother\_height}) + \epsilon \]

The math says that the expected daughter’s height in a population is predicted by the average height of the population when mother’s height is set to zero units (note, this is impossible - we’ll come back to this) plus \(\beta ~\times\) units of daughter’s height (inches) for each additional unit of mother’s height (inches)

We can plug the output of the model directly into the equation as follows:

Show code
library("equatiomatic")
extract_eq(m1,  use_coefs = TRUE)

\[ \operatorname{\widehat{daughter\_height}} = 29.8 + 0.54(\operatorname{mother\_height}) \]

Graph the relationship between mother’s and daughter’s heights

library(ggeffects)
toplot<-ggeffects::ggpredict(m1, terms = "mother_height")

heightplot<-plot(toplot, add.data = TRUE, dot.alpha = .1, jitter = TRUE) +   theme_classic()
heightplot + labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")

Regression to predict beyond the range of a dataset

Joyte Amge is the world’s shortest woman at 25 inches. Sandy Allen was the world’s tallest woman at 91 inches. What is be the expected heights of their daughter, and of every intermediary woman in between?

# use the `expand.grid` command to create a sequence of points for mother's height
ndat<-expand.grid(mother_height = c(25:91)) 

# use the `predict` function to create a new response 
pr<- predict(m1, type = "response", interval = "confidence", newdata =ndat)

# have a look at the object
dplyr::glimpse(pr)
 num [1:67, 1:3] 43.4 44 44.5 45.1 45.6 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:67] "1" "2" "3" "4" ...
  ..$ : chr [1:3] "fit" "lwr" "upr"
# create a new dataframe for the new sequence of points for mother's height and the predicted data
newdata<-data.frame(ndat,pr)
head(newdata)
  mother_height      fit      lwr      upr
1            25 43.42183 42.49099 44.35266
2            26 43.96676 43.06065 44.87288
3            27 44.51170 43.63030 45.39310
4            28 45.05664 44.19995 45.91332
5            29 45.60157 44.76960 46.43355
6            30 46.14651 45.33924 46.95378

Graph the predicted results

Show code
# graph the expected results
predplot<-ggplot(data = newdata, 
       aes(x= mother_height, y = fit))  + 
  geom_point() +  geom_errorbar(aes(ymin = lwr, ymax = upr), width = .1) + 
   expand_limits(x = c(20,91), y = c(0,81))  + theme_classic() + 
  labs(title = "Predicted values for a broader population")

# plot the two graphs together (making the x and y axis at the same scale )
library("patchwork")
# rescale heightplot

# old plot with the new axis and y axis scales, and remove points

heightplot2<-plot(toplot, add.data = FALSE) +   theme_classic()

nhp <- heightplot2 +  expand_limits(x = c(20,91), y = c(0,81) ) +  labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")

# double graph
 nhp /predplot  + plot_annotation(title = "What do you notie about these relationships?", tag_levels = "a")

A simple method for obtaining the predicted values form your fitted model is to obtain the ggeffects output without producing a graph.

library(ggeffects)

toplot<-ggeffects::ggpredict(m1, terms = "mother_height")
toplot
# Predicted values of daughter_height
# x = mother_height

    x | Predicted |         95% CI
----------------------------------
52.50 |     58.41 | [58.15, 58.66]
54.50 |     59.50 | [59.29, 59.70]
57.50 |     61.13 | [60.99, 61.27]
59.50 |     62.22 | [62.13, 62.32]
61.50 |     63.31 | [63.25, 63.38]
63.50 |     64.40 | [64.34, 64.47]
65.50 |     65.49 | [65.40, 65.59]
70.50 |     68.22 | [68.01, 68.42]

Non-linear relationships

Linear regression assumes linearity conditional on a model. Often your data will not be linear!

Consider the following example:

# Simulate nonlinear relationship between x and y
b <- c(2, 0.75)
set.seed(12)
x <- rnorm(100)
set.seed(12)
y <- rnorm(100, mean = b[1] * exp(b[2] * x))
dat1 <- data.frame(x, y)

ot1 <-lm(y ~ x, data  = dat1)
# performance::check_model(ot1)

# Plot linear effect
plot(ggeffects::ggpredict(ot1, terms = "x"), add.data =TRUE, dot.alpha = .4)

Non-linear relationship as modelled by a polynomial regression:

library(splines)
ot2 <-lm(y ~ x + I(x^2), data  = dat1)
plot(ggeffects::ggpredict(ot2, terms = "x"), add.data =TRUE, dot.alpha = .4)

Here is another approach:

library(splines)
ot2.b <-lm(y ~ x + poly(x, 2), data  = dat1)
plot(ggeffects::ggpredict(ot2.b, terms = "x"), add.data =TRUE, dot.alpha = .4)

Non-linear relationship as modeled by a general additive model (spline)

library(splines)

ot3 <-lm(y ~ bs(x), data  = dat1)

#performance::check_model(ot2)
plot(ggeffects::ggpredict(ot3, terms = "x"), add.data =TRUE, dot.alpha = .4)

Centering

Any linear transformation of a predictor is OK. Often we center (or center and scale) all indicators, which gives us an intercept that is meaninful (the expected population average when the other indicators are set their average).

Show code
library(ggeffects)
# original model
m1 <- lm(daughter_height ~ mother_height, data = md_df)
mc <-lm(daughter_height ~ mother_height_c, data=md_df)
sjPlot::tab_model(m1,mc)
  daughter_height daughter_height
Predictors Estimates CI p Estimates CI p
(Intercept) 29.80 28.25 – 31.35 <0.001 63.86 63.80 – 63.92 <0.001
mother_height 0.54 0.52 – 0.57 <0.001
mother_height_c 0.54 0.52 – 0.57 <0.001
Observations 5524 5524
R2 / R2 adjusted 0.252 / 0.252 0.252 / 0.252

Graph model

Show code
plot(ggeffects::ggpredict(mc, terms = "mother_height_c"), add.data =TRUE, dot.alpha = .4)

Note: when fitting a polynomial or any interaction, it is important to center your indicators. We’ll come back to this point in later lectures.

Model evaluation

A simple way to assess your model fit is to compare a model with one covariate with a simple intercept-only model and to assess improvement in either the AIC statistic or the BIC statistic. The BIC is similar to the AIC but adds a penalty for extra predictors. An absolute improvement in either statistic of n > 10 is considered to be a better model.

We can use the performance package to generate a table that compares fits.

library(performance)
# intercept only
ionly <- lm(daughter_height ~ 1, data = md_df)

# covariate added
covadded <- lm(daughter_height ~ mother_height, data = md_df)

# evaluate
performance::compare_performance(ionly, covadded)
# Comparison of Model Performance Indices

Name     | Model |       AIC |       BIC |    R2 | R2 (adj.) |  RMSE | Sigma
----------------------------------------------------------------------------
ionly    |    lm | 26299.969 | 26313.203 | 0.000 |     0.000 | 2.615 | 2.615
covadded |    lm | 24698.514 | 24718.365 | 0.252 |     0.252 | 2.262 | 2.262

What was the model improvement?

Show code
# improved fit
BIC(ionly)- BIC(covadded)
[1] 1594.839

Generate a report

This is easy with the report package

For example:

report::report_statistics(covadded)
beta = 29.80, 95% CI [28.25, 31.35], t(5522) = 37.70, p < .001; Std. beta = 1.37e-14, 95% CI [-0.02, 0.02]
beta = 0.54, 95% CI [0.52, 0.57], t(5522) = 43.12, p < .001; Std. beta = 0.50, 95% CI [0.48, 0.52]

Or if you want a longer report

report::report(covadded)

Though use statistically significant in place of significant. This will avoid misleading your audience into thinking your result is important, when what you intend to communicate is that it is reliable.

Assumptions of regression

From Gelman and Hill (Gelman and Hill 2006)

  1. Validity

    The most important is that the data you are analyzing should map to the research question you are trying to answer (Gelman, Hill, and Vehtari 2020b, 152)

  2. Representativeness

    A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population. (Gelman, Hill, and Vehtari 2020b, 153)

  3. Linearity*

    The most important mathematical assumption of the linear regression model is that its deterministic component is a linear function of the separate predictors: y = β0 + β1x1 + β2x2 +···. If additivity is violated, it might make sense to transform the data (for example, if y = abc, then log y = log a + log b + log c) or to add interactions. If linearity is violated, perhaps a predictor should be put in as 1/x or log(x) instead of simply linearly. Or a more complicated relationship could be expressed using a nonlinear function such as a spline or Gaussian process, (Gelman, Hill, and Vehtari 2020b, 153)

  4. Independence of errors

    The simple regression model assumes that the errors from the prediction line are independent, an assumption that is violated in time series, spatial, and multilevel settings (Gelman, Hill, and Vehtari 2020b, 153)

  5. Equal variance of errors

…unequal variance does not affect the most important aspect of a regression model, which is the form of the predictors (Gelman, Hill, and Vehtari 2020b, 153)

  1. Normality of errors (statistical independence)

The regression assumption that is generally least important is that the errors are normally distributed. In fact, for the purpose of estimating the regression line (as compared to predicting individual data points), the assumption of normality is barely important at all. Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of re-gression residuals. (Gelman and Hill 2006, 46)

A good way to diagnose violations of some of the assumptions just considered (importantly, linearity) is to plot the residuals versus fitted values or simply individual predictors.(Gelman and Hill 2006, 46)

Common confusions

Causal inference is tricky

People use the work “effect” but that is not what regression gives us (by default)

“Normality assumption”

As Gelman and Hill note, the “normality” assumption is the least important. And the assumption pertains to the normality of residuals

Statistical independence

This will be the main reason we do multi-level modelling: to condition on dependencies in the data.

Levels (wrong population)

We sample from undergraduates, but infer about the human population.

Acknowledgments

  1. Richard Mcelreath’s Statistical Rethinking (McElreath 2020)
  2. Regression and other stories (Gelman, Hill, and Vehtari 2020a)

Appendix 1: Conceptual Background

Some preliminaries about science.

Science begins with a question

Science begins with a question about the world. The first step in science, then, is to clarify what you want to know.

Because science is a social practice, you will also need to clarify why your question is interesting: so what?

In short, know your question.

Scientific model (or theory)

Sometimes scientists are interested in specific features of the world: how did virus x originate? Such a question might have a forensic interest: what constellation of events gave rise to a novel infectious disease?

More typically, scientists seek generalisations. How do infectious diseases evolve? How do biological organisms evolve? Such questions have applied and fundamental interests. How can we better prevent infectious disease? How did life originate?

A scientific model is a proposal for how nature is structured (and unstructured). For example, the theory of evolution by natural selection proposes that life emerges from variation, inheritance, and differential reproduction/survival.

To evaluate a scientific model, scientists must make generalisations beyond individual cases. This is where statistics shines.

What is statistics?

Mathematics is a logic of certainty.

Statistics is a logic of uncertainty.

A statistical model uses the logic of probability to make better guesses.

Applications of statistical models in science

Scientific models seek to explain how nature is structured. Where scientific models conflict, we can combine statistical models with data-collection to evaluate the credibility of of one theoretical model over others. To do this, a scientific model must make distinct, non-trivial predictions about the world.

If the predictions are not distinct, the observations will not enable a shift in credibility for one theory over another. Consider the theory that predicts any observation. Such a theory would be better classified as a conspiracy theory; it is compatible with any evidence whatsoever.

Today we introduce a statistical method called regression. We will focus on how regression helps both to evaluate, and to make, informed predictions about the structures of the world.

Appendix 2: How your computer sees your data

Under the hood

Under the hood, your computer sees your data as consisting of vectors and matrices. An algorithm searches to estimate (or in the case of Bayesian inference, to “solve”) an optimization problem to obtain a location for the unobserved parameter; in the case we examined above, this parameter is the relationship between daughter heights and mother heights.

\[\begin{bmatrix} \textbf{y}\\ \textit{daughter's height}\\ 51.5\\ 52.5 \\ 53.0 \\ \vdots \end{bmatrix} = \begin{bmatrix} \textbf{intercept} \\ 1 \\ 1 \\ 1 \\ 1 \\ \vdots \\ \end{bmatrix} \begin{bmatrix} \textbf{x}\\ \textit{mother's height}\\ 59.5\\ 57.5 \\ 60.0 \\ \vdots \end{bmatrix} \begin{bmatrix} \textbf{b}\\ 29.8 \\ 0.54 \end{bmatrix}\]

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge university press.
Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020a. Regression and Other Stories. Cambridge University Press.
———. 2020b. Regression and Other Stories. Cambridge University Press.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. CRC press.
Pearson, Karl, and Alice Lee. 1903. “On the Laws of Inheritance in Man: I. Inheritance of Physical Characters.” Biometrika 2 (4): 357–462.

  1. The relationship of probability distributions and data-generating processes is complex, intriguing, and both historically and philosophically rich \(\dots\). Because our interests are applied, we will hardly touch up this richness in this course, alas.↩︎

  2. Later, we’ll prefer a different way of writing regression equations in math. (Note: writing math isn’t math - it’s just encoding the model that we’ve written).↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Bulbulia (2021, March 23). Psych 447: Elements of a linear model. Retrieved from https://vuw-psych-447.netlify.app/posts/5_1/

BibTeX citation

@misc{bulbulia2021elements,
  author = {Bulbulia, Joseph},
  title = {Psych 447: Elements of a linear model},
  url = {https://vuw-psych-447.netlify.app/posts/5_1/},
  year = {2021}
}