knitr::include_graphics("op.png")
# libraries
library("tidyverse")
library("patchwork")
library("brms")
library("lubridate")
library("splines")
if (!require(equatiomatic)) {
remotes::install_github("datalorax/equatiomatic")
}
# set theme
# theme set
theme_set(theme_classic())
# Import data
# read data
nz_0 <- readr::read_csv2(url("https://raw.githubusercontent.com/go-bayes/psych-447/main/data/nz/nz.csv"))
# to relevel kessler 6 variables
f<-c("None Of The Time","A Little Of The Time","Some Of The Time", "Most Of The Time", "All Of The Time")
# get data into shape
library("tidyverse")
nz <- nz_0 %>%
dplyr::mutate_if(is.character, factor) %>%
select(
-c(
SWB.Kessler01,
SWB.Kessler02,
SWB.Kessler03,
SWB.Kessler04,
SWB.Kessler05,
SWB.Kessler06
)
) %>%
dplyr::mutate(Wave = as.factor(Wave)) %>%
dplyr::mutate(FeelHopeless = forcats::fct_relevel(FeelHopeless, f)) %>%
dplyr::mutate(FeelDepressed = forcats::fct_relevel(FeelDepressed, f)) %>%
dplyr::mutate(FeelRestless = forcats::fct_relevel(FeelRestless, f)) %>%
dplyr::mutate(EverythingIsEffort = forcats::fct_relevel(EverythingIsEffort, f)) %>%
dplyr::mutate(FeelWorthless = forcats::fct_relevel(FeelWorthless, f)) %>%
dplyr::mutate(FeelNervous = forcats::fct_relevel(FeelNervous, f)) %>%
dplyr::mutate(Wave = as.factor(Wave)) %>%
dplyr::mutate(date = make_date(year = 2009, month = 6, day = 30) + TSCORE) %>%
dplyr::mutate(height_m = HLTH.Height * 100,
weight_kg = HLTH.Weight) # better height vars
This week we introduce regression.
By learning regression, you will be better equipped to do psychological science and to evaluate psychological research.
Broadly speaking, a regression model is method for inferring the expected average features of a population, and the variance of a population, conditional on other features of the population as measured in a sample.
We’ll see that regression encompasses more than this definition, however, this definition makes a start.
To understand regression, then, we need to understand the following jargon words: population, sample, measurement, and inference.
In science, a population is a hypothetical construct. It is the set of all potential members of a set of things. In psychological science that set is typically a collection of individuals. We want to understand “The population of all human beings?” or “The New Zealand adult population”; or “The population of undergraduates who may be recruited for IPRP in New Zealand.”
A sample is a randomly realised sub-population from the larger abstract population that a scientific community hopes to generalise about.
Think of selecting balls randomly from an urn. If pulled at random, the balls may inform us about the contents of the urn. For example, if we select one white ball and one black ball, we may infer that the balls in the urn are not all white or all black.
A measure is tool or method for obtaining numerical descriptions of a sample. We often call measures “scales.” We can think of a bathroom weight scale as a tool and method for tracking body weight.
A measurement is the numerical description we obtain from sensors such as statistical surveys, census data, twitter feeds, & etc.
In the course, we have encountered numerical scales, ordinal scales, and factors. The topic of measurement in psychological is, to say the least, very broad.
For now, it is important to keep in mind that, similar to bathroom scales, measures can be prone to error.
Also similar to bathroom scales, error prone scales may nevertheless be useful. We need to investigate the utility of error prone scales against the backdrop of specific interests and purposes.
In regression, we combine measurements on samples with probability theory to guess about the properties of a population we will never observe. We call these properties “parameters.”
The bulk of statistical inference consists of educated guessing about population parameters.
Inference is possible because the parameters of naturally occurring populations are structured by data generating processes that are approximated by probability distributions. A probability distribution is a mathematical function that describes the probability of a random event. Today we will be focusing on height.1
Today we will be talking about the “normal” or “Gaussian distribution.” A very large number of data-generating processes in nature conform the normal distribution.
Let’s consider some examples of randomly generated samples, which we will obtain using R’s rnorm
function.
We can use R to investigate the average height of our imaginary population from which the preceding samples were randomly drawn. We do this in R by writing an “intercept-only” model as follows:
Using the previous simulations:
N = 10 random draws
sm | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 171.81 | 168.19 – 175.43 | <0.001 |
Observations | 100 | ||
R2 / R2 adjusted | 0.000 / 0.000 |
N = 100 random draws
subsm | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 171.49 | 157.85 – 185.14 | <0.001 |
Observations | 10 | ||
R2 / R2 adjusted | 0.000 / 0.000 |
N = 10,000 random draws
largesm | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 170.02 | 169.90 – 170.14 | <0.001 |
Observations | 100000 | ||
R2 / R2 adjusted | 0.000 / 0.000 |
What do we notice about the relationship between sample size the estimated population average?
sm | subsm | largesm | |||||||
---|---|---|---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 171.81 | 168.19 – 175.43 | <0.001 | 171.49 | 157.85 – 185.14 | <0.001 | 170.02 | 169.90 – 170.14 | <0.001 |
Observations | 100 | 10 | 100000 | ||||||
R2 / R2 adjusted | 0.000 / 0.000 | 0.000 / 0.000 | 0.000 / 0.000 |
Does mother height predict daughter height? It seems so. By what how close are is the relationship?
Francis Galton is credited with inventing regression. Galton observed that the height of offspring tends to fall between parental height and the population average, what Galton termed: “regression to the mean.” Galton sought a method for educated guessing about heights, and this led to fitting a line of regression by a method called “least squares” (For a history see: here).
This following dataset is from “The heredity of height,” Karl Pearson and Alice Lee (1903)(Pearson and Lee 1903). I obtained the dataset from (Gelman, Hill, and Vehtari 2020a). Let’s use this dataset to investigate the relationship between a mother’s height and a daughter’s height.
md_df <- data.frame(read.table(url("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/PearsonLee/data/MotherDaughterHeights.txt"), header=TRUE))
# Center mother's height for later example
md_df <- md_df %>%
dplyr::mutate(mother_height_c = as.numeric(scale(mother_height, center = TRUE, scale = FALSE)))
dplyr::glimpse(md_df)
Rows: 5,524
Columns: 3
$ daughter_height <dbl> 52.5, 52.5, 53.5, 53.5, 55.5, 55.5, 55.5, 55…
$ mother_height <dbl> 59.5, 59.5, 59.5, 59.5, 59.5, 59.5, 59.5, 59…
$ mother_height_c <dbl> -2.9987328, -2.9987328, -2.9987328, -2.99873…
Pearson and Lee collected 5,524 observations from mother/daughter height pairs. Let’s examine the data, first by plotting the relationship.
What what is happening here?
explore_md <-ggplot2::ggplot(data = md_df, aes(y = daughter_height, x = mother_height)) +
geom_jitter(alpha = .2) +
labs(title = "The relationship between mothers height and daughter's height") +
ylab("Daughter's height") +
xlab("Mother's height") + theme_classic()
explore_md
Is there a linear predictive relationship between these two parameters? In regression we examine the line of best fit.
daughter_height | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | 29.80 | 28.25 – 31.35 | <0.001 |
mother_height | 0.54 | 0.52 – 0.57 | <0.001 |
Observations | 5524 | ||
R2 / R2 adjusted | 0.252 / 0.252 |
We can plot the coefficient; in a model with one predictor isn’t too informative. As we continue in the course, however, we’ll see that plotting coefficients can be easier than deciphering the numbers in tables. Here are two methods for plotting.
t_m1<-parameters::model_parameters(m1,
ci = 0.95)
method1 <- plot(t_m1) +
labs(title = "The relationship between mothers height and daughter's height") +
ylab("Daughter's height")
method2 <-sjPlot::plot_model(m1)
library(patchwork)
method1 / method2 + plot_annotation(title = "Comparision of two coefficeint plots",
subtitle = "a: parameters see; b: sjPlot",
tag_levels = "a")
Let’s write the equation out in mathematics. How do we read this?2
library("equatiomatic")
extract_eq(m1, use_coefs = FALSE)
\[ \operatorname{daughter\_height} = \alpha + \beta_{1}(\operatorname{mother\_height}) + \epsilon \]
The math says that the expected daughter’s height in a population is predicted by the average height of the population when mother’s height is set to zero units (note, this is impossible - we’ll come back to this) plus \(\beta ~\times\) units of daughter’s height (inches) for each additional unit of mother’s height (inches)
We can plug the output of the model directly into the equation as follows:
library("equatiomatic")
extract_eq(m1, use_coefs = TRUE)
\[ \operatorname{\widehat{daughter\_height}} = 29.8 + 0.54(\operatorname{mother\_height}) \]
Joyte Amge is the world’s shortest woman at 25 inches. Sandy Allen was the world’s tallest woman at 91 inches. What is be the expected heights of their daughter, and of every intermediary woman in between?
# use the `expand.grid` command to create a sequence of points for mother's height
ndat<-expand.grid(mother_height = c(25:91))
# use the `predict` function to create a new response
pr<- predict(m1, type = "response", interval = "confidence", newdata =ndat)
# have a look at the object
dplyr::glimpse(pr)
num [1:67, 1:3] 43.4 44 44.5 45.1 45.6 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:67] "1" "2" "3" "4" ...
..$ : chr [1:3] "fit" "lwr" "upr"
# create a new dataframe for the new sequence of points for mother's height and the predicted data
newdata<-data.frame(ndat,pr)
head(newdata)
mother_height fit lwr upr
1 25 43.42183 42.49099 44.35266
2 26 43.96676 43.06065 44.87288
3 27 44.51170 43.63030 45.39310
4 28 45.05664 44.19995 45.91332
5 29 45.60157 44.76960 46.43355
6 30 46.14651 45.33924 46.95378
Graph the predicted results
# graph the expected results
predplot<-ggplot(data = newdata,
aes(x= mother_height, y = fit)) +
geom_point() + geom_errorbar(aes(ymin = lwr, ymax = upr), width = .1) +
expand_limits(x = c(20,91), y = c(0,81)) + theme_classic() +
labs(title = "Predicted values for a broader population")
# plot the two graphs together (making the x and y axis at the same scale )
library("patchwork")
# rescale heightplot
# old plot with the new axis and y axis scales, and remove points
heightplot2<-plot(toplot, add.data = FALSE) + theme_classic()
nhp <- heightplot2 + expand_limits(x = c(20,91), y = c(0,81) ) + labs(title = "Predicted values of daughter's height from the Pearson/Fox 1903 dataset")
# double graph
nhp /predplot + plot_annotation(title = "What do you notie about these relationships?", tag_levels = "a")
A simple method for obtaining the predicted values form your fitted model is to obtain the ggeffects output without producing a graph.
# Predicted values of daughter_height
# x = mother_height
x | Predicted | 95% CI
----------------------------------
52.50 | 58.41 | [58.15, 58.66]
54.50 | 59.50 | [59.29, 59.70]
57.50 | 61.13 | [60.99, 61.27]
59.50 | 62.22 | [62.13, 62.32]
61.50 | 63.31 | [63.25, 63.38]
63.50 | 64.40 | [64.34, 64.47]
65.50 | 65.49 | [65.40, 65.59]
70.50 | 68.22 | [68.01, 68.42]
Linear regression assumes linearity conditional on a model. Often your data will not be linear!
Consider the following example:
# Simulate nonlinear relationship between x and y
b <- c(2, 0.75)
set.seed(12)
x <- rnorm(100)
set.seed(12)
y <- rnorm(100, mean = b[1] * exp(b[2] * x))
dat1 <- data.frame(x, y)
ot1 <-lm(y ~ x, data = dat1)
# performance::check_model(ot1)
# Plot linear effect
plot(ggeffects::ggpredict(ot1, terms = "x"), add.data =TRUE, dot.alpha = .4)
Non-linear relationship as modelled by a polynomial regression:
Here is another approach:
Non-linear relationship as modeled by a general additive model (spline)
Any linear transformation of a predictor is OK. Often we center (or center and scale) all indicators, which gives us an intercept that is meaninful (the expected population average when the other indicators are set their average).
daughter_height | daughter_height | |||||
---|---|---|---|---|---|---|
Predictors | Estimates | CI | p | Estimates | CI | p |
(Intercept) | 29.80 | 28.25 – 31.35 | <0.001 | 63.86 | 63.80 – 63.92 | <0.001 |
mother_height | 0.54 | 0.52 – 0.57 | <0.001 | |||
mother_height_c | 0.54 | 0.52 – 0.57 | <0.001 | |||
Observations | 5524 | 5524 | ||||
R2 / R2 adjusted | 0.252 / 0.252 | 0.252 / 0.252 |
Graph model
Note: when fitting a polynomial or any interaction, it is important to center your indicators. We’ll come back to this point in later lectures.
A simple way to assess your model fit is to compare a model with one covariate with a simple intercept-only model and to assess improvement in either the AIC
statistic or the BIC
statistic. The BIC
is similar to the AIC
but adds a penalty for extra predictors. An absolute improvement in either statistic of n > 10 is considered to be a better model.
We can use the performance
package to generate a table that compares fits.
library(performance)
# intercept only
ionly <- lm(daughter_height ~ 1, data = md_df)
# covariate added
covadded <- lm(daughter_height ~ mother_height, data = md_df)
# evaluate
performance::compare_performance(ionly, covadded)
# Comparison of Model Performance Indices
Name | Model | AIC | BIC | R2 | R2 (adj.) | RMSE | Sigma
----------------------------------------------------------------------------
ionly | lm | 26299.969 | 26313.203 | 0.000 | 0.000 | 2.615 | 2.615
covadded | lm | 24698.514 | 24718.365 | 0.252 | 0.252 | 2.262 | 2.262
What was the model improvement?
This is easy with the report
package
For example:
report::report_statistics(covadded)
beta = 29.80, 95% CI [28.25, 31.35], t(5522) = 37.70, p < .001; Std. beta = 1.37e-14, 95% CI [-0.02, 0.02]
beta = 0.54, 95% CI [0.52, 0.57], t(5522) = 43.12, p < .001; Std. beta = 0.50, 95% CI [0.48, 0.52]
Or if you want a longer report
report::report(covadded)
Though use statistically significant
in place of significant
. This will avoid misleading your audience into thinking your result is important, when what you intend to communicate is that it is reliable.
From Gelman and Hill (Gelman and Hill 2006)
Validity
The most important is that the data you are analyzing should map to the research question you are trying to answer (Gelman, Hill, and Vehtari 2020b, 152)
Representativeness
A regression model is fit to data and is used to make inferences about a larger population, hence the implicit assumption in interpreting regression coefficients is that the sample is representative of the population. (Gelman, Hill, and Vehtari 2020b, 153)
Linearity*
The most important mathematical assumption of the linear regression model is that its deterministic component is a linear function of the separate predictors: y = β0 + β1x1 + β2x2 +···. If additivity is violated, it might make sense to transform the data (for example, if y = abc, then log y = log a + log b + log c) or to add interactions. If linearity is violated, perhaps a predictor should be put in as 1/x or log(x) instead of simply linearly. Or a more complicated relationship could be expressed using a nonlinear function such as a spline or Gaussian process, (Gelman, Hill, and Vehtari 2020b, 153)
Independence of errors
The simple regression model assumes that the errors from the prediction line are independent, an assumption that is violated in time series, spatial, and multilevel settings (Gelman, Hill, and Vehtari 2020b, 153)
Equal variance of errors
…unequal variance does not affect the most important aspect of a regression model, which is the form of the predictors (Gelman, Hill, and Vehtari 2020b, 153)
The regression assumption that is generally least important is that the errors are normally distributed. In fact, for the purpose of estimating the regression line (as compared to predicting individual data points), the assumption of normality is barely important at all. Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of re-gression residuals. (Gelman and Hill 2006, 46)
A good way to diagnose violations of some of the assumptions just considered (importantly, linearity) is to plot the residuals versus fitted values or simply individual predictors.(Gelman and Hill 2006, 46)
People use the work “effect” but that is not what regression gives us (by default)
As Gelman and Hill note, the “normality” assumption is the least important. And the assumption pertains to the normality of residuals
This will be the main reason we do multi-level modelling: to condition on dependencies in the data.
We sample from undergraduates, but infer about the human population.
Some preliminaries about science.
Science begins with a question about the world. The first step in science, then, is to clarify what you want to know.
Because science is a social practice, you will also need to clarify why your question is interesting: so what?
In short, know your question.
Sometimes scientists are interested in specific features of the world: how did virus x originate? Such a question might have a forensic interest: what constellation of events gave rise to a novel infectious disease?
More typically, scientists seek generalisations. How do infectious diseases evolve? How do biological organisms evolve? Such questions have applied and fundamental interests. How can we better prevent infectious disease? How did life originate?
A scientific model is a proposal for how nature is structured (and unstructured). For example, the theory of evolution by natural selection proposes that life emerges from variation, inheritance, and differential reproduction/survival.
To evaluate a scientific model, scientists must make generalisations beyond individual cases. This is where statistics shines.
Mathematics is a logic of certainty.
Statistics is a logic of uncertainty.
A statistical model uses the logic of probability to make better guesses.
Scientific models seek to explain how nature is structured. Where scientific models conflict, we can combine statistical models with data-collection to evaluate the credibility of of one theoretical model over others. To do this, a scientific model must make distinct, non-trivial predictions about the world.
If the predictions are not distinct, the observations will not enable a shift in credibility for one theory over another. Consider the theory that predicts any observation. Such a theory would be better classified as a conspiracy theory; it is compatible with any evidence whatsoever.
Today we introduce a statistical method called regression. We will focus on how regression helps both to evaluate, and to make, informed predictions about the structures of the world.
Under the hood, your computer sees your data as consisting of vectors and matrices. An algorithm searches to estimate (or in the case of Bayesian inference, to “solve”) an optimization problem to obtain a location for the unobserved parameter; in the case we examined above, this parameter is the relationship between daughter heights and mother heights.
\[\begin{bmatrix} \textbf{y}\\ \textit{daughter's height}\\ 51.5\\ 52.5 \\ 53.0 \\ \vdots \end{bmatrix} = \begin{bmatrix} \textbf{intercept} \\ 1 \\ 1 \\ 1 \\ 1 \\ \vdots \\ \end{bmatrix} \begin{bmatrix} \textbf{x}\\ \textit{mother's height}\\ 59.5\\ 57.5 \\ 60.0 \\ \vdots \end{bmatrix} \begin{bmatrix} \textbf{b}\\ 29.8 \\ 0.54 \end{bmatrix}\]
The relationship of probability distributions and data-generating processes is complex, intriguing, and both historically and philosophically rich \(\dots\). Because our interests are applied, we will hardly touch up this richness in this course, alas.↩︎
Later, we’ll prefer a different way of writing regression equations in math. (Note: writing math isn’t math - it’s just encoding the model that we’ve written).↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-NC-SA 4.0. Source code is available at https://go-bayes.github.io/psych-447/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Bulbulia (2021, March 23). Psych 447: Elements of a linear model. Retrieved from https://vuw-psych-447.netlify.app/posts/5_1/
BibTeX citation
@misc{bulbulia2021elements, author = {Bulbulia, Joseph}, title = {Psych 447: Elements of a linear model}, url = {https://vuw-psych-447.netlify.app/posts/5_1/}, year = {2021} }