A primer on Mixed-Effects Models: Theory and practice

class: center, middle, section

## A primer on Mixed-Effects Models:<br>Theory and practice

Gonzalo García-Castro

[![](img/github.png)](https://github.com/gongcastro)
[![](img/twitter.png)](https://twitter.com/gongcastro)

25th March 2020

---
class: center, middle, section

# First of all...

---
# Disclaimer

I'm not a trained statistician

Don't trust me (too much)

Mistakes may (will) be made

I'm not 100% sure about anything

Probably, none of us will ever be

So let's get to it!

.pull-right-normal[
![](img/meme_noidea.gif)
]

???

Sometimes we need to get fancy to get the most out of out (little and precious) data

---
# Notation

.centered[Linear Mixed-Effects Models = LMM]

**Mixed Models** (aka. Mixed-Effects Models, aka. Multilevel Models, aka. Hierarchical Models, aka. Nested Data Models, aka. Random Parameter Models, aka. Split-Plot Designs)

<br>

If by the end of this presentation you have an intuition about why all **these labels refer to the same thing**

Yay! You have made so much progress.

---
## Disclaimer (bonus 1)

We need to learn a bit about programming.

---
## Disclaimer (bonus 2)

Most statistical literature on LMM uses **R** (e.g., `lmer()` function of the `lme4` package)

---
## Disclaimer (bonus 2)

Many other programming languages support LMM

* Python: `statsmodels` library, `Pymer4`
* Matlab: `Statistics and Machine Learning Toolbox`
* Julia: `MixedModels`
* Stan: `Bayesian modelling`, with interfaces with R, Python and Matlab)

I don't assume you are familiar with R, but I will use some R code for illustration purposes.

---
class: section, center, middle

# Materials

Some recommended reads.

---
### Materials: Books and book chapters

.small[
Navarro, D. J. (2015). Learning statistics with r: A tutorial for psychology students and other beginners.

Field, A., Miles, J., & Field, Z. (2012). 19. Multilevel linear models. In Discovering statistics using r. SAGE Publications.

Winter, B. (2019). Statistics for linguists: An introduction using R. Routledge.

Mirman, D. (2016). 4. Structuring random effects. In Growth curve analysis and visualisation using r. CRC press.

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University press.

Fox, J., & Weisberg, S. (2018). An r companion to applied regression. SAGE publications.

McElreath, R. (2020). Statistical rethinking: A bayesian course with examples in r and stan. CRC press.
]

---

### Materials: Articles

.small[
Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. arXiv Preprint arXiv:1308.5499. https://arxiv.org/abs/1308.5499

DeBruine, L., & Barr, D. J. (2019). Understanding mixed effects models through data simulation. https://doi.org/10.31234/osf.io/xp5cy

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278.

Barr, D. J. (2013). Random effects structure for testing interactions in linear mixed-effects models. Frontiers in Psychology, 4, 328. https://www.frontiersin.org/articles/10.3389/fpsyg.2013.00328/full

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015a). Parsimonious mixed models. arXiv Preprint arXiv:1506.04967. https://arxiv.org/pdf/1506.04967.pdf

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. https://doi.org/10.1016/j.jml.2007.12.005

Barr, D. J. (2008). Analyzing “visual world” eyetracking data using multilevel logistic regression. Journal of Memory and Language, 59(4), 457–474. https://doi.org/10.1016/j.jml.2007.09.002

]

---

### Materials: Articles (special mentions)

.small[
Meteyard, L., & Davies, R. A. (2020). Best practice guidance for linear mixed-effects models in psychological science. Journal of Memory and Language, 112, 104092. https://doi.org/10.1016/j.jml.2020.104092

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015b). Fitting linear mixed-effects models using lme4. Journal of Statistial Software, 67(1). https://doi.org/10.18637/jss.v067.i01
]

---
class: section, middle, center

# Modelling

A reminder.

---
## Modelling: What for?

To estimate a **parameter** that characterises a population using data from samples of that population.

This unknown **parameter** can be whatever we want:

* Central tendency measures (e.g., mean, median)
* Dispersion measures (e.g., standard deviation)
* Measures of association between variables (e.g., coefficients)

We usually estimate all of them when analysing data and perform some kind of **statistical inference** from them.

---
## Modelling: What for?

In confirmatory analyses, we:

1) Hypothesise that the parameter is within a range of values

2) Collect data

3) Perform statistical inference

<br>

* **Frequentist approach**: what is the probability of our data, assuming our hypothesis is true?
* **Bayesian approach**: what is the probability of our hypothesis, given the data?

---
## Modelling: What for?

In experimental confirmatory research, we are usually interesed in **estimating the association between two or more variables**:

* *Age* (predictor) and *vocabulary size* (outcome)
* *Bilingualism* (predictor) and *novelty preference* (outcome)
* *Native linguistic rythmic class* (predictor) and *entrainment to speech signal* (outcome)

We try draw a shape (e.g., line, curve) that defines this relationship.

For now, we will stick to óne shape: **lines**.

---
## Modelling: What for

The three (four, sometimes) steps of modelling:

1) **Model especification**: what variables am I going to include in the model?

2) **Model fitting**: what line fits data the best?

3) **Statistical inference**: Does my model fit data good enough? What is the contribution of each predictor to the goodness of fit?

4) (Bonus track) **Model validation**: Does my model predict new outcomes correctly?

---
## Modelling: 1) Model especification

We need to define what **outcomes** and **predictors** I am interested in.

> e.g., what am I trying to predict? What variables am I using to predict it?

What are my **assumptions** about how they relate to the outcome?

> e.g., linearity, normality of residuals

What are my assumptions about **how each predictor relates to the other predictors**?+

> e.g., do I expect interactions between predictors?

---
## Modelling: 1) Model especification

We will work with **linear** models

This means that we will try to draw a **line** that defines the relationship between predictors and outcome.

But bear in mind that **non-linear models** exist as well

For instance, we may need/want to fit a **curve** or an **exponential** function

> e.g., Growth Curve Analysis

---
## Modelling: 1) Model especification

Every line is defined by the following equation, named the **General Linear Model (GLM)**:

$$
Y = \beta_0+\beta_j \times X + \varepsilon
$$

Where:

* `$Y$` is the value that the outcome variable takes (we know this value)
* `$\beta_0$` is the intercept
* `$\beta_j$` is the coefficient
* `$X_j$` is the value that the predictor `$X$` takes (we know this value)
* `$\varepsilon$`: residual (error the model makes)

---
class: center

## Modelling: 1) Model especification

$$
Y = \beta_0+\beta_1 \times X
$$

This formula underlies most statistical techinques we use normally:

---
class: center

## Modelling: 1) Model especification

Most of them are generalised or special cases of the the **GLM**.

.centered[
[![](img/linear_models.png)](https://lindeloev.github.io/tests-as-linear/)
]

---
## Modelling: 1) Model specification

We can extend the GLM to include more predictors:

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... +  \beta_j  X_j
$$

Each coefficient (e.g., `$\beta_1$`, `$\beta_2$`) will contribute to the model by "adjusting" the line.

But how much should each coefficient contribute?

We need to take a look at the data.

---
## Modelling: 2) Model fitting

There are infinite lines we can draw with these parameters.

$$
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... +  \beta_j \times X_j
$$

We need to find the combination of values of the coefficients `$\beta_0$` and `$\beta_j$` that make the line fit data the best.

We need to find the **Least Squares Regression Line**

This line **minimises the overall distance between the lines and the data points**

How to find it? We use built-in **algorithms** that try a lot of combinations until finding the optimal one.

---
class: center

## Modelling: 2) Model fitting

Play with regression lines:

https://antoinesoetewey.shinyapps.io/statistics-202/

---
## Modelling: 3) Statistical inference

We have obtained **coefficients** that define the slope and position of the line that fits data the best.

But **"best"** is relative.

Even the best model could be fitting the data very badly

We need to assess how wuch of the variance our model accounts for.

Compute compute some measure of goodness of fit

> e.g., `$R^2$`: proportion of variance accounted for the model

If the model shows nice fit, let's smove on to **interpret the coefficients**!

---
## Modelling: 3) Statistical inference

We have a bunch of **coefficients**

Each coefficient tells us **how much each predictor contributes to the slope** of the line

All coefficients contribute to some degree. `$\beta_j = 0$` is almost impossible.

**What coefficients contribute significantly?**

---
## Modelling: 3) Statistical inference

Imagine that the **true (population) value of coefficient** `$\beta_1$` is 0

This means that it has **no predictive value** regarding the outcome variable

How what is the probability of `$\beta_1$` taking the value it takes in our sample, **assuming that `$\beta_1 = 0$`** is true?

If it is very **unlikely**, we should **reject** the assumption that `$\beta_1 = 0$`

If its **likely enough**, we **don't reject** the assumption that `$\beta_1 = 0$`

---
## Modelling: 3) Statistical inference

How can we know the **likelihood of our coefficient**?

We need to map it onto a **probability distribution**.

A probability distribution tells us the probability of each value a variable can take.

Different variables can follow different distributions:

* Gaussian (aka. normal, aka. exponential)
* Student's *t*
* Snedecor's *F*
* Poisson
* ...

---
## Modelling: 3) Statistical inference

We know that **coefficients** in linear models tend to follow a **Gaussian distribution**. This distribution is known.

It tells us the **probability** of each value of our coefficient, **assuming that its true value is 0**.

---
## Modelling: 3) Statistical inference

---
## Modelling: 3) Statistical inference

We can map the probability of our coefficient onto this distribution to find its **associated probability**

If the probability is lower than our **significance threshold** (e.g., `$\alpha = .05$`), we **reject the hypothesis** that the true value of the coefficient is 0!

We **interpret** the model goodness of fit, the coefficients, and draw (or do not draw) conclusions from them.

---
## Modelling: Assumptions!

We assume many things, one of the, being that observations are independent from each other.

This means that the **probability** of one observation taking one value is independent from the value other observations have taken.

---
class: section, center, middle

# Introducing random effects in our linear model

---
## What is a LMM

An extension of the GLM that allows to account for systematic sources of variability beyond our effects of interest.

All models are unnacurate to some degree.

Sometimes, part of the error the model makes is systematic.

---
## Why to use a LMM

* To **avoid aggregating data** (more statistical power)
* To account for **non-independence** of scores
* To account for **hierarchical structures** in our data
* To account for **cross-observational** unit variability

---
## Why to use a LMM

Effect of word frequency on reaction time task in a lexical decision task.

* Word condition:
- High frequency: **CAR** > Yes/No
- Low frequency: **FLAG** > Yes/No
* Non-word condition: **FOIR** > Yes/No

Does *frequency* affect *reaction times* (*RT*)?

---
## Why to use a LMM

Trial-wise data:

<table>
 <thead>
  <tr>
   <th style="text-align:center;"> Participant </th>
   <th style="text-align:center;"> Trial </th>
   <th style="text-align:center;"> Frequency </th>
   <th style="text-align:center;"> RT </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.24 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.34 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.12 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.34 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 5 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.04 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 6 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.47 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 7 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.36 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 8 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.18 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 9 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.61 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.23 </td>
  </tr>
</tbody>
</table>

...

---
## Why to use a LMM

Data-points from all participants are pooled together:

---
## Why to use a LMM

Assumption of non-indenpendence is had to assume here.

---
## Why to use a LMM

<table>
 <thead>
  <tr>
   <th style="text-align:center;"> Participant </th>
   <th style="text-align:center;"> Condition </th>
   <th style="text-align:center;"> RT (s) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.22 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.37 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 2 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.27 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 2 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 1.01 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 3 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.47 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 3 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 1.71 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 4 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.89 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 4 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 2.43 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 5 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 1.52 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 5 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 3.40 </td>
  </tr>
</tbody>
</table>

---
## Why to use a LMM

---
## Why to use a LMM

There is another way of dealing with no independence and *not* losing data.

<table>
 <thead>
  <tr>
   <th style="text-align:center;"> Participant </th>
   <th style="text-align:center;"> Trial </th>
   <th style="text-align:center;"> Condition </th>
   <th style="text-align:center;"> RT (s) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.24 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 2 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.34 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 3 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.12 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 4 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.34 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 5 </td>
   <td style="text-align:center;"> High </td>
   <td style="text-align:center;"> 0.04 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 6 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.47 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 7 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.36 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 8 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.18 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 9 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.61 </td>
  </tr>
  <tr>
   <td style="text-align:center;"> Participant 1 </td>
   <td style="text-align:center;"> 10 </td>
   <td style="text-align:center;"> Low </td>
   <td style="text-align:center;"> 0.23 </td>
  </tr>
</tbody>
</table>

---
## Why to use a LMM

---
## Why to use a LMM

---
class: section, middle, center

# A real-life example

---
## How to use a LMM

Does previous experience with the Head-turn Preference Procedure (HPP) impact performance in the task?

.citation[
Santolin, C., García-Castro, G., Zettersten, M., Sebastian-Galles, N., & Saffran, J. (2020, March 4). Experience with research paradigms relates to infants' direction of preference. https://doi.org/10.31234/osf.io/xgvbh
]

You can find data and code here: https://osf.io/g95ub/

---
## How to use a LMM

An example using real data:

Does previous experience with the Head-turn Preference Procedure (HPP) impact performance in the task?

---
## The Head-turn Preference Procedure

Infants can't tell us whether they **discriminate** between stimuli.

Nor can they be instructed in an experimental task.

We use their **preferential looking** behaviour (how long infants turn their head towards the source of stimulation).

---
## The Head-turn Preference Procedure

1) We **familiarise** infants with one set of stimuli.

2) In each **test** trial, infants are presented with one token from the familiar set, or from the un familiar set.

If infants **look systematically** longer to one set of stimuli than to others, it means that they are able to **discriminate** between them.

---
### The data

We gathered data from **6 experiments** run two different locations:

* Laboratori de Recerca en Infància (Universitat Pompeu Fabra, Barcelona)
* Waisman Center & Department of Psychology (University of Wisconsin-Madinson, Wisconsin)

All experiments were **broadly similar** in design, age of participants, and stimuli type.

---
### The data

However, there were some **differences**:

* Each experiment included different participants
* Participants lived in different countries (i.e., different cultures, lifestyle)
* Partcipants were exposed to different languages
* Participants had different language profile (monolinguals in Wisconsin, bilinguals in Barcelona)

---
### The data

`flip_data <- read.csv("flip_data.csv")`

|Study                                       |HPP 1th |HPP 2nd |HPP 3th |HPP 4th |HPP 5th |HPP 6th |
|:-------------------------------------------|:-------|:-------|:-------|:-------|:-------|:-------|
|Saffran & Wilson (2003)                     |26      |24      |24      |4       |2       |-       |
|Saffran et al. (2008)                       |4       |10      |4       |4       |-       |2       |
|Santolin & Saffran (2019)                   |18      |30      |-       |4       |-       |-       |
|Santolin, Saffran & Sebastian-Galles (2019) |44      |4       |-       |-       |-       |-       |

---
### The data

Long vs. wide format data.

| Participant |                    Study                    | HPP | Item | Looking Time (ms) |
|:-----------:|:-------------------------------------------:|:---:|:----:|:-----------------:|
|  baby50996  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |  0   |      6541.00      |
|  baby50996  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |  1   |      4770.83      |
|  baby51010  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |  0   |      7654.67      |
|  baby51010  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |  1   |      6785.17      |
|  baby51274  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |  0   |      3580.25      |
|  baby51274  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |  1   |      4501.50      |

---

| Participant |                    Study                    | HPP | Familiar LT (ms) | Novel LT (ms) |
|:-----------:|:-------------------------------------------:|:---:|:----------------:|:-------------:|
|  baby50996  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |     6541.00      |    4770.83    |
|  baby51010  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |     7654.67      |    6785.17    |
|  baby51274  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |     3580.25      |    4501.50    |
|  baby51689  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |     7520.67      |    5052.83    |
|  baby52110  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |     8974.83      |   11885.67    |
|  baby53657  | Santolin, Saffran & Sebastian-Galles (2019) |  1  |     9755.17      |    5585.33    |

???

Difference between wide and long format data

---
### Preparing our predictors

We use `Looking time` (in ms) as the **outcome**

We will use `Item` (familiar/novel) and number of previous `HPP` (numeric, 1-6) as predictors

We expect an `HPP` by `Item` **interaction**

> Previous experience with HPP should impact novel and familiar preference differently.

---
### Preparing our predictors

Coding variables: **dumy** vs. **effect** coding.

Model fitting functions like `lmer` don't care about how you call your **factor levels**.

It assigns a **numeric value** to each level.

In our case, *Familiar* = 0 and *Novel* = 1 (alphabetical order).

---
### Preparing our predictors

How you code your variables changes the **information** the model provides you.

This type of coding (0s and 1s) is named ***dumy coding***.

If *Familiar* =  0, and *Novel* = 1:

* **Model intercept**: Looking time when item is familiar.
* **Model slopes**: Change in looking time when item is novel.

---
### Preparing our predictors

We are interested in **interpreting changes in *novelty* preference**. Familiar trials should be the baseline.

The default coding happens to suit us.

But we should should always **explicitely recode our predictors** based on what information we want them to provide.

`flip <- mutate(flip_recorded, Item = ifelse(Item=="Familiar", 0, 1))`

---
### Preparing our predictors

Dummy coding `Item` makes sense because we have a **clear baseline**.

Alternatively, we could have *effect-coded* the predictor `Item`.

*Familiar* = -0.5, *Novel* = 0.5

In this case, the **intercept** (the looking time when `Item` = 0) would inform us about the **average looking time across conditions**.

How we have coded our variables impacts our conclusions especially when interpreting intercepts and main effects if we include interactions in our model.

.footer[We'll come back to this later.]

---
### Only fixed effects: Model fitting

**General Linear Model**:

`$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_jX_j + \varepsilon$$`

For participant `$i$` and condition `$j$`

Fixed effects:

`$$LookingTime_{ij} = \beta_0 + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

---
### Only fixed effects: Model fitting

Implementing our model in R:

```
model_fixed <- lm(formula = LookingTime ~ Item*HPP, data = flip_data)
```

---
### Only fixed effects: Statistical inference

Let's take a closer look at the estimates of the model:

```
summary_fixed <- summary(model_fixed)
summary_fixed$coefficients
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Coefficient </th>
   <th style="text-align:center;"> SE </th>
   <th style="text-align:center;"> t </th>
   <th style="text-align:center;"> p </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 8013.475 </td>
   <td style="text-align:center;"> 490.447 </td>
   <td style="text-align:center;"> 16.339 </td>
   <td style="text-align:center;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item </td>
   <td style="text-align:center;"> -1398.775 </td>
   <td style="text-align:center;"> 693.597 </td>
   <td style="text-align:center;"> -2.017 </td>
   <td style="text-align:center;"> 0.045 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:center;"> -603.079 </td>
   <td style="text-align:center;"> 229.702 </td>
   <td style="text-align:center;"> -2.625 </td>
   <td style="text-align:center;"> 0.009 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item:HPP </td>
   <td style="text-align:center;"> 667.113 </td>
   <td style="text-align:center;"> 324.848 </td>
   <td style="text-align:center;"> 2.054 </td>
   <td style="text-align:center;"> 0.041 </td>
  </tr>
</tbody>
</table>

---
### Only fixed effects: Model interpretation

---
### Introducing random effects

Our data is **hierarchical**:

* Every participant provides two **data points**: one for familiar items and one for novel items.
* There are strong reasons to consider that **data points from the same participant are correlated**.

> Some infants are **long lookers**: high looking times in both conditions
> Some infants are **short lookers**: low looking times in both conditions

We should add **random intercepts** by participant.

---
### Random intercepts: Model especification

**Fixed effects**:

Previously: `$$LookingTime_{ij} = \beta_0 + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

**Fixed effects and random intercepts**:

`$$LookingTime_{ij} = \beta_0 + P_{0i} + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{0i}$` is the intercept of participant `$i$`

---
### Random intercepts by participant: Model fitting

```r
library(lme4)
model_random <- lmer(formula = LookingTime ~ Item*HPP +
                       (1 | Participant), data = flip_data)
```

---
### Random intercepts by participant: Statistical inference

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Coefficient </th>
   <th style="text-align:center;"> SE </th>
   <th style="text-align:center;"> t </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 8013.48 </td>
   <td style="text-align:center;"> 490.45 </td>
   <td style="text-align:center;"> 16.34 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item </td>
   <td style="text-align:center;"> -1398.77 </td>
   <td style="text-align:center;"> 411.32 </td>
   <td style="text-align:center;"> -3.40 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:center;"> -603.08 </td>
   <td style="text-align:center;"> 229.70 </td>
   <td style="text-align:center;"> -2.63 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item:HPP </td>
   <td style="text-align:center;"> 667.11 </td>
   <td style="text-align:center;"> 192.64 </td>
   <td style="text-align:center;"> 3.46 </td>
  </tr>
</tbody>
</table>

Where are my ***p*-values**? I want my *p*-values :(

---
### Random intercepts: Statistical inference

Computing *p*-values in LMM is *not straightfoward*.

It's difficult to come map estimates into **probability distributions**.

Not clear how to pool parameters (many models have been run).

---
### Random intercepts by participant: Statistical inference

Several alternatives:

1) Assume estimated coefficients follow a **normal distribution**: map estandardised coefficients onto the normal distribution (*Mean* = 0, *SD* = 1) to get their probability.

2) Assume estimated coefficients follow a ***t*** or a ***F* distribution*** with *approximated* degrees of freedom:

* *t* distribution: **Satterthwaite**'s approximation to degrees of freedom
* `$\chi^2$` distribution: **Wald**'s `$\chi^2$` test
* *F* distribution: **Kenward-Roger** ANOVA

3) Shift to **Bayesian** statistical inference :)

---

---
### Random intercepts by participant: Statistical inference

```
library(car)
Anova(model_random, type = "III", test.statistic = "F")
```
<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> F </th>
   <th style="text-align:center;"> Df </th>
   <th style="text-align:center;"> Df den. </th>
   <th style="text-align:center;"> p </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 266.967 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 140.814 </td>
   <td style="text-align:center;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item </td>
   <td style="text-align:center;"> 11.565 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 100.000 </td>
   <td style="text-align:center;"> 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:center;"> 6.893 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 140.814 </td>
   <td style="text-align:center;"> 0.010 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item:HPP </td>
   <td style="text-align:center;"> 11.992 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 100.000 </td>
   <td style="text-align:center;"> 0.001 </td>
  </tr>
</tbody>
</table>

---

### Random effects by participant: Model interpretation

---
### Introducing random slopes

* There are also reasons to think that the **effect** of `Item` varies across infants.
* Long-lookers may be show **higher preferences** than others

> Some infants may discriminate better than others between familiar and novel items

The slope of `Item` may vary across infants.

---
### Random intercepts and item slopes by participants: Model especification

Previously: `$$LookingTime_ij = \beta_0 + P_{0i} + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{0i}$` is the **intercept** of participant `$i$`

`$$LookingTime_{ij} = \beta_0 + P_{0i} + (\beta_1 + P_{1i})Item+ \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{1i}$` is the **slope** of `Item` for participant `$i$`

---
### Random intercepts and item slopes by participants: Model fitting

```r
model_random_slopes <- lmer(formula = LookingTime ~ Item*HPP + (1 + Item*HPP | Participant),
                            data = flip_data)
```

```
## Error: number of observations (=204) <= number of random effects (=408) for term (1 + Item * HPP | Participant); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable
```

Problem: there are a lot of parameters to estimate, and not enough variability.

May be related to some levels of HPP just having one participant.

Check this [link](https://stackoverflow.com/questions/19713228/lmer-error-grouping-factor-must-be-number-of-observations) out.

---
### Random intercepts and item slopes by participants: Model fitting

We can (somehow irresponsibly) force the model to go on.

```r
model_random_slopes <- lmer(formula = LookingTime ~ Item*HPP + (1 + Item*HPP | Participant),
                            data = flip_data,
                            control = lmerControl(check.nobs.vs.nRE = "ignore"))
```

The model fails to converge (i.e., can't find the most likely coefficients given the data).

---

---
### Random intercepts and item slopes by participants: Model fitting

So, including `Item` random slopes by participant seems like a bad idea.

Let's stick to by-participants **random intercepts** only.

But haven't finished getting fancy!

---
### Adding a second random effect

Our data is missing an important source of **correlation between data points**: `Study`

Participants are more **similar within study** than between study.

We can account for this new source of variation by adding `Study` as a **random effect**.

---
### Random intercepts by participant and by study: Model especification

Fixed effects and random intercepts by participant:

Previously: `$$LookingTime_ij = \beta_0 + P_{0i} + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{0i}$` is the **intercept** of participant `$i$`

`$$LookingTime_{ij} = \beta_0 + P_{0i} + S_{0k} + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{0i}$` is the **intercept** of participant `$i$` and `$S_{0k}$` is the **intercept** of study `$k$`.

---
### Random intercepts and item slopes by participants and by study: Model fitting

```r
model_random_intercepts2 <- lmer(formula = LookingTime ~ Item*HPP + (1 | Participant) + (1 | Study),
                                 data = flip_data)
```

---
### Random intercepts by participant and by study: Statistical inference

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> F </th>
   <th style="text-align:center;"> Df </th>
   <th style="text-align:center;"> Df den. </th>
   <th style="text-align:center;"> p </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:center;"> 124.638 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 9.060 </td>
   <td style="text-align:center;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item </td>
   <td style="text-align:center;"> 11.565 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 100.000 </td>
   <td style="text-align:center;"> 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:center;"> 4.800 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 133.117 </td>
   <td style="text-align:center;"> 0.030 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item:HPP </td>
   <td style="text-align:center;"> 11.992 </td>
   <td style="text-align:center;"> 1 </td>
   <td style="text-align:center;"> 100.000 </td>
   <td style="text-align:center;"> 0.001 </td>
  </tr>
</tbody>
</table>

---
### Random intercepts by participant and by study: Model interpretation

---
### Introducing random slopes by study

Finally, we can also expect the effect of `HPP` to vary **across studies**.

Infants in some studies could have been **more sensitive** to the effect of previous experience with HPP.

---
### Random intercepts by participant and study, and random `Item` slopes by study: Model especification

Previously: `$$LookingTime_ij = \beta_0 + P_{0i} + S_{0k} + \beta_1Item + \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{0i}$` is the **intercept** of participant `$i$` and `$S_{0k}$` is the **intercept** of study `$k$`.

`$$LookingTime_{ij} = \beta_0 + P_{0i} + (\beta_1 + S_{1k})Item+ \beta_2HPP + \beta_3Item \times HPP + \varepsilon_{ij}$$`

Where `$P_{0i}$` is the **intercept** of participant `$i$`, `$S_{0k}$` is the **intercept** of study `$k$`, and `$S_{1k}$` is the **slope** of `Item` on study `k`.

---
### Random intercepts by participant and study, and random `HPP` slopes by study: Model fitting

```r
model_random_slope2 <- lmer(formula = LookingTime ~ Item*HPP + (1 | Participant) + (1 + HPP | Study),
                            data = flip_data)
```

---
### Random intercepts by participant and study, and random `Item` slopes by study: Statistical inference

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> F </th>
   <th style="text-align:right;"> Df </th>
   <th style="text-align:right;"> Df den. </th>
   <th style="text-align:right;"> p </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 77.738 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 3.156 </td>
   <td style="text-align:right;"> 0.003 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item </td>
   <td style="text-align:right;"> 11.565 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 100.000 </td>
   <td style="text-align:right;"> 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:right;"> 2.918 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 2.620 </td>
   <td style="text-align:right;"> 0.199 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item:HPP </td>
   <td style="text-align:right;"> 11.992 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 100.000 </td>
   <td style="text-align:right;"> 0.001 </td>
  </tr>
</tbody>
</table>

---
### Random intercepts by participant and study, and random `Item` slopes by study: Model interpretation

---
### Random intercepts by participant and study, and random `Item` slopes by study: Model interpretation

However, we missed something in the `lmer` output...

---
### Random intercepts by participant and study, and random `HPP` slopes by study: Model interpretation

Our fit is singular! But what does this mean? It has to do with something we haven't talked about yet.

Random intercepts and slopes are not the only different thing about LMMs.

There is an additional parameter we haven't talked about yet:

[*Dark deep voice*]: The **correlation** parameter, `$\rho$`.

---
### (Parenthesis: Introducing the correlation parameter)

LMMs allow intercepts to vary across observational units.

> Each participant has her own intercept

The amount of **variation across intercepts** impacts how fixed effects are estimated, and how we perform inferences from them.

The intercept **standard deviation**, `$\sigma$` reflects this variability.

Same goes for **random slopes**.

---
### (Parenthesis: Introducing the correlation parameter)

For each random effect (i.e., study), random intercepts and slopes may vary together:

They may be **correlated**.

> Participants that were **slower** in our lexical decission task may be **more sensitive** to the frequency manipulation.

---
<img src="img/task_plot_slo.png" width="100%" />

---
### (Parenthesis: Introducing the correlation parameter)

> Studies where participants looked longer on average may be those where participants discriminated better between familiar and novel trials.

---
### (Parenthesis: Introducing the correlation parameter)

Together with `$\sigma$`, `$\rho$` is calculated for a given random effect, both intercepts and slopes are especified.

These parameters are accessible from the `lmer` output, in the for of a matrix:

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> Term </th>
   <th style="text-align:left;"> SD1 </th>
   <th style="text-align:left;"> SD2 </th>
   <th style="text-align:right;"> Cov. </th>
   <th style="text-align:right;"> Correlation </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Participant </td>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:right;"> 3116294.67 </td>
   <td style="text-align:right;"> 1765.30 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Study </td>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:right;"> 1607461.98 </td>
   <td style="text-align:right;"> 1267.86 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Study </td>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:right;"> 32828.32 </td>
   <td style="text-align:right;"> 181.19 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Study </td>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:right;"> -229717.82 </td>
   <td style="text-align:right;"> -1.00 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Residual </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:left;"> - </td>
   <td style="text-align:right;"> 1991854.67 </td>
   <td style="text-align:right;"> 1411.33 </td>
  </tr>
</tbody>
</table>

---
### (Parenthesis: Introducing the correlation parameter)

The **variance-covariance matrix** is **singular**.

Correlation parameter is near 1 and/or SD are close to 0.

Our model is **overparameterised**.

We should be asking a bit less from the data.

We should make our model more **parsimonious** (Bates et al., 2015).

**Simplify** the random effects structure.

More about failure to converge and singular fit later...

---
### Our final model:

Includes:

* Main effects of `Item` and `HPP`
* `Item` `$\times$` `HPP` interaction
* Crossed random effects for participants and study (in contrast to nested random effects; more about this later)
- Random by-participant intercepts
- Random by-study intercepts

---
### Our final model:

Accounts for:

* The **fixed effect** of `Item` on looking times.
* The **fixed effect** of `HPP` on looking times.
* The **differential effect** of `HPP` on each test `Item` (interaction).
* **Cross-participant variability** in overall looking time.
* **Cross-study variability** in overall looking time.
* **Hierarchical structure** of our data

---
### Our final model:

<table>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> F </th>
   <th style="text-align:right;"> Df </th>
   <th style="text-align:right;"> Df.res </th>
   <th style="text-align:right;"> Pr(&gt;F) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 266.967 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 140.814 </td>
   <td style="text-align:right;"> 0.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item </td>
   <td style="text-align:right;"> 11.565 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 100.000 </td>
   <td style="text-align:right;"> 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> HPP </td>
   <td style="text-align:right;"> 6.893 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 140.814 </td>
   <td style="text-align:right;"> 0.010 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Item:HPP </td>
   <td style="text-align:right;"> 11.992 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 100.000 </td>
   <td style="text-align:right;"> 0.001 </td>
  </tr>
</tbody>
</table>

---
### Random intercepts by participant: Statistical inference

---
class: section, center, middle

# Some additional points

---
## Power analysis in Mixed Models

LMM involved the estimation of many parameters.

To perform power analysis, we need to fix some of these parameters

It's difficult to come up with **sensible parameters** *a priori*

* How much variability am I expecting for each random effect?
* How much correlation am I expecting between random efects?

---
## Power analysis in Mixed Models

Also, estimating power analytically (mapping expected parameters onto known distributions) is not straightforward in LMM.

It's not quite clear what parameters we are referring to.

Alternative: shift to simulation-based power analysis. Already available for non-mixed models (e.g., ANOVA). See Lakens and Caldwell ([2019](https://psyarxiv.com/baxsf/)) and it's accompanying R package, `ANOVApower`.

Some methods are already available for mixed-models:

* Zhang and Wang ([2009](https://link.springer.com/article/10.3758/BRM.41.4.1083)): Includes non-linear models, but it's implemented in commercial software (SAS)
* DeBruine and Barr (2019)

---
## When our model fails to converge

Sometimes, our model can't figure out **what parameters are most likely** given the data.

This can be because **different values** of the same coefficient are **equally likely**.

How to avoid this:

* The **larger** the data, the easier to converge.
* The larger values of the coefficients the more difficult to converge. Consider:
- Changing units of measurement: use **seconds** instead of milliseconds (Barr, 2008).
- **Standardising** your predictors.
* Don't get too fancy with your model: the more **parsimonious**, the better (less parameters to estimate).

---
## When our model is overly complex: singular fit

See Bates et al. (2015)