## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lmerTest'

Mixed Effects Models

In many cases, the data points that we collect may be related in some way, thus violating the assumption of independence present in linear regression, generalised linear models, etc.

Sometimes these relationships will be direct and obvious. For example, we could take four soil samples from the same forest plot, or many different measurements from the same individual sparrow; or interview lots of folks in the same college or organisation.

Or, the individual measurements may be related in a slightly more distant way. For example, we could measure the same tree every year, or interview the same people every year; or take measurements from many trees in plot.

Finally, the data points could be related even more distantly, for example, if we take multiple measurements from the same species, or from closely-related species, or from closely-related individuals.

In all these examples, the relatedness (in time, space, or direct relatedness) means that the measurements from one individual are more likely to be similar to measurements from closely-related individuals. E.g., trees growing close together in a forest are likely to experience similar growing conditions; individuals from the same species are likely to have similar trait values, etc, etc.

Previously, there were only a limited number of options to resolve these issues of non-independence. The main way would be to average across the grouping variable that was likely having an effect (e.g., averaging the four soil samples, taking average tree growth per plot) or by only using a subset of all the data you had collected.

Both these approaches are correct and fine. However, using a mixed effects model gives you more flexibility and allows you to use all the data.

So, instead of averaging over these grouping variables (losing data and information) or treating everything as independent (violating assumptions), we can treat these grouping variables as random effects.

The model essentially creates a different baseline value for each level of the random effect (e.g., plot 1, 2, 3, … n), modelling a different intercept for each level.

We can then model the variables—the fixed effects—that we are interested in, accounting for the various random effects that we need to account for, but are not explicitly interested in.

For example, we know that seedling survival varies among plots, because we know that many of the things that affect survival vary in space: a tree falls over and kills all the seedlings underneath, a patch of soil has high Nitrogen, etc.

Generalized Linear Mixed Effects Model: Example using seedling survival data

Exploring the data

Examine the top few rows of the seedling dataset loaded with this lesson.

head(seedling)
##   PLOT SPECIES HEIGHT STATUS LIGHT CONS        CONBA
## 1    1  PSYBRA     52      1 0.205    1   18.0164841
## 2    1  PIPGLA     40      1 0.205    1    0.8992024
## 3    1  PIPGLA     33      0 0.205    1    0.8992024
## 4    1  ANDINE     41      0 0.205    0    0.0000000
## 5    1  PREMON     90      1 0.205    6 3633.1026780
## 6    1  SLOBER    130      1 0.205    1   67.5457343

Here we have a dataset with seven (7) columns: $PLOT which is the 1m^2 plot in which the seedling is located; $SPECIES a 6-letter code identifying the species; $HEIGHT, the height of the seedling in mm; $STATUS, if the seedling is alive (1) or dead (0); $LIGHT a measure of the light availability for each plot; $CONS, the density of conspecific seedlings in the same plot; and $CONBA, the sum of the basal area of all conspecific trees >1 cm diameter.

We might expect survival to be linked to these variables as follows. Taller seedlings with access to more light should survive better, but trees with more conspecifics around them might suffer lower survival because pests and pathogens might build up in areas of high density (i.e., negative density dependence).

Let’s explore the data a bit more. Return the number of seedlings that lived and died.

table(seedling$STATUS)
## 
##    0    1 
## 2718 1967

Make an histogram of the number of conspecific seedlings.

hist(seedling$CONS)

Make an histogram of seedling height.

hist(seedling$HEIGHT)

Make an histogram of light availability.

hist(seedling$LIGHT)

Note that our predictors have very different ranges so, if you want to directly compare the effects or test for interactions, you should rescale them by subtracting the mean and dividing by the standard deviation.

We could do that manually (e.g., seedling$LIGHTadj <- (seedling$LIGHT - mean(seedling$LIGHT)) / sd(seedling$LIGHT), or use the scale() function. Use scale() to standardize light, height, and conspecifics, starting with light. Call the new column LIGHTadj.

seedling$LIGHTadj <- scale(seedling$LIGHT)

Now center and rescale $HEIGHT.

seedling$HEIGHTadj <- scale(seedling$HEIGHT)

Now center and rescale $CONS.

seedling$CONSadj <- scale(seedling$CONS)

Double-check the data, using head().

head(seedling)
##   PLOT SPECIES HEIGHT STATUS LIGHT CONS        CONBA   LIGHTadj HEIGHTadj
## 1    1  PSYBRA     52      1 0.205    1   18.0164841 -0.9953247 0.7204926
## 2    1  PIPGLA     40      1 0.205    1    0.8992024 -0.9953247 0.3469580
## 3    1  PIPGLA     33      0 0.205    1    0.8992024 -0.9953247 0.1290628
## 4    1  ANDINE     41      0 0.205    0    0.0000000 -0.9953247 0.3780858
## 5    1  PREMON     90      1 0.205    6 3633.1026780 -0.9953247 1.9033521
## 6    1  SLOBER    130      1 0.205    1   67.5457343 -0.9953247 3.1484674
##      CONSadj
## 1 -0.6056147
## 2 -0.6056147
## 3 -0.6056147
## 4 -0.6191610
## 5 -0.5378831
## 6 -0.6056147

Let’s construct a simple binomial glm model, of seedling survival as a function of height, light, and seedling conspecific density. Call this model m0.

m0 <- glm(STATUS ~ HEIGHTadj + LIGHTadj + CONSadj, data = seedling, family = binomial)

Look at the summary of this model.

summary(m0)
## 
## Call:
## glm(formula = STATUS ~ HEIGHTadj + LIGHTadj + CONSadj, family = binomial, 
##     data = seedling)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1052  -1.0299  -0.7141   1.2620   1.9176  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.32368    0.03118 -10.380  < 2e-16 ***
## HEIGHTadj    0.56452    0.04656  12.125  < 2e-16 ***
## LIGHTadj    -0.14061    0.03189  -4.409 1.04e-05 ***
## CONSadj     -0.31647    0.03501  -9.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6373.9  on 4684  degrees of freedom
## Residual deviance: 6009.7  on 4681  degrees of freedom
## AIC: 6017.7
## 
## Number of Fisher Scoring iterations: 4

It appears that all three variables have a significant effect. Taller seedlings survive better, seedlings in higher light actually survive worse, as do those with more conspecifics in the same plot.

However, we know that this simple model is likely to be wrong, because we are not yet accounting for the relationships between individual seedlings. Those in the same plot will likely be affected by the same things (e.g., tree falls), and different species will likely have different inherent survival rates (e.g., there is a clear spectrum of life history strategy from live fast-die young to live slow-die old.

To include random effects in the model, we need to use a new package that includes this kind of model. There are a few that do so. We will use the package lme4, loaded with this lesson.

To specify a mixed effects model, we follow identical syntax to a regular linear model, except that we add in the random effects in parentheses, as follows: + (1|RANDOM), where RANDOM is the variable that you want to set as your random effect.

For example, we know that $PLOT should likely be one random grouping variable, so we would use + (1|PLOT).

Specifically, this syntax means that we want the intercept of the model to vary among plots, i.e., we think that each plot has a (slightly) different mean survival rate.

Using the function for a generalised linear mixed effects model, glmer(), add in a random effect for plot to our previous model, and call it m1.

m1 <- glmer(STATUS ~ HEIGHTadj + LIGHTadj + CONSadj + (1|PLOT), data = seedling, family = binomial)

Look at the summary of this model.

summary(m1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: STATUS ~ HEIGHTadj + LIGHTadj + CONSadj + (1 | PLOT)
##    Data: seedling
## 
##      AIC      BIC   logLik deviance df.resid 
##   5648.6   5680.9  -2819.3   5638.6     4680 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -11.6586  -0.7219  -0.4321   0.8487   3.7094 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  PLOT   (Intercept) 0.6225   0.789   
## Number of obs: 4685, groups:  PLOT, 148
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.29735    0.08480  -3.506 0.000454 ***
## HEIGHTadj    0.65030    0.05389  12.067  < 2e-16 ***
## LIGHTadj    -0.12066    0.07895  -1.528 0.126464    
## CONSadj     -0.10955    0.07512  -1.458 0.144738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) HEIGHT LIGHTd
## HEIGHTadj  0.037              
## LIGHTadj   0.121 -0.111       
## CONSadj    0.366  0.082 -0.047

For lmer() and glmer(), the output is similar to that for lm() and glm(), expect that we have two new sections.

First, as usual, we have what kind of model was run and the formula. Then, some measures of fit and the residuals.

Then, the section on Random effects: describes how much of the variation in survival is explained by $PLOT, and that plots vary quite a bit relative to the fixed effects, because the standard deviation is similar to the effect size for height.

Then we have the usual table of coefficients and p-values (called Fixed effects:). This table suggests that there is no effect of light or conspecifics, once we account for variation among plots. Nevermind all those papers that provide evidence for strong and pervasive negative density dependence. Oh well.

Don’t worry.. we will come back to this model in the lab.

Finally, we have the Correlation of Fixed Effects:. This is not usually that useful or helpful, because it does not have the intuitive meaning you might think. It is not about the correlation of the predictor variables. It is, in fact, about the expected correlation of the regression coefficients. It suggests that in a repeat of the study, variables showing a strong correlation here would likely vary in a similar direction. As I said, not that useful in most circumstances, and can safely be ignored.

Ok, moving on …

Linear Mixed Effects Model: Example using fruit production data with repeated measures

We just worked through an example with a binary response variable. The workings of this model are actually much simpler and easier to compute than a regular linear model. You will notice that the model output was essentially the same as the output from non-mixed effects models.

The situation for linear mixed effects models is not so straightforward, as the function lmer() does not return p-values (if you are interested in those …).

lmer() does not return p-values because of uncertainties in calculating the appropriate degrees of freedom in a model with random effects. Statistics is a developing science, like any other, and mixed effect models are right at the frontier.

For more details and the latest recommendations, see this website: http://glmm.wikidot.com/

Ok, let’s consider a problem where we want to understand how tree size affects the number of fruit that tree produces.

Easy? It would be, but we have three years of data for each individual tree … any analysis that does not account for this repeated measures set-up will not be entirely correct.

As before, let’s look at the data and some plots before we start modelling.

We have a dataset called fruit loaded with the lesson. Look at the top 6 rows of data.

head(fruit)
##   TAG DBH FRUIT YEAR
## 1  86 454    45    1
## 2  86 454   662    2
## 3 101 369     6    1
## 4 101 369     7    2
## 5 155 450   700    1
## 6 155 450   225    2

We have four columns of data: $TAG is the unique tag number of the tree; $DBH is the diameter at breast height of the tree (mm); $FRUIT is the number of fruit counted in each tree; and $YEAR is the year.

Look at the structure of the data to get a better idea of how many trees, years, and to check that everything is the correct class.

str(fruit)
## 'data.frame':    428 obs. of  4 variables:
##  $ TAG  : int  86 86 101 101 155 155 155 160 160 192 ...
##  $ DBH  : int  454 454 369 369 450 450 450 413 413 486 ...
##  $ FRUIT: int  45 662 6 7 700 225 96 67 43 61 ...
##  $ YEAR : int  1 2 1 2 1 2 3 1 3 1 ...

Now use the summary() function to return some simple statistics about the data.

summary(fruit)
##       TAG              DBH            FRUIT             YEAR      
##  Min.   :    86   Min.   :207.0   Min.   :   1.0   Min.   :1.000  
##  1st Qu.:  5125   1st Qu.:406.2   1st Qu.:  24.0   1st Qu.:1.000  
##  Median :  7080   Median :499.0   Median :  85.0   Median :2.000  
##  Mean   : 50884   Mean   :514.1   Mean   : 290.1   Mean   :1.942  
##  3rd Qu.: 50975   3rd Qu.:620.8   3rd Qu.: 251.0   3rd Qu.:3.000  
##  Max.   :409554   Max.   :916.0   Max.   :9190.0   Max.   :3.000

Now look at the distribution of the fruit data. Plot a histogram of $FRUIT.

hist(fruit$FRUIT)

The distribution is definitely skewed, with lots of low values. One option, given that we have count data, would be to run a glm with poisson errors. However, we want to illustrate the use of lmer(), and in the ‘good ol’ days’ a regular log-transformation would do just fine. The log transform is also one of the easier-to-interpret transformations out there.

So, plot another histogram of log(fruit).

hist(log(fruit$FRUIT))

That looks much better… so, let’s try that in our model. First, we need to make a new column, $LOGFRUIT. In the fruit data, make this column of the log of $fruit.

fruit$LOGFRUIT <- log(fruit$FRUIT)

Whoa! Wait a second … what if there are years with no (0) fruit? What would that do? Check out what the log of 0 is.

log(0)
## [1] -Inf

So the log of 0 is negative infinity.. not a number that we can deal with all that easily. Check that there are no 0’s in the $FRUIT column.

table(fruit$FRUIT < 1)
## 
## FALSE 
##   428

Phew, we do not have any. However, remember that a way to deal with 0’s if you ever need to log-transform data is to add 1 (+1) to the data before you transform it.

Ok, so now let’s finally get round to plotting the relationship we are interested in and going to model. Plot $LOGFRUIT as a function of $DBH.

plot(LOGFRUIT ~ DBH, data = fruit)

We also have three years of data. Can you colour each point as a function of $YEAR.

plot(LOGFRUIT ~ DBH, data = fruit, col = YEAR)

It looks like there are certainly differences in fruit production among years. We might want to include that explicitly in the model.

However, $YEAR in the data is numeric, but we actually want it to be a factor. Create a new column in the data called $YEARFAC which is $YEAR as a factor.

fruit$YEARFAC <- as.factor(fruit$YEAR)

Now let’s run a regular linear regression on these data. Use lm() to model log fruit as a function of dbh and year, called m2.

m2 <- lm(LOGFRUIT ~ DBH + YEARFAC, data = fruit)

Look at the summary output of m2.

summary(m2)
## 
## Call:
## lm(formula = LOGFRUIT ~ DBH + YEARFAC, data = fruit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6020 -1.0561  0.0375  1.1837  4.9041 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.2755122  0.2727943   8.341 1.04e-15 ***
## DBH          0.0039720  0.0004772   8.324 1.18e-15 ***
## YEARFAC2     0.6462962  0.1834929   3.522 0.000474 ***
## YEARFAC3    -0.5267316  0.1804221  -2.919 0.003694 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.554 on 424 degrees of freedom
## Multiple R-squared:  0.2062, Adjusted R-squared:  0.2006 
## F-statistic: 36.71 on 3 and 424 DF,  p-value: < 2.2e-16

We actually have the same individuals measured in each year (i.e., repeated measures). We can account for this issue by including individual (i.e., $TAG) as a random effect within a linear mixed effects model.

We used glmer() to run a generalised mixed effects model on the binomial data. We can use lmer() to run a linear mixed effects model on this fruit data. The same package, lme4, includes this function.

However, lme4 does not calculate p-values, because the package authors (eminently sensible people) do not think that there is a reliable way to do this: There are various issues with obtaining the degrees of freedom, and thence the F value and p-values.

Others (also sensible people) disagree, and have developed various approaches to approximate the degrees of freedom and calculate p-values. One of these can now be done using the lmer() function in the package lmerTest.

When we installed the lme4 and lmerTest packages at the beginning of the lesson, the lmer() function in lmerTest overwrote the same function in lme4.

Anyway, now you can write the model. We construct it in a similar way to before. Include + (1|TAG) in the formula to represent the random effect of individual tree. Call this model m3

m3 <- lmer(LOGFRUIT ~ DBH + YEARFAC + (1|TAG), data = fruit)

The summary output is similar to that for glmer(). Have a look.

summary(m3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: LOGFRUIT ~ DBH + YEARFAC + (1 | TAG)
##    Data: fruit
## 
## REML criterion at convergence: 1568.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -3.10042 -0.60147  0.05388  0.64267  2.32086 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  TAG      (Intercept) 0.9051   0.9513  
##  Residual             1.5399   1.2409  
## Number of obs: 428, groups:  TAG, 179
## 
## Fixed effects:
##               Estimate Std. Error t value
## (Intercept)  2.1300736  0.3251750   6.551
## DBH          0.0041716  0.0005979   6.977
## YEARFAC2     0.5415140  0.1499008   3.612
## YEARFAC3    -0.5668612  0.1485199  -3.817
## 
## Correlation of Fixed Effects:
##          (Intr) DBH    YEARFAC2
## DBH      -0.927                
## YEARFAC2 -0.180 -0.020         
## YEARFAC3 -0.213  0.009  0.446

We have the model formula, residuals, and random effects (showing that there is considerable variation among years for any one tree, as well as a lot of unexplained variation in fruit production) …

… the fixed effect coefficient estimates: For every 1mm increase in DBH, a tree produced 0.004 log fruits (or exp(0.004) = c.1 fruit) on average.

Year two had generally more fruit than year one, and year three had generally fewer fruit than year one …

Great! Now you know more than most people about mixed effects models.. but a lot less than you really should if you are interested in working with them ;)

So, please do read more on the lme4 FAQ page: http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html, and the glmm wiki: http://glmm.wikidot.com/, and the references therein.

And finally: Great work on completing FES720! You are are a total R wizard!

Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.

  1. That’s all folks, thanks for the ride!

That’s all folks, thanks for the ride!