## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'lmerTest'
In many cases, the data points that we collect may be related in some way, thus violating the assumption of independence present in linear regression, generalised linear models, etc.
Sometimes these relationships will be direct and obvious. For example, we could take four soil samples from the same forest plot, or many different measurements from the same individual sparrow; or interview lots of folks in the same college or organisation.
Or, the individual measurements may be related in a slightly more distant way. For example, we could measure the same tree every year, or interview the same people every year; or take measurements from many trees in plot.
Finally, the data points could be related even more distantly, for example, if we take multiple measurements from the same species, or from closely-related species, or from closely-related individuals.
In all these examples, the relatedness (in time, space, or direct relatedness) means that the measurements from one individual are more likely to be similar to measurements from closely-related individuals. E.g., trees growing close together in a forest are likely to experience similar growing conditions; individuals from the same species are likely to have similar trait values, etc, etc.
Previously, there were only a limited number of options to resolve these issues of non-independence. The main way would be to average across the grouping variable that was likely having an effect (e.g., averaging the four soil samples, taking average tree growth per plot) or by only using a subset of all the data you had collected.
Both these approaches are correct and fine. However, using a mixed effects model gives you more flexibility and allows you to use all the data.
So, instead of averaging over these grouping variables (losing data and information) or treating everything as independent (violating assumptions), we can treat these grouping variables as random effects.
The model essentially creates a different baseline value for each level of the random effect (e.g., plot 1, 2, 3, … n), modelling a different intercept for each level.
We can then model the variables—the fixed effects—that we are interested in, accounting for the various random effects that we need to account for, but are not explicitly interested in.
For example, we know that seedling survival varies among plots, because we know that many of the things that affect survival vary in space: a tree falls over and kills all the seedlings underneath, a patch of soil has high Nitrogen, etc.
Examine the top few rows of the seedling
dataset loaded with this lesson.
head(seedling)
## PLOT SPECIES HEIGHT STATUS LIGHT CONS CONBA
## 1 1 PSYBRA 52 1 0.205 1 18.0164841
## 2 1 PIPGLA 40 1 0.205 1 0.8992024
## 3 1 PIPGLA 33 0 0.205 1 0.8992024
## 4 1 ANDINE 41 0 0.205 0 0.0000000
## 5 1 PREMON 90 1 0.205 6 3633.1026780
## 6 1 SLOBER 130 1 0.205 1 67.5457343
Here we have a dataset with seven (7) columns: $PLOT
which is the 1m^2 plot in which the seedling is located; $SPECIES
a 6-letter code identifying the species; $HEIGHT
, the height of the seedling in mm; $STATUS
, if the seedling is alive (1) or dead (0); $LIGHT
a measure of the light availability for each plot; $CONS
, the density of conspecific seedlings in the same plot; and $CONBA
, the sum of the basal area of all conspecific trees >1 cm diameter.
We might expect survival to be linked to these variables as follows. Taller seedlings with access to more light should survive better, but trees with more conspecifics around them might suffer lower survival because pests and pathogens might build up in areas of high density (i.e., negative density dependence).
Let’s explore the data a bit more. Return the number of seedlings that lived and died.
table(seedling$STATUS)
##
## 0 1
## 2718 1967
Make an histogram of the number of conspecific seedlings.
hist(seedling$CONS)
Make an histogram of seedling height.
hist(seedling$HEIGHT)
Make an histogram of light availability.
hist(seedling$LIGHT)
Note that our predictors have very different ranges so, if you want to directly compare the effects or test for interactions, you should rescale them by subtracting the mean and dividing by the standard deviation.
We could do that manually (e.g., seedling$LIGHTadj <- (seedling$LIGHT - mean(seedling$LIGHT)) / sd(seedling$LIGHT)
, or use the scale()
function. Use scale()
to standardize light, height, and conspecifics, starting with light. Call the new column LIGHTadj
.
seedling$LIGHTadj <- scale(seedling$LIGHT)
Now center and rescale $HEIGHT
.
seedling$HEIGHTadj <- scale(seedling$HEIGHT)
Now center and rescale $CONS
.
seedling$CONSadj <- scale(seedling$CONS)
Double-check the data, using head()
.
head(seedling)
## PLOT SPECIES HEIGHT STATUS LIGHT CONS CONBA LIGHTadj HEIGHTadj
## 1 1 PSYBRA 52 1 0.205 1 18.0164841 -0.9953247 0.7204926
## 2 1 PIPGLA 40 1 0.205 1 0.8992024 -0.9953247 0.3469580
## 3 1 PIPGLA 33 0 0.205 1 0.8992024 -0.9953247 0.1290628
## 4 1 ANDINE 41 0 0.205 0 0.0000000 -0.9953247 0.3780858
## 5 1 PREMON 90 1 0.205 6 3633.1026780 -0.9953247 1.9033521
## 6 1 SLOBER 130 1 0.205 1 67.5457343 -0.9953247 3.1484674
## CONSadj
## 1 -0.6056147
## 2 -0.6056147
## 3 -0.6056147
## 4 -0.6191610
## 5 -0.5378831
## 6 -0.6056147
Let’s construct a simple binomial glm model, of seedling survival as a function of height, light, and seedling conspecific density. Call this model m0
.
m0 <- glm(STATUS ~ HEIGHTadj + LIGHTadj + CONSadj, data = seedling, family = binomial)
Look at the summary of this model.
summary(m0)
##
## Call:
## glm(formula = STATUS ~ HEIGHTadj + LIGHTadj + CONSadj, family = binomial,
## data = seedling)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1052 -1.0299 -0.7141 1.2620 1.9176
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32368 0.03118 -10.380 < 2e-16 ***
## HEIGHTadj 0.56452 0.04656 12.125 < 2e-16 ***
## LIGHTadj -0.14061 0.03189 -4.409 1.04e-05 ***
## CONSadj -0.31647 0.03501 -9.039 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6373.9 on 4684 degrees of freedom
## Residual deviance: 6009.7 on 4681 degrees of freedom
## AIC: 6017.7
##
## Number of Fisher Scoring iterations: 4
It appears that all three variables have a significant effect. Taller seedlings survive better, seedlings in higher light actually survive worse, as do those with more conspecifics in the same plot.
However, we know that this simple model is likely to be wrong, because we are not yet accounting for the relationships between individual seedlings. Those in the same plot will likely be affected by the same things (e.g., tree falls), and different species will likely have different inherent survival rates (e.g., there is a clear spectrum of life history strategy from live fast-die young to live slow-die old.
To include random effects in the model, we need to use a new package that includes this kind of model. There are a few that do so. We will use the package lme4
, loaded with this lesson.
To specify a mixed effects model, we follow identical syntax to a regular linear model, except that we add in the random effects in parentheses, as follows: + (1|RANDOM)
, where RANDOM is the variable that you want to set as your random effect.
For example, we know that $PLOT
should likely be one random grouping variable, so we would use + (1|PLOT)
.
Specifically, this syntax means that we want the intercept of the model to vary among plots, i.e., we think that each plot has a (slightly) different mean survival rate.
Using the function for a generalised linear mixed effects model, glmer()
, add in a random effect for plot to our previous model, and call it m1.
m1 <- glmer(STATUS ~ HEIGHTadj + LIGHTadj + CONSadj + (1|PLOT), data = seedling, family = binomial)
Look at the summary of this model.
summary(m1)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: STATUS ~ HEIGHTadj + LIGHTadj + CONSadj + (1 | PLOT)
## Data: seedling
##
## AIC BIC logLik deviance df.resid
## 5648.6 5680.9 -2819.3 5638.6 4680
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -11.6586 -0.7219 -0.4321 0.8487 3.7094
##
## Random effects:
## Groups Name Variance Std.Dev.
## PLOT (Intercept) 0.6225 0.789
## Number of obs: 4685, groups: PLOT, 148
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.29735 0.08480 -3.506 0.000454 ***
## HEIGHTadj 0.65030 0.05389 12.067 < 2e-16 ***
## LIGHTadj -0.12066 0.07895 -1.528 0.126464
## CONSadj -0.10955 0.07512 -1.458 0.144738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) HEIGHT LIGHTd
## HEIGHTadj 0.037
## LIGHTadj 0.121 -0.111
## CONSadj 0.366 0.082 -0.047
For lmer()
and glmer()
, the output is similar to that for lm()
and glm()
, expect that we have two new sections.
First, as usual, we have what kind of model was run and the formula. Then, some measures of fit and the residuals.
Then, the section on Random effects:
describes how much of the variation in survival is explained by $PLOT
, and that plots vary quite a bit relative to the fixed effects, because the standard deviation is similar to the effect size for height.
Then we have the usual table of coefficients and p-values (called Fixed effects:
). This table suggests that there is no effect of light or conspecifics, once we account for variation among plots. Nevermind all those papers that provide evidence for strong and pervasive negative density dependence. Oh well.
Don’t worry.. we will come back to this model in the lab.
Finally, we have the Correlation of Fixed Effects:
. This is not usually that useful or helpful, because it does not have the intuitive meaning you might think. It is not about the correlation of the predictor variables. It is, in fact, about the expected correlation of the regression coefficients. It suggests that in a repeat of the study, variables showing a strong correlation here would likely vary in a similar direction. As I said, not that useful in most circumstances, and can safely be ignored.
Ok, moving on …
We just worked through an example with a binary response variable. The workings of this model are actually much simpler and easier to compute than a regular linear model. You will notice that the model output was essentially the same as the output from non-mixed effects models.
The situation for linear mixed effects models is not so straightforward, as the function lmer()
does not return p-values (if you are interested in those …).
lmer()
does not return p-values because of uncertainties in calculating the appropriate degrees of freedom in a model with random effects. Statistics is a developing science, like any other, and mixed effect models are right at the frontier.
For more details and the latest recommendations, see this website: http://glmm.wikidot.com/
Ok, let’s consider a problem where we want to understand how tree size affects the number of fruit that tree produces.
Easy? It would be, but we have three years of data for each individual tree … any analysis that does not account for this repeated measures set-up will not be entirely correct.
As before, let’s look at the data and some plots before we start modelling.
We have a dataset called fruit
loaded with the lesson. Look at the top 6 rows of data.
head(fruit)
## TAG DBH FRUIT YEAR
## 1 86 454 45 1
## 2 86 454 662 2
## 3 101 369 6 1
## 4 101 369 7 2
## 5 155 450 700 1
## 6 155 450 225 2
We have four columns of data: $TAG
is the unique tag number of the tree; $DBH
is the diameter at breast height of the tree (mm); $FRUIT
is the number of fruit counted in each tree; and $YEAR
is the year.
Look at the structure of the data to get a better idea of how many trees, years, and to check that everything is the correct class.
str(fruit)
## 'data.frame': 428 obs. of 4 variables:
## $ TAG : int 86 86 101 101 155 155 155 160 160 192 ...
## $ DBH : int 454 454 369 369 450 450 450 413 413 486 ...
## $ FRUIT: int 45 662 6 7 700 225 96 67 43 61 ...
## $ YEAR : int 1 2 1 2 1 2 3 1 3 1 ...
Now use the summary()
function to return some simple statistics about the data.
summary(fruit)
## TAG DBH FRUIT YEAR
## Min. : 86 Min. :207.0 Min. : 1.0 Min. :1.000
## 1st Qu.: 5125 1st Qu.:406.2 1st Qu.: 24.0 1st Qu.:1.000
## Median : 7080 Median :499.0 Median : 85.0 Median :2.000
## Mean : 50884 Mean :514.1 Mean : 290.1 Mean :1.942
## 3rd Qu.: 50975 3rd Qu.:620.8 3rd Qu.: 251.0 3rd Qu.:3.000
## Max. :409554 Max. :916.0 Max. :9190.0 Max. :3.000
Now look at the distribution of the fruit data. Plot a histogram of $FRUIT
.
hist(fruit$FRUIT)
The distribution is definitely skewed, with lots of low values. One option, given that we have count data, would be to run a glm with poisson errors. However, we want to illustrate the use of lmer()
, and in the ‘good ol’ days’ a regular log-transformation would do just fine. The log transform is also one of the easier-to-interpret transformations out there.
So, plot another histogram of log(fruit).
hist(log(fruit$FRUIT))
That looks much better… so, let’s try that in our model. First, we need to make a new column, $LOGFRUIT
. In the fruit data, make this column of the log of $fruit
.
fruit$LOGFRUIT <- log(fruit$FRUIT)
Whoa! Wait a second … what if there are years with no (0) fruit? What would that do? Check out what the log of 0 is.
log(0)
## [1] -Inf
So the log of 0 is negative infinity.. not a number that we can deal with all that easily. Check that there are no 0’s in the $FRUIT
column.
table(fruit$FRUIT < 1)
##
## FALSE
## 428
Phew, we do not have any. However, remember that a way to deal with 0’s if you ever need to log-transform data is to add 1 (+1) to the data before you transform it.
Ok, so now let’s finally get round to plotting the relationship we are interested in and going to model. Plot $LOGFRUIT
as a function of $DBH
.
plot(LOGFRUIT ~ DBH, data = fruit)
We also have three years of data. Can you colour each point as a function of $YEAR
.
plot(LOGFRUIT ~ DBH, data = fruit, col = YEAR)
It looks like there are certainly differences in fruit production among years. We might want to include that explicitly in the model.
However, $YEAR
in the data is numeric, but we actually want it to be a factor. Create a new column in the data called $YEARFAC
which is $YEAR
as a factor.
fruit$YEARFAC <- as.factor(fruit$YEAR)
Now let’s run a regular linear regression on these data. Use lm()
to model log fruit as a function of dbh and year, called m2
.
m2 <- lm(LOGFRUIT ~ DBH + YEARFAC, data = fruit)
Look at the summary output of m2.
summary(m2)
##
## Call:
## lm(formula = LOGFRUIT ~ DBH + YEARFAC, data = fruit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6020 -1.0561 0.0375 1.1837 4.9041
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2755122 0.2727943 8.341 1.04e-15 ***
## DBH 0.0039720 0.0004772 8.324 1.18e-15 ***
## YEARFAC2 0.6462962 0.1834929 3.522 0.000474 ***
## YEARFAC3 -0.5267316 0.1804221 -2.919 0.003694 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.554 on 424 degrees of freedom
## Multiple R-squared: 0.2062, Adjusted R-squared: 0.2006
## F-statistic: 36.71 on 3 and 424 DF, p-value: < 2.2e-16
We actually have the same individuals measured in each year (i.e., repeated measures). We can account for this issue by including individual (i.e., $TAG
) as a random effect within a linear mixed effects model.
We used glmer()
to run a generalised mixed effects model on the binomial data. We can use lmer()
to run a linear mixed effects model on this fruit data. The same package, lme4
, includes this function.
However, lme4
does not calculate p-values, because the package authors (eminently sensible people) do not think that there is a reliable way to do this: There are various issues with obtaining the degrees of freedom, and thence the F value and p-values.
Others (also sensible people) disagree, and have developed various approaches to approximate the degrees of freedom and calculate p-values. One of these can now be done using the lmer()
function in the package lmerTest
.
When we installed the lme4
and lmerTest
packages at the beginning of the lesson, the lmer()
function in lmerTest
overwrote the same function in lme4
.
Anyway, now you can write the model. We construct it in a similar way to before. Include + (1|TAG)
in the formula to represent the random effect of individual tree. Call this model m3
m3 <- lmer(LOGFRUIT ~ DBH + YEARFAC + (1|TAG), data = fruit)
The summary output is similar to that for glmer()
. Have a look.
summary(m3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: LOGFRUIT ~ DBH + YEARFAC + (1 | TAG)
## Data: fruit
##
## REML criterion at convergence: 1568.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.10042 -0.60147 0.05388 0.64267 2.32086
##
## Random effects:
## Groups Name Variance Std.Dev.
## TAG (Intercept) 0.9051 0.9513
## Residual 1.5399 1.2409
## Number of obs: 428, groups: TAG, 179
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 2.1300736 0.3251750 6.551
## DBH 0.0041716 0.0005979 6.977
## YEARFAC2 0.5415140 0.1499008 3.612
## YEARFAC3 -0.5668612 0.1485199 -3.817
##
## Correlation of Fixed Effects:
## (Intr) DBH YEARFAC2
## DBH -0.927
## YEARFAC2 -0.180 -0.020
## YEARFAC3 -0.213 0.009 0.446
We have the model formula, residuals, and random effects (showing that there is considerable variation among years for any one tree, as well as a lot of unexplained variation in fruit production) …
… the fixed effect coefficient estimates: For every 1mm increase in DBH, a tree produced 0.004 log fruits (or exp(0.004) = c.1 fruit) on average.
Year two had generally more fruit than year one, and year three had generally fewer fruit than year one …
Great! Now you know more than most people about mixed effects models.. but a lot less than you really should if you are interested in working with them ;)
So, please do read more on the lme4
FAQ page: http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html, and the glmm wiki: http://glmm.wikidot.com/, and the references therein.
And finally: Great work on completing FES720! You are are a total R wizard!
Please submit the log of this lesson to Google Forms so that Simon may evaluate your progress.
That’s all folks, thanks for the ride!