Generalis(z)ed Linear Models

Most data are not normal

Remember our general formula:

Y ~ beta0 + beta1X1 + beta2X2 ... betanXn + error

Both linear models and ANOVA have the standard assumptions of independent and identically distributed (i.i.d.) normal random variables.

More explicitly, these standard assumptions are:

The key assumptions that many response variables do not meet are those of normality and identical distribution.

Assuming that your data are normal makes the statistical models much easier, but normal distribution are actually uncommon despite their ubuquity in statistics courses. In the past, people would transform their response variable to try and shoehorn it into a more normal distribution (e.g., taking the log). However, you are then modelling the transformed variable rather than the actual variable, so interpretation can be tricky.

Much data do not meet the assumption of identical distribution because of unequal variance (i.e., the variance is not constant between treatments, among groups, or along a continuous variable). For example, in count data, however, where the response variable is an integer and there are often lots of zeros in the dataframe, the variance may increase linearly with the mean. With proportion data, where we have a count of the number of failures of an event as well as the number of successes, the variance will be an inverted U-shaped function of the mean. Where the response variable follows a gamma distribution (as in time-to-death data) the variance increases faster than linearly with the mean.

Generalized linear models (GLMs) are excellent at dealing with these two issues.

Generalized linear models

The use of GLMs is recommended either when:

Specifically, you should use a GLM when the response variable is:

A generalized linear model has three important properties:

The error structure, or probability distribution

Up to this point, we have dealt with the statistical analysis of data with normal errors. In practice, however, many kinds of data have non-normal errors, for example:

In the past, the only tools available to deal with these problems were transformation of the response variable or the adoption of non-parametric methods. A GLM allows the specification of a variety of different error distributions:

In R, the error structure is defined by means of the family argument, used as part of the model formula. Examples are glm(y ~ z, family = poisson) which means that the response variable y has Poisson errors, and glm(y ~ z, family = binomial) which means that the response is binary, and the model has binomial errors.

As with previous models, the explanatory variable z can be continuous (leading to a regression analysis) or categorical (leading to an ANOVA-like procedure called analysis of deviance)

The linear predictor

The linear predictor is the quantity which incorporates the information about the independent (explanatory, predictor) variables into the model. It is related to the expected value of the data (thus, "predictor") through the link function.

The link function provides the relationship between the linear predictor and the mean of the distribution function. In general, the link is particular to the probabilty distribution.

Common distributions with typical uses

Probability distribution
Typical uses
Link name
Normal
linear-response data
Identity
Poisson count of occurrences in fixed amount of time/space Log
Bernoulli outcome of single yes/no occurrence Logit
Binomial count of # of "yes" occurrences out of N yes/no occurrences Logit

Interpretation and comparison of effect sizes

Y ~ beta0 + beta1X1 + beta2X2 ... betanXn + error

The betas in a GLM are coefficient estimates assigned to the predictor variables.

beta0

Beta0 is a constant. It is the predicted value of Y when all Xs are 0. It is the intercept of the fitted line.

betan

The other betas are all associated with a variable. It weights the contribution of that X to variation in Y. It gives the predicted change in Y for a 1 unit change in X, keeping everything else constant.

It is important to remember that the actual magnitude of a beta is a function of the units of measurement of its X. It is therefore possible to make a beta larger or smaller by simply changing the scale of its variable.

Thus: never compare the raw betas across variables to determine the importance of the variables in prediction.

standardized betas

To compare effect sizes, we must Standardize The Betas ...

To standardize, substract the mean and divide by the standard deviation of the variable.

In R:

x1 <- 1:10
scale(x1)

Standardized coefficients have a mean of 0 and a standard deviation of 1.0.

Interpretation is the same as with raw or unstandardized coefficients, except that the value of each beta corresponds to a unit change in the standard deviation of each X (rather than a simple raw unit change).


Note on terminology

GeneralIZED linear models are not the same as general linear models.

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

The general linear model is a generalization of multiple linear regression model to the case of more than one dependent variable. The errors are usually assumed to be uncorrelated across measurements, and follow a multivariate normal distribution: Thus, ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-tests are all special cases of the general linear model.


Updated: 2016-11-05