Generalis(z)ed Linear Models

Most data are not normal

Remember our general formula:

Y ~ beta₀ + beta₁X₁ + beta₂X₂ ... beta_nX_n + error

Both linear models and ANOVA have the standard assumptions of independent and identically distributed (i.i.d.) normal random variables.

More explicitly, these standard assumptions are:

independent The samples are not autocorrelated or suffer from some other non-independence (i.e., the value of one sample does not influence the value of the next)
identically distributed The samples come from the same probability distribution (in this case, normal).
normal The samples come from a normal probabilty distribution.
random variable i.e., the response variable.

The key assumptions that many response variables do not meet are those of normality and identical distribution.

Assuming that your data are normal makes the statistical models much easier, but normal distribution are actually uncommon despite their ubuquity in statistics courses. In the past, people would transform their response variable to try and shoehorn it into a more normal distribution (e.g., taking the log). However, you are then modelling the transformed variable rather than the actual variable, so interpretation can be tricky.

Much data do not meet the assumption of identical distribution because of unequal variance (i.e., the variance is not constant between treatments, among groups, or along a continuous variable). For example, in count data, however, where the response variable is an integer and there are often lots of zeros in the dataframe, the variance may increase linearly with the mean. With proportion data, where we have a count of the number of failures of an event as well as the number of successes, the variance will be an inverted U-shaped function of the mean. Where the response variable follows a gamma distribution (as in time-to-death data) the variance increases faster than linearly with the mean.

Generalized linear models (GLMs) are excellent at dealing with these two issues.

Generalized linear models

The use of GLMs is recommended either when:

the variance is not constant, and/or
the errors are not normally distributed.

Specifically, you should use a GLM when the response variable is:

binary, often at the individual level (e.g., dead or alive)
binary or count data expressed as proportions, often at the group level (e.g., sex ratio)
count data that are not proportions (e.g., number of insects per leaf, number of bars per street)
data on time to death where the variance increases faster than linearly with the mean (e.g., time data with gamma errors).

A generalized linear model has three important properties:

the error structure
the linear predictor
the link function

The error structure, or probability distribution

Up to this point, we have dealt with the statistical analysis of data with normal errors. In practice, however, many kinds of data have non-normal errors, for example:

errors that are strongly skewed;
errors that are kurtotic;
errors that are strictly bounded (as in proportions);
errors that cannot lead to negative fitted values (as in counts).

In the past, the only tools available to deal with these problems were transformation of the response variable or the adoption of non-parametric methods. A GLM allows the specification of a variety of different error distributions:

binomial errors, useful with data on proportions
Poisson errors, useful with count data
gamma errors, useful with data showing a constant coefficient of variation
exponential errors, useful with data on time to death (survival analysis)

In R, the error structure is defined by means of the family argument, used as part of the model formula. Examples are glm(y ~ z, family = poisson) which means that the response variable y has Poisson errors, and glm(y ~ z, family = binomial) which means that the response is binary, and the model has binomial errors.

As with previous models, the explanatory variable z can be continuous (leading to a regression analysis) or categorical (leading to an ANOVA-like procedure called analysis of deviance)

The linear predictor

The linear predictor is the quantity which incorporates the information about the independent (explanatory, predictor) variables into the model. It is related to the expected value of the data (thus, "predictor") through the link function.

The link function

The link function provides the relationship between the linear predictor and the mean of the distribution function. In general, the link is particular to the probabilty distribution.

Common distributions with typical uses

Probability distribution	Typical uses	Link name
Normal	linear-response data	Identity
Poisson	count of occurrences in fixed amount of time/space	Log
Bernoulli	outcome of single yes/no occurrence	Logit
Binomial	count of # of "yes" occurrences out of N yes/no occurrences	Logit

Interpretation and comparison of effect sizes

Y ~ beta₀ + beta₁X₁ + beta₂X₂ ... beta_nX_n + error

The betas in a GLM are coefficient estimates assigned to the predictor variables.

beta₀

Beta₀ is a constant. It is the predicted value of Y when all Xs are 0. It is the intercept of the fitted line.

beta_n

The other betas are all associated with a variable. It weights the contribution of that X to variation in Y. It gives the predicted change in Y for a 1 unit change in X, keeping everything else constant.

It is important to remember that the actual magnitude of a beta is a function of the units of measurement of its X. It is therefore possible to make a beta larger or smaller by simply changing the scale of its variable.

Thus: never compare the raw betas across variables to determine the importance of the variables in prediction.

standardized betas

To compare effect sizes, we must Standardize The Betas ...

To standardize, substract the mean and divide by the standard deviation of the variable.

In R:

x1 <- 1:10
scale(x1)

Standardized coefficients have a mean of 0 and a standard deviation of 1.0.

Interpretation is the same as with raw or unstandardized coefficients, except that the value of each beta corresponds to a unit change in the standard deviation of each X (rather than a simple raw unit change).

Note on terminology

GeneralIZED linear models are not the same as general linear models.

The generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

The general linear model is a generalization of multiple linear regression model to the case of more than one dependent variable. The errors are usually assumed to be uncorrelated across measurements, and follow a multivariate normal distribution: Thus, ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-tests are all special cases of the general linear model.

Updated: 2016-11-05