FES 720 Introduction to R

The Importance of Missing Data

autosize: true

FES720 Intro to R

The witch tapped her broomstick, and _whoosh!_ they were gone

Julia Donaldon. 2003. Room on the Broom


Many reasons for missing data …


… but only 4 distributions of missing data

  1. Missingness completely at random

  2. Missingness at random

  3. Missingness that depends on unobserved predictors

  4. Missingness that depends on the missing value itself


1. Missingness Completely At Random (MCAR)


The reason for missingness is totally independent of the predictors and response.

i.e., the probability of missingness is the same for each unit in your sample.


The data sample with complete cases remains an unbiased sample of the population.

Can analyse these complete cases, but loss of power.

Data rarely MCAR.


2. Missingness At Random (MAR)


Missingness depends only on other available information (e.g., other predictors).

E.g., different social groups have different response rates to a survey:

if sex, race, education, and age are recorded for all people in the survey, then “earnings” is missing at random if the probability of nonresponse to this question depends only on these other, fully recorded variables.


The data sample with complete cases is a biased sample of the population.

But, this is ok if the regression controls for all the variables that affect the probability of missingness.

E.g., need to include sex, race, education, and age in the model.


3. Missingness Not At Random I (MNAR): depends on unobserved predictors



e.g., if a particular treatment causes discomfort, a patient is more likely to drop out of the study (and ‘discomfort’ is not measured).


Data sample is biased, therefore you need to explicitly model it. (Or accept bias).


4. Missingness Not At Random II (MNAR): depends on the missing value itself


Missingness depends on the (potentially missing) variable itself

E.g., people with higher earnings are less likely to reveal them.

Censoring occurs when a particular value of information leads to missingess



Model the missing data.

Include more predictors.


Illustration of the classification for the mechanism of missing data

Red is missing data in the y-variable.

Blue is observed data.

Source: Nakagawa & Freckleton (2008)

(a) Missing At Random

(b) Missing Not At Random

(c) Missingness Completely At Random


Missing data is also philosophical problem

Cannot be sure that data are MCAR, MAR, or MNAR

Because unobserved predictors (lurking variables) are unobserved.

So, we can never rule them out



The best solution to missing data

The best solution to handle missing data is to have none.

R.A. Fisher


Other solutions to missing data

  1. Complete-case analysis

  2. Available-case analysis

  3. Missingness weighting

  4. Imputation

See references for specific details.


What value should I use to indicate missing data?


Use a value that is:

  1. Compatible with (most) software.

  2. Unlikely to cause errors in analysis.


What value should I use to indicate missing data?


Null values Problems Compatibility Recommendation
0 Indistinguishable from a true zero   Never use
Blank Hard to tell values that are missing from those overlooked on data entry. Hard to tell blanks from spaces (behave differently) R, Python, SQL Best option
-999, 999 Not recognized as null by many programs without user input. Can be inadvertently entered into calculations.   Avoid
NA, na Also an abbreviation (e.g., North America). Can cause problems with data type (turn a numerical column into a text column). NA is more common than na R Good option
N/A An alternate form of NA, but often not compatible with software   Avoid
NULL Can cause problems with data type. Variable use in R SQL Good option
None Uncommon. Can cause problems with data type Python Avoid
No data Uncommon. Can cause problems with data type. Contains a space   Avoid
Missing Uncommon. Can cause problems with data type   Avoid
-, +, . Uncommon. Can cause problems with data type   Avoid

Table 1 Commonly used null values, limitations, compatibility with common software. From: White et al. 2013


R uses ‘NA’ to indicate missing data

x <- c(1, 2, 3, 4, NA, 6, 7, 8, NA, 10)
#  [1]  1  2  3  4 NA  6  7  8 NA 10

# [1] TRUE

#[1] FALSE


Many functions in R have ‘na.rm = ‘ argument

Some functions will fail with missing data.

# generate vector of integers with an NA
x <- c(1:10, NA)

# calculate mean of x
# [1] NA

Include ‘na.rm = TRUE’

mean(x, na.rm = TRUE)
# [1] 5.5


More details

Gelman, A. & Hill J. 2006. Missing data imputation. Chapter 25, In: Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, UK. pp 529–543.

Nakagawa, S. 2015. Missing data: mechanisms, methods, and messages. In: Fox, Negrete-Yankelevich, and Sosa (eds). Ecological Statistics: Contemporary theory and application. Oxford University Press, UK.

Nakagawa, S. and Freckleton, R.P. 2008. Missing inaction: The dangers of ignoring missing data. Trends in Ecology & Evolution 23, 592–596.