Random Data and Sampling

Random number generators in R

R can create lots of different types of random numbers ranging from familiar families of distributions to specialized ones. As we know, random numbers are described by a distribution. That is, some function which specifies the probability that a random number is in some range. For example P(a < X <= b). Often this is given by a probability density (in the continuous case) or by a function P(X = k) = f(k) in the discrete case. R will give numbers drawn from lots of different distributions. In order to use them, you only need familiarize yourselves with the parameters that are given to the functions such as a mean, or a rate. Here are examples of the most common ones. For each, a histogram is given for a random sample of size 100, and density is superimposed as appropriate.

Uniform

function: runif()

arguments:

Uniform numbers are ones that are "equally likely" to be in the specified range. Often these numbers are in [0,1] for computers, but in practice can be between [a,b]. The general form is runif(n, min = 0, max = 1) which allows you to decide how many uniform random numbers you want (n), and the range they are chosen from (min, max)

# generate 1 random number between 0 and 2
runif(n = 1, min = 0, max = 2)
## [1] 1.301

# generate 5 random numbers between 0 and 10
runif(n = 5, min = 0, max = 10)
## [1] 8.52982 7.61161 0.07449 8.28692 5.63048

# if you do not set the min and max, the default is min = 0, max = 1
runif(n = 5)
## [1] 0.059946 0.314155 0.941851 0.315093 0.001383

To see the distribution with min = 0 and max = 1 (the default) we have

# get 100 random numbers between 0 and 1
x = runif(100)

# plot histogram
hist(x, probability = TRUE, col = gray(0.9), main = "uniform on [0,1]")
curve(dunif(x, 0, 1), add = TRUE)
Fig. Histogram of uniform distribution

Fig. Histogram of uniform distribution

Normal distribution

function: rnorm()

arguments:

Normal numbers are the backbone of classical statistical theory due to the central limit theorem The normal distribution has two parameters a mean and a standard deviation. These are the location and spread parameters. For example, IQs may be normally distributed with mean 100 and standard deviation 16; Human gestation may be normal with mean 280 and standard deviation of 10 (approximately). The family of normals can be standardized to normal with mean 0 (centered) and variance 1. This is achieved by "standardizing" the numbers, i.e. Z = (X - mu) / sigma.

For example

# an IQ score
rnorm(n = 1, mean = 100, sd = 16)
## [1] 88.46

# gestation
rnorm(n = 1, mean = 280, sd = 10)
## [1] 279.6

Too see the shape for the defaults (mean 0, standard deviation 1).

x = rnorm(n = 100)
hist(x, probability = TRUE, col = gray(0.9), main = "normal mu=0,sigma=1")
curve(dnorm(x), add = TRUE)
Fig. Histogram of normal distribution

Fig. Histogram of normal distribution

Binomial.

function: rbinom()

arguments:

The binomial random numbers are discrete random numbers. They have the distribution of the number of successes in n independent Bernoulli trials where a Bernoulli trial results in success or failure, with the probability of success = p

A single Bernoulli trial (e.g. a coin flip) is given with size = 1.

rbinom(n = 1, size = 1, prob = 0.5)
## [1] 0

# 10 different numbers
rbinom(n = 10, size = 1, prob = 0.5)
##  [1] 0 0 1 1 1 0 1 0 0 0

A binomially distributed number is the same as the number of 1's in n such Bernoulli numbers.

To generate binomial numbers, we simply change the value of size from 1 to the desired number of trials. For example, with 10 trials:

# how many successes in 10 trials
rbinom(n = 1, size = 10, p = 0.5)
## [1] 5

# 5 separate series of 10 trials
rbinom(n = 5, size = 10, p = 0.5)
## [1] 4 5 4 5 6

The number of successes is of course discrete, but as n gets large, the number starts to look quite normal. This is a case of the central limit theorem which states in general that (X - mu) / \sigma is normal in the limit.

These graphs show 100 binomially distributed random numbers for size = 5 trials and for p = 0.25.

# 100 random numbers
x = rbinom(100, size = 5, p = 0.25)
hist(x, probability = TRUE)

# use points, not curve as dbinom() wants integers only for x
xvals = 0:5
points(xvals, dbinom(xvals, 5, 0.25), type = "h", lwd = 3)
points(xvals, dbinom(xvals, 5, 0.25), type = "p", lwd = 3)
Fig. Histograms of binomial distributions

Fig. Histograms of binomial distributions

Exponential

function: rexp()

arguments:

The exponential distribution is important for theoretical work. It is used to describe lifetimes of electrical components (to first order). For example, if the mean life of a light bulb is 2500 hours one may think its lifetime is random with exponential distribution having mean 2500. The one parameter is the rate = 1/mean. We specify it as follows rexp(n, rate = 1).

Here is an example with the rate being 1/2500.

x = rexp(n = 100, rate = 1/2500)
hist(x, probability = TRUE, col = gray(0.9), main = "exponential mean=2500")
curve(dexp(x, 1/2500), add = TRUE)
Fig. Histogram of exponential distribution

Fig. Histogram of exponential distribution

There are many other distributions of interest in statistics. Common ones are the Poisson, the Student t-distribution, the F distribution, the beta distribution and the Chi squared distribution. All are available in R and follow similar format to these outlined above.

Random Sampling / Bootstrapping

We can use the sample() function to randomly pull values, with or without replacement, from a vector. b is the vector that you want to sample from, size is the number of times to randomly pull a value from b.

b <- 1:6
sample(b, size = 10, replace = TRUE)
##  [1] 2 5 2 5 2 4 4 1 3 6

# or with a function
RollDie = function(n) sample(1:6, n, replace = TRUE)

# Roll five dice three times...
RollDie(5)
## [1] 5 6 1 4 3
RollDie(5)
## [1] 5 2 4 2 2
RollDie(5)
## [1] 3 4 5 6 6

Sampling with and without replacement using sample()

R has the ability to sample with and without replacement. That is, choose at random from a collection of things such as the numbers 1 through 6 in the dice rolling example.

The sampling can be done with replacement (like dice rolling) or without replacement (like a lottery). By default sample() samples without replacement each object having equal chance of being picked. You need to specify replace = TRUE if you want to sample with replacement. Furthermore, you can specify separate probabilities for each if desired.

Roll a die ten times

sample(1:6, 10, replace = TRUE)
##  [1] 1 2 6 4 1 3 2 3 5 2

Toss a coin

sample(c("H", "T"), 10, replace = TRUE)
##  [1] "T" "T" "T" "H" "H" "T" "H" "T" "T" "T"

Pick 6 of 54 (a lottery)

sample(1:54, 6)  # no replacement
## [1] 47 25 18 13 22 36

Pick a card. (Fancy! Uses paste, rep)

cards <- paste(rep(c("A", 2:10, "J", "Q", "K"), 4), c("H", "D", "S", "C"))
sample(cards, 5)
## [1] "J D" "8 D" "8 H" "3 C" "J H"

Roll 2 dice. Even fancier

dice <- as.vector(outer(1:6, 1:6, paste))
sample(dice, 5, replace = TRUE)
## [1] "5 4" "2 1" "3 2" "5 6" "6 6"

The last two illustrate things that can be done with a little typing and a lot of thinking using the fun commands paste() for pasting together strings, rep() for repeating things and outer() for generating all possible products.

A bootstrap sample

Bootstrapping is a method of sampling from a data set to make statistical inference. The intuitive idea is that by sampling, one can get an idea of the variability in the data. The process involves repeatedly selecting samples and then forming a statistic. Here is a simple illustration on obtaining a sample.

The built-in data set faithful has a variable "eruptions" that measures the time between eruptions at Old Faithful. It has an unusual distribution. A bootstrap sample is just a sample with replacement from the given values.

First, examine the dataset

data(faithful)  # part of R's base
names(faithful)
## [1] "eruptions" "waiting"
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

eruptions <- faithful$eruptions
sample(eruptions, 10, replace = TRUE)  # a sample of 10 values
##  [1] 2.800 4.350 4.133 1.867 4.533 2.017 2.133 2.617 4.233 4.250

# Plot the full dataset
hist(eruptions, breaks = 25)
plot of chunk unnamed-chunk-11

plot of chunk unnamed-chunk-11

Second, obtain the bootstrap sample

hist(sample(eruptions, 100, replace = TRUE), breaks = 25)
Fig. Histogram of eruptions.

Fig. Histogram of eruptions.

Notice that the bootstrap sample has a similar histogram, but it is different:

par(mfrow = c(2, 1))
hist(eruptions, breaks = 25)
hist(sample(eruptions, 100, replace = TRUE), breaks = 25)
Fig. Histogram of (top) eruptions, and (bottomw) bootstrapped sample.

Fig. Histogram of (top) eruptions, and (bottomw) bootstrapped sample.


Exercises