Accessing Variables and Managing Subsets of Data

It is best to import the full dataset and subset within R, rather than pull out different parts of data sets in spreadsheet software.

A. Examining the data structure: names(), head(), and str()

The function names() returns the column names of the data.

The function head() returns the top 6 rows of the data (unless you specify a different number of rows).

The function str() returns the status of each variable in the data frame

# Let's use the sparrow data again ... 
dat <- read.table(file = "http://www.simonqueenborough.info/R/data/sparrows.txt", header = TRUE)
names(dat)

##  [1] "Species"  "Sex"      "Wingcrd"  "Tarsus"   "Head"     "Culmen"  
##  [7] "Nalospi"  "Wt"       "Observer" "Age"

head(dat)

##   Species    Sex Wingcrd Tarsus Head Culmen Nalospi   Wt Observer Age
## 1    SSTS   Male    58.0   21.7 32.7   13.9    10.2 20.3        2   0
## 2    SSTS Female    56.5   21.1 31.4   12.2    10.1 17.4        2   0
## 3    SSTS   Male    59.0   21.0 33.3   13.8    10.0 21.0        2   0
## 4    SSTS   Male    59.0   21.3 32.5   13.2     9.9 21.0        2   0
## 5    SSTS   Male    57.0   21.0 32.5   13.8     9.9 19.8        2   0
## 6    SSTS Female    57.0   20.7 32.5   13.3     9.9 17.5        2   0

str(dat)

## 'data.frame':    979 obs. of  10 variables:
##  $ Species : Factor w/ 2 levels "SESP","SSTS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Sex     : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 2 2 1 2 ...
##  $ Wingcrd : num  58 56.5 59 59 57 57 57 57 53.5 56.5 ...
##  $ Tarsus  : num  21.7 21.1 21 21.3 21 20.7 22 20.8 20.1 22.2 ...
##  $ Head    : num  32.7 31.4 33.3 32.5 32.5 32.5 33.1 32.1 32.1 32.2 ...
##  $ Culmen  : num  13.9 12.2 13.8 13.2 13.8 13.3 13.8 13.5 13.7 13.6 ...
##  $ Nalospi : num  10.2 10.1 10 9.9 9.9 9.9 9.9 9.8 9.8 9.8 ...
##  $ Wt      : num  20.3 17.4 21 21 19.8 17.5 19.6 21.2 18.5 20.5 ...
##  $ Observer: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Age     : int  0 0 0 0 0 0 0 0 0 0 ...

Once the data are in R, it is best practice to examine the data graphically, by plotting various relationships, checking for outliers etc. See presentation

B. Accessing data and calling columns

There are various ways to call or access data once it is in R. Some ways are better than others. In order of best practice:

1. Within functions: data =

Many functions have a data = argument...

# linear regression 
m1 <- lm(Wingcrd ~ factor(Species) + factor(Sex), data = dat)

... although many do not

mean(Wingcrd, data = dat) # returns an error!

## Error in mean(Wingcrd, data = dat): object 'Wingcrd' not found

2. The $ sign

Columns within data frames can be called using $

head(dat$Wingcrd)

## [1] 58.0 56.5 59.0 59.0 57.0 57.0

mean(dat$Wingcrd)

## [1] 57.86599

3. Indices: []

Use square brackets to index [row, col]. Although, if your column order changes, you may end up not accessing the correct column!

head(dat[,6])

## [1] 13.9 12.2 13.8 13.2 13.8 13.3

mean(dat[,6])

## [1] 13.15536

4. Worst practice: attach()

The attach() argument directly stores the data object in R, allowing you to dispense with the dat$ prefix to columns.

attach(dat)
Wingcrd
Sex

mean(Wingcrd)
boxplot(Wingcrd)

This option appears initially very attractive; but it can lead to problems

if the attached dataset has variable names that also exist outside the data frame, or in another data frame, or
if variable names match other R functions or example data sets

If you simply must use attach(), always detach() the data after use.

It is far better never to use it.

detach(dat)

C. Accessing subsets of data

1. What different levels/codes/values exist in the data: unique()

Suppose we want to know what the codes are for sex in the dat data

unique(dat$Sex)

## [1] Male   Female
## Levels: Female Male

2. Pulling out only one sex: []

Again, a number of ways to do this.

# 1.
# 1a. create vector of TRUE/FALSE (boolean)
Sel <- dat$Sex == 'Female'
head(Sel)

## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE

# 1b. select rows of dat that = TRUE
#     using [] for indices:
#     we want rows, therefore before the ','
datF <- dat[Sel, ]

# 2.
# or, all in one go...
datF <- dat[dat$Sex == 'Female', ]

3. Extracting the same data, many ways

Observers 1, 2, and 3

dat123 <- dat[dat$Observer == 1 | dat$Observer == 2 | dat$Observer == 3, ]
dat123 <- dat[dat$Observer != 4, ]
dat123 <- dat[dat$Observer < 4, ]
dat123 <- dat[dat$Observer <= 3, ]
dat123 <- dat[dat$Observer >= 1 & dat$Observer <= 3, ]

Females from Observer 1

datF.and.1 <- dat[dat$Sex == 1 & dat$Observer == 1,]

Doing the above sequentially will result in NAs in the data

datF <- dat[dat$Sex == 1, ]           # extracts female data, and assigns it to datF, which has fewer rows than dat
datF1 <- datF[dat$Observer == 1, ]    # so, will add NAs to datF1

4. Sorting the data: order()

Use the function order() to sort the whole data frame by one or more columns

Ord1 <- order(dat$Wt)
dat[Ord1, ]

5. Add more columns to the data frame

New columns can be added to a dataframe simply by calling a new column name after the $: ```dat$new```.

dat$fObserver <- factor(dat$Observer)
dat$fAge <- factor(dat$Sex, labels = c("Zero", "One")) # These labels need to be in the same order as the data they are 'replacing'.
dat$fAge[1:35]

##  [1] One  Zero One  One  One  Zero One  One  Zero One  Zero Zero One  One 
## [15] One  Zero One  Zero One  One  One  One  One  Zero One  One  One  One 
## [29] One  One  One  One  One  One  Zero
## Levels: Zero One

names(dat)

##  [1] "Species"   "Sex"       "Wingcrd"   "Tarsus"    "Head"     
##  [6] "Culmen"    "Nalospi"   "Wt"        "Observer"  "Age"      
## [11] "fObserver" "fAge"

head(dat)

##   Species    Sex Wingcrd Tarsus Head Culmen Nalospi   Wt Observer Age
## 1    SSTS   Male    58.0   21.7 32.7   13.9    10.2 20.3        2   0
## 2    SSTS Female    56.5   21.1 31.4   12.2    10.1 17.4        2   0
## 3    SSTS   Male    59.0   21.0 33.3   13.8    10.0 21.0        2   0
## 4    SSTS   Male    59.0   21.3 32.5   13.2     9.9 21.0        2   0
## 5    SSTS   Male    57.0   21.0 32.5   13.8     9.9 19.8        2   0
## 6    SSTS Female    57.0   20.7 32.5   13.3     9.9 17.5        2   0
##   fObserver fAge
## 1         2  One
## 2         2 Zero
## 3         2  One
## 4         2  One
## 5         2  One
## 6         2 Zero

6. Combining two datasets with common identifier: merge()

dat.meta <- data.frame(Species = c('SSTS', 'SESP'), 
                       Fave.Color = c('red', 'blue'),
                       Diet = c('seeds', 'insects') 
                      )

dat.meta

##   Species Fave.Color    Diet
## 1    SSTS        red   seeds
## 2    SESP       blue insects

datMerged <- merge(dat, dat.meta, by = "Species")
datMerged[1:10, ]

##    Species    Sex Wingcrd Tarsus Head Culmen Nalospi   Wt Observer Age
## 1     SESP Female    61.0   22.5 33.7   14.4    10.4 22.4        2   0
## 2     SESP   Male    60.0   20.6 32.7   14.3    10.4 22.0        2   0
## 3     SESP   Male    66.0   22.4 31.7   15.1    10.5 22.3        2   0
## 4     SESP Female    56.5   23.6 33.9   14.2    10.4 21.5        2   0
## 5     SESP   Male    63.0   22.7 33.1   14.3    10.6 23.1        2   0
## 6     SESP Female    57.0   22.5 33.3   14.2    10.6 21.5        2   0
## 7     SESP   Male    62.0   23.5 34.6   15.0    11.1 23.2        4   0
## 8     SESP   Male    64.0   23.9 33.7   14.6    10.5 22.5        2   1
## 9     SESP Female    61.0   23.4 35.2   15.5    10.8 25.5        4   1
## 10    SESP   Male    61.5   23.1 33.3   14.2    10.3 22.0        2   0
##    fObserver fAge Fave.Color    Diet
## 1          2 Zero       blue insects
## 2          2  One       blue insects
## 3          2  One       blue insects
## 4          2 Zero       blue insects
## 5          2  One       blue insects
## 6          2 Zero       blue insects
## 7          4  One       blue insects
## 8          2  One       blue insects
## 9          4 Zero       blue insects
## 10         2  One       blue insects

# 'all' argument. FALSE = rows with NAs omitted; TRUE = NAs filled in
datMerged <- merge(dat, dat.meta, by = "Species", all = TRUE)
datMerged[1:10,]

##    Species    Sex Wingcrd Tarsus Head Culmen Nalospi   Wt Observer Age
## 1     SESP Female    61.0   22.5 33.7   14.4    10.4 22.4        2   0
## 2     SESP   Male    60.0   20.6 32.7   14.3    10.4 22.0        2   0
## 3     SESP   Male    66.0   22.4 31.7   15.1    10.5 22.3        2   0
## 4     SESP Female    56.5   23.6 33.9   14.2    10.4 21.5        2   0
## 5     SESP   Male    63.0   22.7 33.1   14.3    10.6 23.1        2   0
## 6     SESP Female    57.0   22.5 33.3   14.2    10.6 21.5        2   0
## 7     SESP   Male    62.0   23.5 34.6   15.0    11.1 23.2        4   0
## 8     SESP   Male    64.0   23.9 33.7   14.6    10.5 22.5        2   1
## 9     SESP Female    61.0   23.4 35.2   15.5    10.8 25.5        4   1
## 10    SESP   Male    61.5   23.1 33.3   14.2    10.3 22.0        2   0
##    fObserver fAge Fave.Color    Diet
## 1          2 Zero       blue insects
## 2          2  One       blue insects
## 3          2  One       blue insects
## 4          2 Zero       blue insects
## 5          2  One       blue insects
## 6          2 Zero       blue insects
## 7          4  One       blue insects
## 8          2  One       blue insects
## 9          4 Zero       blue insects
## 10         2  One       blue insects

7. Recoding categorical variables

dat$fSex <- factor(dat$Sex, labels = c("F", "M"))
dat$fObserver <- factor(dat$Observer)
str(dat)

## 'data.frame':    979 obs. of  13 variables:
##  $ Species  : Factor w/ 2 levels "SESP","SSTS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Sex      : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 2 2 1 2 ...
##  $ Wingcrd  : num  58 56.5 59 59 57 57 57 57 53.5 56.5 ...
##  $ Tarsus   : num  21.7 21.1 21 21.3 21 20.7 22 20.8 20.1 22.2 ...
##  $ Head     : num  32.7 31.4 33.3 32.5 32.5 32.5 33.1 32.1 32.1 32.2 ...
##  $ Culmen   : num  13.9 12.2 13.8 13.2 13.8 13.3 13.8 13.5 13.7 13.6 ...
##  $ Nalospi  : num  10.2 10.1 10 9.9 9.9 9.9 9.9 9.8 9.8 9.8 ...
##  $ Wt       : num  20.3 17.4 21 21 19.8 17.5 19.6 21.2 18.5 20.5 ...
##  $ Observer : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Age      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ fObserver: Factor w/ 7 levels "2","3","4","5",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ fAge     : Factor w/ 2 levels "Zero","One": 2 1 2 2 2 1 2 2 1 2 ...
##  $ fSex     : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 2 2 1 2 ...

# we can change the order ...
dat$fObserver <- factor(dat$fObserver,
                    levels = c(8, 7, 2, 3, 4, 5, 6))
table(dat$fObserver)

## 
##   8   7   2   3   4   5   6 
## 109  50 297 241  83  45 154

8. Exporting data

datM <- dat[dat$Sex == 'Male', ]

# different options for writing the data.. see what happens with each one.
write.table(datM,
     file = "Maledata.txt",
     sep=" ", quote = FALSE, append = FALSE, na = "NA")
     
write.table(datM,
     file = "Maledata.txt",
     sep=",", quote = TRUE, append = FALSE, na = "NA")

write.table(datM,
     file = "Maledata.txt",
     sep=" ", quote = TRUE, append = TRUE, na = "NA")

Exercises

A. Using the first small sparrow dataset ...

dat <- data.frame(Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
                  Tarsus = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
                  Head = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, NA),
                  Wt = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7)
                 )
dat

Use names(), head(), and str() to examine the data.
Add a column for species. Make two (2) species, using a vector of 1's and 2's.
Change this species column to a factor.
Change the order of the factor, so that species 2 is the first one and species 1 is the second.
Write two ways to extract the Head column.
Sort the whole data.frame by Tarsus.
Write this new data.frame to your computer.

B. Using results from the 2016 New Haven Road Race!

Find the data here: http://www.newhavenroadrace.org/wp-content/uploads/2015/03/NH16-5k-Overall.txt.

This data shows the times for runners in the race in early September 2016. It shows the runners, their age class and gender, and their times.

Times are logged via chips on each runner's bib. The 'time' column is the time since the starting gun fired. The 'netime' is the actual time from when each runner crossed the start line to when they crossed the finish line. 'Pace' is the average time per mile.

Import the data into R.
How many runners were there?
How many different places did they come from?
How many were from New Haven?
What was the slowest pace for males?
What was the fastest pace for females?
What was the median pace?
What was the difference between the mean pace for males and the mean pace for females?
What pace did your professor run? Was this above or below the mean for males in his age class?