It is best to import the full dataset and subset within R, rather than pull out different parts of data sets in spreadsheet software.
The function names()
returns the column names of the data.
The function head()
returns the top 6 rows of the data (unless you specify a different number of rows).
The function str()
returns the status of each variable in the data frame
# Let's use the sparrow data again ...
dat <- read.table(file = "http://www.simonqueenborough.info/R/data/sparrows.txt", header = TRUE)
names(dat)
## [1] "Species" "Sex" "Wingcrd" "Tarsus" "Head" "Culmen"
## [7] "Nalospi" "Wt" "Observer" "Age"
head(dat)
## Species Sex Wingcrd Tarsus Head Culmen Nalospi Wt Observer Age
## 1 SSTS Male 58.0 21.7 32.7 13.9 10.2 20.3 2 0
## 2 SSTS Female 56.5 21.1 31.4 12.2 10.1 17.4 2 0
## 3 SSTS Male 59.0 21.0 33.3 13.8 10.0 21.0 2 0
## 4 SSTS Male 59.0 21.3 32.5 13.2 9.9 21.0 2 0
## 5 SSTS Male 57.0 21.0 32.5 13.8 9.9 19.8 2 0
## 6 SSTS Female 57.0 20.7 32.5 13.3 9.9 17.5 2 0
str(dat)
## 'data.frame': 979 obs. of 10 variables:
## $ Species : Factor w/ 2 levels "SESP","SSTS": 2 2 2 2 2 2 2 2 2 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 2 2 1 2 ...
## $ Wingcrd : num 58 56.5 59 59 57 57 57 57 53.5 56.5 ...
## $ Tarsus : num 21.7 21.1 21 21.3 21 20.7 22 20.8 20.1 22.2 ...
## $ Head : num 32.7 31.4 33.3 32.5 32.5 32.5 33.1 32.1 32.1 32.2 ...
## $ Culmen : num 13.9 12.2 13.8 13.2 13.8 13.3 13.8 13.5 13.7 13.6 ...
## $ Nalospi : num 10.2 10.1 10 9.9 9.9 9.9 9.9 9.8 9.8 9.8 ...
## $ Wt : num 20.3 17.4 21 21 19.8 17.5 19.6 21.2 18.5 20.5 ...
## $ Observer: int 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : int 0 0 0 0 0 0 0 0 0 0 ...
Once the data are in R, it is best practice to examine the data graphically, by plotting various relationships, checking for outliers etc. See presentation
There are various ways to call or access data once it is in R. Some ways are better than others. In order of best practice:
Many functions have a data =
argument...
# linear regression
m1 <- lm(Wingcrd ~ factor(Species) + factor(Sex), data = dat)
... although many do not
mean(Wingcrd, data = dat) # returns an error!
## Error in mean(Wingcrd, data = dat): object 'Wingcrd' not found
Columns within data frames can be called using $
head(dat$Wingcrd)
## [1] 58.0 56.5 59.0 59.0 57.0 57.0
mean(dat$Wingcrd)
## [1] 57.86599
Use square brackets to index [row, col]. Although, if your column order changes, you may end up not accessing the correct column!
head(dat[,6])
## [1] 13.9 12.2 13.8 13.2 13.8 13.3
mean(dat[,6])
## [1] 13.15536
The attach()
argument directly stores the data object in R, allowing you to dispense with the dat$
prefix to columns.
attach(dat)
Wingcrd
Sex
mean(Wingcrd)
boxplot(Wingcrd)
This option appears initially very attractive; but it can lead to problems
If you simply must use attach()
, always detach()
the data after use.
It is far better never to use it.
detach(dat)
Suppose we want to know what the codes are for sex in the dat data
unique(dat$Sex)
## [1] Male Female
## Levels: Female Male
Again, a number of ways to do this.
# 1.
# 1a. create vector of TRUE/FALSE (boolean)
Sel <- dat$Sex == 'Female'
head(Sel)
## [1] FALSE TRUE FALSE FALSE FALSE TRUE
# 1b. select rows of dat that = TRUE
# using [] for indices:
# we want rows, therefore before the ','
datF <- dat[Sel, ]
# 2.
# or, all in one go...
datF <- dat[dat$Sex == 'Female', ]
Observers 1, 2, and 3
dat123 <- dat[dat$Observer == 1 | dat$Observer == 2 | dat$Observer == 3, ]
dat123 <- dat[dat$Observer != 4, ]
dat123 <- dat[dat$Observer < 4, ]
dat123 <- dat[dat$Observer <= 3, ]
dat123 <- dat[dat$Observer >= 1 & dat$Observer <= 3, ]
Females from Observer 1
datF.and.1 <- dat[dat$Sex == 1 & dat$Observer == 1,]
Doing the above sequentially will result in NAs in the data
datF <- dat[dat$Sex == 1, ] # extracts female data, and assigns it to datF, which has fewer rows than dat
datF1 <- datF[dat$Observer == 1, ] # so, will add NAs to datF1
Use the function order()
to sort the whole data frame by one or more columns
Ord1 <- order(dat$Wt)
dat[Ord1, ]
New columns can be added to a dataframe simply by calling a new column name after the $: ```dat$new```.
dat$fObserver <- factor(dat$Observer)
dat$fAge <- factor(dat$Sex, labels = c("Zero", "One")) # These labels need to be in the same order as the data they are 'replacing'.
dat$fAge[1:35]
## [1] One Zero One One One Zero One One Zero One Zero Zero One One
## [15] One Zero One Zero One One One One One Zero One One One One
## [29] One One One One One One Zero
## Levels: Zero One
names(dat)
## [1] "Species" "Sex" "Wingcrd" "Tarsus" "Head"
## [6] "Culmen" "Nalospi" "Wt" "Observer" "Age"
## [11] "fObserver" "fAge"
head(dat)
## Species Sex Wingcrd Tarsus Head Culmen Nalospi Wt Observer Age
## 1 SSTS Male 58.0 21.7 32.7 13.9 10.2 20.3 2 0
## 2 SSTS Female 56.5 21.1 31.4 12.2 10.1 17.4 2 0
## 3 SSTS Male 59.0 21.0 33.3 13.8 10.0 21.0 2 0
## 4 SSTS Male 59.0 21.3 32.5 13.2 9.9 21.0 2 0
## 5 SSTS Male 57.0 21.0 32.5 13.8 9.9 19.8 2 0
## 6 SSTS Female 57.0 20.7 32.5 13.3 9.9 17.5 2 0
## fObserver fAge
## 1 2 One
## 2 2 Zero
## 3 2 One
## 4 2 One
## 5 2 One
## 6 2 Zero
dat.meta <- data.frame(Species = c('SSTS', 'SESP'),
Fave.Color = c('red', 'blue'),
Diet = c('seeds', 'insects')
)
dat.meta
## Species Fave.Color Diet
## 1 SSTS red seeds
## 2 SESP blue insects
datMerged <- merge(dat, dat.meta, by = "Species")
datMerged[1:10, ]
## Species Sex Wingcrd Tarsus Head Culmen Nalospi Wt Observer Age
## 1 SESP Female 61.0 22.5 33.7 14.4 10.4 22.4 2 0
## 2 SESP Male 60.0 20.6 32.7 14.3 10.4 22.0 2 0
## 3 SESP Male 66.0 22.4 31.7 15.1 10.5 22.3 2 0
## 4 SESP Female 56.5 23.6 33.9 14.2 10.4 21.5 2 0
## 5 SESP Male 63.0 22.7 33.1 14.3 10.6 23.1 2 0
## 6 SESP Female 57.0 22.5 33.3 14.2 10.6 21.5 2 0
## 7 SESP Male 62.0 23.5 34.6 15.0 11.1 23.2 4 0
## 8 SESP Male 64.0 23.9 33.7 14.6 10.5 22.5 2 1
## 9 SESP Female 61.0 23.4 35.2 15.5 10.8 25.5 4 1
## 10 SESP Male 61.5 23.1 33.3 14.2 10.3 22.0 2 0
## fObserver fAge Fave.Color Diet
## 1 2 Zero blue insects
## 2 2 One blue insects
## 3 2 One blue insects
## 4 2 Zero blue insects
## 5 2 One blue insects
## 6 2 Zero blue insects
## 7 4 One blue insects
## 8 2 One blue insects
## 9 4 Zero blue insects
## 10 2 One blue insects
# 'all' argument. FALSE = rows with NAs omitted; TRUE = NAs filled in
datMerged <- merge(dat, dat.meta, by = "Species", all = TRUE)
datMerged[1:10,]
## Species Sex Wingcrd Tarsus Head Culmen Nalospi Wt Observer Age
## 1 SESP Female 61.0 22.5 33.7 14.4 10.4 22.4 2 0
## 2 SESP Male 60.0 20.6 32.7 14.3 10.4 22.0 2 0
## 3 SESP Male 66.0 22.4 31.7 15.1 10.5 22.3 2 0
## 4 SESP Female 56.5 23.6 33.9 14.2 10.4 21.5 2 0
## 5 SESP Male 63.0 22.7 33.1 14.3 10.6 23.1 2 0
## 6 SESP Female 57.0 22.5 33.3 14.2 10.6 21.5 2 0
## 7 SESP Male 62.0 23.5 34.6 15.0 11.1 23.2 4 0
## 8 SESP Male 64.0 23.9 33.7 14.6 10.5 22.5 2 1
## 9 SESP Female 61.0 23.4 35.2 15.5 10.8 25.5 4 1
## 10 SESP Male 61.5 23.1 33.3 14.2 10.3 22.0 2 0
## fObserver fAge Fave.Color Diet
## 1 2 Zero blue insects
## 2 2 One blue insects
## 3 2 One blue insects
## 4 2 Zero blue insects
## 5 2 One blue insects
## 6 2 Zero blue insects
## 7 4 One blue insects
## 8 2 One blue insects
## 9 4 Zero blue insects
## 10 2 One blue insects
dat$fSex <- factor(dat$Sex, labels = c("F", "M"))
dat$fObserver <- factor(dat$Observer)
str(dat)
## 'data.frame': 979 obs. of 13 variables:
## $ Species : Factor w/ 2 levels "SESP","SSTS": 2 2 2 2 2 2 2 2 2 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 2 2 1 2 ...
## $ Wingcrd : num 58 56.5 59 59 57 57 57 57 53.5 56.5 ...
## $ Tarsus : num 21.7 21.1 21 21.3 21 20.7 22 20.8 20.1 22.2 ...
## $ Head : num 32.7 31.4 33.3 32.5 32.5 32.5 33.1 32.1 32.1 32.2 ...
## $ Culmen : num 13.9 12.2 13.8 13.2 13.8 13.3 13.8 13.5 13.7 13.6 ...
## $ Nalospi : num 10.2 10.1 10 9.9 9.9 9.9 9.9 9.8 9.8 9.8 ...
## $ Wt : num 20.3 17.4 21 21 19.8 17.5 19.6 21.2 18.5 20.5 ...
## $ Observer : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Age : int 0 0 0 0 0 0 0 0 0 0 ...
## $ fObserver: Factor w/ 7 levels "2","3","4","5",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ fAge : Factor w/ 2 levels "Zero","One": 2 1 2 2 2 1 2 2 1 2 ...
## $ fSex : Factor w/ 2 levels "F","M": 2 1 2 2 2 1 2 2 1 2 ...
# we can change the order ...
dat$fObserver <- factor(dat$fObserver,
levels = c(8, 7, 2, 3, 4, 5, 6))
table(dat$fObserver)
##
## 8 7 2 3 4 5 6
## 109 50 297 241 83 45 154
datM <- dat[dat$Sex == 'Male', ]
# different options for writing the data.. see what happens with each one.
write.table(datM,
file = "Maledata.txt",
sep=" ", quote = FALSE, append = FALSE, na = "NA")
write.table(datM,
file = "Maledata.txt",
sep=",", quote = TRUE, append = FALSE, na = "NA")
write.table(datM,
file = "Maledata.txt",
sep=" ", quote = TRUE, append = TRUE, na = "NA")
dat <- data.frame(Wingcrd = c(59, 55, 53.5, 55, 52.5, 57.5, 53, 55),
Tarsus = c(22.3, 19.7, 20.8, 20.3, 20.8, 21.5, 20.6, 21.5),
Head = c(31.2, 30.4, 30.6, 30.3, 30.3, 30.8, 32.5, NA),
Wt = c(9.5, 13.8, 14.8, 15.2, 15.5, 15.6, 15.6, 15.7)
)
dat
Use names()
, head()
, and str()
to examine the data.
Add a column for species. Make two (2) species, using a vector of 1's and 2's.
Change this species column to a factor.
Change the order of the factor, so that species 2 is the first one and species 1 is the second.
Write two ways to extract the Head column.
Sort the whole data.frame by Tarsus.
Write this new data.frame to your computer.
Find the data here: http://www.newhavenroadrace.org/wp-content/uploads/2015/03/NH16-5k-Overall.txt.
This data shows the times for runners in the race in early September 2016. It shows the runners, their age class and gender, and their times.
Times are logged via chips on each runner's bib. The 'time' column is the time since the starting gun fired. The 'netime' is the actual time from when each runner crossed the start line to when they crossed the finish line. 'Pace' is the average time per mile.
Import the data into R.
How many runners were there?
How many different places did they come from?
How many were from New Haven?
What was the slowest pace for males?
What was the fastest pace for females?
What was the median pace?
What was the difference between the mean pace for males and the mean pace for females?
What pace did your professor run? Was this above or below the mean for males in his age class?