FES 720 Introduction to R

Statistical Approaches

Three distinct goals for which one might use statistics:

Hypothesis testing

in general ANOVA -> “factor X influences factor Y”,
falsifying null hypothesis (often obviously wrong).

Prediction

precise enough to be wrong,
probabalistic,
prolific data,
proper scales,
place specific,
make public.

Exploration

forefront of knowledge,
no p-values,
hard to publish now,
so, often dressed up as hypothesis-testing.

Which of these you are up to should be determined before you start (collecting and) analyzing the data.

These goals overlap with the summary statistics provided for statistical models.

It is possible to do an approach using the same statistical techniques; but, some techniques are not appropriate for some approaches.

p-value—used in a hypothetico-deductive framework telling us the probability the signal could have been observed by chance (e.g., does nitrogen increase crop yield?)
coefficient estimate—the biological significance (e.g., how much does crop yield increase given a level of nitrogen addition?)
R2—how much what we are studying explains vs the other sources of variation (e.g., how much of the variation in crop yield is due to nitrogen). How well/competely do we understand the system?

Much of science is focussed on p-values to the detriment of other information: A p < 0.05 with an effect size of 0.1% and R2 of 3% is not that useful!

Many ‘exploratory’ analysis are portrayed as hypothesis-testing. e.g.,

do some data mining (also called data dredging) until you find a statistical relationship, then test it for statistical significance.
if you do variable selection first and then significance testing on the selected variables your p-values will be much lower than they should be

Example techniques

1. Hypothesis testing

present full model
model selection

2. Prediction

present full model
model selection
regression over ANOVA

3. Exploration

regression trees
spline regression
principle component analysis
variable selection