# Statistical Approaches

Three distinct goals for which one might use statistics:

- Hypothesis testing

- in general ANOVA -> “factor X influences factor Y”,
- falsifying null hypothesis (often obviously wrong).

- Prediction

- precise enough to be wrong,
- probabalistic,
- prolific data,
- proper scales,
- place specific,
- make public.

- Exploration

- forefront of knowledge,
- no p-values,
- hard to publish now,
- so, often dressed up as hypothesis-testing.

Which of these you are up to should be determined *before* you start (collecting and) analyzing the data.

These goals overlap with the summary statistics provided for statistical models.

It is possible to do an approach using the same statistical techniques; but, some techniques are not appropriate for some approaches.

**p-value**—used in a *hypothetico-deductive* framework telling us the probability the signal could have been observed by chance (e.g., does nitrogen increase crop yield?)
**coefficient estimate**—the *biological significance* (e.g., how much does crop yield increase given a level of nitrogen addition?)
**R2**—how much what we are studying explains vs the other sources of variation (e.g., how much of the variation in crop yield is due to nitrogen). How well/competely do we understand the system?

Much of science is focussed on p-values to the detriment of other information: A p < 0.05 with an effect size of 0.1% and R2 of 3% is not that useful!

Many ‘exploratory’ analysis are portrayed as hypothesis-testing. e.g.,

- do some data mining (also called data dredging) until you find a statistical relationship, then test it for statistical significance.
- if you do variable selection first and then significance testing on the selected variables your p-values will be much lower than they should be

## Example techniques

### 1. Hypothesis testing

- present full model
- model selection

### 2. Prediction

- present full model
- model selection
- regression over ANOVA

### 3. Exploration

- regression trees
- spline regression
- principle component analysis
- variable selection