# Installing R

R is a free and open source programming language. You can find download instructions here:

- Try to install it in your personal computer

Note: If you are using Linux (some debian flavor) then installing R is really simple:

sudo apt-get install r-base r-base-core

# Using R for the course

- If R is installed in the University Computers, you can use R there.
- You can also login to:

139.91.162.50:8787

# Statistics for Bioinformatics

So, just read and try…. it’s the best way to learn applying statistics in bioinformatics problems.

# Why Statistics in Biology

- Many students go to Biology to avoid anything related to math.
- Why do you have now to learn statistics and apply them to biological data?

# Statistics is the Science of Learning from Data

You use a small collection (a sample) to learn something for the whole (the population).

Sample (Wikipedia):

In statistics and quantitative research methodology, a **data sample** is a set of data collected and/or selected from a statistical population by a defined procedure.^{[1]}

A sample can be **random**: The chance of sampling an object does not depend on the properties of the object.

- The sample is taken from a population.
- The population can be assumed something large, perhaps infinite, and unknown.
- We are interested in understanding parameters from the population.
- We study the sample in order to understand the population.

# Let’s start with real Examples: Diet-Induced Obesity in Mice

You can click on the dataset. Then, go to the bottom of the page:

**Experiment design and value distribution**

Click on the experimental design. You will see something similar to that:

This represents the design of the study. Let’s try to understand it:

- 3 protocols: baseline, normal diet, high fat dies
- several time-points (after birth)

**Notes**

- When we compare the diets, the baseline can be neglected (there is no diet there)
- Expression values of the same at different samples are different due to:
- different time-point
- different treatment (diet)
- stochasticity

# Understanding our data

## Descriptive quantities for the gene expression values

In R, it is very easy to calculate descriptive statistics:

- mean, variance, standard deviation, percentiles

It is also very easy to summarize the data using:

- boxplots, histograms, densities

Visualization

With R it is easy to visualize the data:

- Principal Component Analysis
- Multidimensional Scaling
- Heatmaps

# Hypothesis testing

Now, it’s time to perform some hypothesis testing:

For example,

- geneA is higher in high-fat diet mice, than normal fat diet mice

In general, to perform a hypothesis testing we need:

- A hypothesis to test, that
**is formed prior to seeing the data** - Data
- A statistic
- A theoretical distribution of the statistic under the null hypothesis
- A significance threshold