Installing R
R is a free and open source programming language. You can find download instructions here:
- Try to install it in your personal computer
Note: If you are using Linux (some debian flavor) then installing R is really simple:
sudo apt-get install r-base r-base-core
Using R for the course
- If R is installed in the University Computers, you can use R there.
- You can also login to:
139.91.162.50:8787
Statistics for Bioinformatics
So, just read and try…. it’s the best way to learn applying statistics in bioinformatics problems.
Why Statistics in Biology
- Many students go to Biology to avoid anything related to math.
- Why do you have now to learn statistics and apply them to biological data?
Statistics is the Science of Learning from Data
You use a small collection (a sample) to learn something for the whole (the population).
Sample (Wikipedia):
In statistics and quantitative research methodology, a data sample is a set of data collected and/or selected from a statistical population by a defined procedure.[1]
A sample can be random: The chance of sampling an object does not depend on the properties of the object.
- The sample is taken from a population.
- The population can be assumed something large, perhaps infinite, and unknown.
- We are interested in understanding parameters from the population.
- We study the sample in order to understand the population.
Let’s start with real Examples: Diet-Induced Obesity in Mice
You can click on the dataset. Then, go to the bottom of the page:
Experiment design and value distribution
Click on the experimental design. You will see something similar to that:
This represents the design of the study. Let’s try to understand it:
- 3 protocols: baseline, normal diet, high fat dies
- several time-points (after birth)
Notes
- When we compare the diets, the baseline can be neglected (there is no diet there)
- Expression values of the same at different samples are different due to:
- different time-point
- different treatment (diet)
- stochasticity
Understanding our data
Descriptive quantities for the gene expression values
In R, it is very easy to calculate descriptive statistics:
- mean, variance, standard deviation, percentiles
It is also very easy to summarize the data using:
- boxplots, histograms, densities
Visualization
With R it is easy to visualize the data:
- Principal Component Analysis
- Multidimensional Scaling
- Heatmaps
Hypothesis testing
Now, it’s time to perform some hypothesis testing:
For example,
- geneA is higher in high-fat diet mice, than normal fat diet mice
In general, to perform a hypothesis testing we need:
- A hypothesis to test, that is formed prior to seeing the data
- Data
- A statistic
- A theoretical distribution of the statistic under the null hypothesis
- A significance threshold