Statistics in Bioinformatics using R

Installing R

R is a free and open source programming language. You can find download instructions here:

https://www.r-project.org/

  • Try to install it in your personal computer

Note: If you are using Linux (some debian flavor) then installing R is really simple:

sudo apt-get install r-base r-base-core

Using R for the course

  • If R is installed in the University Computers, you can use R there.
  • You can also login to:
    139.91.162.50:8787

 

Statistics for Bioinformatics

So, just read and try…. it’s the best way to learn applying statistics in bioinformatics problems.

 

Why Statistics in Biology

  • Many students go to Biology to avoid anything related to math.
  • Why do you have now to learn statistics and apply them to biological data?

Statistics is the Science of Learning from Data

You use a small collection (a sample) to learn something for the whole (the population).

Sample (Wikipedia):
In statistics and quantitative research methodology, a data sample is a set of data collected and/or selected from a statistical population by a defined procedure.[1]

A sample can be random: The chance of sampling an object does not depend on the properties of the object.

  • The sample is taken from a population.
  • The population can be assumed something large, perhaps infinite, and unknown.
  • We are interested in understanding parameters from the population.
  • We study the sample in order to understand the population.

 

Let’s start with real Examples: Diet-Induced Obesity in Mice

screenshot-from-2016-11-14-11-05-09

You can click on the dataset. Then, go to the bottom of the page:
Experiment design and value distribution

Click on the experimental design. You will see something similar to that:

screenshot-from-2016-11-14-11-24-48

This represents the design of the study. Let’s try to understand it:

  • 3 protocols: baseline, normal diet, high fat dies
  • several time-points (after birth)

Notes

  • When we compare the diets, the baseline can be neglected (there is no diet there)
  • Expression values of the same at different samples are different due to:
    • different time-point
    • different treatment (diet)
    • stochasticity

 

Understanding our data

Descriptive quantities for the gene expression values

In R, it is very easy to calculate descriptive statistics:

  • mean, variance, standard deviation, percentiles

It is also very easy to summarize the data using:

  • boxplots, histograms, densities

Visualization

With R it is easy to visualize the data:

  • Principal Component Analysis
  • Multidimensional Scaling
  • Heatmaps

Hypothesis testing

Now, it’s time to perform some hypothesis testing:

For example,

  • geneA is higher in high-fat diet mice, than normal fat diet mice

In general, to perform a hypothesis testing we need:

  • A hypothesis to test, that is formed prior to seeing the data
  • Data
  • A statistic
  • A theoretical distribution of the statistic under the null hypothesis
  • A significance threshold