Analysis of Microarray Data

Gene Expression Analysis from microarray data

Introduction / Biology Background

Chronic high-level alcohol consumption effect on brain

The chronic high-level alcohol consumption seen in alcoholism leads to dramatic effects on several organs and tissues of the human body.

A more specific question is what is the effect of chronic alcohol consumption in the human brain. Chronic high-level alcohol consumption leads to decreased white matter and inhibition of neurogenesis.

Hippocampus (Figure 1, in light blue) is a very important region of the human brain (see for the role of hippocampus).

To obtain insights into the effects of alcoholism in human brain, researchers analyzed gene expression of post-morterm hippocampus from alcoholic men and women and control (non-alcoholic) men and women.



  • Find and download the dataset from NCBI ( entitled
    “Chronic high-level alcohol consumption effect on brain: post-mortem hippocampus”. Please download the .SOFT file (NOT the full SOFT file)
  • Clean the dataset from the unecessary lines in the beginning and the end. These lines start with either of the symbols: # ! ^
  • read in the dataset in R. The NA string should be set as either “NA” or “null” (hint: na.strings).
  • Choose only a subset of lines for the final dataset:
    all lines from 1 – 10000 that contain no “NA” or “null” (hint: complete.cases and which).
  1. The first two columns are not necessary for the analysis. The second column contains information about the gene names, that you will need later for the presentation of the results. Save this in a variable called gene.names
  2. The final dataset that we will analyze should consist only of a subset of 24 samples i.e. :



“GSM1085677” “GSM1085681” “GSM1085685” “GSM1085689” “GSM1085695” “GSM1085698”

“GSM1085673” “GSM1085679” “GSM1085694” “GSM1085696” “GSM1085699” “GSM1085701”

“GSM1085666” “GSM1085668” “GSM1085670” “GSM1085671” “GSM1085674” “GSM1085678”
“GSM1085665” “GSM1085667” “GSM1085669” “GSM1085672” “GSM1085675” “GSM1085676”

These are:

1 – 6 ( normal face, red ): normal female

7 – 12 ( bold face, red): alcoholic female

13 – 18 (normal face, black): normal male

19 – 24 (bold face, black): alcoholic male

Make a boxplot of the samples and answer the question

a) Is the dataset normalized? How did you figure out if it is or it is not normalized?


Obtaining an overview of the dataset

1. Construct a heatmap of all samples

Explain the heatmap

2. Construct a heatmap of all samples using the function heatmap.2 from the library gplots. Use scale = ‘none’ and trace = ‘none’.

Make a PCA analysis using the function prcomp (hint: prcomp needs the transposed matrix).


Task 5: obtaining differentially expressed genes for classes men vs women

Samples 1-12 are females, and samples 13-24 are males.

  1. Please construct a t-test function that is able to find the differentially expressed genes for males vs females. For each application of the t-test take only the p-value. Apply this function to all rows of the data matrix and save the results (p-values) in a vector.
  2. obtain a list of 100 genes that have the lowest p-value regarding the separation of classes men vs women. Write this list to a file (hint: write.table)
  3. What kind of genes would you expect to find in this list?