combine SweeD with OmegaPlus to detect common outliers

I will demonstrate how to run the R script combined_plot.R which combines the results of OmegaPlus with SweeD and detects the common outliers. Additionally, given a gff file with gene annotations it prints out the outlier genes.

You will need to download the following zip file in order to run the example.
combined_analysis.zip
Please download it, then unzip it. Then compile the new versions of OmegaPlus and SweeD and PUT THE EXECUTABLES in your PATH; then you can study and execute the run.sh file

The run.sh contains commands to compute the SweeD and Omega statistics and combine them in R. Below, I present the process in more detail and starting from generating the data with ms.

  1. Let’s generate a dataset with Hudson’s ms. We will pretend that this is the dataset that we wish to analyze. Analysis of FASTA, or VCF files will be equivalent. The dataset comprises 100 sequences and 1000 SNPs. It is saved in the file ms1.out

    ms 100 1 -s 1000 > ms1.out
  2. Then, we can run SweeD and OmegaPlus on the ms1.out
    • OmegaPlus
       OmegaPlus -name ms1 -grid 1000 -length 100000 -input ms1.out -minwin 100 -maxwin 10000
    • SweeD
       SweeD -name ms1 -grid 1000 -length 100000 -input ms1.out
  3. Results have beens saved in the files OmegaPlus_Report.ms1 and SweeD_Report.ms1, respectively.

  4. If we view the results, we realize that there is a problem
    //1
    75.00	0.000000
    175.00	0.000000
    275.00	0.000000
    374.99	0.000000
    474.99	1.357743
    574.99	1.099995

    and

    //1
    Position	Likelihood	Alpha
    75	0.000000e+00	1.200000e+03
    174	0.000000e+00	1.212121e-01
    274	0.000000e+00	1.333334e-01
    374	0.000000e+00	1.200000e+00
    474	0.000000e+00	9.230772e-01
    574	0.000000e+00	2.400000e+00
    674	0.000000e+00	3.000001e+00

    The problem is that data files begin with //1. Thus, the files cannot be imported automatically in R because they cannot be recognized as tables.

  5. There are various solutions to this problem:
      • Edit the file to remove all lines from the beginning of the file up to the beginning of the results (i.e. including the line //1)
      • In Linux, use the command:
         cp OmegaPlus_Report.ms1 OmegaPlus_Report.BKUP && grep -v // OmegaPlus_Report.BKUP > OmegaPlus_Report.ms1
         cp SweeD_Report.ms1 SweeD_Report.BKUP && grep -v // SweeD_Report.BKUP > SweeD_Report.ms1

        This will copy the OmegaPlus_Report.ms1 to OmegaPlus_Report.BKUP; then, it will remove the lines contain the ‘//’ and print the results to OmegaPlus_Report.ms1.

      • Use the newest version of OmegaPlus (>= 2.2.4) and SweeD (TODO) with options -noSeparator which does not print the ‘//’ lines.

    For the moment, let’s assume the first solution (just edit the file) to remove the first lines. Resulting files should like like:

    75.00	0.000000
    175.00	0.000000
    275.00	0.000000
    374.99	0.000000
    474.99	1.357743
    574.99	1.099995

    and

    Position	Likelihood	Alpha
    75	0.000000e+00	1.200000e+03
    174	0.000000e+00	1.212121e-01
    274	0.000000e+00	1.333334e-01
    374	0.000000e+00	1.200000e+00
    474	0.000000e+00	9.230772e-01
    574	0.000000e+00	2.400000e+00
    674	0.000000e+00	3.000001e+00

    Now we can start with the combined_plot.R script.

  6. Running the script: the best way to run the script is to use either the RStudio (http://www.rstudio.com/) or emacs with ess. The benefit of these approaches is that you can run the script line-by-line and see the results of the line execution immediatelyAlternatively, you can use:
     R CMD BATCH ./combined_plot.R

    In any case DO NOT FORGET to change the thr variable in order to change the significance cutoff value. Default is 0.1 (in the paper I had used a much lower value).

  7. IMPORTANT: The script needs an annotation gff file in order to print out the outlier genes. You can download this file from here: Genes_July_2010_hg19.gff. If you have another annotation file, please modify the script accordingly.