Valid XHTML     Valid CSS2    

 

R and bioinformatics in 15 minutes

                gilles.hunault "at" univ-angers.fr

 

non su          non su          non su

 

There is a French version available here.

 

Clickable table of contents

  1. Very very short presentation of R

  2. R and bioinformatics including metagenomics

  3. References about R, limited and bioinformatics oriented (on purpose)

  4. Demos with examples via R using RStudio (live)

  5. Short summary : pros and cons of R

 

       As a conclusion, a few good news (how to survive R !)

 

Introduction

After «my PhD in 3 seconds», «all the Britannica Encyclopedia in 20 lines» and «the foundations of rationality of the human individual by Schopenhauer and Nietzsche in less than 120 signs», here is a very very short presentation in 10 minutes plus 5 minutes of interactive session using R with a strong inclination towards bioinformatics.

It is, of course, neither educational nor falsely useless, event if it ressembles a little bit to a theatrical show.

rideauxR15.jpg

1. Very very short presentation of R

R is a free scriptable and programmable software available for Windows, MacOs and Linux. It is exhaustively complete for everything that refers to statistics, which includes, in our conception, biostatistics and bioinformatics.

In R, you will find :

  • hundreds of base commands;

  • the possibility to concatenate commands and to automate the computations within scripts;

  • thousands of ready to use data, even for genomics and metagenomics;

  • data structures adapted to statistics, biostatistics and to bioinformatics: vectors, arrays, matrices, lists, "data frames", time series, statistical models...;

  • execution environnements with or without menus, with automated or partially automated production of HTML, Word, LaTeX or PDF documents;

  • thousands of functions that are already tailored to classical computations and the possibility to define (program) as many new more functions as you want or need;

  • high quality graphics designed for articles, posters and all kinds of publications;

  • a structured organization in packages and in "tasks" to [try to] find your way around;

  • references manuals, users' guides documentations in PDF named vignettes and "real" books (Springer, CRC...);

  • millions of users and professionals, blogs, reference web sites, moocs, videos, lectures, conferences...

2. R and bioinformatics including metagenomics

Bioinformatics analyses spread over a very large spectrum of computations, methods and techniques, whether it is building sequences alignments, the creation of phylogenetic trees, the vizualisation of the results using heatmaps, the differential expression of genes, the analysis of interactions networks or genes networks, the prediction of structures, the genetics of populations, the simultaneous representation as bigraphs of genes and PubMed articles, the computation of genomic distances, euclidian or not...

           ../Introduction_R/heatmap01.png ../Introduction_R/bigraph.png http://mixomics.org/wp-content/uploads/2015/03/network_spls.png
           Heatmap (code) Bigraph (code) Network (code)

R suits really well this kind of computations and analyses because R is a rather open language that can be easily interfaced with command line executable programs. Moreover, the graphical possibilities of R and its ability to read files over the Internet allows it to access dynamically to the huge world wide biological data bases of NCBI, EBI, UNIPROT... without forgetting its dedicated site to bioinformatics named Bioconductor.

            ncbi  pubmed workflows clinical trials

Moreover R is up to date for everything that deals with what is called «Big Data», whether it is False Big Data, also named Statistical Big Data (let's say less than 100Gb of data, but with data tables of dimension (n,p) with p>>n) or True Big Data, for which data can not be put in the main memory of only one computer, and that requires several computers, several servers organized in a cluster or a grid.

                     Disease Model           R Hadoop
          SRAdb metadata 15 Gb            non su data 90 Tb     MGRASTer

At last, centralization and readiness of computations, data, examples and help files make R an unrivalled interactive laboratory full of bioinformatics tools event if the Python language, being more general, is a serious and efficient competitor, especially if you already know how to program...

  cRan 7 400 packages 150 000 fonctions 52 000 graphiques

3. References about R, limited and bioinformatics oriented (on purpose)

If, on the Web, you look for expressions such as bioinformatics, "with R", "using R" and books or filetype:pdf you will find numerous books, sometimes published by well known editors, other times only available on the Web, as you can check below.

   

   

Here are my favorites :

non su non su non su
Bioconductor Case Studies Bioinformatics[...] using R[...] Statistical Bioinformatics with R
(2008, 284 p.) (2005, 474 p.) (2010, 336 p.)
non su non su non su
Bioinformatics with R Cookbook Genomic Data with R Microarays with R
(2014, 300 p.) (2015, 280 p.) (2011, 1036 p.)

4. Demos with examples via R using RStudio (live)

Here is an easy to read R script to read and count and then analyze the number of nucleotides for the gene X94991. It is of course R that reads at the NCBI the sequence of the gene via Internet (all that begins with # is a comment and is therefore ignored by R):


     #############################################
     #                                           #
     # comptage des nucléotides du gène X94991   #
     #                                           #
     #############################################
     #
     # pour le GBSA          (gH). novembre 2015 #
     
     library(ape)                                        # chargement de la librairie
     geneX <- read.GenBank("X94991.1",as.character=TRUE) # lecture du gène demandé
     cnt <- table(geneX)                                 # comptages par type de nucléotide
     print( cnt )                                        # affichage
     print(chisq.test(cnt))                              # calcul du chi-deux d'adéquation
     
     

Hereis the result of the execution 


     geneX
       a   c   g   t
     410 789 573 394
     
     Chi-squared test for given probabilities
     data:  cnt X-squared = 187.07, df = 3, p-value < 2.2e-16
     
     

With the help of a few more commands, one can have a better display and a graphical output for the counts:


     #############################################
     #                                           #
     # comptage des nucléotides du gène X94991   #
     # puis pourcentages et graphiques           #
     #                                           #
     #############################################
     #
     # pour le GBSA          (gH). novembre 2015 #
     
     # tri à plat des nucléotides du gène
     
     tap <- table(geneX)
     
     # répartition détaillée des nucléotides
     
     props        <- as.numeric(prop.table(tap))
     pcts         <- paste(round(100*props,1),"%")
     dfRes        <- data.frame(tap,props,pcts)
     names(dfRes) <- c("Nucléotides","Comptages","Proportions","Pourcentages")
     
     print(dfRes)
     
     # visualisation graphique
     
     lng   <- sum(tap)
     coul  <- rainbow(dim(tap)) # mieux que c("red","green","blue","yellow")
     titre <- paste("Nucléotides du gène X94991\n (longueur=",lng,")",sep="")
     
     barplot(tap,main=titre,col=coul,legend.text=pcts,xlim=c(0,6))
     

This is what we get:


       Nucléotides Comptages Proportions Pourcentages
     1           a       410   0.1892890       18.9 %
     2           c       789   0.3642659       36.4 %
     3           g       573   0.2645429       26.5 %
     4           t       394   0.1819021       18.2 %
     

          genex02.png

Then, if you know a little bit of R programming, you can add some explanations and automatize the computations with some general functions for students and basic users:


     #############################################
     #                                           #
     # comptage des nucléotides du gène X94991   #
     # calcu du chi2 d'équirépartition           #
     #                                           #
     #############################################
     #
     # pour le GBSA          (gH). novembre 2015 #
     
     # un chi-deux détaillé en français avec conclusion automatique
     # issu des fonctions du "p'titgros"
     # disponibles à l'adresse
     #   http://www.info.univ-angers.fr/~gh/wstat/statghfns.php
     
     source("http://forge.info.univ-angers.fr/~gh/wstat/statgh.r",encoding="latin1")
     
     # réalisation du chi-deux et explications en français
     
     # il s'agit ici du chi-deux d'adéquation à une distribution théorique
     # (de même effectif global) soit ici la loi uniforme discrète,
     # autrement dit, la loi sous hypothèse d'équirépartition
     
     chi2Adeq(vth=rep(length(geneX[[1]])/4,4),vobs=tap,graph=TRUE)
     
     

      CALCUL DU CHI-DEUX D'ADEQUATION
     
      Valeurs théoriques   541.5 541.5 541.5 541.5
      Valeurs observées    410   789   573   394
      Valeur du chi-deux   187.0674
     
      Chi-deux max (table) à 5 %  7.814728 pour   3  degrés de liberté ; p-value  2.62465e-40
     
      décision : au seuil de  5 % on peut rejeter l'hypothèse
      que les données observées correspondent aux valeurs théoriques.
     
     

Atlas, with the help of well chosen Markdown commands -- for example if you add a the beginning of the  


     ---
     title: bRavo !
     output: all
     ---
     # un exemple vite fait
     ```{r}
     
     
     [...] ici le code R précédent
     

then with only one clic you get the following files :

Word genex04.docx
HTML genex04.html
PDF genex04.pdf

You can check all this by running the following file genex04.rmd 


     ---
     title: "demoR15"
     output: all
     ---
     
     # Gène X94991
     
     ```{r}
     library(ape)                                        # chargement de la librairie
     geneX <- read.GenBank("X94991.1",as.character=TRUE) # lecture du gène demandé
     cnt <- table(geneX)                                 # comptages par type de nucléotide
     print( cnt )                                        # affichage
     print( cnt )                                        # affichage
     ```
     
     # Histogramme des fréquences
     
     ```{r, echo=FALSE}
     # répartition détaillée des nucléotides
     
     tap <- table(geneX)     
     props        <- as.numeric(prop.table(tap))
     pcts         <- paste(round(100*props,1),"%")
     dfRes        <- data.frame(tap,props,pcts)
     names(dfRes) <- c("Nucléotides","Comptages","Proportions","Pourcentages")
     print(dfRes)
          
     # visualisation graphique
          
     lng   <- sum(tap)
     coul  <- rainbow(dim(tap)) # mieux que c("red","green","blue","yellow")
     titre <- paste("Nucléotides du gène X94991\n (longueur=",lng,")",sep="")
          
     barplot(tap,main=titre,col=coul,legend.text=pcts,xlim=c(0,6))
     ```
     # Avec les fonctions (gH)
     ```{r,echo=FALSE,print=FALSE}
     source("http://forge.info.univ-angers.fr/~gh/wstat/statgh.r",encoding="latin1")
     ```
     ```{r}
     # réalisation du chi-deux et explications en français
          
     # il s'agit ici du chi-deux d'adéquation à une distribution théorique
     # (de même effectif global) soit ici la loi uniforme discrète,
     # autrement dit, la loi sous hypothèse d'équirépartition
          
     chi2Adeq(vth=rep(length(geneX[[1]])/4,4),vobs=tap,graph=TRUE)
     ```
     

5. Short summary : pros and cons of R

Pros of R

Very nice statistical tool, free, complete, well documented and well suitable for bioinformatics, research publications and for reproducible research because you don't have to cut and paste results into your articles.

Cons of R

Not so easy to master in a short time. Needs to (but is it really a con?) program to be very efficient.

A few good news
(to survive R!)

To end this presentation, four good news to avoid being all in the same galley with R:

                         ../../Phenotic/galere.jpg

  • The site names Datajoy allows to use R via a Web interface, without any installation.

  • The sites try.codeschool and datacamp deliver self-learning R courses, if the numeros videos and tutorials about R on the Web are not sufficient for your needs.

  • Since many years (precisely from 2011 on), Statistica and SPSS have integrated R in their software; SAS even uses R via SAS/IML :

    non su non su non su
    SAS and R R and Statistica SPSS and R
  • My teachings include R courses for engineers, PhD and post PhDstudents with more than 50 people per year.

    For example you can visit the (French) pages EDA biostats and progR.

    Feel free to see also the page  R for biostatistics in just 5 minutes (!) .

 

 

retour gH    Retour à la page principale de   (gH)