R warm up session

R warm up session for the Bioinformatics Summer School

        gilles.hunault@univ-angers.fr

Remember that with more than 2 millions of functions distributed as packages grouped in task views R is a fantastic tool... but not the only one.

There is even a site dedicated to R and bioinformatics called Bioconductor.

Official documentation (many languages) is here.

Very very short lectures

About R and bioinformatics

in English

in French

About R and [bio]statistics

in English

in French

You can follow this link to have a short reference card for R and Rstudio. A local copy is here.

Choose the section that suits you best:

  1. I don't know R

  2. I think I know R for statistics and bioinformatics

  3. I think I know how to program in R

  4. I know Python better than R

1. I don't know R

OK, you need a gentle introduction to R. So practice is the best solution. Read the following document r-intro based on Frédéric PROIA's warm up document and type the code in Rstudio to see what R answers.

You can load the file warmup.r with all its R code if you don't want to type the code -- but usually to write things helps to remember them. You can also cut and paste the following lines, each instruction at a time:

############################################################################################################# # # warmup.r # ############################################################################################################# # For unitary commands, you can work directly in the console (after the automatic symbol >). # As soon as you deal with a list of commands, it is better to create a script file. help(mean) ?mean example(mean) # Do not hesitate to comment your script (using symbol #), especially when it becomes complicated. # Now I apply Einstein's relativity to compute the curvature of my space-time c <- 1 # Well, maybe I'm wrong... # Try to use R as a calculator, all basic mathematical symbols exist. (1+2)*(5-3) (5/2)^3 sqrt(25) abs(-5) cos(pi/3) # Degrees or Radians ? 19%%3 # What does it mean ? # Use an arrow to assign a value to a variable. Look at the variable's content directly # by its name or through the print function. a <- 2 print(a) b <- -3.2 b # Standard types are automatically handled : integer, float, complex, boolean, string, etc. # Use class to check the type of a variable. Belonging to the same class provides access # to the same functions (just like object-oriented programming). exInt <- 2 class(exInt) exFlo <- pi class(exFlo) # Both floats and integers are numeric exCom <- 1+1i class(exCom) # But complex numbers are not numeric... exBoo <- TRUE class(exBoo) exStr <- "hello" class(exStr) # Try to manipulate and combine them, using the comparison operators and their specific functions. deg360 <- exInt*exFlo Mod(exCom) == sqrt(2) deg360 < pi/2 exCom - Re(exCom) - 1i*Im(exCom) !exBoo paste(exStr, "everybody", sep=" ") exStr + exInt # Hmm, what was I thinking ? # R has the advantage (or the defect) to proceed to mathematically questionable operations. # Special characters exist to handle such results, do not hesitate to check your calculations. bigValue <- 1/0 is.finite(bigValue) is.infinite(bigValue) undefValue <- 0/0 is.nan(undefValue) # What is NaN ? sqrt(-1) is.nan(1i) 1i == sqrt(-1) # What is NA ? 0+1i == sqrt(-1+0i) # A vector in R is treated as a column of values. Some shortcut functions exist to deal # with values having a logical progression, and usual operations and comparisons of vectors are available. V1 <- c(1, 2, 5, -1, 2) V2 <- 1:5 c(V1, V2) V1*V2 # Is it a scalar product ? length(V1) V1[3] V2[10] # Why ? V3 <- seq(-5, 5, by=2) 10:0 t(10:0) # What's the difference ? V4 <- rep(1, 6) V3 >= V4 V3%*%V4 which(V3 < 2) sort(V3) sort(V3, decreasing=TRUE) # Sometimes we may be required to change dynamically the length of a vector, # because we have no prior information on the amount of data to be stored. EmptyVec <- c() # Empty vector length(EmptyVec) EmptyVec <- c(EmptyVec, 0) length(EmptyVec) # Note that () =/= (0) EmptyVec <- c(EmptyVec, 1) EmptyVec <- EmptyVec[-1] length(EmptyVec) # We create a matrix from a vector, specifying the number of rows or columns needed. M <- matrix(c(2, 3, 5, 7, 11, 13), ncol=2) M dim(M) nrow(M) ncol(M) N <- matrix(c(2, 3, 5, 7, 11, 13), ncol=2, byrow=TRUE) N Z5 <- matrix(0, nrow=5, ncol=5) I5 <- diag(5) diag(1:5) diag(I5) # What's the difference ? M[1,2] M[3,4] # Why ? M[3,] M[2:3,2] M[-2,] # We add rows or columns using rbind and cbind. # As for vectors, it enables to change dynamically the dimensions of the matrix. rbind(M, N) cbind(M, N) # Like vectors, usual operations and comparisons of matrices are available. # As it is shown in the examples below, they have to be carefully used. M+N M-N M/N # What is this strange division between matrices ? M*N t(M)%*%N # What's the difference ? M^2 # Is it a matrix product ? A <- matrix(c(1, 3, 2, -4), nrow=2) eigen(A) # How to access to values and vectors separately ? det(A) solve(A) # Why `solve' to inverse ? A == 1 # A list is a generic vector that may contain different objects having a label. V <- c(158, 124, 182) a <- 22 s <- 1.85 n <- "Jon Snow" Indiv <- list(Name = n, Size = s, Age = a, KilledEnemies = V, isAStark = TRUE) Indiv Indiv[[1]] Indiv$Age Indiv[[4]][1] <- Indiv[[4]][1]+1 # The fourth element of the list is a vector Indiv$KilledEnemies summary(Indiv) # A dataframe is a generic matrix that may contain different types of rows or columns, having a label. DF <- data.frame(C1 = 1, C2 = 1:10, C3 = letters[1:10]) DF colnames(DF) dim(DF) DF[3:5,] DF[-10,] rbind(DF, c(1, 1, "a")) DF # Why didn't it change ? DF <- cbind(DF, 10:1) colnames(DF)[4] <- "C2inv" rownames(DF) <- paste("R", 1:10, sep="") DF["R3","C2inv"] # For numeric vectors, descriptive statistics are easily handled with the numerous associated functions. n <- 1000 X <- rnorm(n, mean=3, sd=2) m <- mean(X) var(X) sum((X-m)^2)/n # The difference ? sum((X-m)^2)/(n-1) median(X) quantile(X) quantile(X, probs=c(0.3, 0.6, 0.9)) min(X) max(X) # Like the usual programming languages, R is able to deal with conditions and loops. # Note that we use == to test for equality whereas we use != to test for difference # and <, <=, >, >= to test for comparisons. a <- 1 b <- 2 (a == 1) # Essential, crucial : see the difference between `a = 1' and `a == 1' (b == 1) (a != 1) (a == 1) | (b == 1) (a == 1) & (b == 1) !(b == 1) (b != 1) == !(b == 1) # What ?? (a == 1) | (b == 2) xor((a == 1), (b == 2)) # What's the difference between `or' and `xor' ? # The syntax is if (cond) { instr } else { instr } where the else block is optional. # An ifelse shorcut is also available. # Let's flip a coin x <- runif(1) if (x < 0.5) { print("Heads") } else { print("Tails") } ifelse(runif(1) < 0.5, "Heads", "Tails") # The syntax is for (var in seq) { instr }. # Let's enumerate the alphabet for (i in 1:length(letters)) { print(letters[i]) } # Note that the sequence is not necessarily numeric, for example we can look through a list. # What are the registered properties of Indiv ? for (prop in Indiv) { print(prop) } # The syntax is while (cond) { instr }. # Let's compute the sum of the first n terms of a geometric sequence q <- 1/3 n <- 20 s <- 0 i <- 0 while (i <= (n-1)) { s <- s+q^i i <- i+1 } print(paste("Sum :", s)) print((1-q^n)/(1-q)) # Faster ? # The syntax is repeat { instr } if (cond) { break } . # Let's compute the terms of an arithmetic sequence until it exceed N r <- 1/3 N <- 100 s <- 0 i <- 0 repeat { s <- s + r i <- i+1 if (s > N) { break } } print(paste("Index :", i)) print(paste("Value :", s)) ## Functions # We can also define our own functions. The syntax is name = function(arg) { instr return(var) }, # where the return command is optional. Some examples are provided below. # If your function does not need to return any value, then do not use the return command. # Flip n coins with heads probability p flipcoins <- function(n, p) { for (i in 1:n) { x <- runif(1) if (x < p) { print("Heads") } else { print("Tails") } } } flipcoins(10, 0.1) flipcoins(15, 0.5) flipcoins(2, 0.9) # Use return(val) to return the result of a treatment in your function. # Concatenate 3 vectors into a single matrix concat <- function(V1, V2, V3) { Mat <- cbind(V1, V2, V3) return(Mat) } M <- concat(c(1,0,0), c(0,1,0), c(0,0,1)) M <- concat(rnorm(10), runif(10), rbinom(10,5,0.2)) # A simple method to produce more than one output is to create a list with all required variables. # Estimate mean and variance of a sample estimMV <- function(Sample) { m <- mean(Sample) v <- var(Sample) out <- list(Mean = m, Var = v) return(out) } Est <- estimMV(rnorm(100, 1, sqrt(3))) print(Est$Mean) print(Est$Var) Est <- estimMV(runif(100, -2, 2)) print(Est$Mean) print(Est$Var) ## Basic graphic tools # The usual functions applied to the 2D graphical representations are plot, lines, curve and points. # Do not hesitate to look at help(plot) to get an overview of the numerous opportunities. # Try to change pch, col, type, lwd or lty arguments. Look also at xlim, # ylim, main, xlab or ylab to decorate the graph. # Discrete representation of f(x) = ln(x^2 + 1/x^2) X <- seq(-4, 4, by=0.01) Y <- log(X^2+1/X^2) plot(X, Y, col="blue") plot(X, Y, col="blue", pch=3) plot(X, Y, col="blue", type="l", main="Graph") plot(X, Y, col="magenta", type="l", lwd=3, lty=2, xlab="Abs. X", ylab="Ord. Y") # Discrete representations of f(x) = ln(x^2 + 1/x^2) and g(x) = -x^2+6 Z <- -X^2+6 plot(X, Y, col="magenta", type="l", lwd=3, lty=2, xlab="Abs. X", ylab="Ord. Y") lines(X, Z, type="l", lwd=3, col="red") # That's... nonsense, really X <- rnorm(20) Y <- rexp(20) plot(X, Y, col="blue", type="p") points(X+0.1, Y+0.1 , pch=2, col="red") lines(sort(X), Y, lty=2, col="orange") text(mean(X), max(Y), "Hello", col="magenta") # Use of `curve' to get continuous representations of functions of x curve(sin(x), from=-2*pi, to=2*pi, col="red", lwd=2, xlim=c(-4, 4), ylim=c(-1, 1)) curve(cos(x), from=-2*pi, to=2*pi, col="blue", lwd=2, add=TRUE) # Add a grid grid(col="lightgray", lty="dotted") # Same example as above, with its legend X <- seq(-4, 4, by=0.01) Y <- log(X^2+1/X^2) Z <- -X^2+6 plot(X, Y, col="magenta", type="l", lwd=3, lty=2, xlab="Abs. X", ylab="Ord. Y") lines(X, Z, type="l", lwd=3, col="red") legend("topright", c("f(x)", "g(x)"), col=c("magenta", "red"), lwd=c(3, 3), lty=c(2, 1)) ## Statistical tools # Histograms, boxplots, regression lines, kernel densities, ... are also easily available using R. # Here are some examples. # Histogram, density and boxplot of a standard Gaussian sample X <- rnorm(1000) hist(X, breaks=15, col="lightblue", border="blue", freq=FALSE, xlim=c(-4,4)) lines(density(X), col="red", lwd=2, lty=2) boxplot(X, main="Boxplot of X", col=c("gold")) # Regression line of a scatter plot X <- 0.5*rnorm(100) E <- rnorm(100) Y <- 2 + 2.5*X + E plot(X, Y, type="p", pch=3) LinReg <- lm(Y~X) summary(LinReg)

Then try to solve these exercises:

Exercise 1

Use the iris dataset. Compute the mean of the fourth column. Why is the summary function a «good but limited function»?

Compute now the median of the fourth column. Does R help you to decide how to choose between the mean and the median (or another computation) as the best descriptor of the values? Does R show the unit of the petals' width?

Exercise 2

Read the elf.dar data file using the explanations given at the end of the page elf.htm. Convert the SEXE column of the dataframe into a factor: 0=male and 1=female. Use the table and prop.table functions to compute absolute and relative counts. How to sort them in decreasing order?

Exercise 3

Compute the GC content of gene X94991.1. Use all nice bioinformatics functions of R to do it with a minimum of instructions.

Hint: use the R code from R15 and install the ape package.

Exercise 4

Use again the elf data. Compare the age of women and men with the help of the t.test function. Comment the output.

Does R check that the assumptions of the test are fullfilled?

Exercise 5

What is the purpose of the msaR and AlignStat packages?

Exercise 6

Explain in which cases it is better to use beanplots than boxplots. Use R to show it. You have to install the beanplot package.

Exercise 7

Install and then load the rms package. Why does the installation take so long? Which datasets are included?

This package is associated to a Springer book. What is its name?

Exercise 8

Install and then load the faraway package. Which datasets are included?

This package is associated to a CRC book. What is its name?

How do you remove all lines with NA values with R in the diabetes dataset of this package? Is it a good idea to do so?

Exercise 9

Load the survival package. Check that you don't need to install it. Why? Is there also a book associated to this package?

Try to find the class and the dimensions of two of its dataset, named kidney and leukemia.

How can you find the class and the dimensions of all the datasets of this package?

Exercise 10

Why should you avoid for loops as much as possible in R if you are dealing with columns or lines of data frames?

2. I think I know R for statistics and bioinformatics
There are some good and some bad practices in R to compute statistics and produce bioinformatics results. Check how you do things with these exercises.

No programming is needed here. Use Rstudio to edit and run your code.

Exercise 1

Use the iris dataset. Compute the means of the four first columns with a single instruction.

Can you apply the summary function on the first column for each species with a single instruction?

Exercise 2

Use the cars dataset. Write a single instruction to have the row names showing Car001 Car002... No loop accepted.

Exercise 3
What is the shortest way to see the columns' name and their index, such as in this example:
     [1,] Sepal.Length
     [2,] Sepal.Width
     [3,] Petal.Length
     [4,] Petal.Width
     [5,] Species
     
You may use the iris dataset. Remember: neither programming nor for loop here.
Exercise 4

Read the diabetes dataset at the address http://forge.info.univ-angers.fr/~gh/wstat/Eda/diabetes.dar.

Beware that the first line is the name of the columns and that the first column gives the names of the lines.

Remove all the lines with NA for the bp.1s variable. Which column has then the maximum of NA values?

Hint: use apply and an anonymous function.

Exercise 5

Use again the diabetes dataset. Compute and add to this dataset with a single instruction the categorical variable ageCL based on the rule 'young' if age<18, 'old' otherwise.

When you modify a data frame, what are the differences between the transform and mutate functions?

Exercise 6

What is the most efficient way to compute and display the maximal value and the number of times it occurs in a vector with many many values? How can you prove it?

Exercise 7
Describe an ordinal variable with counts, percentages and cumulated frequencies without any for loop like the following table. Don't forget the NA values.
     Frequency table for the variale  METAVIR_F
                         0      1      2      3      4   <NA>
     Count           250.0  800.0  516.0  364.0  380.0  647.0
     Sums            250.0 1050.0 1566.0 1930.0 2310.0 2957.0
     Percentages       8.5   27.1   17.5   12.3   12.9   21.9
     Cumulative        8.5   35.5   53.0   65.3   78.1  100.0
     
Exercise 8

Build a graphical description of a continuous variable with the curve for the estimation of the density and the normal candidate curve like the one below.

Make a function of your instructions. Which parameters are needed?

Exercise 9

What are the pros and cons of the ggplot2 and lattice packages compared to classical plots in R?

Exercise 10

Are you able to use Shiny and a Jupyter R notebook to have a small app in R? Prove it.
3. I think I know how to program in R
Let's check it. Use Rstudio to edit, run, debug and profile (you said you were a programmer, right?) your code.

Exercise 1

Use the iris dataset. Apply the summary function for all numeric columns for each species with a single instruction.

Hint: use an anonymous function for tapply.

Exercise 2
Create a cats function that underlines a string with a given character, use "=" as default.

Example:
     > cats("First part: descriptive statistics")
     
     First part: descriptive statistics
     ==================================
     
     > cats("Second part: inferential statistics","-")
     
     Second part: inferential statistics
     -----------------------------------
     
     
     
Exercise 3
Create a timer function that prints the date and time before and after executing some code and that computes the duration of the execution.
     > timer( myFunction( 10**3 ) )
     
     Start: 08 june 2018 11:13:02 CEST
     [...] output
     Stop:  08 june 2018 11:13:02 CEST
     
     Time difference of 0.001301765 secs
     
Hint: use the ellipsis for the parameter of the function.
Exercise 4
Create a function extractPvalue that extracts the p-value of a t-test. For example :
     
     > extractPvalue( t.test(iris2$Sepal.Length ~ iris2$Species2) )
     
     1.866144e-07
     
Create a function extractPvalues that produces a table of all t-tests for the columns of a data frame, using the name of the factor as a parameter. For example:
     
     > extractPvalues( iris2, "Species2" )
     
     Variable        p-value
     Sepal.Length    1.866144e-07
     Sepal.Width     0.001819
     Petal.Length    ...
     Petal.Width
     
The iris2 data frame corresponds to the iris dataset without the "setosa" flowers. Define it with a single instruction.

Hint: in extractPvalues use apply with an anonymous function that calls the extractPvalue function from previous exercise.
Exercise 5

Define a function that uses a quantitative variable and two factors and displays the boxplots side by side as in the example of the tooth growth for the guinea pigs found in example("boxplot"). Don't forget the legend.

Exercise 6
The chi-square test function computes only the value of the test statistic but doesn't show where the main differences between the theoretical and the observed values are. Define a function that details the contribution (theo-obs)²/theo for each level and that sorts them by relative importance. As usual, no loops. Here is an example:
     Details of the chi-square statistic test value:
     
       Ind.    The Obs     Dif      Cntr       Pct    Cumul
          2 27.500  55 -27.500 27.500000 42.060623 45.50227
          1  6.875  18 -11.125 18.002273 27.534066 18.00227
          3 41.250  21  20.250  9.940909 15.204394 55.44318
          4 27.500  12  15.500  8.736364 13.362069 64.17955
          5  6.875   4   2.875  1.202273  1.838849 65.38182
     
Exercise 7
Define a function that displays all the subsets of a given set. For example for givenSet <- c("a","b","c") it must display something like:
     1/8  empty set
     2/8  { a }
     3/8  { b }
     4/8  { c }
     5/8  { a , b }
     6/8  { a , c }
     7/8  { b , c }
     8/8  { a , b , c}
     
The subsets must be numbered and produced with an increasing number of elements (variant: decreasing) and displayed in alphabetic order.

Which part of the biostatistics or bioinformatics may need these subsets? Try to produce both iterative and recursive solutions.
Exercise 8

Run the library("gdata") instruction and check that it displays some information. Define a function .library (yes, with a dot at the beginning) that loads silently the libray, that is that supresses the warning outputs. What is the use of the dot at the beginning of the function's name?

Exercise 9

Write a function that finds the longest common subsequence of n (greater than two) aminoacid sequences. Beware that this is not the same problem as finding the longest common substring. Describe first the method and its complexity using O(n) notation. Is it possible to use it for DNA sequences which may be very very long? Prove that it is fast by profiling your code.

Exercise 10

Build a small example for a class of statistical objets (continuous, factor...) with basic methods (size, describe, plot... ) using R4, R5 and R6 formalism in order to explain the pros and cons of these three object oriented mecanisms in R.

Why must every R programmer decide to learn tidyverse or not?

What is the best R package to run "serious" tests (unit tests, integration tests...) in R?
4. I know Python better than R

Great. So use Python to solve the exercises from section 2 (« I think I know R for statistics and bioinformatics») and from section 3 (« I think I know how to program in R»). Then send me at gilles.hunault@univ-angers.fr your scripts so I can check your programming skills.

Final note: selected answers to the exercises are hidden but clickable on this page. Do you like to play hide and seek?

Source code for this page (php)
Retour à la page principale de (gH)