## R warm up session for the Bioinformatics Summer School

gilles.hunault@univ-angers.fr

Remember that with more than 2 millions of functions distributed as packages grouped in task views R is a fantastic tool... but not the only one.

There is even a site dedicated to R and bioinformatics called Bioconductor.

Official documentation (many languages) is

here.

Very very short lectures## About R and bioinformatics

## About R and [bio]statistics

You can follow this link to have a short reference card for R and Rstudio. A local copy is here.

## Choose the section that suits you best:

2. I think I know R for statistics and bioinformatics

## 1. I don't know R

OK, you need a gentle introduction to R. So

practiceis the best solution. Read the following document r-intro based on Frédéric PROIA's warm up document and type the code in Rstudio to see what R answers.You can load the file warmup.r with all its R code if you don't want to type the code -- but usually to write things helps to remember them. You can also cut and paste the following lines, each instruction at a time:

Then try to solve these exercises:

## Exercise 1

Use the

irisdataset. Compute the mean of the fourth column. Why is thesummaryfunction a«good but limited function»?Compute now the median of the fourth column. Does R help you to decide how to choose between the mean and the median (or another computation) as the best descriptor of the values? Does R show the unit of the petals' width?

## Exercise 2

Read the elf.dar data file using the explanations given at the end of the page elf.htm. Convert the

SEXEcolumn of the dataframe into a factor: 0=male and 1=female. Use thetableandprop.tablefunctions to compute absolute and relative counts. How to sort them in decreasing order?## Exercise 3

Compute the

GC contentof geneX94991.1. Use all nice bioinformatics functions of R to do it with a minimum of instructions.

Hint:use the R code from R15 and install theapepackage.## Exercise 4

Use again the

elfdata. Compare the age of women and men with the help of thet.testfunction. Comment the output.Does R check that the assumptions of the test are fullfilled?

## Exercise 5

What is the purpose of the

msaRandAlignStatpackages?## Exercise 6

Explain in which cases it is better to use

beanplotsthan boxplots. Use R to show it. You have to install thebeanplotpackage.## Exercise 7

Install and then load the

rmspackage. Why does the installation take so long? Which datasets are included?This package is associated to a Springer book. What is its name?

## Exercise 8

Install and then load the

farawaypackage. Which datasets are included?This package is associated to a CRC book. What is its name?

How do you remove all lines with NA values with R in the

diabetesdataset of this package? Is it a good idea to do so?## Exercise 9

Load the

survivalpackage. Check that you don't need to install it. Why? Is there also a book associated to this package?Try to find the class and the dimensions of two of its dataset, named

kidneyandleukemia.How can you find the class and the dimensions of all the datasets of this package?

## Exercise 10

Why should you avoid

for loopsas much as possible in R if you are dealing with columns or lines of data frames?## 2. I think I know R for statistics and bioinformatics

There are some good and some bad practices in R to compute statistics and produce bioinformatics results. Check how you do things with these exercises.

No programming is needed here. Use Rstudio to edit and run your code.

## Exercise 1

Use the

irisdataset. Compute the means of the four first columns with a single instruction.Can you apply the

summaryfunction on the first column for each species with a single instruction?## Exercise 2

Use the

carsdataset. Write a single instruction to have therow namesshowingCar001 Car002...No loop accepted.## Exercise 3

What is the shortest way to see the columns' name and their index, such as in this example:

[1,] Sepal.Length [2,] Sepal.Width [3,] Petal.Length [4,] Petal.Width [5,] SpeciesYou may use the

irisdataset. Remember: neither programming nor for loop here.## Exercise 4

Read the

diabetesdataset at the address http://forge.info.univ-angers.fr/~gh/wstat/Eda/diabetes.dar.Beware that the first line is the name of the columns and that the first column gives the names of the lines.

Remove all the lines with

NAfor thebp.1svariable. Which column has then the maximum of NA values?

Hint:useapplyand an anonymous function.## Exercise 5

Use again the

diabetesdataset. Compute and add to this dataset with a single instruction the categorical variableageCLbased on the rule'young'if age<18,'old'otherwise.When you modify a data frame, what are the differences between the

transformandmutatefunctions?## Exercise 6

What is the most efficient way to compute and display the maximal value and the number of times it occurs in a vector with many many values? How can you prove it?

## Exercise 7

Describe an ordinal variable with counts, percentages and cumulated frequencies without any for loop like the following table. Don't forget the NA values.

Frequency table for the variale METAVIR_F 0 1 2 3 4 <NA> Count 250.0 800.0 516.0 364.0 380.0 647.0 Sums 250.0 1050.0 1566.0 1930.0 2310.0 2957.0 Percentages 8.5 27.1 17.5 12.3 12.9 21.9 Cumulative 8.5 35.5 53.0 65.3 78.1 100.0## Exercise 8

Build a graphical description of a continuous variable with the curve for the estimation of the density and the normal candidate curve like the one below.

Make a function of your instructions. Which parameters are needed?

## Exercise 9

What are the pros and cons of the

ggplot2andlatticepackages compared to classical plots in R?## Exercise 10

Are you able to use

Shinyand aJupyter R notebookto have a small app in R? Prove it.## 3. I think I know how to program in R

Let's check it. Use Rstudio to edit, run, debug and profile (you said you were a programmer, right?) your code.

## Exercise 1

Use the

irisdataset. Apply thesummaryfunction for all numeric columns for each species with a single instruction.

Hint:use an anonymous function fortapply.## Exercise 2

Create a

catsfunction that underlines a string with a given character, use "=" as default.Example:

> cats("First part: descriptive statistics") First part: descriptive statistics ================================== > cats("Second part: inferential statistics","-") Second part: inferential statistics -----------------------------------## Exercise 3

Create a

timerfunction that prints the date and time before and after executing some code and that computes the duration of the execution.> timer( myFunction( 10**3 ) ) Start: 08 june 2018 11:13:02 CEST [...] output Stop: 08 june 2018 11:13:02 CEST Time difference of 0.001301765 secs

Hint:use the ellipsis for the parameter of the function.## Exercise 4

Create a function

extractPvaluethat extracts the p-value of a t-test. For example :> extractPvalue( t.test(iris2$Sepal.Length ~ iris2$Species2) ) 1.866144e-07Create a function

extractPvaluesthat produces a table of all t-tests for the columns of a data frame, using the name of the factor as a parameter. For example:> extractPvalues( iris2, "Species2" ) Variable p-value Sepal.Length 1.866144e-07 Sepal.Width 0.001819 Petal.Length ... Petal.WidthThe iris2 data frame corresponds to the iris dataset without the "setosa" flowers. Define it with a single instruction.

Hint:inextractPvaluesuseapplywith an anonymous function that calls theextractPvaluefunction from previous exercise.## Exercise 5

Define a function that uses a quantitative variable and two factors and displays the boxplots side by side as in the example of the tooth growth for the guinea pigs found in

example("boxplot"). Don't forget the legend.## Exercise 6

The chi-square test function computes only the value of the test statistic but doesn't show where the main differences between the theoretical and the observed values are. Define a function that details the contribution (theo-obs)˛/theo for each level and that sorts them by relative importance. As usual, no loops. Here is an example:

Details of the chi-square statistic test value: Ind. The Obs Dif Cntr Pct Cumul 2 27.500 55 -27.500 27.500000 42.060623 45.50227 1 6.875 18 -11.125 18.002273 27.534066 18.00227 3 41.250 21 20.250 9.940909 15.204394 55.44318 4 27.500 12 15.500 8.736364 13.362069 64.17955 5 6.875 4 2.875 1.202273 1.838849 65.38182## Exercise 7

Define a function that displays all the subsets of a given set. For example for

givenSet <- c("a","b","c")it must display something like:1/8 empty set 2/8 { a } 3/8 { b } 4/8 { c } 5/8 { a , b } 6/8 { a , c } 7/8 { b , c } 8/8 { a , b , c}The subsets must be numbered and produced with an increasing number of elements (variant: decreasing) and displayed in alphabetic order.

Which part of the biostatistics or bioinformatics may need these subsets? Try to produce both iterative and recursive solutions.

## Exercise 8

Run the

library("gdata")instruction and check that it displays some information. Define a function.library(yes, with a dot at the beginning) that loads silently the libray, that is that supresses the warning outputs. What is the use of the dot at the beginning of the function's name?## Exercise 9

Write a function that finds the longest common subsequence of n (greater than two) aminoacid sequences. Beware that this is not the same problem as finding the longest common substring. Describe first the method and its complexity using

O(n)notation. Is it possible to use it for DNA sequences which may be very very long? Prove that it is fast by profiling your code.## Exercise 10

Build a small example for a class of statistical objets (continuous, factor...) with basic methods (size, describe, plot... ) using R4, R5 and R6 formalism in order to explain the pros and cons of these three object oriented mecanisms in R.

Why must every R programmer decide to learn

tidyverseor not?What is the best R package to run "

serious" tests (unit tests, integration tests...) in R?## 4. I know Python better than R

Great. So use Python to solve the exercises from section 2 (

« I think I know R for statistics and bioinformatics») and from section 3 (« I think I know how to program in R»). Then send me atgilles.hunault@univ-angers.fryour scripts so I can check your programming skills.

Final note:selected answers to the exercises are hidden but clickable on this page. Do you like to play hide and seek?

Retour à la page principale de (gH)