CMI M1 statistiques séance 3

On peut utiliser la fonction ifelse().

Démonstration :


     ## création d'une variable binaire, on reprend les données LEA
     
     if (!exists("cats")) {  void <- capture.output( source("http://forge.info.univ-angers.fr/~gh/statgh.r",encoding="latin1") ) }
     
     lea <- lit.dar("http://forge.info.univ-angers.fr/~gh/Datasets/lea.dar")
     
     cats("Variable reign")
     print(table(lea$reign))
     
     cats("Variable plante")
     lea$plante <- ifelse(lea$reign=="Viridiplantae",1,0)
     print(table(lea$plante))


     Variable reign
     ==============
     
     
         Alveolata      Bacteria Euryarchaeota         Fungi       Metazoa Parabasalidea Viridiplantae
                 1            38             1            11            23             1           698
     
     Variable plante
     ===============
     
     
       0   1
      75 698

Toute variable est potentiellement intéressante. Il peut être instructif de réaliser une ANOVA avec cette variable pour la cible et en particulier de tracer des boxplots ou des beanplots de la variable en fonction de la cible pour voir s'il y a des différences significatives entre les deux modalités.

Exemples :


     ## boxplots et beanplots d'une variable binaire, données LEA
     
     if (!exists("cats")) {  void <- capture.output( source("http://forge.info.univ-angers.fr/~gh/statgh.r",encoding="latin1") ) }
     
     library(beanplot)
     
     lea <- lit.dar("http://forge.info.univ-angers.fr/~gh/Datasets/lea.dar")
     
     cats("Variable reign")
     print(table(lea$reign))
     
     cats("Variable plante")
     lea$plante <- ifelse(lea$reign=="Viridiplantae",1,0)
     print(table(lea$plante))
     
     # filtrage
     
     leaOK <- lea[ (lea$mw>0) & (lea$length<900) , ]
     
     boxplot(leaOK$length~leaOK$plante,main="Length vs plante",col="lightgreen")
     beanplot(leaOK$length~leaOK$plante,main="Length vs plante",col="orange")
     
     boxplot(leaOK$pi~leaOK$plante,main="PI en fonction de plante",col="lightblue")
     beanplot(leaOK$pi~leaOK$plante,main="PI en fonction de plante",col="purple")
     
     # avec les fonctions gh
     
     decritQTparFacteur("LENGTH dans LEA",leaOK$length,"aa","PLANTE",leaOK$plante,"0 1",TRUE)
     with(leaOK, decritQTparFacteur("PI dans LEA",pi,"g/mol","PLANTE",plante,"0 1",TRUE))


     Variable plante
     ===============
       0   1
      75 698
     
     Variable LENGTH
     ===============
     
     VARIABLE QT  LENGTH dans LEA ,unit=aa
     VARIABLE QL  PLANTE  labels :  0 1
     
                 N       Moy Unite       Ect Cdv        Q1       Med        Q3       EIQ       Min       Max
     Global    764   199.034    aa   114.406  57   129.500   168.000   232.500   103.000    68.000   847.000
     0          71   279.831    aa   179.819  64   145.500   200.000   377.000   231.500    70.000   847.000
     1         693   190.756    aa   101.880  53   128.000   165.000   227.000    99.000    68.000   723.000
     
     Analysis of Variance Table
     
     Response: nomVarQT
                Df  Sum Sq Mean Sq F value    Pr(>F)
     ql          1  510985  510985  41.035 2.616e-10 ***
     Residuals 762 9488802   12452
     ---
     Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
     ANOVA : there is a significant diffence at the level 0.05 for  LENGTH dans LEA   PLANTE
     
        Kruskal-Wallis rank sum test
     
     data:  nomVarQT by ql
     Kruskal-Wallis chi-squared = 16.521, df = 1, p-value = 4.811e-05
     
     KRUSKAL-WALLIS : there is a significant diffence at the level 0.05 for  LENGTH dans LEA   PLANTE
     
     Variable PI
     ===============
     
     
     VARIABLE QT  PI dans LEA ,unit=g/mol
     VARIABLE QL  PLANTE  labels :  0 1
     
                 N       Moy Unite       Ect Cdv        Q1       Med        Q3       EIQ       Min       Max
     Global    764     7.496 g/mol     2.050  27     5.630     7.130     9.570     3.940     3.810    12.610
     0          71     7.142 g/mol     2.319  32     5.005     6.460     9.820     4.815     4.150    12.610
     1         693     7.532 g/mol     2.017  27     5.740     7.160     9.560     3.820     3.810    12.430
     
     Analysis of Variance Table
     
     Response: nomVarQT
                Df Sum Sq Mean Sq F value Pr(>F)
     ql          1    9.8  9.8029  2.3325 0.1271
     Residuals 762 3202.4  4.2027
     ANOVA : there is no significant diffence at the level 0.05 for  PI dans LEA   PLANTE
     
        Kruskal-Wallis rank sum test
     
     data:  nomVarQT by ql
     Kruskal-Wallis chi-squared = 2.2567, df = 1, p-value = 0.133
     
     KRUSKAL-WALLIS : there is no significant diffence at the level 0.05 for  PI dans LEA   PLANTE

Consulter cmi03mrlb.r.


     ## tableau résumé de toutes les régressions logistiques simples
     ## pour un dataframe de variables quantitatives dont la colonne 1 est la cible
     
     # lecture des fonctions gh
     
     if (!exists("cats")) {  void <- capture.output( source("http://forge.info.univ-angers.fr/~gh/statgh.r",encoding="latin1") ) }
     
     #######################################################################
     
     mrlb <- function(df) {
     
     #######################################################################
     
     nbvar <- ncol(df)
     
     # rappel rapide des données
     
     print(summary(df))
     
     # préparation du tableau des résultats
     
     cols   <- c("AUROC","AIC","BIC")
     nbc    <- length(cols)
     tabRes <- data.frame(matrix(NA,nrow=nbvar-1,ncol=nbc))
     names(tabRes)     <- cols
     row.names(tabRes) <- names(df)[2:nbvar]
     
     # remplissage
     
     for (idv in (2:nbvar)) {
         rlb  <- glm( df[,1] ~ df[ ,idv] ,family="binomial")
         ypredites              <- predict(rlb,newdata=df,type="response")
         tabRes$AUROC[ idv-1 ]  <- round(aurocQlPred(unlist(df[,1]),ypredites),4)
         tabRes$AIC[ idv-1 ]    <- round(AIC(rlb),4)
         tabRes$BIC[ idv-1 ]    <- round(BIC(rlb),4)
     } # fin pour idv
     
     # affichage
     
     cat("\nTableau de toutes les régressions logistiques simples\n\n")
     idx <- order(tabRes$AUROC,decreasing=TRUE)
     print(tabRes[idx,])
     cat("\n")
     
     } # fin de fonction mrls
     
     #######################################################################
     
     ## exemples d'applications
     
     # iris
     
     data(iris)
     irisb        <- iris
     irisb$setosa <- ifelse(iris$Species=="setosa",1,0)
     irisb        <- irisb[ , c("setosa","Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") ]
     mrlb( irisb )
     
     # LEA
     
     lea <- lit.dar("http://forge.info.univ-angers.fr/~gh/Datasets/lea.dar")
     lea$plante <- ifelse(lea$reign=="Viridiplantae",1,0)
     lea$plante <- ifelse(lea$reign=="Viridiplantae",1,0)
     leaOK <- lea[ (lea$mw>0) & (lea$length<900) , ]
     leaOK <- leaOK[ , c("plante","length", "foldindex", "pi", "mw", "gravy") ]
     mrlb( leaOK )


     DARA IRIS
     =========
     
          setosa        Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
      Min.   :0.0000   Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
      1st Qu.:0.0000   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
      Median :0.0000   Median :5.800   Median :3.000   Median :4.350   Median :1.300
      Mean   :0.3333   Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
      3rd Qu.:1.0000   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
      Max.   :1.0000   Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
     
     Tableau de toutes les régressions logistiques simples
     
                   AUROC      AIC      BIC
     Petal.Length 1.0000   4.0000  10.0213
     Petal.Width  1.0000   4.0000  10.0213
     Sepal.Length 0.9586  75.8364  81.8577
     Sepal.Width  0.8796 127.8280 133.8493
     
     DATA LEAOK
     ==========
     
          plante           length        foldindex              pi               mw            gravy
      Min.   :0.0000   Min.   : 68.0   Min.   :-0.45034   Min.   : 3.810   Min.   : 7521   Min.   :-2.2056
      1st Qu.:1.0000   1st Qu.:129.5   1st Qu.:-0.19583   1st Qu.: 5.630   1st Qu.:14004   1st Qu.:-1.3232
      Median :1.0000   Median :168.0   Median :-0.11698   Median : 7.130   Median :17451   Median :-1.0998
      Mean   :0.9071   Mean   :199.0   Mean   :-0.10008   Mean   : 7.496   Mean   :21247   Mean   :-1.0113
      3rd Qu.:1.0000   3rd Qu.:232.5   3rd Qu.:-0.03051   3rd Qu.: 9.570   3rd Qu.:25025   3rd Qu.:-0.8109
      Max.   :1.0000   Max.   :847.0   Max.   : 0.41210   Max.   :12.610   Max.   :94076   Max.   : 0.7991
     
     Tableau de toutes les régressions logistiques simples
     
                AUROC      AIC      BIC
     gravy     0.7541 422.5821 431.8592
     foldindex 0.7387 425.4206 434.6978
     length    0.6463 447.0219 456.2990
     mw        0.6446 447.2591 456.5363
     pi        0.5541 474.2179 483.4951

Il est possible de télécharger la fonction mrlb() sur internet, via l'instruction :


     source("http://forge.info.univ-angers.fr/~gh/Cmi/mrlb.r",encoding="latin1")

Cours CMI / M1 Statistiques - séance 3

Résumé de la séance 3

Liens utiles

Exercices de régression logistique binaire (RLB)