PREDIPATH : analysis of the 65x65 data
Clickable table of contents
2. Analysis of the 0/1 data: reduction of dimension
3. Analysis of the 0/1 data: prediction of the status
4. Analysis of the percentage data: reduction of dimension
5. Analysis of the percentage data: prediction of the status
1. Description of the data
The original data is a 65x65 matrix of percentages of identity.
The genomes, described in the file allGroups are known to be commensal or pathogenic, as shown in the following excerpt:
ASSEMBLY_GENOME SPECIES GROUP GCF_000026185.1 Erwinia_tasmaniensis_Et1 commensal GCF_000196615.1 Erwinia_billingiae_Eb661|Eb661 commensal GCF_000336255.1 Erwinia_toletana_DAPP-PG_735|DAPP-PG_735 commensal GCF_000745075.1 Erwinia_sp._9145|9145 commensal GCF_000770305.1 Erwinia_oleae|DAPP-PG531 commensal GCF_000773975.1 Erwinia_typographi|M043b commensal [...] GCF_002732435.1 Erwinia_amylovora|NHWL02-2 pathogenic GCF_002732445.1 Erwinia_amylovora|OR1 pathogenic GCF_002732485.1 Erwinia_amylovora|MANB02-1 pathogenic GCF_002732505.1 Erwinia_amylovora|EA110 pathogenic GCF_002803865.1 Erwinia_amylovora|E-2 pathogenic GCF_002952315.1 Erwinia_pyrifoliae|EpK1 pathogenicThe clusters are described in the file allClusters. Some clusters are shown below:
Name Assembly_genome cluster_1 argannot~~(Bla)Penicillin_Binding_Protein_Ecoli:CP002291:664439-666340:1902 cluster_2 card~~gb|AAC75733.1|ARO:3000074|emrB cluster_3 card~~gb|AIL15701|ARO:3003775|Escherichia cluster_4 card~~gb|BAE77595.1|ARO:3003303|Escherichia ... cluster_64 vfdb_pre~~VFG043692(gi:112292717) cluster_65 vfdb_pre~~VFG044340(gi:292489703)In order to better see the information, cluster names have been shortened and normalized: cluster_x becomes CL0x. The names of the genomes have been shortened too and prefixed by C for commensal or by P for pathogenic. Here is a part of the data to be analyzed, the complete file is here:
Short-name status CL01 CL02 CL03 CL04 CL05 CL06 CL07 CL08 CL09 CL10 CL11 CL12 CL13 CL14 CL15 C-0026185.1 commensal 0 0 0 0 98.095 0 0 0 0 0 95.976 0 90.984 0.00 90.783 C-0196615.1 commensal 0 0 0 0 99.048 0 0 0 0 0 95.380 0 90.164 0.00 0.000 C-0336255.1 commensal 0 0 0 0 98.571 0 0 0 0 0 95.746 0 90.164 90.84 90.668 C-0745075.1 commensal 0 0 0 0 98.571 0 0 0 0 0 96.051 0 0.000 0.00 91.129 C-0770305.1 commensal 0 0 0 0 98.571 0 0 0 0 0 96.051 0 0.000 0.00 91.129 [...] P-2732445.1 pathogenic 0 0 0 0 98.571 0 0 0 0 0 96.051 0 0 0 0.000 P-2732485.1 pathogenic 0 0 0 0 98.095 0 0 0 0 0 96.051 0 0 0 0.000 P-2732505.1 pathogenic 0 0 0 0 98.571 0 0 0 0 0 95.976 0 0 0 0.000 P-2803865.1 pathogenic 0 0 0 0 98.571 0 0 0 0 0 96.051 0 0 0 0.000 P-2952315.1 pathogenic 0 0 0 0 98.571 0 0 0 0 0 95.753 0 0 0 90.3232. Analysis of the 0/1 data: reduction of dimension
The original data have been converted to 0/1 data: if the percentage is not 0 then the value is 1. This provides a matrix of presence/absence to be analyzed.
Here is a part of the data to be analyzed, the complete file is here in text format and there in CSV format:
Genome status CL01 CL02 CL03 CL04 CL05 CL06 CL07 CL08 CL09 CL10 CL11 CL12 CL13 CL14 CL15 C-0026185.1 commensal 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 C-0196615.1 commensal 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 C-0336255.1 commensal 0 0 0 0 1 0 0 0 0 0 1 0 1 1 1 C-0745075.1 commensal 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 C-0770305.1 commensal 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 [...] P-2732445.1 pathogenic 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 P-2732485.1 pathogenic 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 P-2732505.1 pathogenic 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 P-2803865.1 pathogenic 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 P-2952315.1 pathogenic 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1As one can see, some columns look similar. And in fact, some colums are equal. So we first wrote a script to remove them. Here is the result of this script:
NB Column Equal-to: 01 status 02 CL01 CL02 CL04 CL06 CL07 CL08 CL09 CL10 CL12 CL23 CL53 CL55 04 CL03 06 CL05 12 CL11 14 CL13 15 CL14 16 CL15 17 CL16 18 CL17 CL52 19 CL18 20 CL19 CL20 22 CL21 23 CL22 25 CL24 26 CL25 27 CL26 CL40 28 CL27 29 CL28 CL29 CL31 CL32 CL38 31 CL30 34 CL33 CL39 CL42 CL44 CL46 CL47 35 CL34 CL36 CL49 CL65 36 CL35 CL50 38 CL37 42 CL41 44 CL43 46 CL45 49 CL48 52 CL51 55 CL54 57 CL56 CL57 CL58 CL59 CL60 CL61 CL62 CL63 CL64So instead of analyzing 65 columns of 0/1 data, we have to analyze only 30 columns. Since no resulting column if constant (details not shown here), we can proceed to predict the status of the genomes.
3. Analysis of the 0/1 data: prediction of the status
The classical statistical method for the classification of two classes is called binary logistic regression. Here it fails since there is a complete separation of the data, which means there is a combination of some columns that predicts exactly the status.
We have found such a combination (probably minimal) with 4 columns:
Coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 1.03273 0.02297 44.961 < 2e-16 *** CL41 0.96727 0.05544 17.449 < 2e-16 *** CL33 0.45818 0.06591 6.952 3.02e-09 *** CL34 -0.45818 0.08175 -5.605 5.56e-07 *** CL13 -0.08364 0.03574 -2.340 0.0226 * Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.08739 on 60 degrees of freedom Multiple R-squared: 0.9697, Adjusted R-squared: 0.9677 F-statistic: 480.6 on 4 and 60 DF, p-value: < 2.2e-16 Comparing status and round(ml$fitted.values),0): perfect separation!The minimal data needed to predict the status is then shown below:
Details of the predictors values ================================ Status 1 = commensal status CL41 CL33 CL34 CL13 C-0026185.1 1 1 0 1 1 C-0196615.1 1 0 0 0 1 C-0336255.1 1 0 0 0 1 C-0745075.1 1 0 0 0 0 C-0770305.1 1 0 0 0 0 C-0773975.1 1 0 0 0 0 C-1267535.1 1 0 0 0 1 C-1267545.1 1 0 0 0 1 C-1269445.1 1 0 0 0 1 C-1422605.1 1 0 0 0 0 C-1484765.1 1 0 0 0 1 C-1517405.1 1 0 0 0 0 C-1571305.1 1 0 0 0 1 C-2751995.1 1 0 0 0 0 C-2752015.1 1 0 0 0 0 C-2752025.1 1 0 0 0 0 C-2752055.1 1 0 0 0 0 C-2752075.1 1 0 0 0 0 C-2752165.1 1 0 0 0 0 C-2752575.1 1 0 0 0 1 C-2752595.1 1 0 0 0 0 C-2865965.1 1 0 0 0 0 C-2980095.1 1 0 0 0 1 C-0068895.1 1 0 0 0 0 Status 2 = pathogenic status CL41 CL33 CL34 CL13 P-0026985.1 2 1 1 1 0 P-0027205.1 2 1 1 1 0 P-0027265.1 2 1 1 1 0 P-0091565.1 2 1 1 1 0 P-0165815.1 2 1 1 1 0 P-0240705.2 2 1 1 1 0 P-0367545.1 2 1 1 1 0 P-0367565.1 2 1 1 1 0 P-0367585.1 2 1 1 1 0 P-0367605.1 2 1 1 1 0 P-0367625.1 2 1 1 1 0 P-0367645.1 2 1 1 1 0 P-0367665.1 2 1 1 1 0 P-0404125.1 2 1 0 0 0 P-0513355.1 2 1 1 1 0 P-0513395.1 2 1 1 1 0 P-0513415.1 2 1 1 1 0 P-0590885.1 2 1 0 0 0 P-0696075.1 2 1 1 1 0 P-0975275.1 2 1 0 0 0 P-1050515.1 2 1 0 1 0 P-2732125.1 2 1 1 1 0 P-2732175.1 2 1 1 1 0 P-2732205.1 2 1 1 1 0 P-2732215.1 2 1 1 1 0 P-2732245.1 2 1 1 1 0 P-2732255.1 2 1 1 1 0 P-2732285.1 2 1 1 1 0 P-2732295.1 2 1 1 1 0 P-2732315.1 2 1 1 1 0 P-2732335.1 2 1 1 1 0 P-2732365.1 2 1 1 1 0 P-2732385.1 2 1 1 1 0 P-2732405.1 2 1 1 1 0 P-2732425.1 2 1 1 1 0 P-2732435.1 2 1 1 1 0 P-2732445.1 2 1 1 1 0 P-2732485.1 2 1 1 1 0 P-2732505.1 2 1 1 1 0 P-2803865.1 2 1 1 1 0 P-2952315.1 2 1 1 1 0Since these data correspond to presence/absence, a count-heatmap is probably a better way to understand these data (you can click on the image to have a better view):
Columns: CL13 CL41 CL33 CL34; Rows: commensal then pathogenic.
Since the linear regression model is not easy to understand, the main columns used to predict the status are given below but none is specific, which means that alone none is sufficient for the prediction:
Thresholds : presence : pct1 >= 90 %; absence: pct0 < 10 % no cluster is able to predict "commensal" 6 clusters nearly allow to predict "pathogenic" NB Column sum sumCommens pctCommens sumPatho pctPatho specCommens specPatho presqueCommens presquePatho 19 CL33 37 0 0.00 37 90.24 YYY 20 CL34 39 1 4.17 38 92.68 YYY 21 CL35 40 1 4.17 39 95.12 YYY 23 CL41 42 1 4.17 41 100.00 YES YYY 24 CL43 38 0 0.00 38 92.68 YYY 25 CL45 41 1 4.17 40 97.56 YYY Details of the predicted values: ================================ NB Column sum sumCommens pctCommens sumPatho pctPatho specCommens specPatho presqueCommens presquePatho 01 CL01 1 1 4.17 0 0.00 02 CL03 4 4 16.67 0 0.00 03 CL11 64 23 95.83 41 100.00 YES XXX YYY 04 CL13 10 10 41.67 0 0.00 05 CL14 12 12 50.00 0 0.00 06 CL15 29 21 87.50 8 19.51 07 CL16 6 6 25.00 0 0.00 08 CL17 2 2 8.33 0 0.00 09 CL18 44 6 25.00 38 92.68 YYY 10 CL19 1 0 0.00 1 2.44 11 CL21 62 23 95.83 39 95.12 XXX YYY 12 CL22 1 1 4.17 0 0.00 13 CL24 14 0 0.00 14 34.15 14 CL25 30 0 0.00 30 73.17 15 CL26 32 0 0.00 32 78.05 16 CL27 30 0 0.00 30 73.17 17 CL28 33 0 0.00 33 80.49 18 CL30 29 0 0.00 29 70.73 19 CL33 37 0 0.00 37 90.24 YYY 20 CL34 39 1 4.17 38 92.68 YYY 21 CL35 40 1 4.17 39 95.12 YYY 22 CL37 34 0 0.00 34 82.93 23 CL41 42 1 4.17 41 100.00 YES YYY 24 CL43 38 0 0.00 38 92.68 YYY 25 CL45 41 1 4.17 40 97.56 YYY 26 CL48 36 0 0.00 36 87.80 27 CL51 2 0 0.00 2 4.88 28 CL54 2 2 8.33 0 0.00 29 CL56 1 1 4.17 0 0.00So, roughly speaking, cluster 41 predicts nearly pathogenicity. The three other clusters are here to take care of genome C-0026185.1.
Genome status predit arrondi C-0026185.1 1 1.4581818 1 C-0196615.1 1 0.9490909 1 C-0336255.1 1 0.9490909 1 C-0745075.1 1 1.0327273 1 C-0770305.1 1 1.0327273 1 C-0773975.1 1 1.0327273 1 C-1267535.1 1 0.9490909 1 C-1267545.1 1 0.9490909 1 C-1269445.1 1 0.9490909 1 C-1422605.1 1 1.0327273 1 C-1484765.1 1 0.9490909 1 C-1517405.1 1 1.0327273 1 C-1571305.1 1 0.9490909 1 C-2751995.1 1 1.0327273 1 C-2752015.1 1 1.0327273 1 C-2752025.1 1 1.0327273 1 C-2752055.1 1 1.0327273 1 C-2752075.1 1 1.0327273 1 C-2752165.1 1 1.0327273 1 C-2752575.1 1 0.9490909 1 C-2752595.1 1 1.0327273 1 C-2865965.1 1 1.0327273 1 C-2980095.1 1 0.9490909 1 C-0068895.1 1 1.0327273 1 P-0026985.1 2 2.0000000 2 P-0027205.1 2 2.0000000 2 P-0027265.1 2 2.0000000 2 P-0091565.1 2 2.0000000 2 P-0165815.1 2 2.0000000 2 P-0240705.2 2 2.0000000 2 P-0367545.1 2 2.0000000 2 P-0367565.1 2 2.0000000 2 P-0367585.1 2 2.0000000 2 P-0367605.1 2 2.0000000 2 P-0367625.1 2 2.0000000 2 P-0367645.1 2 2.0000000 2 P-0367665.1 2 2.0000000 2 P-0404125.1 2 2.0000000 2 P-0513355.1 2 2.0000000 2 P-0513395.1 2 2.0000000 2 P-0513415.1 2 2.0000000 2 P-0590885.1 2 2.0000000 2 P-0696075.1 2 2.0000000 2 P-0975275.1 2 2.0000000 2 P-1050515.1 2 1.5418182 2 P-2732125.1 2 2.0000000 2 P-2732175.1 2 2.0000000 2 P-2732205.1 2 2.0000000 2 P-2732215.1 2 2.0000000 2 P-2732245.1 2 2.0000000 2 P-2732255.1 2 2.0000000 2 P-2732285.1 2 2.0000000 2 P-2732295.1 2 2.0000000 2 P-2732315.1 2 2.0000000 2 P-2732335.1 2 2.0000000 2 P-2732365.1 2 2.0000000 2 P-2732385.1 2 2.0000000 2 P-2732405.1 2 2.0000000 2 P-2732425.1 2 2.0000000 2 P-2732435.1 2 2.0000000 2 P-2732445.1 2 2.0000000 2 P-2732485.1 2 2.0000000 2 P-2732505.1 2 2.0000000 2 P-2803865.1 2 2.0000000 2 P-2952315.1 2 2.0000000 2The predictions are not robust since there are some values around 1.5; it is difficult to assign them to class 1 or 2.
4. Analysis of the percentage data: reduction of dimension
So now we are using the original data (percentages of identity). As before, we try to remove equal columns. This time, only one column can be removed:
NB Column Equal-to: 01 status 02 CL01 03 CL02 04 CL03 05 CL04 06 CL05 07 CL06 08 CL07 09 CL08 10 CL09 11 CL10 12 CL11 13 CL12 14 CL13 15 CL14 16 CL15 17 CL16 18 CL17 19 CL18 20 CL19 CL20 22 CL21 23 CL22 24 CL23 25 CL24 26 CL25 27 CL26 28 CL27 29 CL28 30 CL29 31 CL30 32 CL31 33 CL32 34 CL33 35 CL34 36 CL35 37 CL36 38 CL37 39 CL38 40 CL39 41 CL40 42 CL41 43 CL42 44 CL43 45 CL44 46 CL45 47 CL46 48 CL47 49 CL48 50 CL49 51 CL50 52 CL51 53 CL52 54 CL53 55 CL54 56 CL55 57 CL56 58 CL57 59 CL58 60 CL59 61 CL60 62 CL61 63 CL62 64 CL63 65 CL645. Analysis of the percentage data: prediction of the status
Here again the binary logistic regression fails since there is also a complete separation of the data, which means there is a combination of some columns that performs a complete separation of the data with respect to the status.
We have probably found a minimal solution with these 3 columns and their combination:
Coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0001358 0.0187541 53.329 < 2e-16 *** CL41 0.0108103 0.0005939 18.201 < 2e-16 *** CL33 0.0046968 0.0006854 6.853 4.14e-09 *** CL34 -0.0054701 0.0008640 -6.331 3.22e-08 *** Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.08994 on 61 degrees of freedom Multiple R-squared: 0.9674, Adjusted R-squared: 0.9658 F-statistic: 603.5 on 3 and 61 DF, p-value: < 2.2e-16 Comparing status and round(ml$fitted.values),0): perfect separation!The minimal data needed to predict the status is shown below:
Equal values ============ genome status CL33 CL34 CL41 C-0196615.1 1 0 0 0 C-0336255.1 1 0 0 0 C-0745075.1 1 0 0 0 ... P-2732485.1 2 100 100 100 P-2732505.1 2 100 100 100 P-2803865.1 2 100 100 100 Unequal values ============== genome status CL33 CL34 CL41 C-0026185.1 1 0 97 94 P-0026985.1 2 92 98 99 P-0027265.1 2 92 98 99 P-0165815.1 2 92 98 99 P-0404125.1 2 0 0 92 P-0590885.1 2 0 0 94 P-0975275.1 2 0 0 92 P-1050515.1 2 0 94 94 P-2952315.1 2 92 98 99Here are the predicted values:
genome status predicted rounded C-0026185.1 1 1.490340 1 C-0196615.1 1 1.000136 1 C-0336255.1 1 1.000136 1 C-0745075.1 1 1.000136 1 C-0770305.1 1 1.000136 1 C-0773975.1 1 1.000136 1 C-1267535.1 1 1.000136 1 C-1267545.1 1 1.000136 1 C-1269445.1 1 1.000136 1 C-1422605.1 1 1.000136 1 C-1484765.1 1 1.000136 1 C-1517405.1 1 1.000136 1 C-1571305.1 1 1.000136 1 C-2751995.1 1 1.000136 1 C-2752015.1 1 1.000136 1 C-2752025.1 1 1.000136 1 C-2752055.1 1 1.000136 1 C-2752075.1 1 1.000136 1 C-2752165.1 1 1.000136 1 C-2752575.1 1 1.000136 1 C-2752595.1 1 1.000136 1 C-2865965.1 1 1.000136 1 C-2980095.1 1 1.000136 1 C-0068895.1 1 1.000136 1 P-0026985.1 2 1.965778 2 P-0027205.1 2 2.003826 2 P-0027265.1 2 1.965778 2 P-0091565.1 2 2.003826 2 P-0165815.1 2 1.965778 2 P-0240705.2 2 2.003826 2 P-0367545.1 2 2.003826 2 P-0367565.1 2 2.003826 2 P-0367585.1 2 2.003826 2 P-0367605.1 2 2.003826 2 P-0367625.1 2 2.003826 2 P-0367645.1 2 2.003826 2 P-0367665.1 2 2.003826 2 P-0404125.1 2 1.999958 2 P-0513355.1 2 2.003826 2 P-0513395.1 2 2.003826 2 P-0513415.1 2 2.003826 2 P-0590885.1 2 2.014876 2 P-0696075.1 2 2.003826 2 P-0975275.1 2 1.999958 2 P-1050515.1 2 1.502364 2 P-2732125.1 2 2.003826 2 P-2732175.1 2 2.003826 2 P-2732205.1 2 2.003826 2 P-2732215.1 2 2.003826 2 P-2732245.1 2 2.003826 2 P-2732255.1 2 2.003826 2 P-2732285.1 2 2.003826 2 P-2732295.1 2 2.003826 2 P-2732315.1 2 2.003826 2 P-2732335.1 2 2.003826 2 P-2732365.1 2 2.003826 2 P-2732385.1 2 2.003826 2 P-2732405.1 2 2.003826 2 P-2732425.1 2 2.003826 2 P-2732435.1 2 2.003826 2 P-2732445.1 2 2.003826 2 P-2732485.1 2 2.003826 2 P-2732505.1 2 2.003826 2 P-2803865.1 2 2.003826 2 P-2952315.1 2 1.965778 2Here also a count-heatmap is probably a better way to understand these data (you can click on the image to have a better view):
Columns: CL41 CL33 CL34; Rows: commensal then pathogenic.
So, as previously, cluster 41, named vfdb_pre~~VFG041501(gi:292487026), is nearly a predictor of pathogenicity since all pathogenic genomes have a non zero value for this cluster and all commensal genomes are null for this cluster except the genome named GCF_000026185.1 Erwinia_tasmaniensis_Et.
As before the predictions are not robust since again there are some values around 1.5; it is also difficult to assign them to class 1 or 2.
6. Concluding remarks
Though we have found some predictors for the status, the solution is not robust since one genomes can change it all.
If it was possible to ignore genome C-0026185.1 the predictor for pathogenicity would be exactly the presence of cluster 41, identified as vfdb_pre~~VFG041501(gi:292487026).
One has to think, using the 0/1 matrix, of the clusters equal to the predictors CL33 and CL34.
In order to get a robust classification, we have to think of a new strategy:
We can remove genome named GCF_000026185.1 Erwinia_tasmaniensis_Et so pathogenicity can be considered as equal to CL41 greater than 90.
We can analyze the real status of genome named GCF_000026185.1 Erwinia_tasmaniensis_Et for it may be pathogenic with a bad annotation.
We can decide to add more data to be sure of the information.
We can refine pathogenicity using more than two levels.
Retour à la page principale de (gH)