1 Introduction

This package creates a multivariate predictor for determining to which of multiple classes a given sample belongs. Several multivariate classification methods are available, including the Compound Covariate Predictor, Diagonal Linear Discriminant Analysis, Nearest Neighbor Predictor, Nearest Centroid Predictor, and Support Vector Machine Predictor. For all class prediction methods requested, this package provides an estimate of how accurately the classes can be predicted by this multivariate class predictor. The whole procedure is evaluated by the cross-validation methods including leave-one-out cross-validation, k-fold validation and 0.632+ bootstrap validation. The cross-validated estimate of misclassification rate is computed and performance of each classifier is provided. New samples can be further classified based on specified classifiers and the multivariate predictor from full dataset.

2 Installation

To install the package from its binary version, you need to manually pre-install the ROC dependency package by running the following script in R console:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ROC")

Afterwards, please install the classpredict R package through the local installation. Click on “Packages” on the R menu bar, and select “install package(s) from local files”. Please browse for “classpredict_0.2.zip” and click on “open”.

3 Quick Start

This package provides test.classPrediction for a quick start of class prediction analysis over one of the built-in sample data (i.e., “Brca”, “Perou”, and “Pomeroy”).

library(classpredict)
res <- test.classPredict("Brca",outputName = "ClassPrediction", generateHTML = FALSE)
## Getting analysis results ...

The list res includes the following objects:

names(res)
##  [1] "performClass"        "percentCorrectClass" "predNewSamples"     
##  [4] "classifierTable"     "probInClass"         "CCPSenSpec"         
##  [7] "LDASenSpec"          "K1NNSenSpec"         "K3NNSenSpec"        
## [10] "CentroidSenSpec"     "SVMSenSpec"          "BCPPSenSpec"        
## [13] "probNew"             "weightLinearPred"    "thresholdLinearPred"
## [16] "GRPCentroid"         "pmethod"             "workPath"

Here we give simple explanation about each object in res:

  • res$performClass is a data frame with the performance of classifiers during cross-validation:
res$performClass[1:11,]
##    Array id Class label Mean Number of genes in classifier CCP Correct?
## 1     s1996       BRCA1                                 16          YES
## 2     s1822       BRCA1                                 20          YES
## 3     s1714       BRCA1                                 28          YES
## 4     s1224       BRCA1                                 15          YES
## 5     s1252       BRCA1                                 28          YES
## 6     s1510       BRCA1                                 20          YES
## 7     s1905       BRCA1                                 20          YES
## 8     s1900       BRCA2                                 13          YES
## 9     s1787       BRCA2                                 17          YES
## 10    s1721       BRCA2                                 10          YES
## 11    s1486       BRCA2                                 17           NO
##    DLDA Correct? 1NN Correct? 3NN Correct? Nearest Centroid Correct?
## 1            YES          YES          YES                       YES
## 2            YES          YES          YES                       YES
## 3            YES          YES          YES                       YES
## 4            YES          YES          YES                       YES
## 5             NO          YES           NO                       YES
## 6            YES          YES          YES                       YES
## 7            YES          YES          YES                       YES
## 8            YES          YES           NO                        NO
## 9            YES          YES          YES                       YES
## 10           YES          YES          YES                       YES
## 11            NO          YES           NO                        NO
##    SVM Correct? BCCP Correct?
## 1           YES           YES
## 2           YES           YES
## 3           YES           YES
## 4           YES           YES
## 5           YES           YES
## 6           YES           YES
## 7           YES           YES
## 8           YES           YES
## 9           YES           YES
## 10          YES           YES
## 11           NO            NO
  • res$percentCorrectClass is a data frame with the mean percent of correct classification for each sample using different prediction methods.
res$percentCorrectClass
##   CCP Correct? DLDA Correct? 1NN Correct? 3NN Correct?
## 1           91            82          100           73
##   Nearest Centroid Correct? SVM Correct? BCCP Correct?
## 1                        82           91            91
  • res$predNewSamples is a data frame with predicted class for each new sample. NC means that a sample is not classified. In this example, there are four new samples.
res$predNewSamples[1:4,]
##   ExpID TrueClass   CCP   LDA    K1    K3 Centroid   SVM  BCCP
## 1 s1816   predict BRCA2 BRCA2 BRCA2 BRCA2    BRCA2 BRCA2 BRCA2
## 2 s1616   predict BRCA2 BRCA1 BRCA2 BRCA1    BRCA2 BRCA2    NC
## 3 s1063   predict BRCA1 BRCA1 BRCA1 BRCA1    BRCA1 BRCA1 BRCA1
## 4 s1936   predict BRCA2 BRCA2 BRCA2 BRCA2    BRCA2 BRCA2 BRCA2
  • res$probNew is a data frame with the predicted probability of each new sample belonging to the class (BRCA1) from the the Bayesian Compound Covariate method.
res$probNew[1:4,]
##   Array id Class Probability
## 1    s1816 BRCA1  p < 1.0e-3
## 2    s1616 BRCA1       0.344
## 3    s1063 BRCA1           1
## 4    s1936 BRCA1  p < 1.0e-3
  • res$classifierTable is a data frame with composition of classifiers such as geometric means of values in each class, p-values and Gene IDs.

  • res$probInClass is a data frame with predicted probability of each training sample belonging to aclass during cross-validation from the Bayesian Compound Covariate

  • res$CCPSenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Compound Covariate Predictor Classifier.

  • res$LDASenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Diagonal Linear Discriminant Analysis Classifier.

  • res$K1NNSenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 1-Nearest Neighbor Classifier.

  • res$K3NNSenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the 3-Nearest Neighbor Classifier.

  • res$CentroidSenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Nearest Centroid Classifier.

  • res$SVMSenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Support Vector Machine Classifier.

  • res$BCPPSenSpec is a data frame with performance (i.e., sensitivity, specificity, positive prediction value, negative prediction value) of the Bayesian Compound Covariate Classifier.

  • res$weightLinearPred is a data frame with gene weights for linear predictors such as Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Support Vector Machine.

  • res$thresholdLinearPred contains the thresholds for the linear prediction rules related with res$weightLinearPred. Each prediction rule is defined by the inner sum of the weights (\(w_i\)) and log expression values (\(x_i\)) of significant genes. In this case, a sample is classified to the class BRCA1 if the sum is greater than the threshold; that is, \(\sum_i w_i x_i > threshold\).

  • res$GRPCentroid is a data frame with centroid of each class for each predictor gene.

  • res$pmethod is a vector of prediction methods that are specified.

  • res$workPath is the path for Fortran and other intermediate outputs.

Cross-validation ROC curves are provided for Compound Covariate Predictor, Diagonal Linear Discriminant Analysis and Bayesian Compound Covariate Classifiers.

library(classpredict)
res <- test.classPredict("Brca",outputName = "ClassPrediction")
## Getting analysis results ...
plotROCCurve(res,"ccp")
plotROCCurve(res,"dlda")
plotROCCurve(res,"bcc")

When the argument generateHTML is set to be TRUE, an HTML file called ClassPrediction.html will be created under C:\Users\YourUserName\Documents\Brca\Output\ClassPrediction.

4 Data Input

classPredict is the main R function to perform class prediction analysis. In this section, we will look into details about how to prepare inputs for classPredict. Once again, we use the “Brca” sample data for an example. The package contains the following “Brca” sample information:

*Brca_LOGRAT.txt : a table of expression data with rows representing genes and columns representing samples;

*Brca_FILTER.TXT: a list of filtering information, where 1 means the corresponding gene passes the filters while 0 means it is excluded from analysis;

*Brca_GENEID.txt: a table of gene information corresponding to row information of Brca_LOGRAT.txt and Brca_FILTER.TXT;

*Brca_EXPDESIGN.txt: a table with class information AND/OR separate test set information.

There are a total of 15 samples, where 11 samples will used as training data and the remaining are new samples for class prediction. We run the following code to obtain objects like exprTrain and exprTest as inputs to classPredict.

dataset<-"Brca"
# gene IDs
geneId <- read.delim(system.file("extdata", paste0(dataset, "_GENEID.txt"), package = "classpredict"), as.is = TRUE, colClasses = "character") 
# expression data
x <- read.delim(system.file("extdata", paste0(dataset, "_LOGRAT.TXT"), package = "classpredict"), header = FALSE)
# filter information, 1 - pass the filter, 0 - filtered
filter <- scan(system.file("extdata", paste0(dataset, "_FILTER.TXT"), package = "classpredict"), quiet = TRUE)
# class information
expdesign <- read.delim(system.file("extdata", paste0(dataset, "_EXPDESIGN.txt"), package = "classpredict"), as.is = TRUE)
# training/test information
testSet <- expdesign[, 10]
trainingInd <- which(testSet == "training")
predictInd <- which(testSet == "predict")
ind1 <- which(expdesign[trainingInd, 4] == "BRCA1")
ind2 <- which(expdesign[trainingInd, 4] == "BRCA2")
ind <- c(ind1, ind2)
exprTrain <- x[, ind]
colnames(exprTrain) <- expdesign[ind, 1]
exprTest <- x[, predictInd]
colnames(exprTest) <- expdesign[predictInd, 1]

exprTrain is a 3226*11 matrix with rows representing genes and columns representing 11 training samples.

exprTrain[1:5,]
##        s1996       s1822      s1714       s1224      s1252      s1510
## 1 -3.0817938 -2.73039293 -1.8744690 -2.28824496 -0.3453870 -1.4232113
## 2  0.2781018 -0.20113993 -0.5334322 -0.57929373 -0.2874397 -0.8826430
## 3  0.4375801  0.10479617  0.9533499 -0.22050031  0.3532323 -0.6731896
## 4 -0.8389376 -0.23562828  0.6195197  0.81221521 -0.4181434 -0.5250910
## 5 -0.4340958  0.06756324  0.7655347 -0.09386685 -0.4181434  0.3841435
##        s1905      s1900      s1787       s1721       s1486
## 1 -1.6828099 -1.7776077 -0.2410080 -0.29195589  0.24146917
## 2 -1.0000000 -0.4150376 -1.0223678 -0.74802077 -1.16699564
## 3  0.9940752  0.5109619 -0.1643868  0.02185956  0.24146917
## 4  0.7697023  0.2630344  0.6429682  1.45843005 -0.04146478
## 5 -0.2725259 -0.1926452 -0.5145731 -0.62403196 -0.01761806

exprTest is a 3226*4 matrix with the expressions of four new samples.

exprTest[1:5,]
##        s1816      s1616      s1063       s1936
## 1 -0.8214026 -0.5618789 -0.4611339 -0.93288577
## 2 -0.8614801 -1.6322682 -0.7737241 -0.33342373
## 3  0.4066253  0.4381211  0.4116309  1.25153875
## 4  1.3286228  1.3737305  0.5574818  1.02272010
## 5  1.3330686  1.2422009  0.0402640  0.03394729

The following procedure develops from all samples seven classifiers which are used to predict classes of new samples. Individual genes that are used by classifiers are selected at the 0.001 significance level. Random variance model will be used for univariate tests. The leave-one-out cross-validation method is employed to evaluate class prediction accuracy by selecting predictors and training classifiers from cross-validated traning set and calculating the cross-validated estimate of misclassification error over the cross-validated test set. Equal prior probabilities are assumed for the Bayesian Compound Covariate Predictor.

projectPath <- file.path(Sys.getenv("HOME"),"Brca")
outputName <- "classPrediction2"
generateHTML <- TRUE
prevalence <- c(length(ind1)/(length(ind1)+length(ind2)),length(ind2)/(length(ind1)+length(ind2)))
names(prevalence) <- c("BRCA1", "BRCA2")
resList <- classPredict(exprTrain = exprTrain, exprTest = exprTest, isPaired = FALSE, 
                        pairVar.train = NULL, pairVar.test = NULL, geneId,
                        cls = c(rep("BRCA1", length(ind1)), rep("BRCA2", length(ind2))),
                        pmethod = c("ccp", "bcc", "dlda", "knn", "nc", "svm"), 
                        geneSelect = "igenes.univAlpha",
                        univAlpha = 0.001, univMcr = 0, foldDiff = 0, rvm = TRUE, filter = filter, 
                        ngenePairs = 25, nfrvm = 10, cvMethod = 1, kfoldValue = 10, bccPrior = 1, 
                        bccThresh = 0.8, nperm = 0, svmCost = 1, svmWeight =1, fixseed = 1, 
                        prevalence = prevalence, projectPath = projectPath, 
                        outputName = outputName, generateHTML = generateHTML)
if (generateHTML)
  browseURL(file.path(projectPath, "Output", outputName,
            paste0(outputName, ".html")))

It returns the same list as shown in the Quick Start Section. For more details about classPredict, please type help("classPredict") in the R console.

5 Session info

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] classpredict_0.2
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.0  magrittr_1.5    htmltools_0.3.6 tools_3.6.0    
##  [5] yaml_2.2.0      Rcpp_1.0.1      stringi_1.4.3   rmarkdown_1.13 
##  [9] knitr_1.23      stringr_1.4.0   digest_0.6.19   xfun_0.8       
## [13] ROC_1.60.0      evaluate_0.14