One of the main focuses of the Molecular Statistics and Bioinformatics
Section is developing methodology to assist in the analysis of gene
expression data derived from microarray systems. We are interested in
developing statistical methods for both the "early phase" (design of
arrays, image analysis, data quality assessment, ratio calculation and
normalization, etc.) and "discovery phase" (cluster analysis, class
prediction, tests of significance, etc.) of microarray research projects.
Several methods we have developed and are developing are discussed below.
Assessment Of Cluster Reproducibility
Hierarchical cluster analysis is a popular method for examining the relationships between genes or experiments based on gene expression data from microarray experiments. Cluster analysis can be helpful for determining what genes have the most similar expression across experiments and what experiments have the most similar gene expression profiles. However, cluster algorithms always result in the formation of clusters, even for data sets where little or no underlying structure is present. Furthermore, statistical significance cannot be assigned to particular clusters using standard statistical techniques.
Therefore, we have developed measures that are helpful in assessing the reproducibility of individual clusters. The fundamental idea behind these measures is that the most believable clusters are those that would persist given small perturbations of the data, where the perturbations represent an anticipated level of noise in gene expression measurements due to assay variability and variation due to sub-sampling of specimens. This method was used to assess the reproducibility of a novel clustering of melanoma samples and cell lines in a collaboration with a group of investigators from the National Human Genome Research Institute1.
Class Prediction of Tumor Subtypes
One of the potential uses of microarray data is for the classification of specimens into phenotypic, prognostic or predictive groups based solely on their gene expression profiles. For example, a substantial proportion of node-negative breast cancers will be cured by surgery alone (with no further treatment). Determining which patients have such cancers would be of great clinical benefit. Though breast cancers that are curable by surgery alone may not be phenotypically distinguishable from others, it is possible they have distinguishable gene expression patterns that can be used as the basis of classification.
We have developed a method for the classification of specimens into
one of two pre-determined classes based on gene expression data using
a compound covariate predictor2. The predictor is a linear combination
of the log-expression ratios of genes differentially expressed between
the two classes, with the log-ratio of each gene weighted by the univariate
two-sample t-statistic for the gene. A classification threshold is selected
that assigns a specimen into one of the two classes based on the value
of its compound covariate predictor. We use a cross-validated approach
for the classification of specimens and have developed a permutation
test for determining the significance of resulting misclassification
error rates. We are applying this method to microarray data for various
types of cancer in collaboration with researchers within the National
Institutes of Health.
Comparison of Microarray Designs for Class Comparison and Class
Discovery
Microarray design can have a significant impact on the researchers'
ability to identify genes associated with cancer phenotypes, and to
discover new taxonomies for tumors from expression profiles. Complementary
DNA microarrays are based on competitive hybridization of pairs of RNA
samples to the array. Frequently, a sub-sample of a common reference
RNA sample is used as one of the two samples hybridized on each microarray.
Recently, other experimental designs for allocating samples to arrays
have been proposed. However, the relative merits of microarray designs
have not been thoroughly evaluated.
We have developed a statistical model that facilitates the evaluation
of designs when the goal is to compare pre-specified groups, and when
the goal is to seek out new taxonomies. In all cases, design description
must include the level at which samples are to be drawn, e.g. which
are to represent multiple aliquots from a single RNA source, and which
aliquots from different sources. When comparing pre-specified groups,
the relative efficiencies of different designs are shown to depend on
the relation between intra- and inter-sample variability. When seeking
out new taxonomies, both analytic results and Monte Carlo methods show
that for certain designs the ability to identify meaningful clusters
breaks down as the sample size increases. These results suggest some
relatively straightforward guidelines for selecting a microarray design
depending on the objectives of the experiment.
Prognostic Prediction Using Gene Expression Profiles
We are extending our research on tumor class prediction for binary
outcome data to cases where outcome is continuous. Specifically, we
are developing methodology for associating patterns of gene expression
with survival time in patients diagnosed with cancer. We adjust for
standard prognostic factors in developing a gene expression prognostic
index so that, if significance is obtained for the resulting index,
the index provides prognostic information beyond current standards.
Controlling the number of false discoveries: Application to high
dimensional genomic data
A straightforward approach to the identification of genes expressed
differentially between different groups of individuals is to perform
a univariate analysis of group mean differences for each gene, and then
identify those genes that are most statistically significant. Using
nominal significance levels (unadjusted for the multiple comparisons)
will lead to the identification of many genes that truly are not differentially
expressed, "false discoveries". A reasonable strategy in many situations
is to allow a small number of false discoveries, or a small proportion
of the identified genes to be false discoveries. Although previous work
has considered control for the expected proportion of false discoveries,
we show these methods may be inadequate. We propose two stepwise permutation-based
procedures to control with specified confidence the actual number of
false discoveries and approximately the actual proportion of false discoveries.
Limited simulation studies demonstrate substantial gain in sensitivity
to detect truly differentially expressed genes even when allowing as
few as one or two false discoveries. We apply these new methods to analyze
a microarray data set consisting of measurements on approximately 9000
genes in paired tumor specimens, collected both before and after chemotherapy
on 20 breast cancer patients. The methods described are broadly applicable
to the problem of identifying which variables of any large set of measured
variables differ between pre-specified groups.
Identifying pre-post chemotherapy differences in gene expression
in breast tumors: a statistical method appropriate for this aim
Although widely used for the analysis of gene expression microarray
data, cluster analysis may not be the most appropriate statistical technique
for some study aims. We demonstrate this by considering a previous analysis
of microarray data obtained on breast tumor specimens, many of which
were paired specimens from the same patient before and after chemotherapy.
Reanalyzing the data using statistical methods that appropriately utilize
the paired differences for identification of differentially expressed
genes, we find 17 genes that we can confidently identify as more expressed
after chemotherapy than before. These findings were not reported by
the original investigators who analyzed the data using cluster analysis
techniques.
Methods for assessing reproducibility of clustering patterns observed
in analyses of microarray data
Recent technological advances such as cDNA microarray technology have
made it possible to simultaneously interrogate thousands of genes in
a biological specimen. A cDNA microarray experiment produces a gene
expression "profile". Often interest lies in discovering novel subgroupings,
or "clusters", of specimens based on their profiles, for example identification
of new tumor taxonomies. Cluster analysis techniques such as hierarchical
clustering and self-organizing maps have frequently been used for investigating
structure in microarray data. However, clustering algorithms always
detect clusters, even on random data, and it is easy to misinterpret
the results without some objective measure of the reproducibility of
the clusters. We present statistical methods for testing for overall
clustering of gene expression profiles, and we define easily interpretable
measures of cluster-specific reproducibility that facilitate understanding
of the clustering structure. We apply these methods to elucidate structure
in cDNA microarray gene expression profiles obtained on melanoma tumors
and on prostate specimens.
Multiple comparisons methods applied to multivariate Cox regression
models
When clinical outcome data is available on a set of specimens that
have been molecularly profiled by cDNA arrays, it is of interest to
identify genes whose expression levels are associated with survival.
One approach to this problem is to perform a univariate survival analysis
relating survival to expression level for each gene, and then identify
those genes that are most statistically significant. Using nominal significance
levels (unadjusted for the multiple comparisons) will lead to the identification
of many genes that truly are not associated with survival, "false discoveries".
If there is no adjustment for other covariates in the survival model,
step-down permutation methods that control the number or proportion
of false discoveries can be readily adapted to this setting. However,
identification of genes that remain significantly associated with survival
after adjustment for standard prognostic variables is of particular
interest. In this situation, permutation techniques cannot be directly
applied due to likely correlations among the genes and standard prognostic
variables. We are exploring modified permutation and bootstrapping methods
to address this problem.
Statistical treatment of saturated spots in cDNA microarray data
Saturation of fluorescent signal may be encountered for spots on a
cDNA microarray corresponding to highly expressed genes. Naïve thresholding
of pixel levels at the saturation point will lead to underestimation
of total intensity for the spot. We explore some statistical methods
to adjust for saturation and provide less biased estimates of total
spot intensity.
References
1. Bittner, M. et al., Molecular classification of cutaneous malignant
melanoma by gene expression profiling, Nature, 406:536-540, 2000.
2. Tukey, J.W., Tightening the clinical trial, Controlled Clinical
Trials, 14:266-285, 1993.