RESEARCH PROGRAM
The main focus
of my research has been on the development of statistical methodology and applications
in biological and medical problems.. More specifically, my work has been centered
on developing and applying data mining techniques to the high-dimensional
cancer datasets arising
in gene expression and
copy number microarray, methylation and genotyping studies. I am also interested in developing methods for combining different types of data across genomic platforms
(e.g., gene expression and array CGH) and across species (e.g., human and mouse). My
ongoing research includes:
- Development
of the methods for variable selection.
Identifying
the loci responsible for variation in quantitative or binary traits such as tumor size, cancer
subtype or survival status, is a
problem of great importance
to biologists. One of the main features of the genomic datasets is their unfavorable
"p" (number of variables) to "n" (number of samples) ratio. Many
important genetic variables affect the trait of interest via epistasis
rather than on their own, and identification of such genes is notoriously
hard when p/n ratio is large. In my PhD thesis,
I have developed a novel approach for discovering interacting loci in the
context of mouse linkage studies. There, I have utilized binary decision trees
in combination with powerful aggregation approaches by shifting
the focus of the analysis from prediction to variable selection. This approach has
been successfully used as an exploratory tool in the study of plasmacytoma-related
morbidity in Emu-v-abl transgenic
mice (Symons et al., 2001). I am interested in
applying these ideas for
variable selection to even more complicated situations of microarray studies and generalizing
them for pathway discovery.
- Development
of the methods for accurate class prediction.
Accurate class prediction
is a problem of the utmost importance in cancer classification. There, biologists
are often interested in developing genetic methods for tumor
subtype identification and prognosis. Given the complexity of the microarray tumor
data involving unfavorable ratio of the number of variables to the number
of samples, large number of sets of highly correlated variables
(e.g., co-regulated genes), and high between-patient heterogeneity, the question
arises as to whether classic
statistical methods for discrimination can be used for this new type of data. Together with my Ph. D. adviser
Dr. Terry Speed and
collaborator Dr. Sandrine Dudoit, we have
conducted a thorough comparison study with several publicly available gene
expression datasets each containing known cancer subtypes. We were able
to fairly compare a number of traditional discrimination methods such
as K-Nearest-Neighbors (kNN) and Linear Discriminant Analysis (LDA) with
the state-of-the-art machine learning approaches including application
of bagging and boosting. In our JASA article (2001)
and in the book chapters in Chapman & Hall /CRC and Kluwer,
we have demonstrated
that in a typical gene expression dataset, the otherwise successful machine learning
classifiers do not have an advantage over traditional statistical methods
noted for their high bias and low variance, e.g. k-NN or LDA with the
assumption of uncorrelated variables. It is very likely that as
the number of samples in a typical microarray dataset increases, the machine
learning methods capable of exploring the space of interaction will
gain an edge over standard discrimination approaches. I am interested
in exploring this with the large datasets available to me at the UCSF
Cancer Center. Additionally, noting that epidemiological issues have been largely
ignored in the published microarray data studies, I am very interested
in developing the epidemiologically
meaningful approaches for the design and analysis of the genomic studies for clinical purposes, in particular,
carefully incorporating clinical information and population
prevalence of the subtypes into the design and analysis. This work has started
in collaboration with Dr. Terry Speed.
- Development
of the methods for novel class discovery.
Cluster analysis involves
the search through data for observations that are similar enough to each other to
be grouped together. When a
clustering algorithm is applied
to a set of observations, a partition of the data is obtained whether or not the
data exhibit a true or "natural" grouping structure. This fact causes
no problems if clustering is done for obtaining a practical grouping of
the given set of objects, for instance for organizational purposes.
However, if interest lies more in the recognition of an unknown classification
of the data, an artificial clustering is not acceptable, and therefore
clusters resulting from the algorithm must be investigated for their
relevance. Apart from descriptive,
graphical or exploratory methods, this task can be performed by using probabilistic models and suitable
statistical significance tests. Discovery of novel tumor classes using
gene expression data is one example where the need to reliably estimate
the number of clusters and accurately allocate observations arises.
With my collaborator Dr. Sandrine
Dudoit, we proposed to apply
resampling methods to (i) estimate the number of clusters in a dataset and (ii) improve
accuracy of the cluster assignment. The approach to (i) uses
ideas from discriminant analysis. Since the clusters obtained from cluster
analysis are eventually used for prediction purposes, it is natural to
apply discrimination techniques in clustering. For (ii), bootstrap aggregation
is used to improve cluster accuracy and to assign confidence to
the labels of the individual observations. We have successfully demonstrated
the utility of both approaches
on simulated data and real microarray datasets by conducting careful comparison studies of our methods with
the available methods, This work is presented in the two manuscripts published in Bioinformtics (2002)
and Genome Biology (2003). I am interested in continuing to build upon
the ideas proposed in these manuscripts and applying them to the
datasets of my collaborators at the UCSF Cancer Center.
- Development
of the methods for the analysis of the array CGH data.
The development
of solid tumors is associated with acquisition of complex genetic alterations, indicating
that failures in the mechanisms that maintain the integrity of the genome
contribute to tumor evolution. Thus, one expects that the particular
types of genomic alterations seen in tumors reflect underlying failures in
maintenance of genetic stability, as well as selection for changes that
provide growth advantage. Microarray-based comparative genomic
hybridization (array CGH) can be used to investigate genomic alterations. The
computational task is to map and characterize the number and types of
copy number alterations present in the tumors, and so define copy number
phenotypes as well as to associate them with known biological
markers. To utilize the spatial coherence between nearby clones, I proposed to
use an unsupervised Hidden Markov Models approach. The clones are partitioned
into states which representunderlying copy number of the group
of clones. The structural changes in a tumor genome may be recorded and
characterized computationally using the above methodology. The method
is described in our 2004 paper in JMVA and has been successfully applied to a number of cell line and primary tumor datasets.
This research has greatly
benefited from the continuing input of my biological collaborators Dr. Donna Albertson, Dr. Dan Pinkel and
A. Snijders. My current
ongoing effort is focused on refining the methodology and incorporating it in the aCGH package of R/BioConductor that I have developed together with P. Dimitrov,
a PhD student at UC Berkeley. Additionally, I am very interested in
using the HMM approach for data reduction, so that the resulting lower-dimensional
dataset is used as an input to classification and variable selection
procedures as well as procedures
for combining different data types. We are also investigating approaches for identifying discrete
levels of the copy number across the entire genome. Finally, with my biological
collaborators, we have successfully
applied the above methodology to efficiently identify regions of homozygosity and heterozygosity in backcross
mice using array CGH data.
- Development
of the methods for combining clinical, copy number and expression data for identification of driver genes and discovery
of novel pathways.
More cancer
datasets are becoming available containing copy number and gene expression measurement as well
as clinical information on samples. An interesting biological question is
to look for so-called driver genes, or the genes which de-regulation
is being strongly selected for by the tumor. Currently, the computational
methods for addressing this question are very primitive and generally
treat copy number and expression data symmetrically by concentrating
on computing correlations of mRNA levels and copy number for a given
gene. Clinical information has been largely ignored in this context. Together
with my collaborator, Dr. Ru
Fang Yeh, we propose novel
methods for combining arbitrary types of genomic and clinical information
to infer gene modules. This
work is in its very preliminary stage at this moment. If successful,
we will proceed to test the hypotheses using experimental approached of our
collaborators at LBL.