RESEARCH PROGRAM

The main focus of my research has been on the development of statistical methodology and applications in biological and medical problems.. More specifically, my work has been centered on developing and applying data mining techniques to the high-dimensional cancer datasets arising in gene expression and copy number microarray, methylation and genotyping studies. I am also interested in developing methods for combining different types of data across genomic platforms (e.g., gene expression and array CGH) and across species  (e.g., human and mouse). My ongoing research includes:

Identifying the loci responsible for variation in quantitative or binary traits such as tumor size, cancer subtype or survival status, is a
problem of great importance to biologists. One of the main features of the genomic datasets is their unfavorable "p" (number of variables) to "n" (number of samples) ratio. Many important genetic variables affect the trait of interest via epistasis rather than on their own, and identification of such genes is notoriously hard when p/n ratio is large. In my PhD thesis, I have developed a novel approach for discovering interacting loci in the context of mouse linkage studies. There, I have utilized binary decision trees in combination with powerful aggregation approaches by shifting the focus of the analysis from prediction to variable selection. This approach has been successfully used as an exploratory tool in the study of plasmacytoma-related morbidity in Emu-v-abl transgenic mice (Symons et al., 2001). I am interested in applying these ideas for variable selection to even more complicated situations of microarray studies and generalizing them for pathway discovery.

Accurate class prediction is a problem of the utmost importance in cancer classification. There, biologists are often interested in developing genetic methods for tumor subtype identification and prognosis. Given the complexity of the microarray tumor data involving unfavorable ratio of the number of variables to the number of samples, large number of sets of highly correlated variables (e.g., co-regulated genes), and high between-patient heterogeneity, the question arises as to whether classic statistical methods for discrimination can be used for this new type of data. Together with my Ph. D. adviser Dr. Terry Speed and collaborator Dr. Sandrine Dudoit, we have conducted a thorough comparison study with several publicly available gene expression datasets each containing known cancer subtypes. We were able to fairly compare a number of traditional discrimination methods such as K-Nearest-Neighbors (kNN) and Linear Discriminant Analysis (LDA) with the state-of-the-art machine learning approaches including application of bagging and boosting. In our JASA article (2001) and in the book chapters in Chapman & Hall /CRC and Kluwer, we have demonstrated that in a typical gene expression dataset, the otherwise successful machine learning classifiers do not have an advantage over traditional statistical methods noted for their high bias and low variance, e.g. k-NN or LDA with the assumption of uncorrelated variables. It is very likely that as the number of samples in a typical microarray dataset increases, the machine learning methods capable of exploring the space of interaction will gain an edge over standard discrimination approaches. I am interested in exploring this with the large datasets available to me at the UCSF Cancer Center. Additionally, noting that epidemiological issues have been largely ignored in the published microarray data studies, I am very interested in developing the epidemiologically meaningful approaches for the design and analysis of the genomic studies for clinical purposes, in particular, carefully incorporating clinical information and population prevalence of the subtypes into the design and analysis. This work has started in collaboration with Dr. Terry Speed.


Cluster analysis involves the search through data for observations that are similar enough to each other to be grouped together. When a
clustering algorithm is applied to a set of observations, a partition of the data is obtained whether or not the data exhibit a true or "natural" grouping structure. This fact causes no problems if clustering is done for obtaining a practical grouping of the given set of objects, for instance for organizational purposes. However, if interest lies more in the recognition of an unknown classification of the data, an artificial clustering is not acceptable, and therefore clusters resulting from the algorithm must be investigated for their relevance. Apart from descriptive, graphical or exploratory methods, this task can be performed by using probabilistic models and suitable statistical significance tests. Discovery of novel tumor classes using gene expression data is one example where the need to reliably estimate the number of clusters and accurately allocate observations arises. With my collaborator Dr. Sandrine Dudoit, we proposed to apply resampling methods to (i) estimate the number of clusters in a dataset and (ii) improve accuracy of the cluster assignment. The approach to (i) uses ideas from discriminant analysis. Since the clusters obtained from cluster analysis are eventually used for prediction purposes, it is natural to apply discrimination techniques in clustering. For (ii), bootstrap aggregation is used to improve cluster accuracy and to assign confidence to the labels of the individual observations. We have successfully demonstrated the utility of both approaches on simulated data and real microarray datasets by conducting careful comparison studies of our methods with the available methods, This work is presented in the two manuscripts published in Bioinformtics (2002) and Genome Biology (2003).  I am interested in continuing to build upon the ideas proposed in these manuscripts and applying them to the datasets of my collaborators at the UCSF Cancer Center.


The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic alterations seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. Microarray-based comparative genomic hybridization (array CGH) can be used to investigate genomic alterations. The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers. To utilize the spatial coherence between nearby clones, I proposed to use an unsupervised Hidden Markov Models approach. The clones are partitioned into states which representunderlying copy number of the group of clones. The structural changes in a tumor genome may be recorded and characterized computationally using the above methodology. The method is described in our 2004 paper in JMVA and has been successfully applied to a number of cell line and primary tumor datasets. This research has greatly benefited from the continuing input of my biological collaborators Dr. Donna Albertson, Dr. Dan Pinkel and A. Snijders. My current ongoing effort is focused on refining the methodology and incorporating it in the aCGH package of R/BioConductor that I have developed together with P. Dimitrov, a PhD student at UC Berkeley. Additionally, I am very interested in using the HMM approach for data reduction, so that the resulting lower-dimensional dataset is used as an input to classification and variable selection procedures as well as procedures for combining different data types. We are also investigating approaches for identifying discrete levels of the copy number across the entire genome. Finally, with my biological collaborators, we have successfully applied the above methodology to efficiently identify regions of homozygosity and heterozygosity in backcross mice using array CGH data.


More cancer datasets are becoming available containing copy number and gene expression measurement as well as clinical information on samples. An interesting biological question is to look for so-called driver genes, or the genes which de-regulation is being strongly selected for by the tumor. Currently, the computational methods for addressing this question are very primitive and generally treat copy number and expression data symmetrically by concentrating on computing correlations of mRNA levels and copy number for a given gene. Clinical information has been largely ignored in this context. Together with my collaborator, Dr. Ru Fang Yeh, we propose novel methods for combining arbitrary types of genomic and clinical information to infer gene modules. This work is in its very preliminary stage at this moment. If successful, we will proceed to test the hypotheses using experimental approached of our collaborators at LBL.