http://biostat.ucsf.edu/events.html
UCSF
Department of Epidemiology & Biostatistics

Division of Biostatistics

Seminar Archives

May 6, 2008

Brian Leroux

Department of Biostatistics, University of Washington

Estimation of the Intraclass Correlation Coefficient

The Intraclass Correlation Coefficient (ICC) is a useful parameter for planning studies that involve clustered data, such as dental studies and group-randomized trials. The ICC is the correlation between two outcomes within the same cluster, and is used for performing power calculations. The ANOVA method is a convenient method for estimation of the ICC, which uses only simple closed-form expressions. Other methods, such as maximum-likelihood, may achieve greater precision than the ANOVA method, but require distributional assumptions and iterative computational procedures. These other methods require distributional assumptions for hypothesis testing and confidence intervals for the ICC. In this talk, I describe a new closed-form method of estimation and inference for the ICC that does not require distributional assumptions. Simulation studies show that the new method yields valid inferences for the ICC in a wide range of settings. The new method performs similarly to ANOVA when the data are normally distributed and cluster sizes are equal, but has greater precision than ANOVA if cluster sizes are unequal. Applications are made to data sets from a dental practice-based research network and a school-based smoking prevention trial.

April 24, 2008

Paul Scheet

Center for Statistical Genetics and Department of Biostatistics, University of Michigan

A Statistical Model for Patterns of Population Genetic Variation with Applications

Current high-throughput technologies have enabled large-scale surveys of population genetic data, such as those for genome-wide association (GWA) studies of complex traits. These data demand computationally tractable models for inference. In this talk I present a statistical model for patterns of linkage disequilibrium (the correlation of alleles at nearby loci; LD) among tightly-linked SNPs. I demonstrate how this model may be used to improve association mapping techniques by imputing genotypes from a dense reference panel of individuals and by directly modeling haplotype variation to detect associations with rare SNPs. I also present a new LD-based quality control tool for genotype data, which can detect, and in some cases correct, genotyping errors. Finally, I will present a new framework for incorporating haplotype information into traditional single-marker methods for analyzing population genetic data and apply the model to a recent survey of SNP data from the Human Genome Diversity Project to visualize global haplotype variation.

April 1, 2008

David Siegmund

Department of Statistics, Stanford University

Mapping Quantitative Traits

I describe a unified model for the statistical foundations of population based association mapping and family based linkage mapping of quantitative traits in humans. Analysis of the model involves the efficient score statistic for the conditional likelihood, given the phenotypes. Analytic expressions for noncentrality parameters give qualitative insight into the relative power of different statistics and the loss of power that occurs if the scientist's assumed genetic model differs from nature's "true" genetic model. The multiple comparisons problem of genome scans to search for anonymous genes is discussed.

Reference: Dupuis J, Siegmund D and Yakir B. (2007) PNAS, 104:20210-5.

March 4, 2008

David Draper

Department of Applied Mathematics & Statistics, UC Santa Cruz

BAYESIAN DECISION THEORY IN BIOSTATISTICS: THE UTILITY OF UTILITY

The discipline of statistics may be divided broadly into four activities: description (graphical and numerical summaries of a data set, without attempting to reason outward from it), inference (drawing probabilistic conclusions about the underlying process that gave rise to the data), prediction (summarizing uncertainty about future observables), and decision-making (looking for optimal behavioral choices in the face of uncertainty, by constructing appropriate utility functions and maximizing expected utility). The history of the discipline has tended to focus on description and inference at the expense of prediction and decision-making; in particular, problems that at first look inferential may profitably be reformulated as decisions, and people sometimes use inferential tools to suggest "optimal" behaviors that are not as optimal as they initially seem. In this talk I'll describe two case studies in biostatistics in which Bayesian decision theory gives new insight in settings that seem inferential: variable selection in generalized linear models (with application to the construction of a cost-effective scale for measuring sickness at admission to hospital) and determining the efficacy of a vaccine against HIV.

January 15, 2008

Fushing Hsieh

Department of Statistics, UC Davis

Nonparametric state-space decoding computations for non-autonomous dynamics

Hierarchical factor segmentation (HFS) algorithm is introduced as nonparametric computations for decoding state-space trajectory underlying various types of time series data generated from non-autonomous dynamics. We illustrate two applications of HFS algorithm: one for decoding in Hidden Markov model (HMM) and the other for computing signature-phases in circadian rhythms. In HMM, efficiency of HFS algorithm is compared with the popular dynamic programming based Viterbi and posterior-Viterbi algorithms on simulated as well as real CpG island genetics data. On circadian rhythm, we analyze event-time series (actogram) data generated from experiment possibly coupled with light pulse interruptions. A sequence of signature-phases is computed to mark a sequence of rhythmic cycles of variable cyclic lengths. The signature-phase also provides rigorous foundation for phase-shift measurements which are very important in biomedical hormone therapy. Our non-Fourier analysis for rhythmic dynamics is compared with Fourier analysis and Periodogram based methodologies.

December 18, 2007

Andrew Vickers

Sloan Kettering Memorial Hospital, New York

How do we know whether a predictive model is of clinical value? How do we know whether a molecular marker is worth measuring? A discussion of some simple decision analytic methods

There is increasing interest in and use of multivariable prediction models to aid clinical management. In oncology, it has been shown that such models are more accurate than the use of crude risk categories, such as those based on cancer stage. Accordingly, it has been suggested that multivariable models should be used to make decisions about patient care, such as whether a patient should receive chemotherapy after initial curative surgery. Research on molecular markers has mirrored the growth of prediction models: currently an enormous volume of papers are published examining whether a tissue or blood marker can predict the occurrence or course of disease.

Markers and models are currently evaluated in terms of accuracy using metrics such as the area-under-the-curve (AUC), sensitivity and specificity or the concordance index. A model is thought to be a good one if it is accurate; a marker is claimed to be of value if it increases the accuracy of a model. But how accurate is accurate enough? For instance, should we use a model with an AUC of 0.65, or only those with AUC's above 0.75? Similarly, if a marker improves AUC from, say, 0.65 to 0.68, is it worth using in the clinic?

This all depends, of course, on what the model or marker will be used for. Evaluating models and markers in terms of clinical consequences is the remit of a field known as "decision analysis". The problem with decision analysis, however, is that it requires additional information, for example, on the benefits, harms and costs of treatment, or on patient preferences for different health states. Perhaps as a result, the number of papers in the literature using decision analytic methods is dwarfed by those that report accuracy.

In this presentation, I will describe some simple decision analytic methods that can be directly applied to the data set of a model or marker, without the need for external information. These methods can therefore be used to tell us whether or not to use a model in the clinic, or whether a marker is a good one. To illustrate the use of the methods I will look at markers for the detection of prostate cancer, and also examine whether a statistical model is a better basis than cancer stage for determining use of chemotherapy after radical cystectomy.

References:
Vickers AJ and Elkin EB. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Medical Decision Making 2006;26(6):565-74. Reprint
Vickers AJ, Kramer BS, Baker SG. Selecting patients for randomized trials: a systematic approach based on risk group. Trials 2006;7:30. Reprint

December 11, 2007

Diana Miglioretti

Center for Health Studies, Seattle

Modeling the dissemination of a screening test (and other interests)

Microsimulation modelers such as the Cancer Intervention and Surveillance Modeling Network (CISNET) rely on accurate models of screening dissemination to estimate the contribution of screening to observed changes in cancer incidence and mortality. We propose an approach for estimating the age at first screening test from current status data collected via two series of cross-sectional surveys. To model the national probability of ever having screening test of interest, we incorporate birth cohort effects into a mixed-influence diffusion model. We link a state-specific model to the national-level diffusion model using a marginalized modeling approach. To simulate screening histories for our microsimulation model, we will link this model to a latent class survival model for modeling multiple gap times between screening examinations, which I will briefly describe as a works in progress. If there is time, I will also describe some of my other interests, including my work with the Breast Cancer Surveillance Consortium.

November 20, 2007

Raquel Prado

Department of Applied Mathematics & Statistics, UC Santa Cruz

ASSESSING THE EFFECT OF SELECTION IN DNA SEQUENCES ENCODING MALARIA ANTIGENS

A model-based approach for assessing the effect of natural selection at the amino acid level in protein-coding DNA sequences is presented. Bayesian generalized linear models are used to describe patterns of codon mutations in count data derived from sequence alignments. Such models provide a flexible framework thatallows experts to simultaneously perform the following tasks: detecting residues with relatively large ratios of non-synonymous to synonymous mutation probabilities; comparing intra-specific andinter-specific mutation probabilities, as well as mutation probabilities across various protein domains; and determining if radical changes are being encouraged by natural selection. Key modeling features include the incorporation of biologically meaningfulinformation via structured priors and model validation via posteriorpredictive checks and/or estimation of gene trees. The methodology is illustrated with analyses of polymorphic data obtained from isolates of the apical membrane antigen-1 in the human malaria parasite P.falciparum. Divergence data derived from a strain of the homologous gene in P.reichenowi are also analyzed.

October 2, 2007

Marc Coram

Department of Health Research and Policy, Stanford University

Allele Frequency Estimation by Borrowing Strength across Populations

In genetic studies, allele frequency at a genetic marker is routinely inferred, often using genotypes from a small set of individuals. Improving the accuracy of these estimates will benefit studies of human genetic variation or the genetic etiology of heritable traits. Here, we propose an empirical Bayes approach for estimating allele frequencies at single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. Applications of our method to data from recent genomic projects suggest that this empirical Bayes approach can substantially reduce the variability in the frequency estimates, while introducing little bias. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.

September 25, 2007

John Boscardin

Department of Biostatistics, UCLA

FLEXIBLE MODELING OF HETEROGENEOUS LONGITUDINAL DATA

We model multivariate longitudinal data on multiple subjects using a state space smoothing spline approach . The covariance parameters for the state space model are subject-specific so as to allow for heterogenei ty, but are modeled hierarchically to facilitate borrowing of information across subjects to the extent s upported by the data. The performance and applicability of this model is highlighted using intensive c are unit data from a prospective, observational study of severe head trauma patients. In this setting, real-time inference is required for clinical utility. Extensions to within-subject heterogeneity will al so be discussed. This is joint work with Hector Lemus.

May 18, 2007

Jason Fine

Department of Statistics and Department of Biostatistics & Medical Informatics, University of Wisconsin, Madison

NONPARAMETRIC ASSOCIATION ANALYSIS OF MULTIVARIATE COMPETING RISKS DATA, WITH APPLICATION TO DEMENTIA ONSET IN AN AGING POPULATION

While nonparametric association analyses of bivariate failure times have been widely studied, analogous analyses of bivariate competing risks data have not been investigated. Such analyses are important in familial association studies in genetic epidemiology and demography, where multiple interacting failure types may invalidate nonparametric analyses for independently censored clustered survival data. The scenario is common in population based studies where onset of certain chronic diseases, eg, psychiatric disorders, may be dependently censored by death. I first develop nonparametric estimators for the bivariate cause-specific hazards function and the bivariate cumulative incidence function, which are natural extensions of their univariate counterparts and make no assumptions about the dependence of the risks. The estimators are shown to be uniformly consistent and to converge weakly to Gaussian processes.

Time-dependent summary association measures are proposed and yield formal tests of independence in clusters. The practical utility of the methodology is illustrated in an analysis of dementia in the Cache County Aging Study, where dependent censoring by mortality is heavy and the onset associations are strongly time-varying.

May 8, 2007

Mary Lesperance

Department of Mathematics & Statistics, University of Victoria

GRAPHICAL TECHNIQUES FOR GENE EXPRESSION STUDIES

Correspondence analysis (CA) is a descriptive technique designed for investigating the association between row and column variables by graphically displaying the patterns in the data. It has been widely applied to categorical data. We explore and develop variations of CA techniques to identify differentially expressed genes and to assess the quality of replicate DNA arrays.

Multiple correspondence analysis (MCA) and a related technique called joint correspondence analysis (JCA) are methods for visualizing the joint features of 2 or more categorical variables. We have been working with the Genetic Pathology Evaluation Centre (GPEC) at UBC and the Breast Outcomes Unit (BCOU) at the B.C. Cancer Agency (BCCA) to study relationships between molecular markers and outcomes for breast cancer. Molecular markers and diagnostic variables are typically categorized as positive/negative by pathologists and oncologists, whereas outcome measures such as time to recurrence or breast cancer specific survival time are continuous and possibly censored. We consider fuzzy coding methods to display survival information in an MCA analysis of molecular markers.

May 3, 2007

Mitchell Gail

Chief of Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute

PROBABILITY OF DETECTING DISEASE-ASSOCIATED SINGLE NUCLEOTIDE POLYMORPHISMS IN CASE-CONTROL STUDIES WIT H WHOLE GENOME SCANS

Some case-control genome-wide association studies (CCGWASs) select promising single nucleotide polymorphisms (SNPs) by ranking corresponding p-values, rather than by applying the same p-value threshold to each SNP. For such a study, we define the detection probability (DP) for a specific disease-associated SNP as the probability that the SNP will be "T-selected", namely have one of the top T largest chi-square values (or smallest p-values) for trend tests of association. The corresponding proportion positive (PP) is the fraction of selected SNPs that are true disease-associated SNPs. We study DP and PP analytically and via simulations, both for fixed and for random effects models of genetic risk, that allow for heterogeneity in genetic risk. DP increases with genetic effect size and case-control sample size, and decreases with the number of non-disease SNPs, mainly through the ratio of T to N, the total number of SNPs. We show that DP increases very slowly with T, and the increment in DP per unit increase in T declines rapidly with T. DP is also diminished if the number of true disease SNPs exceeds T. For a genetic odds ratio per minor allele of 1.2 or less, even a CCGWAS with 1000 cases and 1000 controls requires T to be impractically large to achieve an acceptable DP, leading to PP values so low as to make the study futile and misleading. We further calculate the sample size of the initial CCGWAS that is required to minimize the total cost of a research program that also includes follow-up studies to examine the T selected SNPs. A large initial CCGWAS is desirable if genetic effects are small or if the cost of a follow-up study is large.

Joint work with M. Pfeiffer, William Wheeler and David Pee.

April 10, 2007

Biao Xing

Senior Biostatistician, Genentech, Inc.

BLINDED SAMPLE SIZE REESTIMATION IN RANDOMIZED CLINICAL TRIALS WITH CONTINUOUS ENDPOINT

Blinded sample size reestimation allows for modifying the sample size of an ongoing trial to ensure sufficient statistical power without breaking the blind. One challenge is the blinded estimation of the within group variance. Early proposed methods either make untenable assumptions or are only applicable to two-treatment trials. Moreover, these methods are often biased. We proposed a simple unbiased method, which also makes minimal assumptions. The method uses the enrollment order of subjects and the randomization block size to estimate the variance and then reestimate the sample size. It can be applied to normal or non-normal data, to trials with two or more arms, equal or unequal allocation schemes, and fixed or random randomization block sizes. Results from simulations and data analysis suggest that the proposed blinded sample size estimation approach is practical.

March 27, 2007

Alan Hubbard

Division of Biostatistics, UC Berkeley SPH

A NEW SCREENING ALGORITHM FOR MULTIPLE RISK FACTOR/DISEASE ASSOCIATION STUDIES: COMBINING VARIABLE IMPORTANCE, THE CONDITIONAL PERMUTATION DISTRIBUTION AND MULTIPLE TESTING PROCEDURES

A typical study design for investigating potential causes of disease can involve collecting a large number of potential risk factors and disease outcomes in a random sample of individuals. A new proposal is made, merging recently developments in causal inference and computational biology with existing methods, for an algorithm providing simultaneous ranking/testing of many potential risk factors. The procedure has three features: 1) The Parameter of Interest: a natural parameter of interest for certain study designs and risk factors is that inspired by the so-called population intervention model (Hubbard and van der Laan, 2005), which can also serve as a more general measure of variable importance. Under assumptions, this importance measure for a particular risk factor (variable) can be interpreted as the change in the mean disease outcome in a population if an intervention (for all subjects) set the variable to its "safest" level. 2) Marginal Inference: to provide inference (p-values) for estimated variable importances, conditional permutation methods (Rosenbaum, 1984) are available; these methods can potentially provide exact finite sample tests. 3) Experimentwise Inference: A recently proposed multiple testing procedure (using the quantile-function) inspired by problems in computational biology provides sharp control of experimentwise type I errors, while using the conditional permutation distribution for marginal control. The benefits of this new combined methodology are a parameter with public health significance, robust inference that automatically accounts for model selection, and a set of risk factors for which one has the most evidence (confidence) of having an impact on disease in the target population. This technique, which could replace typically ad hoc approaches, provides an automated procedure for analyzing studies of many candidate risk factors and disease outcomes.

March 2, 2007

Yuanyuan Xiao

Center for Bioinformatics & Molecular Biostatistics, UCSF

SNP GENOTYPING USING AFFYMETRIX GENECHIP ARRAYS

Modern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polyphorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, for example, use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls.

In this talk, I will discuss current genotyping methods using Affymetrix SNP arrays and will introduce a new algorithm (MAMS) we have developed, which combines single-array multi-SNP and multi-array single-SNP calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi-array methods. Using a set of publicly available HapMap arrays/samples with known genotypes (from other genotyping technologies) as benchmarks, we illustrate the performance of MAMS in comparison of existing genotyping algorithms.

February 22, 2007

Yu Shen

Department of Biostatistics, M. D. Anderson Cancer Center

Inference of Tamoxifen's Effects on Prevention of Breast Cancer from a Randomized Controlled Trial

Breast cancer is the most common non-skin cancer among women in the United States, and continues to be an important cause of morbidity and mortality for women at high risk of developing the disease. The advent of preventive intervention and early detection of cancer brings greater hope to the control of breast cancer, while also posing significant challenges to researchers and public health policy makers. To provide quantitative frameworks to describe the natural history of breast cancer; assess the impact of the primary preventive intervention on the natural progression of the disease, we propose a flexible semiparametric model to assess the effects of a preventive agent on the incidence of breast cancer as well as time to the diagnosis of the disease, separately, in the framework of a cure-rate model. We used an estimating equation approach to estimate the unknown parameters, and assessed the semiparametric model assumption with a test based on the area between two survival curves. This is a joint work with Qin and Costantino.

February 21, 2007

Eran Halperin

International Computer Science Institute, UC Berkeley

WHOLE-GENOME DISEASE ASSOCIATION STUDIES: CHALLENGES AND SOLUTIONS

The recent data release of the Haplotype Mapping project, and the rapid reduction in genotyping costs, open new directions and opportunities in the study of complex diseases via the analysis of single nucleotide polymorphisms (SNPs) data. At the same time, the increased size of the SNP datasets set new computational and statistical challenges.

In this talk I will discuss some of the challenges set by the large-scale of these studies, and the current solutions to these challenges. In particular, I will describe recent results on whole-genome haplotype analysis, including haplotype inference, and the incorporation of the HapMap data in haplotype analysis of case-control studies. I will also discuss potential drawbacks of these methods due to population substructure, and suggest solutions that are scalable to the coming large-scale studies.

February 16, 2007

Wei Li

Dana-Farber Cancer Institute, Harvard School of Public Health

ChIP-chip ON GENOME TILING ARRAYS: TOWARDS AN UNDERSTANDING OF THE GLOBAL TRANSCRIPTIONAL REGULATION

Identifying the regulatory targets of a transcription factor (TF) is crucial to understanding its biological function. Chromatin Immunoprecipitation coupled with DNA microarray analysis (ChIP-chip) has quickly evolved as a popular technique to study the in vivo targets of DNA-binding proteins at the genome level.

We developed a series of algorithms to reliably detect and annotate ChIP-enriched regions using Affymetrix whole-genome tiling arrays, including 1) Model-based Analysis of Tiling-arrays (MAT) for ChIP-region detection, 2) extreme MApping of OligoNucleotide (xMAN) for microarray probe mapping, 3) Cis-regulatory Element Annotation System (CEAS) for ChIP-region annotation. Since the inception in early 2006, they have been adopted by hundreds of academic users and are now considered as the ChIP-chip data analysis standard in many labs. We are also coordinating the ENCODE spike-in consortium, which consists of more than 10 transcriptional regulation groups worldwide, to systematically analyze the performance variability introduced in ChIP-chip protocols, array platforms, and analysis methods.

We applied those algorithms to the ChIP-chip data of Estrogen Receptor (ER) and Androgen Receptor (AR) on Affymetrix human genome tiling arrays, and successfully identified thousands of novel binding sites, most of which are far from the promoters of known genes. A screen for enriched motifs within those regions revealed both the typical and non-typical AR responsive elements (ARE) and several other co-factor motifs, including Forkhead and Ap1. Co-immunoprecipitation and re-ChIP assays confirmed the interaction with these co-factors in vivo. Specific targeted silencing of these various cofactors differentially affected hormone-induced gene expression and cell cycle progression.

February 14, 2007

Niko Beerenwinkel

Harvard University

EVOLUTIONARY ESCAPE ON FITNESS LANDSCAPES

The evolution of HIV within individual patients is associated with disease progression and failure of antiretroviral drug therapy. Using graphical models we describe the development of HIV drug resistance mutations and show how these models improve predictions of the clinical outcome of combination therapy. We present combinatorial algorithms for computing the risk of escape of an evolving population on a given fitness landscape. The geometry of fitness landscapes and the underlying gene interactions are analyzed in an attempt to generalize the notion of pairwise epistasis to higher-order genetic systems. Finally, we discuss the new and exciting prospects for analyzing viral genetic variation that arises from recent pyro-sequencing technology.

February 6, 2007

John Rice

Department of Statistics, UC Berkeley

Testing Many Hypotheses

Suppose that a very large number of independent null hypotheses are tested, almost all of which are true. How can the proportion of false null hypotheses be estimated? For motivation, I will briefly discuss the Taiwanese-American Occultation Survey, and will explain how this question arises. I will then present some results based on joint work with Nicolai Meinshausen.

January 9, 2007

Alan Dabney

Department of Statistics, Texas A&M

MODEL-BASED PROTEIN SUMMARIES AND DIFFERENTIAL LABEL-FREE QUANTITATIVE PROTEOMICS

An LC-MS experiment begins with the component peptides of a mixture of proteins. Peptides are first separated by liquid chromatography, then each peptide is characterized by mass and quantified by peak height using mass spectrometry. Differential label-free quantitative proteomics refers to the use of peak heights to compare peptide abundance between groups of interest. Statistical issues include: intensity- (peak-height-) dependent bias, widespread in formative missingness, and the desire to make inference at the protein level on the basis of peptides. I will present a model-based approach to addressing these issues. The method will be illustrated on data from the Pacific Northwest National Laboratory.

November 14, 2006

Jay Bartroff

Department of Statistics, Stanford University

MODERN SEQUENTIAL ANALYSIS IN COMPUTERIZED ADAPTIVE TESTING

Sequential analysis of data is used in a variety of types of psychometric tests, including computerized adaptive testing (CAT), classroom interaction intervention, psychological studies with longitudinal data, depression diagnosis, and even crime-suspect identification tests. Focusing on CAT, we discuss designing efficient procedures using sequential generalized likelihood ratio tests, and show how these techniques can lead to substantial improvement over currently-used stopping rules and conventional fixed-length tests. We also extend the asymptotic optimality theory of these tests from the i.i.d. setting to the case of sequentially generated experiments, as in CAT. An example of these tests is given using a real math question pool provided by a subsidiary of the Educational Testing Service. Further practical issues like test security and content balancing will be discussed, and the interesting theoretical challenges they pose. This is joint work with T. L. Lai and Matthew Finkelman.

October 17, 2006

Ying Qing Chen

Statistical Center for HIV/AIDS Research & Prevention (SCHARP),
Fred Hutchinson Cancer Research Center

On Attributable Risk Functions

Time-to-event endpoints are often used in clinical and epidemiological studies to evaluate disease association with hazardous exposures. In the statistical literature of time-to-event analysis, such association is usually measured by the hazard ratio in the proportional hazards model. In public health, it is also of important interest to assess the excess risk attributable to an exposure in a given population. In this talk, we discuss the notion of "population attributable fraction" for the binary outcomes and extend it to the attributable risk function for event times in prospective studies. A simple estimator of the time-varying attributable risk function is proposed under the proportional hazards model. Our proposed methodology is motivated and demonstrated by the data collected in a multicenter acquired immunodeficiency syndrome (AIDS) cohort study to estimate the attributable risk of human immunodeficiency virus type 1 (HIV-1) infections due to several potential risk factors.

September 26, 2006

Hans Mueller

Department of Statistics, UC Davis

Functional Methods for Longitudinal Data

Functional methods are designed for the analysis of samples of time course data under minimal assumptions. Three useful functional concepts in the context of longitudinal biological data are warping, functional principal components and functional regression. Corresponding models, problems and examples will be discussed. Applications include gene time course expression data, sparse and irregular longitudinal data, and longitudinal data with time-to-event.

December 22, 2005 (joint local ASA meeting)

Wing Hung Wong

Department of Statistics, Stanford University

GLOBAL STUDY OF GENE REGULATION IN EMBRYONIC STEM CELLS

We are interested in the transcriptional programs underlying embryonic stem cells and their early differentiated lineages. Our approach uses gene expression profiling, cell sorting, chromatin immunoprecipitation, as well as multi-species cis-regulatory sequence analysis, to identify developmentally regulated genes and to characterize the sequence elements responsible for their regulation.

CBMB December 13, 2005

Zemin Zhang

Senior Scientist, Department of Bioinformatics, Genentech, Inc.

CANCER TARGET FINDING FROM DNA COPY NUMBER ANALYSIS

Aberrant DNA amplification is one of the most common mutations in cancer cells and it frequently leads to increased expression of encapsulated cancer-promoting genes. Such genetic changes also provide opportunities for cancer diagnostics and targeted therapies. For example, the anti-HER2 antibody drug Herceptin has been used for treating the breast cancer patients diagnosed with HER2 amplification. To identify additional HER2-like targets, we took several approaches to study recurrently amplified regions in cancers. First, we developed a computational method for scanning an EST-based transcriptome to find genomic regions harboring cluster of genes with increased expression in cancer tissues. We demonstrated that these regions correlated with previously identified tumor amplicons. We then analyzed BAC clone-based array CGH data using a variety of methods to explore DNA copy number changes in hundreds of breast-, brain- and colon-tumor samples. Since the low resolution of BAC-clone array CGH (approximately 1-2 Mbp) impedes the localization of culprit cancer-causing genes, we incorporated the Affymetrix expression microarray data to pinpoint amplified genes with increased expression in cancer. The correlation between expression patterns of neighboring genes was found to be helpful in confirming DNA amplifications and localizing the culprit cancer genes. In addition, we developed a computational scanning method to search for genes frequently amplified across different types of tumor tissues. Preliminary applications of this method demonstrated great reductions in the noise level commonly seen with array CGH data, and implicated known general oncogenes as well as novel genes with potential tumor-promoting functions.

December 13, 2005

Chris Triggs

Department of Statistics, University of Auckland

ESTABLISHING IDENTITY USING DNA PROFILES

The widespread use of evidence from DNA profiles has transformed forensic science. Many questions of identity in both civil and criminal investigations can be reduced to the question: Does this person belong in this pedigree? We have a set of people whose genetic profiles and relationships we know. We wish to assess the weight of evidence as to whether another individual, whose profile maybe known only in part, is related to the first group. Cases of disputed paternity where we wish to assess whether a man is the biological father of a particular child are examples of this.

After a general discussion of how to present probabilistic evidence in this talk I will discuss two specific problems. The first is a case of disputed paternity with evidence from a large pedigree. The second arises from the Boxing Day tsunami where there were a large number of unidentified bodies. Genetic information from family members was known about some of the missing people.

November 28, 2005

Xiao-Hua Andrew Zhou

Department of Biostatistics, University of Washington

SEMI-PARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION OF ROC CURVES

The diagnostic capability or accuracy of a medical test is often assessed using a receiver operating characteristic (ROC) curve. In this talk, I will discuss a new semi-parametric likelihood approach to estimate the ROC curve that satisfies the property of invariance of the ROC curve. I will show that our new estimator is asymptotically normal and report a simulation study, which demonstrates that the proposed estimator has the best performance among all the existing semi-parametric estimators considered here. Finally, I will outline a new semi-parametric estimation approach for ROC curve regression models.

This is a joint work with Huazhen Lin.

November 15, 2005

Robert Tibshirani

Professor of Health Research and Policy, and Statistics, Stanford University

BIOMARKER DISCOVERY: FACT OR ARTEFACT?

The areas of genomics and proteomics present exciting challenges for Statistical Sciences. The main challenge is to extract interpretable and reproducible information from datasets with large numbers of features (genes, SNPs, proteins) and a relatively small number of observations (biological samples, patients). There has been a flurry of statistical work, some by statisticians and also by other quantitative researchers in biology, computer science, physics and engineering.

In my view, the quality of this work has been very mixed. The race to publish and obtain grant funding has produced a significant number of fragile, irreproducible analyses. I will discuss in detail a recent controversial study published in NEJM, and then I will suggest ways in which the field can move forward in a more productive way. I briefly will describe a new tool called "supervised principal components," an example of a promising method for tackling this kind of problem.

CBMB November 15, 2005

David M. Rocke

UC Davis Division of Biostatistics (School of Medicine),
Department of Applied Science (College of Engineering),
and Institute for Data Analysis and Visualization

VARIABILITY AND DATA TRANSFORMATION FOR GENE EXPRESSION, PROTEOMICS, AND METABOLOMICS DATA

Biologists now have the capacity to measure thousands of compounds simultaneously from a single biological sample using gene expression arrays, mass spectrometry, NMR spectroscopy or other methods. These methods can be used to measure mRNA transcripts, proteins, short peptides, lipids, and other biologically active compounds. In this talk, I will describe an important statistical challenge in the use of such data. Using raw data, logarithms, or ratios, the variability of the measurements is strongly dependent on the level of expression, causing a failure of the assumptions of most standard methods of statistical analysis. We present a solution to this problem via a specially tuned data transformation and show how it promotes the effectiveness of simple and sophisticated analyses of the data.

October 7, 2005

Victor De Gruttola

Department of Biostatistics, Harvard University

RELATING GENOTYPE TO PHENOTYPE: RESAMPLING-BASED MULTIPLE HYPOTHESIS TESTING USING ORDER STATISTICS

Development and spread of resistance to anti-retroviral drugs limits their utility. We present multiple testing methods relating HIV genotype to phenotype. A semi-parametric resampling approach identifies patterns of mutations at a set of relevant codons associated with changes in drug susceptibility with respect to wild-type. It compares observed, ordered, mean responses to expected order statistics from an unspecified error distribution, preserves the family-wise error rate asymptotically, and is approximately conservative in finite samples. Two applications use protease sequences and measures of in-vitro sensitivity from the Stanford HIV Drug Resistance Database. The first identifies patterns of mutations that enhance or decrease drug susceptibility; the second investigates interactions. This latter shows that while M46I/L mutations are associated with drug resistance, adding L88D/S mutations leads to hypersusceptible virus. Further addition of T90M/L mutations results in highly resistant virus. This allows the investigation of how mutations act in the presence of others and may suggest mechanisms by which resistance occurs or is reversed through the accumulation of mutations.

Joint work with Jennifer Schumi

October 7, 2005

Mark Segal

UCSF Department of Epidemiology & Biostatistics

CHESS, CHANCE AND CONSPIRACY

Chess and chance are seemingly strange bedfellows. Luck and/or randomness have no apparent role in move selection when the game is played at the highest levels. However, when competition is at the ultimate level, that of the World Chess Championship (WCC), chess and conspiracy are not strange bedfellows, there being a long and colorful history of accusations levied between participants. One such accusation, frequently repeated, was that all the games in the 1985 WCC (Karpov vs Kasparov) were fixed and pre-arranged move-by-move. That this claim was advanced by a former World Champion, Bobby Fischer, argues that it at least be investigated. That the only published, concrete basis for this claim consists of an observed run of particular moves, allows this investigation to be performed using probabilistic and statistical methods. In particular, we employ imbedded finite Markov chains to evaluate distributions of select runs statistics. Further, we demonstrate how both chess computers and game databases can be brought to bear on the problem.

No knowledge of chess is assumed -- we touch on poker, go, checkers, baseball, basketball, parapsychology and cosmology so hopefully there is something for everyone.

September 26, 2005

Joanne Chapman

School of Physical, Environmental and Mathematical Sciences
University of New South Wales, Canberra, Australia

A CORRELATED GAMMA FRAILTY MODEL FOR BIVARIATE PROPORTIONAL HAZARDS SURVIVAL DATA

In this talk I'll give you a brief review of the basic definitions and functions used in survival analysis, and take you a quick trip on the path of progress since the development of the famous, and well-referenced, Cox's proportional hazards model in 1972.

I'll also talk about the concept of frailty (a measure of unknowns) and show how it is incorporated into standard survival models. In bivariate survival, as well as being a measure of unknown heterogeneity, frailty also measures association. I will introduce an extension to presently available models that allows us to easily model negative association; and show the particular importance of this when the heterogeneity present in the data is small.

CBMB August 18, 2005

Gordon Smyth

Walter & Eliza Hall Institute of Medical Research

PARAMETER SHRINKAGE AND SEPARATE CHANNEL ANALYSIS OF TWO-COLOUR MICROARRAY DATA

Analysis of two-colour microarray data is traditionally conducted by way of the log-ratios of red to green channel intensities for each spot on each array. A point of some controversy is whether more information can be obtained from the data by maintaining the two channels as separate observations. Pioneering work by Wolfinger et al (2001) advocated an approach using mixed linear models, treating each spot as a block with a random effect. Although flexible and useful, the mixed model framework greatly complicates the application of empirical Bayesian methods. This talk will describe an approach to separate channel analysis using random effects and heteroscedastic regression which applies different rates of shrinkage to different aspects of the covariance models. The aim is to gain information as well as preserving computational efficiency and simplicity of interpretation for the final models.

CBMB July 8, 2005

Natalie Thorne

University of Cambridge, Computational Biology Group
Department of Oncology, Hutchison/MRC Research Centre

ISSUES IN THE ANALYSIS OF DNA METHYLATION ARRAY DATA

DNA methylation plays an important role in regulation of gene transcription and is strongly implicated in cancer development. There are many limitations of the current methods to detect DNA methylation in a high-throughput genome-wide profiling manner. This is limiting the identification of potential new DNA methylation markers that predict or promote neoplastic progression.

We have been working on developing and comparing various methods for assessing genome-wide DNA methylation using an annotated 12K CpG island microarray. I will discuss some of the issues that need to be addressed in the low level analysis of such microarray data. In particular I will discuss the problem of normalisation.

June 8, 2005 (joint local ASA meeting)

Peter Bacchetti

UCSF Department of Epidemiology & Biostatistics

A COMPLETELY DIFFERENT APPROACH TO SAMPLE SIZE PLANNING

This talk will critique the usual power-based methods of determining sample size and propose an alternative. One difficulty is that the standard approach requires exact specification of inputs that generally are not known in advance, such as the standard deviations and size of the difference. A neglected but equally serious problem is that ignoring the cost implications of different sample size choices cannot be justified. We propose and justify a new approach for choosing sample size based on cost efficiency, the ratio of a study's scientific and/or practical value to its total cost. This can lead to very different answers than conventional power-based methods or Bayesian maximization of expected utility. By showing that a study's projected scientific or practical value exhibits diminishing marginal returns as a function of increasing sample size for a wide variety of definitions of study value, we are able to propose two simple methods that are justified as not falling short of the most cost-efficient sample size. The first is to choose the sample size that minimizes the average cost per subject. The second is to choose sample size to minimize total cost divided by the square root of sample size. This latter method is theoretically more justifiable for innovative studies, but also appears to perform well and has some justification in other cases. For example, if projected study value is assumed to be proportional to power at one specific alternative and total cost is a linear function of sample size, then this approach is more cost efficient than the sample size producing 90% power. In many situations, these methods are easier to implement, based on more reliable inputs, and better justified than current conventional approaches.

Most of the material is from joint work with Chuck McCulloch and Mark Segal.

May 11, 2005

Joseph Hogan

Departments of Medical Science and Community Health, Brown University

SENSITIVITY ANALYSIS FOR ESTIMATES OF CAUSAL TREATMENT EFFECT IN LONGITUDINAL HIV COHORT STUDIES

This talk is intended to illustrate the use of instrumental variables and associated sensitivity analysis for estimating causal treatment effects of HAART from observational cohort studies. Our focus will be on transparent representation of underlying assumptions, and on the role of coherent sensitivity analyses to understand the effects of departures from those assumptions. Characteristics of an 'ideal' sensitivity analysis will be proposed.

As part of the talk, we highlight key differences between various approaches to causal inference (e.g. propensity scores versus instrumental variables); for the most part, they can be differentiated by underlying assumptions about whether all confounders have been observed. It is argued that this at least partially explains why (for example) economists tend to prefer instrumental variables while epidemiologists favor propensity scores and inverse weighting.

April 6, 2005

Bryan Shepherd

Department of Biostatistics, University of Washington

COMPARING OUTCOMES THAT ONLY EXIST IN A GROUP CHOSEN AFTER RANDOMIZATION

In many experiments researchers would like to compare between treatments an outcome that only exists in a subset of participants selected after randomization. For example, in preventive HIV vaccine efficacy trials it is of interest to determine whether randomization to vaccine causes lower HIV viral load, a quantity that only exists in participants who acquire HIV. I will talk about some of the challenges of making these comparisons and propose sensitivity analysis methods using causal inference techniques. These methods estimate the average causal effect of treatment assignment on a post-infection outcome among those who would be infected whether randomized to vaccine or placebo. Our key assumption is that subjects randomized to the vaccine arm who become infected would also have become infected if randomized to the placebo arm. It is not known which of those subjects infected in the placebo arm would have been infected if randomized to the vaccine, but this can be modeled conditional on baseline covariates, the observed viral load, and a specified sensitivity parameter. I apply these methods to the first Phase III preventative HIV vaccine trial (VaxGen's trial of AIDSVAX B/B).

February 23, 2005

Sophia Rabe-Hesketh

Educational Statistics & Interdepartmental Group in Biostatistics
University of California, Berkeley

GENERALIZED LINEAR LATENT AND MIXED MODELS FOR NOMINAL DATA

As the name implies, generalized linear latent and mixed models (GLLAMMs) are multilevel latent variable models. The latent variables may represent true variables measured with error or random coefficients. Alternatively, they may be used merely to induce dependence among different responses, possibly of mixed types. Latent variables can be regressed on other latent and observed variables varying at the same or higher levels.

I will begin by describing the GLLAMM framework and then consider models for nominal responses.

Two important examples of nominal data are unordered polytomous responses, such as treatment chosen by a physician, and rankings, such as the preference order of different beers. It is natural to formulate models for such responses in terms of the latent 'utility' or 'attractiveness' of the alternatives (treatments or beers), giving rise to the well-known multinomial logit model. When the data have a multilevel structure, dependence among the observed responses from the same cluster (given the covariates) can be thought of as arising from residual correlations among the underlying utilities. It is useful to structure these correlations using latent variables varying at different levels of the hierarchical dataset. The methodology will be applied to party choice and rankings from the 1987-1992 panel of the British Election Study. Three levels will be considered: elections, voters, and constituencies.

October 13, 2004

David Oakes

Department of Biostatistics and Computational Biology, University of Rochester

ON THE POTENTIAL FOR RISK REVERSAL DUE TO HETEROGENEITY SELECTION

Suppose that the total risk for experiencing an event is the sum of an observed portion (o) and an unobserved portion (u). The unobserved portion follows a population distribution dF(u) say. Consider now the risk of experiencing a second event, among people who have already experienced a first event. Even if the individual risks are unchanged (so that an individual who has risk o + u for the first event has the same risk o + u for a second event) the population level distribution of risks will change, following Bayes' theorem. We show how this can lead to risk reversal, whereby, among those who have suffered a first event, higher values of o are associated on average with lower risks for a second event. These musings were prompted by unexpected results from a study of the influence of coronary disease-related genotypes on recurrent cardiac events among patients who had experienced a myocardial infarction. The talk will describe the study, its unexpected results, and their possible explanation via this risk reversal phenomenon.

September 22, 2004

Tianxi Cai

Department of Biostatistics, Harvard School of Public Health

SEMI-PARAMETRIC BOX-COX POWER TRANSFORMATION MODELS FOR CENSORED SURVIVAL OBSERVATIONS

The accelerated failure time model specifies that the logarithm of the failure time is linearly related to the covariate vector without assuming a parametric error distribution. In this article, we consider the semi-parametric Box-Cox transformation model, which includes the above regression model as a special case, to analyze possibly censored failure time observations. Inference procedures for the transformation and regression parameters are proposed via a resampling technique. Prediction of the survival function of future subjects with a specific covariate vector is also provided via point-wise and simultaneous interval estimates. All the proposals are illustrated with the data sets from two clinical studies.

January 28, 2004

Peter Gilbert

Statistical Center for HIV/AIDS Research & Prevention (SCHARP)

SENSITIVITY ANALYSES COMPARING OUTCOMES MEASURED ONLY IN A SUBSET SELECTED POST-RANDOMIZATION, WITH APPLICATION TO HIV VACCINE TRIALS

In many experiments researchers want to compare an outcome that is only measured in a subset of participants selected after randomization. For example, in HIV vaccine efficacy trials it is of interest to determine whether randomization to vaccine causes lower viral load, a quantity that only exists in infected subjects. To make a causal comparison and account for potential selection bias we propose a sensitivity analysis following the principal stratification framework set forth by Frangakis and Rubin (2002). Our goal is to obtain the average causal effect of treatment assignment on viral load at a given baseline covariate level in the always infected principle stratum (those who would have been infected whether they had been assigned to vaccine or placebo). We assume stable unit treatment values (SUTVA), randomization, and that subjects randomized to the vaccine arm who became infected would also have become infected if randomized to the placebo arm (monotonicity). Membership in the always infected stratum is unknown, but can be modeled conditional on randomization arm, infection status, covariates, the observed viral load, and a specified sensitivity parameter. The observed viral load is also modeled as a function of covariates and given treatment assignment. We can then obtain maximum likelihood estimates of the average causal effect conditional on covariates and the sensitivity parameter. This approach is extended to include censoring and non-continuous outcome variables. We apply our method to VaxGen's Phase III HIV vaccine trial, and conclude that vaccination has no significant effect on viral load.

November 5, 2003

John Kornak

UCSF/VA Medical Center Magnetic Resonance Unit

ISSUES IN THE STATISTICAL ANALYSIS OF fMRI DATA

Functional magnetic resonance imaging (fMRI) is a non-invasive imaging technique capable of detecting changes in cerebral activity. fMRI experiments typically focus on the detection of these changes by quantifying local hemodynamic responses to brain activity, and on estimating their magnitude and extent. The complex biological mechanisms underlying the phenomenon observed via fMRI are not fully understood, leading to many widely differing statistical approaches to data analysis.

This talk will describe several alternative statistical approaches for fMRI data and focus on issues relating to the compensatation for temporal-smearing effects of the hemodynamic response and spatial analysis of response parameters/statistics. Evidence is presented of the need to consider local shape variation of the hemodynamic response function in order to optimally estimate brain activation levels. A fully Bayesian spatial model, taking estimated brain activation levels as input, is then constructed for the purpose of determining regions of activation. In contrast to the usual spatial thresholding approaches, this model inherently trades local hemodyamic response magnitude with spatial extent.

October 15, 2003

Michael LeBlanc

Department of Biostatistics, University of Washington /
Fred Hutchinson Cancer Research Center

ADAPTIVE RISK GROUP REFINEMENT

Combinations of univariate clinical decisions, such as {serum calcium = 3} and {age <60}, are often easier to interpret than smooth or additive decision boundaries obtained from fitting additive regression models. Tree-based or recursive partitioning methods, such as the Classification and Regression Tree (CART) algorithm due to Breiman, Friedman, Olshen and Stone, (1984), are widely used for coming up with such simple rules.

Tree based methods, while useful, do not directly allow for calibration of patient groups in terms of average patient outcome or the proportion of patients in the group. For instance, in developing a clinical trial for a new aggressive therapy, one must limit the study to only those patients with sufficiently poor prognosis appropriate for the toxicity associated with that therapy. However, the poor prognostic group must include a sufficient proportion of the patients with that disease to make patient accrual to the clinical trial feasible.

Motivated by Patient Rule Induction Method (PRIM) method of Friedman and Fisher (1999), we construct interpretable prognostic rules based on a sequence of "box shaped" regions in the predictor space indexed by the fraction of patients in the prognostic group. Simulations are used to study the properties of the method and compare it to constructing prognostic groups based on regression trees and linear proportional hazards models. We consider graphical methods for understanding constructed regions and also describe an analysis of several completed clinical trials for patients with multiple myeloma.

CBMB October 3, 2003

Alexander Schliep

Max Planck Institute

ANALYZING GENE EXPRESSION TIME-SERIES DATA

A number of microarray datasets provide some information about how cellular processes cause changes over time. Observing and measuring those changes over time allows insights into the how and why of regulation. However, the proper way of analyzing the resulting time-course data is still very much an issue under investigation. The inherent time dependencies in the data suggest that clustering techniques which reflect those dependencies might be more appropriate.

We propose an approach based on Hidden Markov Models (HMMs) to account for the horizontal dependencies along the time axis in time-course data and to cope with the prevalent errors and missing values. The HMMs are used within a model-based clustering respectively a mixture-modeling framework.

The inherent robustness problems with clustering noisy data are circumvented by adding partial information about known groups of data in the complete data set. This is known as partly supervised learning.

GQL, a graphical user interface allows us to interactively explore an expression profile dataset for time-course.

October 2, 2003

Kelly H. Zou

Department of Radiology, Brigham and Women's Hospital
Department of Health Care Policy, Harvard Medical School

STATISTICAL VALIDATION OF IMAGING ANALYSIS

The validity of image segmentation is an important issue in image processing because it has a direct impact on surgical planning. We examined classification accuracy in imaging analysis based on three two-sample validation metrics against the estimated composite latent gold standard, which was derived from several experts' manual segmentations by an expectation-maximization (EM) algorithm called STAPLE. The distribution functions of the tumor and control pixel data were parametrically assumed to be a mixture of two beta distributions with different shape parameters. We estimated the corresponding receiver operating characteristic (ROC) curve, Dice similarity coefficient, and mutual information, over all possible decision thresholds. Based on each validation metric, an optimal threshold was then computed via maximization. We illustrated these methods using magnetic resonance (MR) imaging data on three radiologic examples: (1) accuracy of brain tumor segmentation, (2) reliability of elbow medial collateral ligament assessment, and (3) hidden gold standard in prostate peripheral zone segmentation for brachytherapy. The performances of these validation metrics were investigated via Monte-Carlo simulation. Extensions of incorporating spatial correlation structures were briefly considered under a Markov random fields model.

CBMB August 13, 2003

Eric Schadt

Rosetta Inpharmatics (A wholly owned subsidiary of Merck)

INFERRING CAUSALITY FROM MICROARRAY DATA IN SEGREGATING POPULATIONS: AN UNBIASED APPROACH TO THE IDENTIFICATION OF TARGETS FOR COMMON HUMAN DISEASES

A key goal of biomedical research is to identify the basis of common human diseases. Here I present a procedure for the identification of key drivers of common human diseases using gene expression data in a segregating mouse population. Central to this procedure is the integration of genetic and gene expression information with clinical trait data to infer causal patterns of association between key drivers and disease phenotypes. This procedure allows for the objective identification of druggable targets for common human diseases. Specific examples on the application of this method to obesity traits will be provided.

August 1, 2003

Jason P. Fine

Departments of Statistics and Biostatistics, University of Wisconsin, Madison

COMPARING NON-NESTED COX MODELS

This talk will focus on choosing between two, possibly non-nested proportional hazards models using the partial likelihood ratio test. I will present the limiting distribution of the test under general conditions. The multiplicative hazards models being fitted may be misspecified and the true model is not assumed to be contained by either of the fitted models. The null hypothesis is that the models are equidistant in Kullback-Leibler distance applied to the rank likelihood. The ratio statistic is consistent for the model which is closer to the truth. However, its distribution takes one of two forms and depends on the unknown data generating mechanism, which complicates inference. A two-step testing procedure is proposed which is valid regardless of the true model. The first step involves a novel test for the equality of the fitted models which is separate from the partial likelihood. The methodology has important applications in model assessment. A reanalysis of the well-known PBC data will be used to demonstrate its utility in selecting the functional forms of covariates and relative risks.

June 11, 2003

Marvin Zelen

Department of Biostatistics, Harvard University

THE EARLY DETECTION OF DISEASE AND STOCHASTIC MODELS

Early detection of disease presents opportunities for using existing technologies to significantly improve patient benefit. The possibility of diagnosing a chronic disease early, while it is asymptomatic, may result in treating the disease in an earlier stage leading to better prognosis. Many cancers, diabetes, tuberculosis, cardiovascular disease, HIV related diseases, etc., may have better prognosis when combined with an effective treatment. However, gathering scientific evidence to demonstrate benefit has proved to be difficult. Clinical trials have been arduous to carry out, because of the need to have large numbers of subjects, long follow-up periods and problems of non-compliance. Implementing public health early detection programs have proved to be costly and not based on analytic considerations. Many of these difficulties are a result of not understanding the early disease detection process and the disease natural histories. One way to approach these problems is to model the early detection process. This talk will discuss stochastic models for the early detection of disease. Breast cancer will be used to illustrate some of the ideas. The talk will discuss breast cancer randomized trials, stage shift and benefit, scheduling of examinations, issues of screening younger women and those at elevated risk and the planning of trials.

CBMB May 19, 2003

Laura Lazzeroni

Division of Biostatistics, Stanford University

ALLELE SHARING AND ALLELIC ASSOCIATION IN AFFECTED SIB PAIRS

The genotypes of affected siblings contain information about both allele sharing and allelic association, either of which can point to the presence of a disease-related gene. Allele sharing tests, also known as linkage or identity-by-descent tests, are designed to detect whether siblings who share the same disease also tend to inherit the same alleles at a genetic locus. Allelic association tests, such as the transmission-disequilibrium test, are designed to detect the association of a disease and a particular allele in the population at large. Whether allele sharing or allelic association is stronger and which type of test is more powerful depends on unknown factors, including the true genetic disease model at any linked risk-related loci, the strength of any other genetic and environmental risk factors and the population distribution of those factors. The difference in power can be substantial. I will discuss a test designed to detect both allele sharing and allelic association that is as powerful, or nearly as powerful, in any setting as the more powerful of the sharing and association tests. Underlying the test is a mixture model formulated in terms of family-specific relative risks. I will show how this model also yields interesting clues about which genetic and population models are most plausible in light of observed levels of allele sharing and association. This information can be used to decide whether an implicated locus provides a promising lead for further research.

May 9, 2003

Glen Satten

The Centers for Disease Control

will speak on two topics:
IS THERE EVIDENCE THAT SEROCONVERTING REPEAT BLOOD DONORS CHANGE THE PATTERN OF THEIR BLOOD DONATIONS AT SEROCONVERSION - OR, HOW SPECIAL IS A "SPECIAL" INTERVAL?

Glen A. Satten, George Schreiber, Simone Glynn, Michael P. Busch, David Wright, Fanhui Kong, Steve Kleinman and the REDS study group

Length-biased sampling occurs in renewal processes when the probability that an interval is selected is proportional to the length of the interval. This can occur when intervals are selected because they contain an event that is independent of the renewal process and occurs with constant hazard. For example, if the times between donations for repeat blood donors are independent but identically distributed, and if the donor seroconverts to HIV (develops antibodies that indicate infection with human immunodeficiency virus), then the interval between the last HIV seronegative and first HIV seropositive interval is expected to be longer than that donor's previous time intervals between donations. We develop hypothesis tests to determine if the relationship between the typical and length-biased intervals are as expected, or if there is departure from length-biased sampling. We further develop a regression method to determine if there are covariates that explain the departure from length-biased sampling. Our approach is motivated by the question of whether there is evidence that repeat blood donors who develop antibodies to HIV or other viral infections change their donation pattern in some way because of seroconversion.

CASE-CONTROL STUDIES WITH DIFFERENTIAL NUMBERS OF MEASUREMENTS

Glen A. Satten, W. Dana Flanders

By accident or by design, there may be a systematic difference in the number of exposure measurements that are available for case patients and for control patients. For example, in a recent study of Stachybotrys atras (Etzel et al. 1998 Arch Pediatr Adolesc Med. 152:757-62), the spore count in the homes of case patients was compared to the spore count in the homes of matched control probands. For case patients, an average of environmental 6 measurements were taken, while for control probands an average of 3 environmental measurements were taken. To account for the difference in the number of measurements, Etzel et al. used the average spore count for each study participant as a summary of their exposure. While this appears reasonable, we show that it may result in bias. We present a novel estimator that gives valid inference even when the number of measurements in cases is systematically different from the number in controls. We also consider analyses that use the maximum recorded exposure for each study participant.

March 13, 2003

Karl Broman

Department of Biostatistics, Johns Hopkins University

IDENTIFYING ESSENTIAL GENES IN M. TUBERCULOSIS BY RANDOM TRANSPOSON MUTAGENESIS

Mycobacterium tuberculosis (Mtb) is the organism which causes tuberculosis. Its circular genome of 4.4 Mbp has been completely sequenced and contains 4250 genes. In random transposon mutagenesis, one creates a library of mutants, each of which contains a single insertion of a transposon. Here we consider the Himar1 transposon, which inserts at random at a dinucleotide TA. The Mtb genome contains 74,403 such TA sites. We consider data on a library of 1425 transposon insertion mutants; for each mutant, the particular TA site at which insertion occurred has been determined. That a mutant with transposon insertion within a particular gene is viable indicates that the gene is not essential for the viability of the organism. Genes that are essential for the viability of the organism will never show up in such a library of insertion mutants.

We describe a Bayesian method for estimating the proportion of essential genes in the Mtb genome and for identifying genes likely to be essential, on the basis of such data. The prior distribution for the number of essential genes was taken to be uniform. A Gibbs sampler was used to estimate the posterior distribution.

CBMB March 19, 2003

Serafim Batzoglou

Assistant Professor of Computer Science, Stanford University

ALIGNMENTS, MOTIFS, AND MICROARRAYS

High-throughput experimentation technologies such as whole-genome sequencing and gene microarrays are transforming the way we do biology. From the traditional one-organism, few-genes framework we are quickly moving to many-organism, whole-genome studies. These are powered by algorithms, systems, and paradigms from computer science. In this talk we will cover some of the computational techniques we develop towards high-throughput biology. We will talk about methods for whole-genome multiple alignment and application to the human/mouse/rat genomes, gene microarray expression analysis, and regulatory motif-finding based on cross-species conservation and microarray measurements.

February 27, 2003

Hongzhe Li

Rowe Program in Human Genetics, UC Davis School of Medicine

THE ADDITIVE GENETIC GAMMA FRAILTY MODELS FOR GENETIC LINKAGE AND ASSOCIATION ANALYSIS

Many complex human diseases are due to multiple disease genes and both genetic and environmental risk factors. These diseases often also show variable age of disease onset. In order to incorporate both covariates and age of onset information into genetic analysis, we define an additive genetic gamma frailty model constructed based on the inheritance vectors. Within this modeling framework, we derive a retrospective likelihood ratio test for linkage and a score test for testing genetic association in the linked region using sibships data. Such tests can incorporate both affected and unaffected sibs, environmental covariates and age at disease onset or censoring information, and therefore provide a practical solution to mapping genes for complex diseases with variable age of onset. Simulation studies indicate that the proposed methods have correct type 1 error rates and perform better than the commonly used methods for linkage or association analysis. We further demonstrate the methods using the simulated data set from GAW12 and a real data set of affected sib pairs of prostate cancer.

February 18, 2003

Florin Vaida

Assistant Professor of Biostatistics, Harvard School of Public Health

CONDITIONAL AKAIKE INFORMATION FOR MIXED EFFECTS MODELS

In this talk we show that for a mixed effects model where the focus is on the cluster-specific inference the commonly used definition for AIC is not appropriate. We propose a new definition for the Akaike information to be used in such conditional inference, and we show that for a linear mixed effects model this definition leads to an Akaike information criterion (AIC) where the penalty for the random effects is related to the effective number of parameters, rho, proposed by Hodges and Sargent (Biometrika 2001); rho reflects an interim level of complexity between a fixed-effects model with no cluster effects, and a corresponding model with fixed cluster-specific effects. We compare the conditional AIC with the marginal AIC (in current standard use), and we argue that the latter is only appropriate when the inference is focused on the marginal, population-level parameters. We discuss the relationship of the conditional AIC with the deviance information criterion and other related work. A pharmaco-kinetics data application is used to illuminate the distinction between the two inference settings, and the usefulness of the conditional AIC.

CBMB January 23, 2003

Tony Rossini

Assistant Professor of Biostatistics, University of Washington

STATISTICAL ANALYSIS OF THE GENETIC DIVERSITY OF PATHOGENS

The average pair-wise evolutionary distance between molecular sequences is a simple and approximate measure which describes genetic diversity. Unfortunately, computing the standard errors for this quantity are not straightforward. We describe 2 approaches for doing this. One is a simple and statistically valid approach for computing simple inferential statistics based on U-statistics. This method takes into account the correlation due to reuse of sequence clones. We assume that the resulting sequences are sampled independently within units such as people or bodily compartments being compared for genetic diversity. The second approach uses linear mixed effects models to accommodate variation, and allow for testing for the necessity of random effects. These approaches are examined using data from a study of HIV pathogenesis in children. We conclude with some general problems with this approach and suggest future research into computationally intensive methods using phylogenies which addresses those concerns.

January 15, 2003

Sandrine Dudoit

Assistant Professor of Biostatistics, University of California, Berkeley

STATISTICAL METHODS AND SOFTWARE FOR THE ANALYSIS OF DNA MICROARRAY EXPERIMENTS

DNA microarrays are part of a new class of biotechnologies that allow the monitoring of expression levels in cells for thousands of genes simultaneously. Microarray experiments are being performed increasingly in biological and medical research to address a wide range of problems. In cancer research, microarrays are used to study the molecular variations among tumors with the aim of developing better diagnosis and treatment strategies for the disease. Microarray experiments generate large and complex multivariate datasets. The application of sound statistical design and analysis principles can greatly improve the efficiency and reliability of these experiments throughout the data acquisition and analysis process. Efficient and well-designed statistical software is an essential link between the development of statistical methodology and its positive and timely impact on biology. I will present a survey of statistical methods and software for the analysis of DNA microarray data. I will discuss more specifically computing resources developed as part of the Bioconductor project. This collaborative effort aims to produce an open source and open development computing environment for the analysis of genomic data (www.bioconductor.org).

December 4, 2002

Patrick Heagarty

Associate Professor of Biostatistics, University of Washington

TIME-DEPENDENT ROC CURVES AND LONGITUDINAL DIAGNOSTIC ACCURACY

ROC curves are a popular method for displaying sensitivity and specificity of a continuous diagnostic marker, Y, for a binary disease variable, D. However, many disease outcomes are time-dependent, D(t), and ROC curves that vary as a function of time may be more appropriate. A common example of a time-dependent variable is vital status where D(t)=1 if a patient has died prior to time t and is 0 otherwise. In Heagerty, Lumley and Pepe (2000) we have proposed summarizing the discrimination potential of a marker Y, measured at baseline (t=0), by calculating ROC curves for cumulative disease or death by time t. In other study designs both the disease outcome, D(t), and the marker, Y(t), are measured longitudinally. For this situation there are alternative approaches to defining and estimating sensitivity and specificity. One approach directly estimates the the distribution of the marker process conditional on the survival time using semi-parametric regression quantiles as described in Heagerty and Pepe (1999). A second approach uses "partly conditional" survival methods and more naturally handles censored onset times. The alternative definitions and estimation approaches will be illustrated using longitudinal pulmonary function measurements among cystic fibrosis subjects, and using the Multicenter Aids Cohort (MACS) data.

November 15, 2002

John Nelder

Imperial College, London

EXTENDED LIKELIHOOD INFERENCE APPLIED TO A NEW CLASS OF MODELS

Random-effect models require an extension of Fisher likelihood. Extended likelihood (Pawitan) or, equivalently, h-likelihood (Lee & Nelder), provide a basis for likelihood inference applicable to random-effect models. The model class, called hierarchical generalized linear models (HGLMs), is derived from generalized linear models (GLMs). It supports (1) joint modelling of mean and dispersion; (2) GLM errors for the response; (3) random effects in the linear predictor for the mean, with distributions following any conjugate distribution of a GLM distribution; (4) structured dispersion components depending on covariates. Fitting of fixed and random effects, given dispersion components, reduces to fitting an augmented GLM, while fitting dispersion components, given fixed and random effects, uses an adjusted profile h-likelihood and reduces to a second interlinked GLM, which generalizes REML to all the GLM distributions. A single algorithm can fit all members of the class and does not require either prior distributions or the multiple quadrature needed for methods using marginal likelihood.

Model checking also generalizes from GLMs and allows the visual checking of all aspects of the model. The model class can be extended to cover correlated data expressed by random terms in the model, thus allowing fitting of spatial and temporal models with GLM errors. Correlations can be expressed by transformations of white noise, by structured covariance matrices, or by structured precision matrices. Finally the class can be extended to double HGLMs, which allow random effects in the dispersion model as well as in the mean. This leads, among other things, to a potentially large expansion of classes of models used in finance, the properties of which have still to be investigated.

October 23, 2002

Francessca Dominici

Assistant Professor of Biostatistics, Johns Hopkins University

ESTIMATING HEALTH EFFECTS OF AIR POLLUTION: STATISTICAL CHALLENGES, FINDINGS, AND POLICY IMPLICATIONS

Evidence from time series studies of air pollution and health is central to major policy decisions concerning the risk of death associated with air pollution exposure. The nature and characteristics of time series data make risk estimation challenging, requiring development of complex statistical methods able to detect effects that are very small relative to the combined effects of confounders and residual variation.

Using the National Mortality Morbidity Air Pollution Study, which includes time series data from the 90 largest US locations for the period 1987-1994, we discuss: parametric versus semi-parametric approaches for estimating city-specific relative risks; hierarchical models for synthesizing city-specific estimates, and estimation of the exposure-response relation between air pollution and mortality.

We report national-level estimates of the health effects of air pollution, review their sensitivity to model choice and prior distributions and discuss policy implications.

Sources of model uncertainty call for a systematic assessment of model choice and for development of new methods. Importantly, the weight given by this scientific evidence in setting policy requires a level of confidence in findings that is difficult to attain in the small effects/many potential confounders context, regardless of the sophistication of the statistical approach.

May 20, 2002

Stephen Senn

Department of Statistical Science, University College London

TWO CHEERS FOR P-VALUES

P-values are a practical success but a critical failure. Scientists the world over use them but scarcely a statistician can be found to defend them. Bayesians in particular find them ridiculous but even the modern frequentist has little time for them.

The invention of P-values is often mistakenly ascribed to RA Fisher but in fact they are far older, dating back at least as far Daniel Bernoulli's significance test of 1734 regarding the inclinations of the planetary orbits. The Bayesian Karl Pearson also used them in his famous paper of 1900 on the chi-square goodness of fit test, some 25 years before the publication of Fisher's influential Statistical Methods for Research Workers.

Recently there has been a growing campaign against their use in medical statistics. The journal Epidemiology has even banned them. Bayesian critics have drawn attention to the fact that a just significant result has a moderate replication probability whilst failing to note that this is a desirable and necessary property shared by Bayesian statements. P-values have even been attacked in the popular press.

In this talk I shall consider whether there are any grounds for continuing to use this ubiquitous but despised device.

April 22, 2002

Ying Lu

Department of Radiology, UCSF

ON THE EQUIVALENCE OF TWO DIAGNOSTIC TESTS BASED ON PAIRED OBSERVATIONS

Equivalence of two diagnostic tests is a common problem in medical research. Often we want to determine if a new diagnostic test is as good as the standard reference test. Sometimes, we are interested in an inexpensive test that may have an acceptable inferiority in sensitivity or specificity. While hypothesis testing procedures and sample size formulas for the equivalence of sensitivity or specificity alone have been proposed, very few studies discussed simultaneous comparisons for both indications. In this paper, we present three different hypothesis testing procedures and sample size formulas for simultaneous comparison of sensitivity and specificity based on paired observations and with known disease status. These statistical procedures are then used to compare two classification rules that identify women for future osteoporotic fracture. Simulation experiments demonstrate that the new tests and sample size formulas give the appropriate type I and II error rates. Differences between our approach and the approach of Lui and Cumberland (2001) are discussed. This is a joint work with Drs. H. Jin, ST. Harris, and HK Genant. The research is supported by NIH grant R03 AR47104.

April 18, 2002

Su-Chun Cheng

Department of Statistics, Texas A&M University

APPLICATIONS OF SEMIPARAMETRIC TRANSFORMATION MODELS: CLUSTERED FAILURE TIME DATA AND COVARIATE MEASUREMENT ERROR

The Cox model has been used extensively to model univariate failure times as a function of covariates in the analysis of clinical trials. Nevertheless, its proportional hazards assumption may be questionable for some data. To this end, Dabrowska & Doksum (1988), Cheng, Wei & Ying (1995, 1997) and Scharfstein, Tsiatis & Gilbert (1998) studied a class of semiparametric transformation models, under which an unknown transformation of the event time is linearly related to the covariates with various completely specified error distributions. This class of regression models, which includes the proportional hazards and proportional odds models as special cases, provides useful alternatives to the Cox model for analyzing survival data. The methods for univariate event times, however, may be inappropriate if the data consist of a large number of small clusters of correlated failure times. Also, in the presence of covariate measurement error, survival analysis with the observed covariate may yield a biased estimate for the regression parameter. Existing research on these two topics has focused on using the Cox model with a frailty for clustered event times and on adapting the Cox model to account for covariates with measurement errors. In this talk, I will present separate applications of the transformation models that are generalized to handle clustered survival data and to accommodate covariate measurement error.

CBMB March 14, 2002

Xiaole Liu

Stanford Medical Informatics

DISCOVERY OF TRANSCRIPTION FACTOR BINDING SITES USING COMPUTATIONAL STATISTICS

The rapid development of sequencing technology has enabled the human and many other genomes to be sequenced and made publicly available. Microarray technology has also become considerably more robust and sensitive. The combination of the two allows biologists to study gene expression and transcription regulation at the genome level. Given a set of upstream DNA sequences whose downstream genes are clustered together based on similarity in gene expression profile, or a set of DNA sequences enriched in chromatin immunoprecipitation followed by microarray experiments (ChIP-array), it is desirable to conduct computational analysis to find common sequence motifs that are the potential transcription factor binding sites regulating transcription.

I will review the established approaches for discovering common DNA motifs in a set of sequences, and introduce two computational statistics approaches, BioProspector and MDscan.

BioProspector searches for common sequence motifs from any general cluster of DNA sequences, especially potential transcription factor binding sites from upstream sequences of genes clustered by expression profile similarity. BioProspector adopts a Gibbs sampling motif discovery strategy, but provides many improvements. Motifs can have one-block, two-block, or two-block palindromic patterns. BioProspector allows variable copies of a motif per sequence, and uses background model with Markov dependency to improve the specificity of motifs. The statistical significance of a discovered motif can be calculated by Monte Carlo simulation. Current results for testing each BioProspector feature have been very encouraging. A BioProspector web site is setup for biologists to load their sequences on the server for motif discovery. Another program, MatrixScan, is developed to search the genome for more potential sites using a discovered motif matrix.

MDscan is a fast and novel algorithm that looks for motifs from a set of sequences when one has confidence that a subgroup of the sequences contains the motif more abundantly. It can be used to find protein-DNA interaction sites from sequences selected by ChIP-array experiments because the sequences highly enriched by ChIP-array are very likely to contain the real protein-DNA interaction sites, and with multiple copies per sequence. The comparison of MDscan with several other motif-finding programs shows the advantage of MDscan in both speed and accuracy. It also succeeds in identifying the correct motifs from all published ChIP-array experiments.

CBMB March 4, 2002

Ru-Fang Yeh

Department of Biology, Massachusetts Institute of Technology

PREDICTING HOMOLOGOUS GENE STRUCTURES AND EXONIC SPLICING ENHANCERS IN THE HUMAN GENOME

The sequence of the human genome provides the foundation for new approaches to study the organization and functions of human genes. In this talk, I will demonstrate the use of sequence analysis methods to address two different but closely related problems - identification of genes and exonic splicing enhancers.

A major challenge following the completion of the human genome project is to identify the locations and encoded protein sequences of all human genes. We have developed GenomeScan, a new gene identification program which combines the power of ab initio gene finding algorithm as in Genscan with database search results (such as blastX) in an integrated model. Accuracy from extensive testing and results of the application of GenomeScan to 2.7 billion bases of publicly available human genomic DNA will be discussed.

The vast amount of sequence data also allow us to study the association of sequence content with various biological process. Our PROFILER method uses a statistical analysis of exon-intron and splice site composition to screen for short oligonucleotide sequence motifs in exons that enhance pre-mRNA splicing. Representatives of the predicted motifs were found to possess significant enhancer activity when tested in vivo, while point mutants exhibited sharply reduced activity as predicted. The experimental results verified the ability of PROFILER to predict the splicing phenotypes of exonic mutations in human genes.

December 13, 2001

David Strauss

The Life Expectancy Project

MORTALITY RESEARCH AT THE UC LIFE EXPECTANCY PROJECT

The Life Expectancy Project is a San-Francisco based research and consulting group, formerly housed at UC Riverside. The group, which works with a California data base of 235,000 persons with mental disability, has published more than 100 articles, mostly on epidemiological and actuarial studies of mortality.

The talk will be an overview of a broad range of topics:

Details on most of this can be found on the group's web site, www.LifeExpectancy.com.

September 24, 2001

Jacqueline Law

Pharsight Corporation

THE JOINT MODELING OF A LONGITUDINAL DISEASE PROGRESSION MARKER AND THE FAILURE TIME PROCESS IN THE PRESENCE OF CURE

In this talk I will present a cure model which incorporates a longitudinal disease progression marker. The model is motivated by studies of patients with prostate cancer undergoing radiation therapy. The patients are followed until recurrence of the prostate cancer or censoring, with the PSA marker measured intermittently. Some patients are cured by the treatment and are immune from recurrence. A joint-cure model is developed for this type of data, in which the longitudinal marker and the failure time process are modeled jointly, with a fraction of patients assumed to be immune from the endpoint. A hierarchical nonlinear mixed effects model is assumed for the marker and a time- dependent Cox's proportional hazards model is used to model the time to endpoint. The probability of cure is modeled by a logistic link. The parameters are estimated using a Monte Carlo EM algorithm. Importance sampling with an adaptively chosen t-distribution and variable Monte Carlo sample size is used. This model is fitted to a prostate cancer database. A simulation study is also performed. It is found that the parameter estimates have better statistical properties when the longitudinal disease progression marker is incorporated into the cure model. The classification of the censored patients into the cure group and the susceptible group based on the estimated conditional recurrence probability from the joint-cure model has a higher sensitivity and specificity, and a lower misclassification probability compared with the standard cure model. The addition of the longitudinal data has the effect of reducing the impact of the identifiability problems in a standard cure model and can help overcome biases due to informative censoring.

July 5, 2001

Alistair Scott

Department of Statistics, University of Auckland, New Zealand

Case Control Studies with Complex Sampling

The use of complex sampling designs in population-based case-control studies is relatively common, particularly for sampling the control population. This is prompted by the usual cost and logistical benefits conferred by multi-stage sampling. Complex sampling is typically ignored in the analysis, but with the advent of packages like SUDAAN, survey-weighted analyses that take account of the sample design can be carried out routinely. This talk explores some more efficient alternatives, which can also be implemented using readily available software. We also look at robustness of the procedures when the model is mis-specified.

April 10, 2001

Jack Kalbfleisch

Department of Statistics and Actuarial Science, University of Waterloo

Bootstrapping the Estimating Function

In the Estimating Function (EF) Bootstrap, the distribution of the estimating function is estimated by resampling its terms using bootstrap techniques. Studentized versions of the EF Bootstrap yield methods that are invariant under reparametrizations and yield higher order approximations to confidence regions. This approach often has substantial advantage, both in computation and accuracy, over more traditional bootstrap methods and it applies to a wide class of practical problems where the data are independent but not necessarily identically distributed. We will discuss applications in this context and extensions to estimating components of a vector parameter. Simulations are used to compare the EF bootstrap with competing methods in several examples including the common means problem and nonlinear regression. We will conclude with some discussion of extensions of this approach to autoregressive models and outlining a number of problems for further study.

March 26, 2001

Heping Zhang

Associate Professor of Biostatistics, Yale University School of Medicine

Multivariate Adaptive Splines Models for the Analysis of Longitudinal Data (MASAL)

A mixed-effects multivariate adaptive splines model will be presented to analyze longitudinal or growth curves data that may or may not have been collected through a regular measurement schedule. The MASAL algorithm by Zhang (1994, 1997, 1999) will be described and applied to determine the nonparametric fixed-effects in the mixed-effects multivariate adaptive splines model. The potential of this procedure is illustrated with the analysis of a data set on the effect of cocaine use by pregnant women on the growth of their infants after birth. In addition, residual diagnoses are presented to validate the mixed-effects multivariate adaptive splines model.

March 19, 2001

Chengcheng Hu

Department of Biostatistics, University of Washington

Cox Regression with Mismeasured or Missing Covariates

This talk deals with the estimation of the Cox proportional hazards model when covariates are measured with error or missing. For the measurement error problem, the classical additive measurement error model is considered, as well as a more general model which represents the mismeasured version of the covariate as an arbitrary linear function of the true covariates plus a random noise. No distributional form is imposed on the covariates or the error. Assuming that the covariates are measured precisely for a validation set, we develop consistent and asymptotically normal estimators for the regression parameters and the cumulative baseline hazard function. Simulation studies indicate that the proposed estimators work well for practical sample sizes, and a real example is provided. The method is also adapted to the situation when only replicate measurements are available for the covariates, instead of a validation set. A similar approach is taken to study the Cox model with missing covariates. Imputed covariates are used and a class of modified partial likelihood score functions are proposed to correct the bias in the ordinary imputation approach. The resulting estimators are shown to be consistent and asymptotically normal, and their finite sample properties are explored using simulation.

March 12, 2001

Jason Fine

Department of Statistics, Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison

Risk Assessment via a Robust Probit Model, with Application to Toxicology

A number of frameworks may be used to assess the risk associated with a continuous toxicity outcome. In a rat study of aconiazide, a drug under investigation for treatment of tuberculosis, animals receiving high doses tended to experience increased weight loss. The goal of our analysis is to identify a "safe" dose. One approach is to formulate the effect of the exposure on the adverse effect with a simple normal model and to compute the risk function using tail probabilities from the standard normal distribution. This risk function depends heavily on the assumed model and may be sensitive to misspecification. A semiparametric alternative based on another definition of risk has recently been studied. However, it is not clear whether the two approaches are related. We explore a semiparametric normal model, in which an unknown transformation of the adverse response satisfies the linear model. It is demonstrated that this formulation unifies the two approaches, allowing for a coherent risk analysis of the dose-response data. The methodology includes estimation and inference for the unknown transformation in the semiparametric model for the continuous response. Novel model-checking techniques are proposed for diagnosing lack-of-fit, including a formal sup-norm test of the simple normal model. The aconiazide data serves as a case study for the risk assessment procedure.

March 6, 2001

Maja Pavlic

Group in Biostatistics, University of California, Berkeley

Modeling response to treatment using normal mixtures

Repeat measurements of patient characteristics are often used to assess response to treatment. If some proportion of patients does not respond to treatment at all, they behave as if they have not been treated. This definition leads to a mixture model description of responders and non-responders to treatment.

However, mixture models can be used to closely approximate any distribution. For this reason, model assumptions, in particular mixture dimensions, need to be checked. Choosing the number of mixture components is a non regular model selection problem since the likelihood ratio test statistic does not follow its usual asymptotic distribution. We propose minimizing estimated distance between the fitted and true model densities to chose a mixture of optimal dimension. Distances we consider are Kullback-Leibler and L2. The method of minimizing distance performs well in comparison to other available model selection functionals as indicated by simulation studies.

Fracture intervention trial (FIT) was a randomized clinical trial of osteoporosis drug alendronate. The statistical methods discussed are applied to the bone mineral density (BMD) change data from FIT. Based on the observed changes in BMD, we challenge the existence of non-responders to alendronate using the mixture model selection methods.

December 12, 2000

Joerg Rahnenfuehrer

Group in Biostatistics, University of California, Berkeley

Data Compression and Statistical Inference: Multivariate Permutation Tests for Clustered Data

The talk deals with the choice of clustering algorithms for multivariate data sets. We make use of a wide range class of algorithms containing as special cases both the well known k-means algorithm and the Kohonen (1985) algorithm. These algorithms define partitions by profoundly investigted by Poetzelberger and Strasser (1999).

We compare the quality of the clustering procedures by first applying them to multivariate data sets and then treating a k-sample problem. For computing the test statistics the data points are replaced by their conditional expectations with respect to the MSP-partition. We present Monte Carlo simulations of power functions for tests that are carried out as multivariate permutation tests.

The results show a vital and decisive connection between the optimal choice of the clustering algorithm and the tails of the probability distribution of the data. Especially for distributions with heavy tails the performance of k-means type algorithms totally breaks down.

Finally we demonstrate the influence of the choice of the cluster algorithm on the quality of the compression of high dimensional real data sets of microarray experiments where often poor working algorithms are applied.

October 26, 2000

Mark van der Laan

Department of Statistics and Group in Biostatistics, University of California, Berkeley

Statistical Inference with Microarray Data

Large-scale gene expression studies are becoming increasingly common as new microarray technology makes it possible to capture the gene expression profiles for thousands of genes at once. Statistical inference with such high dimensional data structures (and, all too often, relatively small samples) is a challenging analytical problem. In the current microbiology literature, (hierarchical) cluster analysis methods have been used to find groups of genes with similar patterns of expression. Such methods are purely exploratory and, thus, do not provide any type of significance levelfor features in the data or any opportunities for purposeful experimental design. We propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. We focus on rules that operate on mean vectors and covariance (i.e. correlation) matrices; we also employ the output of a standard cluster analysis methodology ("partitioning around medoids" or PAM) to further refine the subset by exploiting the dependence of certain subsets of genes. An estimate of the target subset is obtained by applying the procedure to the sample statistics (e.g. mean and covariance). The parametric bootstrap is used to estimate the distribution of these estimated subsets; relevant summary measures of this distribution are also proposed. We prove consistency of the subset estimates and asymptotic validity of this parametric bootstrap under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. The practical performance of the method is illustrated with a simulation study. The method has also been used to analyze cancer-patient data.

September 28, 2000

Mark Segal

Division of Biostatistics, UCSF

Clustering of Translocation Breakpoints

Translocation, the physical movement of genetic material from one chromosome to another, can juxtapose portions of two cellular genes to generate chimeric gene products and/or alter regulation of gene expression. This provides a putative oncogenic stimulus and, indeed, several gene fusions from translocations have been identified in leukemias, lymphomas, sarcomas.

The statistical analysis of translocation breakpoints has focussed on the extent to which they cluster. Somewhat questionable methods have been employed in this regard. After highlighting these shortcomings, we introduce a variety of approaches including scan statistics, smoothed bootstrap, and gap statistics, that provide a comprehensive means for appraising clustering. We apply this battery to TEL-AML1 translocations, the most common translocation in childhood ALL. Results obtained indicate much weaker evidence for clustering than previously published.

April, 10, 1999

Dan Scharfstein

Assistant Professor, Department of Biostatistics
Johns Hopkins School of Hygiene and Public Health

Methods for Conducting Sensitivity Analysis of Trials with Potentially Non-ignorable Competing Causes of Censoring

We consider inference for the treatment-arm mean difference of an outcome that would have been measured at the end of a randomized follow-up study if, during the course of the study, patients had not initiated a non-randomized therapy or dropped out. We argue that the treatment-arm mean difference is not identified unless unverifiable assumptions are made. We describe identifying assumptions that are tantamount to postulating relationships between the components of a pattern-mixture model, but can also be interpreted as imposing restrictions on the cause-specific censoring probabilities of a selection model. We then argue that although sufficient for identification, these assumptions are insufficient for inference due to the curve of dimensionality. We propose reducing dimensionality by specifying semiparametric cause-specific selection models. These models are useful for conducting a sensitivity analysis to examine how inference for the treatment-arm mean difference changes as one varies the magnitude of the cause-specific selection bias over a plausible range. We provide methodology for conducting such sensitivity analysis and illustrate our methods with an analysis of data from the AIDS Clinical Trial Group (ACTG) study 002.

This is joint work with Andrea Rotnitzky and James Robins (Harvard School of Public Health) and Ting-Li Su (Qunitiles, Inc.)

March 13, 1999

Michael I. Jordan

Department of Statistics, University of California, Berkeley

Graphical models and variational approximation

Graphical models provide an elegant formalism for probabilistic computation that unifies much of the literature on complex probabilistic models in computer science, engineering, statistics, and physics. For sparse graphs (e.g., graphs in the form of chains or trees, such as Kalman filters, hidden Markov models, and probabilistic decision trees), there exist general algorithms for probabilistic inference that are exact, efficient and practical. For dense graphs, however, the exact algorithms are often (hopelessly) inefficient, and this fact has hindered the application of this richer class of models to real-life problems. I discuss variational methodology, which provides a general framework for approximate inference in graphical models. I illustrate variational methods with examples of applications to problems in prediction, diagnosis and control.

September 29, 1999

Hongzhe Li

University of California Davis School of Medicine

THE ADDITIVE GAMMA FRAILTY MODELS FOR LINKAGE ANALYSIS OF AGE-AT-ONSET VARIATION FOR COMPLEX DISEASES

Ages at onset data arise frequently in mapping studies of complex traits, for which model-free methods have been widely used recently. Since these complex traits are often lack of simple inheritance patterns, the robust feature of model-free methods, i.e. not requiring specification of mode of inheritance, is especially desirable. However, current methods in nonparametric linkage analysis are mainly concentrated on the affected relative pairs or affected family members with age of onset information either ignored or taken into account by specifying age-dependent penetrances for liability classes.

I will first demonstrate that the power of these methods could be greatly affected by ages at onset and naively combining affected subjects with different ages at onset could result in a reduced power in detecting linkage. I will then present an additive gamma frailty model for linkage analysis of age-at-onset variation. For each individual, I define a frailty as sum of the frailty due to the putative disease locus based on the inheritance distribution and the frailty due to additive polygenic effect and use the Cox proportional hazard model to model age at onset. I will show that the variance of the frailty and therefore the variance of age-at-onset can be written as sum of the variance due to the putative disease gene and the variance of polygene and that test of linkage can be formulated as test of zero variance due to the putative disease gene. I will derive the conditional hazard ratio parameter for sib pairs and define a likelihood ratio based Lod score statistic under the proposed model. Finally, I will present simulation studies to show that the proposed test has correct type I error rate yet it gains more power compared to ones where ages age onset data are ignored.

July 15, 1999

Nancy Flournoy

Department of Mathematics and Statistics, American University

STATISTICAL SCIENCE: A CASE STUDY IN DISCOVERY

My current research in Adaptive Designs is motivated by my 15-year tenure with the team that pioneered bone marrow transplantation at the Fred Hutchinson Cancer Research Center.  In some medical investigations I became extremely dissatisfied with the rate of learning using existing statistical design methodologies.  I will briefly describe the experimental conditions that led to this frustration and motivate my research interests.  I will outline progress that has been made.

Although my current research focuses on 'problems' that result from using existing design methodologies, existing statistical methodologies are also marvelous tools of discovery.  I never think of a study in isolation, but rather as a stage in the process of knowledge acquisition.  I will review a serial mixture of randomized experiments and observational studies by which we were able to establish that cytomegalovirus (CMV) infection could result from blood transfusions. At that time hepatitis was the only viral infection thought to transmit in this manner.  Just as case studies started to appear reporting HIV infection following blood transfusions, our findings for CMV led blood banks to gear up for doing routine viral screenings.

My medical sciences motivated research and this series of investigations illustrate the way I would promote statistics as an integral part of the medical sciences discovery and learning process.

July 29, 1999

Mark Segal

UCSF Division of Biostatistics

Prediction of Binding Peptide Sequences: Application of Trees and Bump-Hunts.

Milik et al., (Nature Biotechnology, 16: 753-6, 1998) use artificial neural networks (ANNs) to predict the amino acid sequences of peptides that bind to the particular MHC class I molecule, K^b. Their motivation is that simple rules for such prediction, based solely on preferences for specific amino acids in certain (anchor) positions, are inadequate and that binding is influenced by the amino acids in all positions of the peptide. The purpose of the ANN application was to elucidate these more complex rules. While ANNs provide a powerful and flexible machinery they have some shortcomings with respect to this problem: (i) difficulty handling highly polymorphic positions in terms of the amino acid representation itself (as opposed to derived properties thereof); (ii) a "black-box" representation of the prediction rule that precludes interpretative insight; and (iii) mediocre performance in terms of sensitivity and specificity for the phage library analyzed.

We demonstrate that handling unordered categorical covariates with numerous levels and attendant interactions (shortcoming (i)) is, in fact, problematic for many regression methods. Further, this and the other difficulties can be effectively redressed using classification tree techniques. We illustrate this approach using the same data studied by Milik et al. Additionally, recently devised bump-hunting methods that also adeptly handle unordered categorical covariates are applied. Other interesting problem features including (a) position covariation, and (b) whether observed associations are attributable to amino acid properties, are addressed.

July 6, 1999

Charles McCulloch

Departments of Statistical Science and Biometrics, Cornell University

Latent Class Mixed Models

Linear mixed models are a well-known method for incorporating heterogeneity (e.g., subject-to-subject variation) into a statistical analysis for continuous responses. However heterogeneity cannot always be captured by the usual assumptions of normally distributed random effects. Latent class mixed models offer a way of incorporating additional heterogeneity which can be used to uncover distinct subpopulations, to incorporate correlated non-normally distributed outcomes, and to classify individuals. The methodology is illustrated with data from the Nutritional Prevention of Cancer trials: latent class models are used with longitudinal data on prostate specific antigen (PSA) as well as incidence of prostate cancer. Four subpopulations are identified which differ both with regard to their PSA trajectories and their incidence rate of prostate cancer.

June 10, 1999

Ying Qing Chen

Department of Biostatistics, Johns Hopkins University

Accelerated Hazards Model and Its Extensions

The proportional hazards model for survival time data assumes that the risk factors of interest predict their effect multiplicatively on an underlying unknown hazard function. Although this model has been studied widely in the statistical literature, it may not be applicable when the assumption of constant proportionality is violated. In a two-arm randomized clinical trial, for example, participants in the treatment group would have the same risk process through time as those in the control group, except that the treatment would speed up or slow down this process. Some alternatives such as the accelerated failure time model have been developed in the literature. In this talk, an accelerated hazards model is introduced to estimate such a treatment effect when there is a scale change relationship between hazard functions. The methodology and its estimation procedure are studied within a two-sample setting. Extensions of the model to other general settings are discussed. The proposed method is applied to a real data set to investigate the practical usage.

This work is joint with Mei-Cheng Wang of the Department of Biostatistics, Johns Hopkins University.

April 15, 1999

Adam Olshen

Stanford Human Genome Center, Department of Genetics

SAMapper: A Maximum Likelihood Method for Constructing Radiation Hybrid Maps

In this talk I will discuss both the uses of radiation hybrid mapping and the mapping program SAMapper developed at the Stanford Human Genome Center. Radiation hybrid mapping is the most common method of making high resolution maps of the human genome. A human-hamster radiation hybrid is constructed by irradiating a human cell and fusing it with a hamster cell. A hybrid contains the whole hamster genomeand random portions of the human genome. Maps are constructed from data on the retention of human markers in a panel of hybrids. The purpose of our new method is to adapt the Boehnke-Lange-Cox maximum likelihood techniques for radiation hybrid mapping so that it is possible to build good maps on the order of several thousand markers. This is accomplished through the use of reasonable plug-in estimators to speed up the likelihood calculation and simulated annealing to search through the many possible orders of markers. In addition, we have developed a novel method of bootstrapping to assess the uncertainty in our maps.

This work is joint with Laura Lazzeroni, Ying Luo and David Cox.

April 9, 1999

Anthony J. Lawrance

School of Mathematics and Statistics, University of Birmingham, Birmingham, England

Engine Mapping: Statistical Modelling as Auto Engineering

Engine mapping is the term used in the auto industry when modelling engine ouputs, such as torque and emissions, in terms of engine inputs, such as load, air-fuel ratio and exhaust gas recycling ratio. Such relationships are required by electronic engine controllers to provide optimum fuel economy within legal restrictions on exhaust gas emissions and within the operational limits of the engine. The older method offered considerable scope for improvement and a new approach has been developed. The key idea is to consider the problem in two stages. The first stage is concerned with response or output sequences as functions of spark advance, which is the way the data are collected, from experiments designed in terms of the input variables. The second stage involves informed multivariate regression modelling of key engineering quantities of curves fitted to these sequences. This division of the problem allows both input from the engineering base and the effective use of statistical modelling and a variety of diagnostics; it produces models with much improved predictive performance. The approach is outlined and then illustrated on data from a designed experiment carried out earlier during the work. A number of novel statistical features are involved, but there are some parallels with repeated measures. The topic had not previously been subject to detailed study in either the statistics or engineering communities and is now being refined for implementation with future production engines. The work is being carried out as a closely collaborative project with the Ford Engineering Research Centre in the UK.

December 7, 1998

David Giltinan

Genentech

Sensitivity Analysis in Mixed Effects Models using the Weighted Bootstrap