Analyzing discrete data: challenges in the modern era of data science

Dungang Liu Chair
University of Cincinnati
Dungang Liu Organizer
University of Cincinnati
Sunday, Aug 6: 4:00 PM - 5:50 PM
Invited Paper Session 
Metro Toronto Convention Centre 
Room: CC-202A,CC-202B 



Main Sponsor

Social Statistics Section

Co Sponsors

Section on Statistics in Marketing
Survey Research Methods Section


Bayesian Semiparametric Joint Modeling of Longitudinal Data and Discrete Outcomes

The Women's Health Initiative (WHI) Life and Longevity After Cancer (LILAC) study is an excellent resource for studying the impact of treatment on the health of cancer survivors. For instance, longitudinal data collected in the main WHI study could be used to identify clinical and lifestyle factors predictive of treatment-related symptoms. However, the timing of the longitudinal data relative to diagnosis is highly variable complicating the analysis. To address this issue, we propose two Bayesian semiparametric joint models: one for a binary outcome (e.g., presence/absence of a symptom) and one for a count outcome (e.g., total number of symptoms). In both methods, longitudinal data are modeled jointly using a generalized linear mixed model (GLMM). The GLMM is used to impute values of the longitudinal variables at a time point of interest (e.g., diagnosis), which are then used as predictors in the outcome models. Binary outcomes are modeled using logistic regression while count outcomes are modeled using a nonparametric rounded mixture of Gaussians, which can accommodate overdispersion, zero-inflation, and multimodality. Applications to the LILAC data will be presented. 


Woobeen Lim, Food and Drug Administration
Michelle Naughton, The Ohio State University
Electra Paskett, The Ohio State University


Michael Pennell, The Ohio State University

Jeffreys-prior penalty in binomial-response generalized linear models

Penalization of the likelihood by Jeffreys' invariant prior, or by a positive power thereof, is shown to produce finite-valued maximum penalized likelihood estimates in a broad class of binomial generalized linear models. Such models include logistic regression, where the Jeffreys-prior penalty is known to reduce the asymptotic bias of the maximum likelihood estimator, and models with other commonly used link functions such as probit and log-log. We discuss shrinkage towards equiprobability across observations, the implications of finiteness and shrinkage for inference about the model parameters, and the performance of maximum penalized likelihood estimation in settings with moderately high-dimensional covariate structures. These theoretical results and methods underpin the increasingly widespread use of reduced-bias and similarly penalized binomial regression models in many applied fields. 


Ioannis Kosmidis, University of Warwick

New Residuals for Regression Models with Discrete Outcomes Based on Double Probability Integral Transform

Making informed decisions about model adequacy has been an outstanding issue for regression models with discrete outcomes. Standard assessment tools for such outcomes (e.g. deviance residuals) often show a large discrepancy from the hypothesized pattern even under the true model and thus are not informative. To fill this gap, we construct a new type of residuals for general discrete (e.g., binary and count) outcomes. The proposed residuals are based on two layers of probability integral transform. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the shape of QQ plots can further help identify potential causes of misspecification. Through simulation studies, we demonstrate empirically that the proposed residuals outperforms commonly used residuals for various model assessment tasks, since it is close to the hypothesized pattern under the true model and significantly departs from this pattern under model misspecification, and is thus an effective assessment tool.  


Lu Yang, University of Minnesota

Rank Intraclass Correlation for Clustered Data

Clustered data are common in practice. Observations in the same cluster are often more similar to each other than to those from other clusters. The intraclass correlation coefficient (ICC), first introduced by Fisher, is frequently used to measure this similarity. However, the ICC is sensitive to extreme values and skewed distributions, and depends on the scale of the data. It is also not applicable to ordered categorical data. We define the rank ICC as a natural extension of Fisher's ICC to the rank scale, and describe its corresponding population parameter. We also extend the definition for distributions with more than two hierarchies. We describe estimation and inference procedures, conduct simulations to evaluate the performance of our method, and illustrate our method in real data examples that have skewed data, count data, and three-level data.


Chun Li, University of Southern California