Contributed Poster Presentations: Section on Statistical Learning and Data Science

Jacob Bien Chair
University of Southern California
 
Tuesday, Aug 8: 10:30 AM - 12:20 PM
Contributed Posters 
Metro Toronto Convention Centre 
Room: CC-Hall E 

Main Sponsor

Section on Statistical Learning and Data Science

Presentations

29 Accelerate Recurrent Neural Networks via Decomposition and Parallelization

Recurrent neural networks (RNNs) have been widely used for sequence modeling tasks. Especially, they excel at capturing the recurrent patterns in data. This paper reveals that the weight matrix for the previous hidden state determines the recurrent dynamics an RNN can provide. With the real Jordan decomposition, we show that the recurrent dynamics can be decoupled, with each of them essentially driven by the eigenvalues of the weight matrix. A novel concept of recurrence features is formally defined. We also demonstrate that an RNN is equivalent to combining a series of small RNNs with ignorable loss. Correspondingly, this project proposes an RNN surrogate, Parallelized RNN (ParaRNN), where the constituent small RNNs are run in parallel. The training speed of the ParaRNN is shortened compared to the vanilla one, and the hidden state of the original RNN can be recovered by aggregating all the small hidden states. Furthermore, we generalize our experience with the vanilla RNN and propose the so-called Segregate-Parallelize-Aggregate framework to accelerate a broader range of recurrent neural networks. 

Keywords

Machine Learning

Neural Networks

Recurrent Dynamics 

Co-Author(s)

Feiqing Huang, The University of Hong Kong
Kexin Lu, The University of Hong Kong
Guodong Li, University of Hong Kong

First Author

Yuxi CAI

Presenting Author

Yuxi CAI

31 Estimation and Inference on Pointwise Mutual Information Matrix

We modify a latent vector model to study the conditional dependency structures among the codes in the electronic health record (EHR). We derive a low rank estimator for the pointwise mutual information (PMI) matrix under our data generation model based on empirical PMI matrix. The statistical rates and asymptotic normality of the proposed estimators are proved and we also show the relationship between latent embedding vectors and the PMI matrix. Numerical results from simulation studies and EHR data suggest that the testing method based on our proposed low rank PMI estimator outperforms existing testing methods based on empirical estimator and cosine similarity. 

Keywords

large-scale inference

Electronic health records

latent vector model

random matrix 

Co-Author(s)

Shuting Shen, Harvard University
Ziming Gan
Doudou Zhou
Junwei Lu, Harvard T.H. Chan School of Public Health
Tianxi Cai, Harvard University

First Author

Zhiwei Xu

Presenting Author

Zhiwei Xu

32 Improving Variable Selection in Beta Regression Models using Correlated Resampling

Technological innovations allow large amounts of data to be collected in a single observation. As a result, practitioners often face problems in which the number of variables exceeds the number of observations. Such situations arise in many fields, from the basic sciences to the social sciences, and variable selection techniques are required.

Moreover, the correlation between variables cannot be neglected since it is often very high, and variable selection methods often fail to distinguish between informative and uninformative variables. Therefore, variable selection has become one of the major challenges in statistics. Yet, if the number of variables greatly exceeds the number of observations or if the variables are highly correlated, the performance of variable selection methods is usually limited in terms of recall and precision.

We propose a general algorithm that improves model selection in data sets with correlated variables that uses the correlation structure to select reliable variables in parsimonious or non-parsimonious Beta regression models. It improves the performance of many existing usual models, as demonstrated on simulated and real datasets. 

Keywords

Variable selection

Beta regression models

Correlated resampling

Sparse regression

Penalized regression 

Co-Author

Myriam Maumy-Bertrand, Universite De Technologie De Troyes

First Author

Frederic Bertrand, University of Technology of Troyes

Presenting Author

Frederic Bertrand, University of Technology of Troyes

34 Integrative Factorization for Multiple Linked Tensors

Advances in biomedical technologies generates high content biomedical data that are both multi-way and multi-source. Integrative analysis of these data sets has the potential to capture and synthesize different facets of complex biological system. However, these studies are limited by current statistical models. In this work, we propose a Multiple Linked Tensors Factorization (MULTIFAC) method extending the CANDECOMP/PARAFAC decomposition to simultaneously reduce the dimension of multiple multi-way arrays and approximate underlying true signal. The model can automatically reveal latent structures that are individual components or components shared across subset of data sources. We also extend this algorithm to its expectation–maximization (EM) version to handle incomplete data. Extensive simulation studies are conducted to demonstrate MULTIFAC's ability to (i) approximate underlying signal, (ii) identify shared and unshared structures, and (iii) impute missing data. We apply our method to integrate two omics data sources (metabolomic and proteomic) across two tissue compartments (blood and cerebrospinal fluid) and multiple developmental time points for a study on iron deficiency. 

Keywords

CANDECOMP/PARAFAC decomposition

dimension reduction

data integration

missing data imputation

tensor 

Co-Author

Eric Lock, University of Minnesota

First Author

Zhiyu Kang

Presenting Author

Zhiyu Kang

37 Machine Learning methods for bias correction and precision optimization using covariate adjustment

In randomized trials, adjustment for pre-specified baseline covariates results in efficiency gains for estimated treatment effects. The effect estimate remains unbiased for complete data even when the adjustment model is misspecified. When outcome data are missing, however, misspecification of an adjustment model can lead to biased treatment effect estimates. This paper investigates the use of machine learning (ML) for the adjustment model and addresses two questions. For complete data, we investigate whether ML improves efficiency gains relative to a misspecified adjustment model. Here we find that improvements are directly related to proportion variation explained by baseline covariates under the correct model. For missing data, we examine whether using ML can improve efficiency while avoiding bias attributable to model misspecification. Similar findings hold for missing data, with the degree of bias correction depending on the missing data mechanism. The methods and findings are illustrated in simulation studies and application to a randomized trial and can provide additional guidance for the appropriate use of covariate adjustment in randomized trials. 

Keywords

Machine learning

Covariate adjustment

Bias correction

Precision optimization

Model specification

Missing data in randomized trials 

Co-Author

Joseph Hogan, Brown University

First Author

Amos Okutse, Brown University

Presenting Author

Amos Okutse, Brown University

38 MoatPlus: Price Anomaly Detection System at Scale for E-commerce

Online marketplaces execute large volume of price updates that are initiated by individual marketplace sellers each day on the platform. This price democratization comes with increasing challenges with data quality. Lack of centralized guardrails that are available for a traditional online retailer causes a higher likelihood for inaccurate prices to get published on the website, leading to poor customer experience and potential for revenue loss. We present MoatPlus (Masked Optimal Anchors using Trees, Proximity-based Labeling and Unsupervised Statistical-features), a scalable price anomaly detection framework for a growing marketplace platform. The goal is to leverage proximity and historical price trends from unsupervised statistical features to generate an upper price bound. We build an ensemble of models to detect irregularities in price-based features, exclude irregular features and use an optimized weighting scheme to build a reliable price bound in real-time pricing pipeline. We found that our approach reduces false negatives by improving system recall by 21%, while maintaining high precision. 

Keywords

e-commerce

anomaly detection

data mining

machine learning

big data

streaming system 

Co-Author(s)

Qiwen Kang
Lijie Wan, Walmart Lab, Sunnyvale

First Author

Akshit Sarpal, Walmart Labs

Presenting Author

Akshit Sarpal, Walmart Labs

39 Model Improvement with Machine Learning Techniques

Machine Learning techniques such as Decision Trees (CART), Bagging, Boosting, Random Forest, Support Vector Machines (SVM), and Naïve Bayes Methods are used to improve predictions of classification models. Case studies with customer churn will be discussed, and comparisons of the accuracy between different types of models will be made using ROC curves. 

Keywords

Machine Learning

ROC Curve

Logistic Regression

Random Forest

Support Vector Machine 

Co-Author

Jacob Callahan, Minnesota State University

First Author

Deepshikha Sanjel, Minnesota State University, Mankato

Presenting Author

Deepshikha Sanjel, Minnesota State University, Mankato

40 PLS models and their extension for big data

Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that could fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing us to fit Cox models for big data with missing values.

In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted.

Bertrand  et Maumy (2021) Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models. Front. Big Data 4:684794.. 

Keywords

Cox Models

Censored data

Big Data

Partial Least Squares Regression 

Co-Author

Frederic Bertrand, University of Technology of Troyes

First Author

Myriam Maumy-Bertrand, Universite De Technologie De Troyes

Presenting Author

Myriam Maumy-Bertrand, Universite De Technologie De Troyes

41 Post-selection inference for e-value based confidence intervals

Suppose that one can construct a valid (1-δ)-confidence interval (CI) for each of K parameters of potential interest. If a data analyst uses an arbitrary data-dependent criterion to select some subset S of parameters, then the aforementioned CIs for the selected parameters are no longer valid due to selection bias. We design a new method to adjust the intervals in order to control the false coverage rate (FCR).
The main established method is the "BY procedure" by Benjamini and Yekutieli (JASA, 2005). Unfortunately, the BY guarantees require certain restrictions on the selection criterion and on the dependence between the CIs. We propose a natural and much simpler method which is valid under any dependence structure between the original CIs, and any (unknown) selection criterion, but which only applies to a special, yet broad, class of CIs. Our procedure reports (1-δ|S|/K)-CIs for the selected parameters, and we prove that it controls the FCR at δ for confidence intervals that implicitly invert *e-values*; examples include those constructed via supermartingale methods, via universal inference, or via Chernoff-style bounds on the moment generating function, among others. 

Keywords

Benjamini-Yekutieli procedure

false coverage rate

Bayes factor

A/B testing 

Co-Author(s)

Ruodu Wang, University of Waterloo
Aaditya Ramdas, Carnegie Mellon University

First Author

Ziyu Xu

Presenting Author

Ziyu Xu

42 Random effects expectation-maximization (RE-EM) tree: A catalyst in classifying longitudinal data

Random Effects Expectation-Maximization (RE-EM) tree, a tree-based data mining tool, accounts for within-subject correlational structure in longitudinal data that partition time-space into smaller segments to achieve homogeneity in the response thereby serving as an efficient method in approximating knots for fitting piecewise mixed effects model on longitudinal unbalanced data. Successful application of recently introduced post-hoc mixture modeling of BLUPs for the classification of longitudinal unbalanced data require an optimal approximation of knots for fitting piecewise linear mixed effects model. Application of RE-EM tree on a dataset of early childhood growth patterns detected three knots that we used to fit a piecewise linear mixed effects model. Post-hoc mixture modeling of the BLUPs from the piecewise mixed effects model have produced distinct trajectories of early-childhood pathways to obesity. 

Keywords

Random effects expectation-maximization

Change point

Piecewise mixed effects model

Longitudinal unbalanced data

Cluster analysis

Tree-based data mining 

First Author

Md Jobayer Hossain, Nemours Biomedical Research, A.I. DuPont Children's Hospital

Presenting Author

Araf Jahin

43 ReBoot: Distributed statistical learning via refitting Bootstrap samples

With the data explosion in the digital era, it is common for data to be distributed across multiple sites. However, there are two challenges in analyzing decentralized data: (a) communication of large-scale data between sites is expensive and inefficient; (b) data are not allowed to be shared for privacy or legal reasons. To conquer these challenges, we propose a one-shot distributed learning algorithm via refitting Bootstrap samples, which we refer to as ReBoot. Theoretically, we analyze the statistical rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems respectively. ReBoot achieves the full-sample statistical rate in both cases whenever the subsample size is not too small. We show that the systematic bias of ReBoot, the error that is independent of the number of subsamples, is O(n^-2) in GLM, where n is the subsample size. The simulation study illustrates the statistical advantage of ReBoot over competing methods. In addition, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks, which exhibits superiority over FedAvg within early rounds of communication. 

Keywords

Distributed Learning

One-Shot Aggregation

Generalized Linear Models

Phase Retrieval

Model Aggregation 

Co-Author

Ziwei Zhu, University of Michigan, Ann Arbor

First Author

Yumeng Wang

Presenting Author

Yumeng Wang

44 Targeted Optimal Treatment Regime Learning Using Summary Statistics

Personalized decision-making, aiming to derive optimal individualized treatment rules based on individual characteristics, has recently attracted increasing attention in many fields. Current literature mainly focuses on estimating ITRs from a single source population. In real-world applications, the distribution of a target population can be different from that of the source population. Due to privacy concerns and other practical issues, individual-level data from the target population is often not available. We consider an ITR estimation problem where the source and target populations may be heterogeneous, individual data is available from the source population, and only the summary information of covariates is accessible from the target population. We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics. Both empirical performance and theoretical properties of the proposed estimators are examined. 

Keywords

Covariate shift

Double robustness

Empirical likelihood

Entropy balancing

Multi-source policy learning 

Co-Author(s)

Wenbin Lu, North Carolina State University
Shu Yang, North Carolina State University, Department of Statistics

First Author

Jianing Chu, North Carolina State University

Presenting Author

Jianing Chu, North Carolina State University

46 Valid Inference for Machine Learning Model Parameters

The parameters of a machine learning model are typically learned by minimizing a loss function on a set of training data. However, this can come with the risk of overtraining; in order for the model to generalize well, it is of great importance that we are able to find the optimal parameter for the model on the entire population---not only on the given training sample. In this paper, we construct valid confidence sets for this optimal parameter of a machine learning model, which can be generated using only the training data without any knowledge of the population. We then show that studying the distribution of this confidence set allows us to assign a notion of confidence to arbitrary regions of the parameter space, and we demonstrate that this distribution can be well-approximated using bootstrapping techniques. 

Keywords

PAC-Learning

Glivenko-Cantelli Classes

Random Sets

Imprecise Probability

Hypothesis Testing 

Co-Author

Jonathan Williams, North Carolina State University

First Author

Neil Dey

Presenting Author

Neil Dey

47 Wasserstein Tangential PCA for Probability Measures

This presentation is concerned with learning variations of probability measures in the Wasserstein space. We introduce a spectral method, termed as the Wasserstein tangential principal component analysis (WT-PCA), to capture the local principal modes of geodesic variations of a collection of absolutely continuous probability measures at their barycenter. 

Keywords

Principal component analysis

Wasserstein covariance operator

optimal transport 

Co-Author(s)

Xiaohui Chen, University of Illinois at Urbana-Champaign
Young-Heon Kim, University of British Columbia

First Author

Peng Xu

Presenting Author

Peng Xu

WITHDRAWN Alternative Estimators in the Poisson Inverse Gaussian Regression Model with Multicollinearity

To address the over-dispersion problem, the Poisson Inverse Gaussian Regression model (PIGRM) is applied to the modeling of count datasets. The PIGRM estimates are typically estimated using the maximum likelihood estimator (MLE). When the explanatory variables in the PIGRM are correlated, the MLE does not produce useful findings. In this work, some biased estimators i.e. stein, ridge, Liu and modified Liu estimators are adapted to resolve the issue of multicollinearity in the PIGRM. These biased estimators have different behavior for different models that's why these are considered for the PIGRM to identify the best one. Every biased estimator has a biasing parameter with some limitations. Additionally, this study proposed some biasing parameters for the Stein estimator. The performance of the considered biased estimators is evaluated with the help of a simulation study under different parametric conditions and a real-life application based on the minimum mean squared error criterion. The simulation and application findings favor the ridge estimator with specific biasing parameters because it provides less variation than the Stein, Liu and improved Liu estimators.  

Keywords

Poisson Inverse Gaussian Regression Model 

First Author

Muhammad Amin, University of Sargodha

WITHDRAWN Inference for nonstationary economic time series

Many data in economics are observed over time, admitting temporal correlation and also exhibiting persistent upward and downward movements. A relaxation of standard assumptions is nonstationarity modeled through locally stationary processes with a smoothly varying trend.
This talk will present novel estimators for high-dimensional autocovariance and precision matrices that uses the locally stationarity property. The estimators are used to derive consistent predictors in nonstationary time series. Besides some theoretical verifications, we illustrate the finite sample properties of the new methodology by a simulation study and an application to economics data. 

Keywords

time series

nonstationarity

economics 

Co-Author

David Matteson, Cornell University

First Author

Marie-Christine Duker, Cornell University

WITHDRAWN IntelliCare: Unified Intelligence for Predictive and Explanatory Care in Human Aging

Aging is a complex process that affects organisms differently, and chronological age does not always align with biological age. The prediction of human age is becoming increasingly important in various fields. Due to the variety of modalities of the data in existence, e.g., facial images, brain MRI, disease diagnosis; a multi-modal approach that can account for different biomarkers will be crucial & groundbreaking. In our work, we propose a unified model - "IntelliCare" that is a pre-trained transformer framework that is predictive and explanatory care for human aging. Moreover, multi-modal learning for medical imaging faces subgroup distribution shifts in medical data. It can happen due to unlabeled subclasses inside every superclass causing hidden stratification. We use explanations to improve model robustness against subgroup distribution shifts. Therefore, robustness is a core aspect of model quality that is essential for ensuring explainability. We evaluate our IntelliCare over 10 datasets for pretraining and over 5 datasets for fine-tuning in human aging. 

Keywords

Age prediction

Distribution shifts

Multi-modal learning

Representation learning

Robustness

Transfer learning 

Co-Author(s)

Jun Yu
Kai Zhang, Lehigh University

First Author

Eashan Adhikarla, Department of Computer Science and Engineering, Lehigh University

WITHDRAWN The Weighted Latent Space Edge Clustering Model for Network Data

Networks arise as dominant structures in many fields. In network analysis, community detection, the unsupervised clustering of the actors, is critical for understanding network structure. Even though various statistical models have been developed for community detection, very little work has been devoted to the area of edge clustering in comparison with traditional node clustering approaches. In particular, there do not exist methods that leverage edge weights in clustering edges.
Thus, we propose the Weighted Latent Space Edge Clustering (WLSEC) model which addresses this methodological research gap by clustering weighted directed edges. The WLSEC model is built from the latent space model, where the probability of an edge between nodes and the weight of that edge depend on the features of both nodes and the latent environments. Then we propose a generalized EM (GEM) algorithm and gradient-based Monte Carlo algorithms to conduct the estimation of the WLSEC model. We evaluate the performance of WLSEC model by both simulation studies and real-world networks. Comparing with the non-weighted latent space edge cluster model, the WLSEC model has a significant improvement in accuracy. 

Keywords

Network Analysis

Clustering

Community detection

Latent space models 

Co-Author

Daniel Sewell, University of Iowa

First Author

Haomin Li, University of Iowa

Presenting Author

Haomin Li, University of Iowa