Tuesday, Aug 8: 10:30 AM - 12:20 PM
Contributed Posters
Metro Toronto Convention Centre
Room: CC-Hall E
Main Sponsor
Section on Statistical Learning and Data Science
Presentations
Recurrent neural networks (RNNs) have been widely used for sequence modeling tasks. Especially, they excel at capturing the recurrent patterns in data. This paper reveals that the weight matrix for the previous hidden state determines the recurrent dynamics an RNN can provide. With the real Jordan decomposition, we show that the recurrent dynamics can be decoupled, with each of them essentially driven by the eigenvalues of the weight matrix. A novel concept of recurrence features is formally defined. We also demonstrate that an RNN is equivalent to combining a series of small RNNs with ignorable loss. Correspondingly, this project proposes an RNN surrogate, Parallelized RNN (ParaRNN), where the constituent small RNNs are run in parallel. The training speed of the ParaRNN is shortened compared to the vanilla one, and the hidden state of the original RNN can be recovered by aggregating all the small hidden states. Furthermore, we generalize our experience with the vanilla RNN and propose the so-called Segregate-Parallelize-Aggregate framework to accelerate a broader range of recurrent neural networks.
Keywords
Machine Learning
Neural Networks
Recurrent Dynamics
We modify a latent vector model to study the conditional dependency structures among the codes in the electronic health record (EHR). We derive a low rank estimator for the pointwise mutual information (PMI) matrix under our data generation model based on empirical PMI matrix. The statistical rates and asymptotic normality of the proposed estimators are proved and we also show the relationship between latent embedding vectors and the PMI matrix. Numerical results from simulation studies and EHR data suggest that the testing method based on our proposed low rank PMI estimator outperforms existing testing methods based on empirical estimator and cosine similarity.
Keywords
large-scale inference
Electronic health records
latent vector model
random matrix
Technological innovations allow large amounts of data to be collected in a single observation. As a result, practitioners often face problems in which the number of variables exceeds the number of observations. Such situations arise in many fields, from the basic sciences to the social sciences, and variable selection techniques are required.
Moreover, the correlation between variables cannot be neglected since it is often very high, and variable selection methods often fail to distinguish between informative and uninformative variables. Therefore, variable selection has become one of the major challenges in statistics. Yet, if the number of variables greatly exceeds the number of observations or if the variables are highly correlated, the performance of variable selection methods is usually limited in terms of recall and precision.
We propose a general algorithm that improves model selection in data sets with correlated variables that uses the correlation structure to select reliable variables in parsimonious or non-parsimonious Beta regression models. It improves the performance of many existing usual models, as demonstrated on simulated and real datasets.
Keywords
Variable selection
Beta regression models
Correlated resampling
Sparse regression
Penalized regression
Advances in biomedical technologies generates high content biomedical data that are both multi-way and multi-source. Integrative analysis of these data sets has the potential to capture and synthesize different facets of complex biological system. However, these studies are limited by current statistical models. In this work, we propose a Multiple Linked Tensors Factorization (MULTIFAC) method extending the CANDECOMP/PARAFAC decomposition to simultaneously reduce the dimension of multiple multi-way arrays and approximate underlying true signal. The model can automatically reveal latent structures that are individual components or components shared across subset of data sources. We also extend this algorithm to its expectation–maximization (EM) version to handle incomplete data. Extensive simulation studies are conducted to demonstrate MULTIFAC's ability to (i) approximate underlying signal, (ii) identify shared and unshared structures, and (iii) impute missing data. We apply our method to integrate two omics data sources (metabolomic and proteomic) across two tissue compartments (blood and cerebrospinal fluid) and multiple developmental time points for a study on iron deficiency.
Keywords
CANDECOMP/PARAFAC decomposition
dimension reduction
data integration
missing data imputation
tensor
In randomized trials, adjustment for pre-specified baseline covariates results in efficiency gains for estimated treatment effects. The effect estimate remains unbiased for complete data even when the adjustment model is misspecified. When outcome data are missing, however, misspecification of an adjustment model can lead to biased treatment effect estimates. This paper investigates the use of machine learning (ML) for the adjustment model and addresses two questions. For complete data, we investigate whether ML improves efficiency gains relative to a misspecified adjustment model. Here we find that improvements are directly related to proportion variation explained by baseline covariates under the correct model. For missing data, we examine whether using ML can improve efficiency while avoiding bias attributable to model misspecification. Similar findings hold for missing data, with the degree of bias correction depending on the missing data mechanism. The methods and findings are illustrated in simulation studies and application to a randomized trial and can provide additional guidance for the appropriate use of covariate adjustment in randomized trials.
Keywords
Machine learning
Covariate adjustment
Bias correction
Precision optimization
Model specification
Missing data in randomized trials
Online marketplaces execute large volume of price updates that are initiated by individual marketplace sellers each day on the platform. This price democratization comes with increasing challenges with data quality. Lack of centralized guardrails that are available for a traditional online retailer causes a higher likelihood for inaccurate prices to get published on the website, leading to poor customer experience and potential for revenue loss. We present MoatPlus (Masked Optimal Anchors using Trees, Proximity-based Labeling and Unsupervised Statistical-features), a scalable price anomaly detection framework for a growing marketplace platform. The goal is to leverage proximity and historical price trends from unsupervised statistical features to generate an upper price bound. We build an ensemble of models to detect irregularities in price-based features, exclude irregular features and use an optimized weighting scheme to build a reliable price bound in real-time pricing pipeline. We found that our approach reduces false negatives by improving system recall by 21%, while maintaining high precision.
Keywords
e-commerce
anomaly detection
data mining
machine learning
big data
streaming system
Machine Learning techniques such as Decision Trees (CART), Bagging, Boosting, Random Forest, Support Vector Machines (SVM), and Naïve Bayes Methods are used to improve predictions of classification models. Case studies with customer churn will be discussed, and comparisons of the accuracy between different types of models will be made using ROC curves.
Keywords
Machine Learning
ROC Curve
Logistic Regression
Random Forest
Support Vector Machine
Fitting Cox models in a big data context -on a massive scale in terms of volume, intensity, and complexity exceeding the capacity of usual analytic tools-is often challenging. If some data are missing, it is even more difficult. We proposed algorithms that could fit Cox models in high dimensional settings using extensions of partial least squares regression to the Cox models. Some of them were able to cope with missing data. We were recently able to extend our most recent algorithms to big data, thus allowing us to fit Cox models for big data with missing values.
In addition, we proposed sparse group extensions of our algorithms and defined a new robust measure based on the Schmid score and the R coefficient of determination for least absolute deviation: the integrated R Schmid Score weighted.
Bertrand et Maumy (2021) Fitting and Cross-Validating Cox Models to Censored Big Data With Missing Values Using Extensions of Partial Least Squares Regression Models. Front. Big Data 4:684794..
Keywords
Cox Models
Censored data
Big Data
Partial Least Squares Regression
Suppose that one can construct a valid (1-δ)-confidence interval (CI) for each of K parameters of potential interest. If a data analyst uses an arbitrary data-dependent criterion to select some subset S of parameters, then the aforementioned CIs for the selected parameters are no longer valid due to selection bias. We design a new method to adjust the intervals in order to control the false coverage rate (FCR).
The main established method is the "BY procedure" by Benjamini and Yekutieli (JASA, 2005). Unfortunately, the BY guarantees require certain restrictions on the selection criterion and on the dependence between the CIs. We propose a natural and much simpler method which is valid under any dependence structure between the original CIs, and any (unknown) selection criterion, but which only applies to a special, yet broad, class of CIs. Our procedure reports (1-δ|S|/K)-CIs for the selected parameters, and we prove that it controls the FCR at δ for confidence intervals that implicitly invert *e-values*; examples include those constructed via supermartingale methods, via universal inference, or via Chernoff-style bounds on the moment generating function, among others.
Keywords
Benjamini-Yekutieli procedure
false coverage rate
Bayes factor
A/B testing
Random Effects Expectation-Maximization (RE-EM) tree, a tree-based data mining tool, accounts for within-subject correlational structure in longitudinal data that partition time-space into smaller segments to achieve homogeneity in the response thereby serving as an efficient method in approximating knots for fitting piecewise mixed effects model on longitudinal unbalanced data. Successful application of recently introduced post-hoc mixture modeling of BLUPs for the classification of longitudinal unbalanced data require an optimal approximation of knots for fitting piecewise linear mixed effects model. Application of RE-EM tree on a dataset of early childhood growth patterns detected three knots that we used to fit a piecewise linear mixed effects model. Post-hoc mixture modeling of the BLUPs from the piecewise mixed effects model have produced distinct trajectories of early-childhood pathways to obesity.
Keywords
Random effects expectation-maximization
Change point
Piecewise mixed effects model
Longitudinal unbalanced data
Cluster analysis
Tree-based data mining
With the data explosion in the digital era, it is common for data to be distributed across multiple sites. However, there are two challenges in analyzing decentralized data: (a) communication of large-scale data between sites is expensive and inefficient; (b) data are not allowed to be shared for privacy or legal reasons. To conquer these challenges, we propose a one-shot distributed learning algorithm via refitting Bootstrap samples, which we refer to as ReBoot. Theoretically, we analyze the statistical rate of ReBoot for generalized linear models (GLM) and noisy phase retrieval, which represent convex and non-convex problems respectively. ReBoot achieves the full-sample statistical rate in both cases whenever the subsample size is not too small. We show that the systematic bias of ReBoot, the error that is independent of the number of subsamples, is O(n^-2) in GLM, where n is the subsample size. The simulation study illustrates the statistical advantage of ReBoot over competing methods. In addition, we propose FedReBoot, an iterative version of ReBoot, to aggregate convolutional neural networks, which exhibits superiority over FedAvg within early rounds of communication.
Keywords
Distributed Learning
One-Shot Aggregation
Generalized Linear Models
Phase Retrieval
Model Aggregation
Personalized decision-making, aiming to derive optimal individualized treatment rules based on individual characteristics, has recently attracted increasing attention in many fields. Current literature mainly focuses on estimating ITRs from a single source population. In real-world applications, the distribution of a target population can be different from that of the source population. Due to privacy concerns and other practical issues, individual-level data from the target population is often not available. We consider an ITR estimation problem where the source and target populations may be heterogeneous, individual data is available from the source population, and only the summary information of covariates is accessible from the target population. We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics. Both empirical performance and theoretical properties of the proposed estimators are examined.
Keywords
Covariate shift
Double robustness
Empirical likelihood
Entropy balancing
Multi-source policy learning
Co-Author(s)
Wenbin Lu, North Carolina State University
Shu Yang, North Carolina State University, Department of Statistics
First Author
Jianing Chu, North Carolina State University
Presenting Author
Jianing Chu, North Carolina State University
The parameters of a machine learning model are typically learned by minimizing a loss function on a set of training data. However, this can come with the risk of overtraining; in order for the model to generalize well, it is of great importance that we are able to find the optimal parameter for the model on the entire population---not only on the given training sample. In this paper, we construct valid confidence sets for this optimal parameter of a machine learning model, which can be generated using only the training data without any knowledge of the population. We then show that studying the distribution of this confidence set allows us to assign a notion of confidence to arbitrary regions of the parameter space, and we demonstrate that this distribution can be well-approximated using bootstrapping techniques.
Keywords
PAC-Learning
Glivenko-Cantelli Classes
Random Sets
Imprecise Probability
Hypothesis Testing
This presentation is concerned with learning variations of probability measures in the Wasserstein space. We introduce a spectral method, termed as the Wasserstein tangential principal component analysis (WT-PCA), to capture the local principal modes of geodesic variations of a collection of absolutely continuous probability measures at their barycenter.
Keywords
Principal component analysis
Wasserstein covariance operator
optimal transport
To address the over-dispersion problem, the Poisson Inverse Gaussian Regression model (PIGRM) is applied to the modeling of count datasets. The PIGRM estimates are typically estimated using the maximum likelihood estimator (MLE). When the explanatory variables in the PIGRM are correlated, the MLE does not produce useful findings. In this work, some biased estimators i.e. stein, ridge, Liu and modified Liu estimators are adapted to resolve the issue of multicollinearity in the PIGRM. These biased estimators have different behavior for different models that's why these are considered for the PIGRM to identify the best one. Every biased estimator has a biasing parameter with some limitations. Additionally, this study proposed some biasing parameters for the Stein estimator. The performance of the considered biased estimators is evaluated with the help of a simulation study under different parametric conditions and a real-life application based on the minimum mean squared error criterion. The simulation and application findings favor the ridge estimator with specific biasing parameters because it provides less variation than the Stein, Liu and improved Liu estimators.
Keywords
Poisson Inverse Gaussian Regression Model
Many data in economics are observed over time, admitting temporal correlation and also exhibiting persistent upward and downward movements. A relaxation of standard assumptions is nonstationarity modeled through locally stationary processes with a smoothly varying trend.
This talk will present novel estimators for high-dimensional autocovariance and precision matrices that uses the locally stationarity property. The estimators are used to derive consistent predictors in nonstationary time series. Besides some theoretical verifications, we illustrate the finite sample properties of the new methodology by a simulation study and an application to economics data.
Keywords
time series
nonstationarity
economics
Aging is a complex process that affects organisms differently, and chronological age does not always align with biological age. The prediction of human age is becoming increasingly important in various fields. Due to the variety of modalities of the data in existence, e.g., facial images, brain MRI, disease diagnosis; a multi-modal approach that can account for different biomarkers will be crucial & groundbreaking. In our work, we propose a unified model - "IntelliCare" that is a pre-trained transformer framework that is predictive and explanatory care for human aging. Moreover, multi-modal learning for medical imaging faces subgroup distribution shifts in medical data. It can happen due to unlabeled subclasses inside every superclass causing hidden stratification. We use explanations to improve model robustness against subgroup distribution shifts. Therefore, robustness is a core aspect of model quality that is essential for ensuring explainability. We evaluate our IntelliCare over 10 datasets for pretraining and over 5 datasets for fine-tuning in human aging.
Keywords
Age prediction
Distribution shifts
Multi-modal learning
Representation learning
Robustness
Transfer learning
Networks arise as dominant structures in many fields. In network analysis, community detection, the unsupervised clustering of the actors, is critical for understanding network structure. Even though various statistical models have been developed for community detection, very little work has been devoted to the area of edge clustering in comparison with traditional node clustering approaches. In particular, there do not exist methods that leverage edge weights in clustering edges.
Thus, we propose the Weighted Latent Space Edge Clustering (WLSEC) model which addresses this methodological research gap by clustering weighted directed edges. The WLSEC model is built from the latent space model, where the probability of an edge between nodes and the weight of that edge depend on the features of both nodes and the latent environments. Then we propose a generalized EM (GEM) algorithm and gradient-based Monte Carlo algorithms to conduct the estimation of the WLSEC model. We evaluate the performance of WLSEC model by both simulation studies and real-world networks. Comparing with the non-weighted latent space edge cluster model, the WLSEC model has a significant improvement in accuracy.
Keywords
Network Analysis
Clustering
Community detection
Latent space models