Sunday, Aug 6: 4:00 PM - 5:50 PM
1423
Invited Paper Session
Metro Toronto Convention Centre
Room: CC-701B
Applied
Yes
Main Sponsor
Section on Statistical Learning and Data Science
Co Sponsors
Association for the Advancement of Artificial Intelligence
Section on Statistical Computing
Presentations
We propose an adaptive sampling approach for multiple testing which aims to maximize statistical power while ensuring anytime false discovery control. We consider n distributions whose means are partitioned by whether they are below or equal to a baseline (nulls), versus above the baseline (actual positives). In addition, each distribution can be sequentially and repeatedly sampled. Inspired by the multi-armed bandit literature, we provide an algorithm that takes as few samples as possible to exceed a target true positive proportion (i.e. proportion of actual positives discovered) while giving anytime control of the false discovery proportion (nulls predicted as actual positives). Our sample complexity results match known information theoretic lower bounds and through simulations we show a substantial performance improvement over uniform sampling and an adaptive elimination style algorithm. Given the simplicity of the approach, and its sample efficiency, the method has promise for wide adoption in the biological sciences, clinical testing for drug discovery, and online A/B/n testing problems.
I review examples of adaptive experiments conducted at Meta. Providing information about the kinds of statistical and computation problems faced in doing adaptive experiments at scale in products that go out to billions of users.
I present work on using Bayesian optimization to more efficiently conduct large scale experimentation. And I consider cancer problems as well as other challenging issues that arise to be solved by statisticians and machine learning researchers in doing adaptive experiments, technology companies. I've set up open source frameworks for adaptive experimentation that provides opportunities for statisticians and machine learning researchers to understand the problems being solved and the best solutions open to industry, as well as save potential for contributing their own techniques and methods.
Multi-armed bandit algorithms like Thompson Sampling (TS) can be used to conduct adaptive experiments to improve user experience. However, such assignment strategies can often increase False Positive Rate (FPR) and decrease statistical power . We tackle this by introducing a novel heuristic algorithm, called TS-PostDiff (Posterior Probability of Difference), which takes a Bayesian approach to mixing TS and Uniform Random (UR) strategies, allowing for more UR exploration when there is little or no reward to be gained. We evaluate TS-PostDiff against state-of-the-art strategies. The empirical and simulation results help characterize the trade-offs of these approaches between reward, FPR, and statistical power, as well as under which circumstances each is effective. We quantify the advantage of TS-PostDiff in performing well across multiple differences in arm means (effect sizes), showing the benefits of adaptively changing randomization/exploration in TS in a "Statistically Considerate'' manner. This highlights important considerations for future algorithm development and analysis to better balance reward and statistical analysis.
Contextual bandit algorithms are becoming ubiquitous tools for adaptive sequential experimentation in healthcare and the tech industry. This adaptivity raises interesting but hard statistical inference questions: e.g., how would our algorithm had done if we used a policy different from the logging policy that was used to collect the data – a problem known as "off-policy evaluation" (OPE).
Using modern martingale techniques, we present a suite of methods for OPE inference that can be used even while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy is data-adaptively changing due to learning, and even if the context distributions are drifting over time.
Concretely, we derive confidence sequences – sequences of confidence intervals that are uniformly valid over time – for various functionals, including (time-varying) off-policy mean reward values, as well as the entire CDF of the off-policy rewards. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require known bounds on the maximal importance weights, and (d) adapt to the empirical variance of our estimators.
Using bandit algorithms to conduct adaptive randomised experiments can minimise regret, but it poses major challenges for statistical inference. Recent attempts to address these challenges typically impose restrictions on the exploitative nature of the bandit algorithm–trading off regret–and require large sample sizes to ensure asymptotic guarantees. However, large experiments generally follow a successful pilot study, which is tightly constrained in its size or duration. Increasing power in such small pilot experiments, without limiting the adaptive nature of the algorithm, can allow promising interventions to reach a larger experimental phase. In this work we introduce a novel hypothesis test, uniquely based on the allocation probabilities of the bandit algorithm, and without constraining its exploitative nature or requiring a minimum experimental size. We characterise our Allocation Probability Test when applied to Thompson Sampling, presenting its theoretical properties, and illustrating its finite-sample performances compared to state-of-the-art approaches. We show the regret and inferential advantages of our approach in both simulations and in a real-world experiment.
Online reinforcement learning algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the sampled user data trajectories; this can cause standard variance estimators for i.i.d. data to underestimate the true variance of estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively sampled data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, I discuss how we plan to use this approach for the primary analysis for Oralytics, a mobile health study to encourage oral self-care, that will use a pooling reinforcement learning algorithm.