Machine learning and imputation techniques for survey design and missing data

Elizabeth Petraglia Chair
Westat
 
Sunday, Aug 6: 4:00 PM - 5:50 PM
0021 
Contributed Papers 
Metro Toronto Convention Centre 
Room: CC-206B 

Main Sponsor

Government Statistics Section

Presentations

A Machine Learning Approach for Imputation of Missing Race and Ethnicity Information in EHRs

Race and ethnicity data are frequently missing in electronic health records (EHRs). Simply excluding individuals with missing data from the analysis can result in a loss of analytic power and biased estimates, thus limiting the use of EHRs for health disparities research. Surnames and residential addresses can be used to predict race and ethnicity but these variables are unavailable in anonymized collections of EHRs. Furthermore, race and ethnicity information are often missing not at random (e.g., individuals refuse to provide this information due to privacy concerns); assuming the data are missing at random imputation models fail to correct for known potential bias. To address these issues, we propose a local similarity imputation method based on machine learning techniques using geocoded auxiliary information, behavioral risk factors, and health status features. The new approach was compared with well-established imputation methods: hot-deck and Bayesian multiple imputations. A simulation study was used to evaluate the imputation accuracy of each method. The results showed that the new approach outperformed the other two imputation methods with high sensitivity and specificity. 

Keywords

missng imputation

machine learning techniques

race and ethnicity missing

local simility imputation

electronic health record 

Co-Author(s)

Deborah Rolka, Centers for Disease Control and Prevention
Elizabeth Lundeen, CDC
Yu Chen, Centers for Disease Control and Prevention
Rachel Rutkowski, CDC

First Author

Hui Xie, Centers for Disease Control and Prevention

Presenting Author

Hui Xie, Centers for Disease Control and Prevention

Early Season Corn Acreage Estimates in the Presence of Extreme Weather

The United States Department of Agriculture National Agricultural Statistics Service (NASS) provides timely and accurate statistics in service to U.S. agriculture. An example includes planted acreage estimates for corn. NASS conducts surveys in March and June to provide early season estimates of corn acreage. Since planting typically occurs in May and June, farmers are generally reporting planting intentions in March. The information collected through the June survey is typically a close representation of what is planted since corn planting generally is complete by the end of June. It is possible, however, that planting can be prevented by extreme weather. If this is the case, the June survey may still include planting intentions, which can bias the results when intentions are changed due to weather conditions. More information is necessary to mitigate this potential source of bias. The objective of this study is to use machine learning to combine the June survey estimate with precipitation, temperature, economic and other data to forecast corn planted acreage. The accuracies of the model estimates are measured based on the relative error with respect to official acreages. 

Keywords

Machine Learning

Agriculture

Anomaly Detection 

Co-Author(s)

Noemi Guindin, USDA NASS
Kevin Hunt, USDA NASS
Claire Boryan, USDA/NASS
Luca Sartore, National Institute of Statistical Sciences

First Author

Jonathon Abernethy, USDA/NASS

Presenting Author

Jonathon Abernethy, USDA/NASS

Group Quarters Count Imputation for the 2020 Census

Group quarters (GQs) are places where people live in a group living arrangement owned or managed by entities providing housing for the residents. GQs include such places as university student housing, residential treatment centers, nursing facilities, group homes, military barracks, and correctional facilities for adults. During the 2020 Census, when processing GQs at the end of data collection, many had not provided the necessary information indicating their occupancy status or population count. To address this issue, we assembled a GQ count imputation team to remove reporting errors from GQs when possible, and to apply a count imputation procedure when valid responses from occupied GQs were not available.

The team's work was divided into two stages. First, we partitioned the GQ universe into (a) resolved and (b) unresolved GQs: Resolved GQs had a clear status and count. Unresolved GQs were known to be occupied, but did not have a population count. Second, we developed an imputation method that was statistically robust yet able to be quickly implemented, and applied it to the unresolved GQs. This work describes the GQ imputation process as well as providing high-level results. 

Keywords

group quarters

count imputation 

First Author

Andrew Keller, US Census Bureau

Presenting Author

Andrew Keller, US Census Bureau

Machine learning and climate indicators as an approach to enhance operational decisions

Forecasting crop yield ahead of a harvest period is a complex problem. In recent years abnormal conditions such as drought, heat waves, freezes, and floods have been observed in major United States crop-producing regions, making forecasting even more challenging. An increase in climate variability and frequency of extreme weather due to climate change is expected to bring additional challenges to crop yield forecasts for different crops in diverse geographies. Our research focuses on using machine learning approaches to develop indicators for critical climate events, with an emphasis on indicators for yield of winter wheat. The outcome of the development of certain indicators may lead to useful predictors regarding operational decisions during the growing season of winter wheat. Current issues using machine learning on agricultural applications include the limitations of linear and non-linear approaches for capturing crop yield, as well as the importance of including scientific expert judgment in the model selection process. 

Keywords

machine learning

model selection

climate change

crop yield

forecasting 

Co-Author

Alex Tarter, National Agricultural Statistics Service

First Author

Noemi Guindin, USDA NASS

Presenting Author

Alex Tarter, National Agricultural Statistics Service

On calibrated inverse probability weighting via a machine learning model for incomplete survey data

Incomplete data, whether realized from nonresponse in survey data or counterfactual outcomes in observational studies, may lead to biased estimation of study variables. Nonresponse and selection bias may be mitigated with techniques that weight the incomplete data to match characteristics of the partially unobserved complete data. Inverse probability weighting is a widely used method in causal inference that relies on a propensity model to construct adjusted weights; whereas calibration is a common method used by survey statisticians to use constrained optimization to construct adjusted weights. This paper reviews inverse probability weighting and a particular calibration method by distinguishing them in the statistical sense of variable balancing, extending propensity score construction to include generalized boosting models, and demonstrating the use of inverse probability weighting and calibration separately and together through a widely cited simulation study evaluation. 

Keywords

Missing data analysis

Calibration

Machine learning

Survey analysis 

Co-Author(s)

Darcy Morris, U.S. Census Bureau
Patrick Joyce, US Census Bureau
Isaac Dompreh

First Author

Joseph Kang, US Census Bureau

Presenting Author

Joseph Kang, US Census Bureau

Optimizing Imputation for an Area Survey

Every year the U.S. Department of Agriculture's National Agricultural Statistics Service (NASS) conducts the June Area Survey (JAS) based on an area frame, which has complete coverage of all land in the contiguous U.S. The data collected from the JAS are used to supply direct estimates of acreage and measures of sampling coverage for NASS's list frame, which consists of all known farms in the U.S. Response rates have been declining in many federal surveys, including the JAS, leading to heavier reliance on imputation. NASS has begun exploration of utilizing automatic imputation for the JAS using various machine learning models. Previous research has found that NASS's Predictive Cropland Data Layer (PCDL) has good predictive power at certain entropy levels for major U.S. crop commodities. This paper explores the interaction between the entropy levels, PCDL values, and other data from administrative and survey sources to determine which entropy levels are appropriate for the purposes of imputing the JAS. 

Keywords

Imputation

area frame

nonresponse

geospatial

administrative data

multiple data sources 

Co-Author(s)

Arthur Rosales, USDA/NASS
Luca Sartore, National Institute of Statistical Sciences
Tara Murphy, USDA National Agricultural Statistics Service

First Author

Sean Rhodes, USDA

Presenting Author

Sean Rhodes, USDA