Sunday, Aug 6: 4:00 PM - 5:50 PM
0021
Contributed Papers
Metro Toronto Convention Centre
Room: CC-206B
Main Sponsor
Government Statistics Section
Presentations
Race and ethnicity data are frequently missing in electronic health records (EHRs). Simply excluding individuals with missing data from the analysis can result in a loss of analytic power and biased estimates, thus limiting the use of EHRs for health disparities research. Surnames and residential addresses can be used to predict race and ethnicity but these variables are unavailable in anonymized collections of EHRs. Furthermore, race and ethnicity information are often missing not at random (e.g., individuals refuse to provide this information due to privacy concerns); assuming the data are missing at random imputation models fail to correct for known potential bias. To address these issues, we propose a local similarity imputation method based on machine learning techniques using geocoded auxiliary information, behavioral risk factors, and health status features. The new approach was compared with well-established imputation methods: hot-deck and Bayesian multiple imputations. A simulation study was used to evaluate the imputation accuracy of each method. The results showed that the new approach outperformed the other two imputation methods with high sensitivity and specificity.
Keywords
missng imputation
machine learning techniques
race and ethnicity missing
local simility imputation
electronic health record
The United States Department of Agriculture National Agricultural Statistics Service (NASS) provides timely and accurate statistics in service to U.S. agriculture. An example includes planted acreage estimates for corn. NASS conducts surveys in March and June to provide early season estimates of corn acreage. Since planting typically occurs in May and June, farmers are generally reporting planting intentions in March. The information collected through the June survey is typically a close representation of what is planted since corn planting generally is complete by the end of June. It is possible, however, that planting can be prevented by extreme weather. If this is the case, the June survey may still include planting intentions, which can bias the results when intentions are changed due to weather conditions. More information is necessary to mitigate this potential source of bias. The objective of this study is to use machine learning to combine the June survey estimate with precipitation, temperature, economic and other data to forecast corn planted acreage. The accuracies of the model estimates are measured based on the relative error with respect to official acreages.
Keywords
Machine Learning
Agriculture
Anomaly Detection
Group quarters (GQs) are places where people live in a group living arrangement owned or managed by entities providing housing for the residents. GQs include such places as university student housing, residential treatment centers, nursing facilities, group homes, military barracks, and correctional facilities for adults. During the 2020 Census, when processing GQs at the end of data collection, many had not provided the necessary information indicating their occupancy status or population count. To address this issue, we assembled a GQ count imputation team to remove reporting errors from GQs when possible, and to apply a count imputation procedure when valid responses from occupied GQs were not available.
The team's work was divided into two stages. First, we partitioned the GQ universe into (a) resolved and (b) unresolved GQs: Resolved GQs had a clear status and count. Unresolved GQs were known to be occupied, but did not have a population count. Second, we developed an imputation method that was statistically robust yet able to be quickly implemented, and applied it to the unresolved GQs. This work describes the GQ imputation process as well as providing high-level results.
Keywords
group quarters
count imputation
Forecasting crop yield ahead of a harvest period is a complex problem. In recent years abnormal conditions such as drought, heat waves, freezes, and floods have been observed in major United States crop-producing regions, making forecasting even more challenging. An increase in climate variability and frequency of extreme weather due to climate change is expected to bring additional challenges to crop yield forecasts for different crops in diverse geographies. Our research focuses on using machine learning approaches to develop indicators for critical climate events, with an emphasis on indicators for yield of winter wheat. The outcome of the development of certain indicators may lead to useful predictors regarding operational decisions during the growing season of winter wheat. Current issues using machine learning on agricultural applications include the limitations of linear and non-linear approaches for capturing crop yield, as well as the importance of including scientific expert judgment in the model selection process.
Keywords
machine learning
model selection
climate change
crop yield
forecasting
Incomplete data, whether realized from nonresponse in survey data or counterfactual outcomes in observational studies, may lead to biased estimation of study variables. Nonresponse and selection bias may be mitigated with techniques that weight the incomplete data to match characteristics of the partially unobserved complete data. Inverse probability weighting is a widely used method in causal inference that relies on a propensity model to construct adjusted weights; whereas calibration is a common method used by survey statisticians to use constrained optimization to construct adjusted weights. This paper reviews inverse probability weighting and a particular calibration method by distinguishing them in the statistical sense of variable balancing, extending propensity score construction to include generalized boosting models, and demonstrating the use of inverse probability weighting and calibration separately and together through a widely cited simulation study evaluation.
Keywords
Missing data analysis
Calibration
Machine learning
Survey analysis
Every year the U.S. Department of Agriculture's National Agricultural Statistics Service (NASS) conducts the June Area Survey (JAS) based on an area frame, which has complete coverage of all land in the contiguous U.S. The data collected from the JAS are used to supply direct estimates of acreage and measures of sampling coverage for NASS's list frame, which consists of all known farms in the U.S. Response rates have been declining in many federal surveys, including the JAS, leading to heavier reliance on imputation. NASS has begun exploration of utilizing automatic imputation for the JAS using various machine learning models. Previous research has found that NASS's Predictive Cropland Data Layer (PCDL) has good predictive power at certain entropy levels for major U.S. crop commodities. This paper explores the interaction between the entropy levels, PCDL values, and other data from administrative and survey sources to determine which entropy levels are appropriate for the purposes of imputing the JAS.
Keywords
Imputation
area frame
nonresponse
geospatial
administrative data
multiple data sources