Tuesday, Aug 8: 8:30 AM - 10:20 AM
Topic-Contributed Paper Session
Metro Toronto Convention Centre
Survey Research Methods Section
Government Statistics Section
Social Statistics Section
Statistical agencies widely use cell suppression methods for economic censuses and establishment surveys to protect sensitive tabular data from disclosure to the public. The goal is to reduce the risk of disclosure through first identifying sensitive cells as primary suppressions and then finding additional cells as the complementary (secondary) suppressions to protect the primary cells against an attacker. In general, the cell suppression problems (CSP) can be described as a linear programming problem. In this presentation, cell suppression models are reviewed, with a focus on network flow models with heuristic solutions for two-dimensional tables as well as exact optimal solutions. Applications of cell suppression methods from statistical agencies are highlighted. The extension of the solutions to high-dimensional, hierarchical, and linked tables is also discussed.
The current implementation of complementary cell suppression methodology at Statistics Canada relies on a linear programming (LP) solver finding the optimal solution in SAS. As an alternative, open-source LP solvers are being investigated. Among these solvers, it is not clear which one would perform better for the suppression problem until we actually use them and assess performance. Therefore, a Python version of suppression was implemented using open-source linear programming packages. There are several challenges in comparing the performance of solvers. For example, it is difficult to assess the solution of the linear programming problem since the heuristic method requires solving the LP problem sequentially. This presentation discusses the performance of alterative solvers in relation to typical suppression problems.
Integer Programing (IP) methods allow for finding an optimal solution for complementary suppressions when processing multiple primary suppressions. However, the model complexity of IP makes it impractical to use in most production environments. The complexity of an IP model is largely determined by the number of primary suppressions and the size of the solution space (available cells to complement primaries), and grows exponentially. Therefore, IP scales poorly, and generally handles only small datasets. Over the years, researchers have been trying to solve cell suppression problems with IP, but it has found only limited application.
In this research, we first identify the limits of IP by exploring what sizes of problem IP can handle - in terms of the number of primary suppressions and the size of the solution space. We then investigate ways to reduce the IP model size. Seveeral approaches are used: restrict the solution space, restrict the number of primary suppressions in the model , and a top-down approach by dividing data geographically. While the resulting method does not guarantee optimal, the results may still be an improvement over solutions obtained by LP.
We discuss two major initiatives that the U.S. Energy Information Administration (EIA) is leading to improve our statistical disclosure limitation procedures: (1) modernizing our cell suppression software and (2) establishing a formal procedure for conducting disclosure reviews of our data products. To modernize our cell suppression software, we first acquired the research prototype of the Census Bureau's linear programming (LP) cell suppression software in July 2017. Since February 2019, we have been successfully using our modified version of the Census Bureau's LP prototype in production to perform disclosure analysis for year-end estimates of U.S. photovoltaic module shipments by state and territory. In February 2021, we successfully tested our modified version of the Census Bureau's LP prototype in our modernized processing environment, and we plan to use this version of the prototype in our new production environment. In a related effort, we plan to form a Disclosure Review Board and incorporate formal disclosure reviews as part of our production processes to ensure that we consistently apply appropriate statistical disclosure limitation techniques to our data products.