Monday, Aug 7: 8:30 AM - 10:20 AM

1557

Topic-Contributed Paper Session

1557

Topic-Contributed Paper Session

Metro Toronto Convention Centre

Room: CC-206F

Yes

International Indian Statistical Association

Section on Physical and Engineering Sciences

Section on Statistics and the Environment

Most general-purpose classification methods, such as support-vector machine (SVM)

and random forest (RF), fail to account for an unusual characteristic of astronomical

data: known measurement error uncertainties. In astronomical data, this information is

often given in the data but discarded because popular machine learning classifiers

cannot incorporate it. We propose a simulation-based approach that incorporates

heteroscedastic measurement error into an existing classification method to better

quantify uncertainty in classification. The proposed method first simulates perturbed

realizations of the data from a Bayesian posterior predictive distribution of a Gaussian

measurement error model. Then, a chosen classifier is fit to each simulation. The

variation across the simulations naturally reflects the uncertainty propagated from the

measurement errors in both labeled and unlabeled data sets. We demonstrate the use

of this approach via two numerical studies. The first is a thorough simulation study

applying the proposed procedure to SVM and RF, which are well-known hard and soft

classifiers, respectively. The second study is a realistic classification problem.

and random forest (RF), fail to account for an unusual characteristic of astronomical

data: known measurement error uncertainties. In astronomical data, this information is

often given in the data but discarded because popular machine learning classifiers

cannot incorporate it. We propose a simulation-based approach that incorporates

heteroscedastic measurement error into an existing classification method to better

quantify uncertainty in classification. The proposed method first simulates perturbed

realizations of the data from a Bayesian posterior predictive distribution of a Gaussian

measurement error model. Then, a chosen classifier is fit to each simulation. The

variation across the simulations naturally reflects the uncertainty propagated from the

measurement errors in both labeled and unlabeled data sets. We demonstrate the use

of this approach via two numerical studies. The first is a thorough simulation study

applying the proposed procedure to SVM and RF, which are well-known hard and soft

classifiers, respectively. The second study is a realistic classification problem.

Model fitting with Poisson counting processes and validation has been adopted and put into software packages for high energy physics. The heterogeneous Poisson counting process with a large number of energy bins with zero counts makes traditional large sample approximations inappropriate to use in practice. Numerical solutions have been proposed in the astrophysics literature. Astronomers have always been interested in learning theoretical guarantees of the procedures that they adopt. We study the problem of goodness-of-fit with rigorous statistical methods, and show practical implications of our results with numerical studies.

I discuss examples of how we at the CHASC Astrostatistics Collaboration have done several analyses of high-energy astronomical data (spatial, spatial+temporal, spatial+spectral, spectral+temporal, spatial+spectral+temporal) tailored to the specific astronomical problems that are of interest. The differences in approaches highlight what compromises were necessary to achieve progress and what trade-offs were needed to maximize the utility of the results. The methods include Bayesian, frequentist, computer vision, and machine learning techniques, several used in combination.

Astronomers often deal with data where the covariates and the dependent variable are measured with heteroskedastic, non-Gaussian errors. While techniques have been developed for estimating regression parameters for data with heteroskedasticity and measurement errors, most methods lack procedures for model validation such as checking structural assumptions. We develop a model validation test, using ideas from conformal prediction, that is invariant to heteroskedasticity and measurement errors. We empirically demonstrate that this new test gives finite-sample control over type 1 error probabilities under a variety of assumptions on the measurement errors in the observed data, while other prediction intervals do not. We further demonstrate how our conformal prediction approach can be used for testing structural assumptions of proposed models from the literature relating planet mass and planet radius.