Wednesday, Aug 9: 8:30 AM - 10:20 AM
1668
Topic-Contributed Paper Session
Metro Toronto Convention Centre
Room: CC-205B
Applied
Yes
Main Sponsor
Health Policy Statistics Section
Co Sponsors
Committee on Applied Statisticians
Section for Statistical Programmers and Analysts
Presentations
Artificial intelligence (AI) and single-cell studies have been making waves in the science and technology communities. AI offers a broad range of methods that can be used to investigate diverse data- and hypothesis-driven questions in single-cell biology (Ma, Q., Xu, D. Deep learning shapes single-cell data analysis. Nat Rev Mol Cell Biol, 2022). The highly heterogeneous nature of single-cell data can be analyzed across a wide range of research topics by generalizing deep-learning model design and optimization in a hypothesis-free manner. This talk will introduce in-house graph representation learning methods for gene expression data to discover underlying mechanisms in diverse biological systems.
Speaker
Qin Ma, The Ohio State University
In the analysis of single-cell RNA sequencing data, researchers often first characterize the variation between cells by estimating a latent variable, representing some aspect of the individual cell state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and confidence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error control or nominal coverage. Furthermore, approaches such as sample splitting that can be fruitfully applied to solve similar problems in other settings are not applicable in this context. We introduce count splitting, an extremely flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study, and apply count splitting to a dataset of pluripotent stem cells differentiating to cardiomyocytes.
Cell type identification is a key step in analyzing single-cell RNA-seq. Existing methods are typically based on a cluster-then-annotate approach, wherein clustering is performed across cells using principal components that capture the reduced dimensionality of highly variable genes; while cell type annotation is performed separately for each cell-cluster utilizing external marker gene information. This separation could potentially lead to poor annotation due to discrepancy between clustering and cell-type annotation subspaces. We propose a novel two-graph fusion technique that judiciously integrates complementary information from both marker and variable gene sets to perform simultaneous dimensionality reduction and cell type annotation. We directly incorporate marker gene information into uniform manifold approximation and projection to improve cell-type predictions. Through comprehensive evaluations on several real scRNA-seq datasets spanning various cancerous tissues of melanoma, colorectal carcinoma, and brain metastasis, as well as normal tissues from human and mouse, we demonstrate the efficacy of the proposed method over state-of-the-art cell-type annotation approaches.
I will describe two techniques for the analysis of single-cell sequencing data. (1) Forest Fire Clustering. This is an efficient and interpretable method for cell-type discovery from single-cell data. It makes minimal prior assumptions and, different from current approaches, calculates a non-parametric posterior probability that each cell is assigned a cell-type label. These posterior distributions allow for the evaluation of a label confidence for each cell and enable the computation of "label entropies," highlighting transitions along developmental trajectories. (2) SCAN-ATAC-Sim. It is difficult to benchmark the performance of various scATAC-seq analysis techniques (such as clustering and deconvolution) without having a priori a known set of gold-standard cell types. To simulate scATAC-seq experiments with known cell-type labels, we introduce an efficient and scalable scATAC-seq simulation method that down-samples bulk ATAC-seq data (e.g., from representative cell lines or tissues). Our protocol uses a consistent but tunable signal-to-noise ratio across cell types in a scATAC-seq simulation.