Supplementary MaterialsExtended Data Physique 1. chromatin accessibility, and imaging transcriptomics datasets, and show that Augur outperforms existing methods based on differential gene expression. Augur identified the neural circuits restoring locomotion in mice following spinal cord neurostimulation. Within a decade, single-cell technologies have scaled from individual cells to entire organisms1,2. Investigators are now able to quantify RNA and protein expression, handle their spatial business in complex tissues, and dissect their regulation in hundreds of thousands of cells. This exponential increase in scale is enabling a transition from atlasing of healthy tissues to delineating the cell type-specific responses to disease and experimental perturbation3,4. cIAP1 Ligand-Linker Conjugates 2 This shift requires a parallel analytical transition, from cataloguing the marked molecular differences between cell types to resolving more subtle phenotypic alterations within cell types. Existing tools focus on identifying individual genes or proteins with statistically significant differences between conditions5. However, inferences at the level of individual analytes are ill-suited to address the broader question of which are most responsive to a perturbation in the multidimensional space of single-cell data. Such prioritizations could clarify the contribution of each cell type to organismal phenotypes such as disease state, or identify cellular subpopulations that mediate the response to external stimuli such as drug treatment. Cell type prioritization could also guide downstream investigation, including the selection of experimental systems such as Cre lines or FACS gates to support causal experiments. However, investigators currently lack bespoke tools to identify cell types affected by perturbation. Here, we introduce Augur, a versatile method to prioritize cell types based on their molecular response to a biological perturbation (Fig. 1a). We reasoned that cell types most responsive to a perturbation should be more separable, within the multidimensional space of single-cell measurements, than less affected ones, and that the relative difficulty of this separation would provide a quantitative basis for cell type prioritization. We formalized this difficulty as a classification task, asking how accurately disease or perturbation state could be predicted from highly multidimensional single-cell measurements. For each cell type, Augur withholds a proportion of sample labels, and trains a classifier around the labeled subset. The classifier predictions are compared with the experimental labels, and cell types are prioritized based on the area under the receiver operating characteristic curve (AUC) of these predictions in cross-validation. Open in a separate window cIAP1 Ligand-Linker Conjugates 2 Fig. 1 Augur correctly prioritizes cell types cIAP1 Ligand-Linker Conjugates 2 in synthetic and experimental single-cell datasets. a, Schematic overview of Augur. b, AUCs of Augur and a naive random forest classifier without subsampling in simulated scRNA-seq datasets made up of increasing numbers of cells. Cell type prioritizations are confounded by training dataset size for the naive classifier, but Augur abolishes this confounding factor. The mean and standard deviation of = 10 impartial simulations are shown. Dotted lines show linear regression; shaded areas show 95% confidence intervals. c, Pearson correlations between the AUC of each cell type, and the number of cells of that type sequenced, across a compendium of 22 scRNA-seq datasets, for Augur and a naive random forest classifier without subsampling. d, Augur AUCs scale monotonically with both the proportion of DE genes and the magnitude of DE in simulated cell populations of = 200 cells. e, Relationship between number of DE genes detected by a representative test for single-cell differential gene expression (Wilcoxon rank-sum test), and the proportion of DE genes simulated between the two populations, for simulated populations of between = 100 and = 1,000 cells. f, Cell type prioritization in simulated scRNA-seq data from a tissue with 5,000 cells, eight cell types and increasingly unequal numbers of cells per type, as quantified by the Gini coefficient. The Pearson correlation to the simulation ground truth (proportion of DE genes) is usually shown for Augur and a representative test for single-cell DE (Wilcoxon rank-sum test). The mean and standard deviation SIGLEC7 of = 10 impartial simulations are shown. Dashed line shows mean cell type Gini coefficient cIAP1 Ligand-Linker Conjugates 2 across = 22 published scRNA-seq datasets (0.52). **, p 0.01; ***, p 0.001, two-sided paired t-test. g, Pearson correlation between cell type prioritizations (AUC/number of DE genes) and simulation ground truth for Augur and six assessments for single-cell DE in simulated tissues containing.