In a study in which a treatment has the effect of altering the composition of cells, subjects in the treatment and control groups may have different numbers of cells of each cell type. In our simulation study, we also found that the pseudobulk method was conservative, but in some settings, mixed models had inflated FDR. The number of UMIs for cell c was taken to be the size factor sjc in stage 3 of the proposed model. Each panel shows results for 100 simulated datasets in one simulation setting. To whom correspondence should be addressed. Further, applying computational methods that account for all sources of variation will be necessary to gain better insights into biological systems, operating at the granular level of cells all the way up to the level of populations of subjects. Crowell et al. These approaches will likely yield better type I and type II error rate control, but as we saw for the mixed method in our simulation, the computation times can be substantially longer and the computational burden of these methods scale with the number of cells, whereas the pseudobulk method scales with the number of subjects. ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 The study by Zimmerman et al. 10e-20) with a different symbol at the top of the graph. Next, we matched the empirical moments of the distributions of Eijc and Eij to the population moments. ## [64] later_1.3.0 munsell_0.5.0 tools_4.2.0 Multiple methods and bioinformatic tools exist for initial scRNA-seq data processing, including normalization, dimensionality reduction, visualization, cell type identification, lineage relationships and differential gene expression (DGE) analysis (Chen et al., 2019; Hwang et al., 2018; Luecken and Theis, 2019; Vieth et al., 2019; Zaragosi et al., 2020). However, a better approach is to avoid using p-values as quantitative / rankable results in plots; they're not meant to be used in that way. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, https://doi.org/10.1093/bioinformatics/btab337, https://www.bioconductor.org/packages/release/bioc/html/aggregateBioVar.html, https://creativecommons.org/licenses/by/4.0/, Receive exclusive offers and updates from Oxford Academic, Academic Pulmonary Sleep Medicine Physician Opportunity in Scenic Central Pennsylvania, MEDICAL MICROBIOLOGY AND CLINICAL LABORATORY MEDICINE PHYSICIAN, CLINICAL CHEMISTRY LABORATORY MEDICINE PHYSICIAN. ## [97] Matrix_1.5-3 vctrs_0.6.1 pillar_1.9.0 The expression parameter for the difference between groups 1 and 2, i2, was varied in order to evaluate the properties of DS analysis under a number of different scenarios. Supplementary Table S1 shows performance measures derived from these curves. ## [85] mime_0.12 formatR_1.14 compiler_4.2.0 Define the aggregated countsKij=cKijc, and let sj=csjc. Specifically, the CDFs are in high agreement for the subject method in the range of P-values from 0 to 0.2, whereas the mixed method has a slight inflation of small P-values in the same range compared to the permutation test. (a) Volcano plots and (b) heatmaps of top 50 genes for 7 different DS analysis methods. Among the other five methods, when the number of differentially expressed genes was small (pDE = 0.01), the mixed method had the highest PPV values, whereas for higher numbers of differentially expressed genes (pDE > 0.01), the DESeq2 method had the highest PPV values. ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 Figure 3(b and c) show the PPV and negative predictive value (NPV) for each method and simulation setting under an adjusted P-value cutoff of 0.05. provides an argument for using mixed models over pseudobulk methods because pseudobulk methods discovered fewer differentially expressed genes. ## [115] MASS_7.3-56 rprojroot_2.0.3 withr_2.5.0 Carver College of Medicine, University of Iowa. Then, for each method, we defined the permutation test statistic to be the unadjusted P-value generated by the method. Here, we present a highly-configurable function that produces publication-ready volcano plots. In addition to simulated data, we analysed an animal model dataset containing large and small airway epithelia from CF and non-CF pigs (Rogers et al., 2008). (b) CD66+ basal cells were identified via detection of CEACAM5 or CEACAM6. The use of the dotplot is only meaningful when the counts matrix contains zeros representing no gene counts. Four of the methods were applications of the FindMarkers function in the R package Seurat (Butler et al., 2018; . For each method, the computed P-values for all genes were adjusted to control the FDR using the BenjaminiHochberg procedure (Benjamini and Hochberg, 1995). In order to contrast DS analysis with cells as units of analysis versus subjects as units of analysis, we analysed both simulated and experimental data. In terms of identifying the true positives, wilcox and mixed had better performance (TPR = 0.62 and 0.56, respectively) than subject (TPR = 0.34). With this data you can now make a volcano plot; Repeat for all cell clusters/types of interest, depending on your research questions. Here is the Volcano plot: I read before that we are not allowed to do the differential gene expression using the integrated data. Performance measures for DS analysis of simulated data. The results of our comparisons are shown in Figure 6. (c) Volcano plots show results of three methods (subject, wilcox and mixed) used to identify CD66+ and CD66- basal cell marker genes. The lists of genes detected by the other six methods likely contain many false discoveries. ## Platform: x86_64-pc-linux-gnu (64-bit) In bulk RNA-seq studies, gene counts are often assumed to follow a negative binomial distribution (Hardcastle and Kelly, 2010; Leng et al., 2013; Love et al., 2014; Robinson et al., 2010). (c and d) Volcano plots show results of three methods (subject, wilcox and mixed) used to find differentially expressed genes between IPF and healthy lungs in (c) AT2 cells and (d) AM. Applying themes to plots. Suppose that cell-level variance ij20. ## [7] crosstalk_1.2.0 listenv_0.9.0 scattermore_0.8 1. Supplementary Figure S10 shows concordance between adjusted P-values for each method. In each panel, PR curves are plotted for each of seven DS analysis methods: subject (red), wilcox (blue), NB (green), MAST (purple), DESeq2 (orange), Monocle (gold) and mixed (brown). sessionInfo()## R version 4.2.0 (2022-04-22) Subject-level gene expression scores were computed as the average counts per million for all cells from each subject. If subjects are composed of different proportions of types A and B, DS results could be due to different cell compositions rather than different mean expression levels. Further, subject has the highest AUPR (0.21) followed by mixed (0.14) and wilcox (0.08). The other two methods were Monocle, which utilized a negative binomial generalized additive model to test for differences in gene expression using the R package Monocle (Qiu et al., 2017a, b; Trapnell et al., 2014) and mixed, which modeled counts using a negative binomial generalized linear mixed model with a random effect to account for differences in gene expression between subjects and DS testing was performed using a Wald test. ## [28] dplyr_1.1.1 crayon_1.5.2 jsonlite_1.8.4 This can, # be changed with the `group.by` parameter, # Use community-created themes, overwriting the default Seurat-applied theme Install ggmin, # with remotes::install_github('sjessa/ggmin'), # Seurat also provides several built-in themes, such as DarkTheme; for more details see, # Include additional data to display alongside cell names by passing in a data frame of, # information Works well when using FetchData, ## [1] "AAGATTACCGCCTT" "AAGCCATGAACTGC" "AATTACGAATTCCT" "ACCCGTTGCTTCTA", # Now, we find markers that are specific to the new cells, and find clear DC markers, ## p_val avg_log2FC pct.1 pct.2 p_val_adj, ## FCER1A 3.239004e-69 3.7008561 0.800 0.017 4.441970e-65, ## SERPINF1 7.761413e-36 1.5737896 0.457 0.013 1.064400e-31, ## HLA-DQB2 1.721094e-34 0.9685974 0.429 0.010 2.360309e-30, ## CD1C 2.304106e-33 1.7785158 0.514 0.025 3.159851e-29, ## ENHO 5.099765e-32 1.3734708 0.400 0.010 6.993818e-28, ## ITM2C 4.299994e-29 1.5590007 0.371 0.010 5.897012e-25, ## [1] "selected" "Naive CD4 T" "Memory CD4 T" "CD14+ Mono" "B", ## [6] "CD8 T" "FCGR3A+ Mono" "NK" "Platelet", # LabelClusters and LabelPoints will label clusters (a coloring variable) or individual points, # Both functions support `repel`, which will intelligently stagger labels and draw connecting, # lines from the labels to the points or clusters, ## Platform: x86_64-pc-linux-gnu (64-bit), ## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3, ## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3, ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C, ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8, ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8, ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C, ## [9] LC_ADDRESS=C LC_TELEPHONE=C, ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C, ## [1] stats graphics grDevices utils datasets methods base, ## [1] patchwork_1.1.2 ggplot2_3.4.1, ## [3] thp1.eccite.SeuratData_3.1.5 stxBrain.SeuratData_0.1.1, ## [5] ssHippo.SeuratData_3.1.4 pbmcsca.SeuratData_3.0.0, ## [7] pbmcMultiome.SeuratData_0.1.2 pbmc3k.SeuratData_3.1.4, ## [9] panc8.SeuratData_3.0.2 ifnb.SeuratData_3.1.0, ## [11] hcabm40k.SeuratData_3.0.0 bmcite.SeuratData_0.3.0, ## [13] SeuratData_0.2.2 SeuratObject_4.1.3. 6a) and plotting well-known markers of these two cell types (Fig. Help! For a sequence of cutoff values between 0 and 1, precision, also known as positive predictive value (PPV), is the fraction of genes with adjusted P-values less than a cutoff (detected genes) that are differentially expressed. ## Matrix products: default These methods provide interpretable results that generalize to a population of research subjects, account for important sources of biological and technical variability and provide adequate FDR control. The general process for detecting genes then would be: Repeat for all cell clusters/types of interest, depending on your research questions. ## ## [16] cluster_2.1.3 ROCR_1.0-11 limma_3.54.1 Marker detection methods were found to have unacceptable FDR due to pseudoreplication bias, in which cells from the same individual are correlated but treated as independent replicates, and pseudobulk methods were found to be too conservative, in the sense that too many differentially expressed genes were undiscovered. (e and f) ROC and PR curves for subject, wilcox and mixed methods using bulk RNA-seq as a gold standard for (e) AT2 cells and (f) AM. This figure suggests that the methods that account for between subject differences in gene expression (subject and mixed) will detect different sets of genes than the methods that treat cells as the units of analysis. We evaluated the performance of our tested approaches for human multi-subject DS analysis in health and disease. In order to determine the reliability of the unadjusted P-values computed by each method, we compared them to the unadjusted P-values obtained from a permutation test. To illustrate scalability and performance of various methods in real-world conditions, we show results in a porcine model of cystic fibrosis and analyses of skin, trachea and lung tissues in human sample datasets. Tried. We have found this particularly useful for small clusters that do not always separate using unbiased clustering, but which look tantalizingly distinct. The main idea of the theorem is that if gene counts are summed across cells and the number of cells grows large for each subject, the influence of cell-level variation on the summed counts is negligible. The other six methods involved DS testing with cells as the units of analysis. #' @return Returns a volcano plot from the output of the FindMarkers function from the Seurat package, which is a ggplot object that can be modified or plotted. (a) AUPR, (b) PPV with adjusted P-value cutoff 0.05 and (c) NPV with adjusted P-value cutoff 0.05 for 7 DS analysis methods. Two of the methods had much longer computation times with DESeq2 running for 186min and mixed running for 334min. Standard normalization, scaling, clustering and dimension reduction were performed using the R package Seurat version 3.1.1 (Butler et al., 2018; Satija et al., 2015; Stuart et al., 2019). Give feedback. In that case, the number of modes in the expression distribution in the CF group (bimodal) and the non-CF group (unimodal) would be different, but the pseudobulk method may not detect a difference, because it is only able to detect differences in mean expression. These results suggest that only the subject method will exhibit appropriate type I error rate control. If zjc1,zjc2,,zjcL are L cell-level covariates, then a log-linear regression model could take the form logijc=lzjclijl. Supplementary Figure S14 shows the results of marker detection for T cells and macrophages. ## [3] thp1.eccite.SeuratData_3.1.5 stxBrain.SeuratData_0.1.1 (2019) used scRNA-seq to profile cells from the lungs of healthy subjects and those with pulmonary fibrosis disease subtypes, including hypersensitivity pneumonitis, systemic sclerosis-associated and myositis-associated interstitial lung diseases and IPF (Reyfman et al., 2019). To obtain permutation P-values, we measured the proportion of permutation test statistics less than or equal to the observed test statistic, which is the permutation test statistic under the observed labels. Figure 2 shows precision-recall (PR) curves averaged over 100 simulated datasets for each simulation setting and method. If a gene was not differentially expressed, the value of i2 was set to 0. ## [118] sctransform_0.3.5 parallel_4.2.0 grid_4.2.0 . To avoid confounding the results by disease, this analysis is confined to data from six healthy subjects in the dataset. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide, This PDF is available to Subscribers Only. See ?FindMarkers in the Seurat package for all options. NCF = non-CF. Search for other works by this author on: Iowa Institute of Human Genetics, Roy J. and Lucille A. In the second stage, the observed data for each gene, measured as a count, is assumed to follow a Poisson distribution with mean equal to the product of a size factor, such as sequencing depth, and gene expression generated in the first stage. ## [61] labeling_0.4.2 rlang_1.1.0 reshape2_1.4.4 The volcano plot for the subject method shows three genes with adjusted P-value <0.05 (-log 10 (FDR) > 1.3), whereas the other six methods detected a much larger number of genes. Marker detection methods allow quantification of variation between cells and exploration of expression heterogeneity within tissues. To characterize these sources of variation, we consider the following three-stage model: In stage i, variation in expression between subjects is due to differences in covariates via the regression function qij and residual subject-to-subject variation via the dispersion parameter i. ## [1] stats graphics grDevices utils datasets methods base True positives were identified as those genes in the bulk RNA-seq analysis with FDR<0.05 and |log2(IPF/healthy)|>1. ## [9] panc8.SeuratData_3.0.2 ifnb.SeuratData_3.1.0 We can then change the identity of these cells to turn them into their own mini-cluster. First, we present a statistical model linking differences in gene counts at the cellular level to four sources: (i) subject-specific factors (e.g. First, a random proportion of genes, pDE, were flagged as differentially expressed. Infinite p-values are set defined value of the highest . Plotting multiple plots was previously achieved with the CombinePlot() function. ## [67] cachem_1.0.7 cli_3.6.1 generics_0.1.3 In contrast, single-cell experiments contain an additional source of biological variation between cells. For each setting, 100 datasets were simulated, and we compared seven different DS methods. The following equations are identical: . do cameras catch illegal u turns, ballet rambert peaky blinders tickets, sesac repertory search,