The Single-cell Pediatric Cancer Atlas: Data portal and open-source tools for single-cell transcriptomics of pediatric tumors

This manuscript (permalink) was automatically generated from AlexsLemonade/ScPCA-manuscript@804c5ba on May 11, 2026.

Authors

✉ — Correspondence possible via GitHub Issues or email to Jaclyn N. Taroni <jaclyn.taroni@ccdatalab.org>.

Summary

The Single-cell Pediatric Cancer Atlas (ScPCA) Portal (https://scpca.alexslemonade.org/) is a data resource for uniformly processed single-cell and single-nuclei RNA sequencing (RNA-seq) data and de-identified metadata from pediatric tumor samples. Originally comprised of data from 10 projects funded by Alex’s Lemonade Stand Foundation (ALSF), the Portal currently contains summarized gene expression data for over 700 samples across 55 cancer types from ALSF-funded and community-contributed datasets. Downloads include gene expression data as SingleCellExperiment or AnnData objects containing raw and normalized counts, PCA and UMAP coordinates, automated cell type annotations, and copy-number variation estimates, along with summary reports. Some samples have additional data from bulk RNA-seq, spatial transcriptomics, and/or feature barcoding (e.g., CITE-seq and cell hashing) included in the download. All data on the Portal were uniformly processed using scpca-nf, an efficient and open-source Nextflow workflow that uses alevin-fry to quantify gene expression. Comprehensive documentation, including descriptions of file contents and a guide to getting started, is available at https://scpca.readthedocs.io.

Introduction

The number of studies employing single-cell RNA-seq (scRNA-seq) has grown rapidly since this technology was introduced1. Unlike its predecessor bulk RNA-seq, which averages expression profiles of all cells within a sample, single-cell technology quantifies gene expression in individual cells. Tumors are transcriptionally heterogeneous, highlighting the importance of using scRNA-seq in studying tumor samples2. Researchers can use scRNA-seq data from patient tumor samples to analyze and identify individual cell populations that may influence tumor growth, resistance, and metastasis3, as well as how tumor cells interact with normal cells in the tumor microenvironment4.

As the number of scRNA-seq datasets expands, efforts have emerged to create centralized data resources. For example, CELLxGENE5,6 offers gene expression data from samples spanning hundreds of cell types in standardized analysis formats. Other resources such the Human Cell Atlas (HCA) and Human Tumor Atlas Network (HTAN) offer harmonized data, enabling reliable cross-sample comparisons and discovery across diverse biological contexts and disease types. The HCA, which provides a comprehensive map of all cell types in the human body using single-cell genomics, contains uniformly processed scRNA-seq data obtained from normal tissue with few samples derived from diseased tissue7. The HTAN also hosts a collection of genomic data from tumors across multiple cancer types, including scRNA-seq8.

Despite these resources, there have been considerably fewer efforts to harmonize and distribute data specifically from pediatric tumors. As pediatric cancer is much less common than adult cancer, fewer samples are available and access to pediatric tumor data is often limited9. Recently, Xu and colleagues highlighted lack of standardization of pediatric cancer single-cell data as a barrier to reuse in their attempt to create an atlas10. Thus, it is imperative to make harmonized data from pediatric tumors accessible to researchers11. To address this unmet need, Alex’s Lemonade Stand Foundation (ALSF) and the Childhood Cancer Data Lab developed and maintain the Single-cell Pediatric Cancer Atlas (ScPCA) Portal (https://scpca.alexslemonade.org/), a data resource for single-cell and single-nuclei RNA-seq data of pediatric tumor samples.

The ScPCA Portal holds uniformly processed summarized gene expression from 10x Genomics droplet-based single-cell and single-nuclei RNA-seq for over 700 samples from a diverse set of 55 types of pediatric cancers. Originally comprised of data from 10 projects funded by ALSF, the Portal has since expanded to include data contributed by pediatric cancer research community members. The Portal additionally includes data obtained from bulk RNA-seq, spatial transcriptomics, and feature barcoding methods such as CITE-seq and cell hashing. All data on the Portal are available in formats ready for downstream analysis with common workflow ecosystems, including SingleCellExperiment objects used by R/Bioconductor12 or AnnData objects used by Scanpy and related Python modules13. Downloaded objects contain both raw and normalized gene expression counts, dimensionality reduction results, cell type annotations, and copy-number variation estimates. As of May 2026, over 950 unique downloaders had accessed the Portal since its launch.

To ensure uniform processing of all current and future Portal data, we created scpca-nf, an open-source Nextflow14 pipeline (https://github.com/AlexsLemonade/scpca-nf). This pipeline increases transparency and facilitates analyses across multiple samples and projects without re-processing. The scpca-nf workflow uses alevin-fry15 for fast and efficient quantification of single-cell gene expression for all samples on the Portal, including scRNA-seq data and any associated CITE-seq or cell hash data. The scpca-nf pipeline is also a resource, allowing researchers to process their own samples for comparison to Portal data and or for submission to the Portal.

Here, we present the Single-cell Pediatric Cancer Atlas as a freely-available resource for all pediatric cancer researchers. The ScPCA Portal provides downloads ready for immediate use, allowing researchers to skip time-consuming data re-processing and wrangling steps. We provide comprehensive documentation about data processing and the contents of the Portal, including a guide to getting started working with an ScPCA dataset (https://scpca.readthedocs.io/). The ScPCA Portal advances pediatric cancer research by accelerating researchers’ ability to answer important biological questions.

Results

The Single-cell Pediatric Cancer Atlas Portal

In March 2022, the Childhood Cancer Data Lab launched the Single-cell Pediatric Cancer Atlas (ScPCA) Portal to make uniformly processed, summarized single-cell and single-nuclei RNA-seq data and de-identified metadata from pediatric tumor samples openly available for download by the research community. Data on the Portal was obtained using two different mechanisms: raw data was accepted from ALSF-funded investigators and processed using our open-source pipeline scpca-nf, or investigators processed their raw data using scpca-nf and submitted the output for inclusion on the Portal.

All samples on the Portal include a core set of metadata obtained from investigators, including age, sex, diagnosis, subdiagnosis (if applicable), tissue location, and disease stage. The majority of projects include additional metadata, such as treatment or tumor stage, if provided. We standardized all provided metadata to maintain consistency across projects. Where applicable, we include ontology term identifiers in addition to human-readable values. We use ontology term identifiers obtained from HsapDv (age)16, PATO (sex)17,18, NCBI taxonomy (organism)19,20, MONDO (disease)21,22, UBERON (tissue)2325, and Hancestro (ethnicity, if applicable)26,27. These ontology term identifiers standardize metadata terms and facilitate comparisons across Portal datasets and other research projects.

The Portal contains data from over 700 samples and 55 tumor types2834. Figure 1A summarizes all samples from patient tumors and patient-derived xenografts currently available on the Portal. The most common sample diagnosis on the Portal is leukemia (n = 216), followed by from sarcoma and soft tissue tumors (n = 194), brain and central nervous system tumors (n = 167), and a variety of other solid tumors (n = 115). Most samples were collected at initial diagnosis (n = 520), with a smaller number of samples collected at recurrence (n = 129), during progressive disease (n = 12), during or after treatment (n = 11), or post-mortem (n = 5). The Portal also contains a small number of human tumor cell line samples (n = 6) and non-cancerous samples (n = 6).

Sample data includes summarized gene expression data from either single-cell or single-nuclei RNA sequencing with the 10x Genomics droplet-based platform. Some samples also include additional data, such as CITE-seq quantification of cell-surface protein levels with antibody-derived tags (ADTs)35, or hashtag oligonucleotide (HTO) quantification for multiplexed samples36. Raw FASTQ files are not available for download from the Portal, but we provide links to raw data sources in external repositories, such as the Database of Genotypes and Phenotypes (dbGaP)37,38, when available. 95 samples have associated CITE-seq data, and 35 samples have associated multiplexing data. In some cases, multiple libraries from the same sample were created for additional assays, either for bulk RNA-seq (n = 182) or spatial transcriptomics (n = 41). Seven samples in the Portal have only bulk RNA-seq or spatial transcriptomics data. A summary of the number of single-cell or single-nuclei samples and their associated additional modalities is shown in Figure 1B, and a detailed summary of the total samples with each sequencing method broken down by project is available in Table S1.

Samples on the Portal are organized by project, where each project is a collection of similar samples from an individual lab. Users can filter projects based on diagnosis, included modalities (e.g., CITE-seq, bulk RNA-seq), 10x Genomics kit version (e.g., 10Xv2, 10Xv3), and whether a project includes samples from patient-derived xenografts or cell lines. The project card displays an abstract, the total number of samples, a list of all present diagnoses, and links to any external information associated with the project, including publications and links to external resources, e.g., SRA or GEO (Figure 1C). The project card also indicates the sequencing metadata such as the 10x Genomics kit version, the suspension type (cell or nucleus), if additional sequencing like bulk RNA-seq is present, or if the samples were multiplexed using cell hashing.

The Portal also provides visualization of individual samples via the UCSC Cell Browser interface39 as seen in Figure 1C. Interactive UMAPs allow users to explore the cells within each sample, coloring by cell type annotations, gene expression values, or other calculated metrics.

Uniform processing of data available on the ScPCA Portal

We developed scpca-nf, an open-source and efficient Nextflow14 workflow for quantifying single-cell and single-nuclei RNA-seq data to process data for the Portal. Nextflow is a workflow management system that facilitates multi-step and long-running bioinformatics processes in a portable and reproducible manner across computing environments, including high-performance computing clusters and cloud-based computing40. Nextflow allows seamless dependency management, as each workflow process is run using a specified container image. This flexibility and containerization make the workflow easily portable for general use. Setup requires only installing Nextflow and a supported container engine, managing a configuration file for the computing environment, and organizing input files.

When building scpca-nf, we sought a fast and memory-efficient tool for gene expression quantification to minimize processing costs with comparable performance to the widely-used Cell Ranger platform41,42. Our comparisons between alevin-fry15 and Cell Ranger showed that alevin-fry had lower run time and memory usage (Figure S1A) but retained comparable mean gene expression(Figure S1B), total UMIs per cell (Figure S1C), and total genes detected per cell (Figure S1D). We therefore used salmon alevin and alevin-fry15 in scpca-nf to quantify gene expression data.

Taking FASTQ files as input, scpca-nf aligns reads using the selective alignment option in salmon alevin to an index with transcripts corresponding to spliced cDNA and intronic regions, known as the splici index in alevin-fry (Figure 2A). alevin-fry outputs a gene-by-cell count matrix for all barcodes identified, even those that may not contain true cells.

scpca-nf performs filtering of empty droplets, removal of low-quality cells, normalization, dimensionality reduction, cell type annotation, and copy-number variation (CNV) inference (Figure 2A). We used the Bioconductor ecosystem43,44 for filtering, normalization, and dimensionality reduction because of its rich documentation, wide use in the community, and relatively small file sizes. The unfiltered gene-by-cell counts matrices are filtered to remove any barcodes that are unlikely to contain cells using DropletUtils::emptyDropsCellRanger()45,46. Low-quality cells are identified and removed with miQC47, which jointly models the proportion of mitochondrial reads and detected genes per cell and calculates a probability that each cell is compromised. The remaining cells’ counts are normalized48, and reduced-dimension representations are calculated using both principal component analysis (PCA) and uniform manifold approximation and projection (UMAP)49. Cell types are classified using three automated methods, SingleR50, CellAssign51, and SCimilarity52, and a consensus cell type label is derived from these labels. Finally, CNV is estimated for each cell using inferCNV53.

To support both R and Python users, downloads are available as either SingleCellExperiment or AnnData54 objects. The workflow outputs three different SingleCellExperiment objects to .rds files: a processed object containing dimension reduction results, cell type annotations, and CNV inference; an unfiltered object with no processing; and a filtered object with the empty droplet filtered gene-by-cell matrices. scpca-nf also converts all SingleCellExperiment objects to AnnData objects to .h5ad files (Figure 2A). Downloads contain the unfiltered, filtered, and processed objects from scpca-nf to allow users to choose to perform their own filtering and normalization or to start their analysis from a processed object. Providing unfiltered raw counts is consistent with the recommendations in Xu et al.10 for maximizing reusability when sharing pediatric cancer single-cell data.

All downloads from the Portal include a quality control (QC) report with a summary of processing information (e.g., alevin-fry version), library statistics (e.g., the total number of cells), and a collection of diagnostic plots for each library (Figure 2B-G). A knee plot displaying total UMI counts for all droplets (i.e., including empty droplets) indicates the effects of the empty droplet filtering (Figure 2B). For each cell remaining after filtering, the total UMI count, genes detected, and mitochondrial fraction are calculated and summarized in a scatter plot (Figure 2C). We include plots showing the miQC model and the results of miQC filtering (Figure 2D-E). We also provide a UMAP with cells colored by the number of genes detected and a faceted UMAP plot colored by the expression of highly-variable genes (Figure 2F-G).

Processing samples with additional modalities

scpca-nf includes modules for processing sequencing modalities beyond single-cell or single-nuclei RNA-seq, including CITE-seq (ADT)35, multiplexed (cell hashing)36, spatial transcriptomics, or bulk RNA-seq.

For CITE-seq libraries, ADT reads are quantified using salmon alevin and alevin-fry (Figure S2A). The workflow performs ADT-by-cell counts matrix normalization (see Methods for details) and calculates QC statistics that users can employ for additional filtering before downstream analysis. For these libraries, the QC report includes additional ADT-related statistics and ADT-specific diagnostic and exploratory plots (Figure S2B-D).

For multiplexed libraries, the HTO FASTQ files are quantified using salmon alevin and alevin-fry (Figure S2E). Although scpca-nf quantifies the HTO data and includes an HTO-by-cell counts matrix in all objects, final demultiplexing is not performed. Instead, scpca-nf applies multiple demultiplexing methods, including demultiplexing with DropletUtils::hashedDrops()45,46, Seurat::HTODemux()36, and genetic demultiplexing55 if bulk RNA-seq data from constituent samples are available. All demultiplexing results are saved in the filtered and processed SingleCellExperiment objects, and HTO-specific library statistics are included in the QC report.

For bulk RNA-seq data, scpca-nf trims reads using fastp56, quantifies expression with salmon (Figure S3A)57, and outputs a TSV file with the gene-by-sample counts matrix for all samples in a given ScPCA project.

Spatial transcriptomics data is processed with Space Ranger58 to quantify expression and process slide images (Figure S3B). The output includes the spot-by-gene matrix along with a summary report produced by Space Ranger.

Merged objects

Combining data from multiple samples into a single object facilitates joint gene-level analyses, such as differential expression or gene set enrichment analyses. Therefore, we provide a single merged object for each ScPCA project containing all raw and normalized gene expression data and metadata for all single-cell and single-nuclei RNA-seq libraries (with some exceptions as described in the Methods) produced using our merge.nf workflow (Figure S3C). Merged objects are not batch-corrected or integrated; users can perform their own batch correction or integration as needed to suit their experimental designs.

Downloading projects from the ScPCA Portal

Users can download data from individual samples or all data from an ScPCA project as either SingleCellExperiment (.rds) or AnnData (.h5ad) objects. When downloading a complete project, users can either download individual files for each sample (Figure 2H) or one file containing the gene expression data and metadata for all project samples in the project as a merged object (Figure 2I). Users can also generate custom datasets by selecting specific samples across projects for a single download. In addition to the web interface, we provide an R package, ScPCAr, for programmatic access to metadata and files on the Portal, available at https://alexslemonade.github.io/ScPCAr/.

Annotating cell types

Assigning cell type labels to single-cell and single-nuclei RNA-seq data is an essential step in analysis. Cell type annotation requires knowledge of expected cell types and their associated gene expression patterns, which may be available from public databases or individual publications. Automated cell type annotation methods leveraging public databases are an excellent initial step in this process, as they can be applied consistently and transparently across samples. As such, we include cell type annotations from three different automated methods, SingleR50, CellAssign51, and SCimilarity52, in all processed SingleCellExperiment and AnnData objects (Figure 3A) (see Methods for details). An additional cell type report with information about references, comparisons among cell type annotation methods, and diagnostic plots is also provided.

Some ScPCA projects also have curated cell type annotations, including tumor cells and disease-specific cell states, provided by submitters. These submitter-provided annotations are in all SingleCellExperiment and AnnData objects (unfiltered, filtered, and processed). Cell type reports for these projects include a table summarizing the submitter cell type annotations, a UMAP colored by the submitter annotation, and a comparison of submitter annotations to automated cell typing results.

Assigning consensus cell types

SingleR, CellAssign, and SCimilarity use different references and computational approaches. Additionally, most public annotated reference datasets compatible with SingleR and CellAssign, including those used for the Portal, are derived from normal tissue, making annotating tumor datasets particularly difficult. Consistent cell type annotations across methods can indicate higher confidence, so we created a set of ontology-aware rules to assign consensus cell type labels based on the methods’ agreement.

scpca-nf assigns consensus cell type labels when two of the three automated methods agree. Specifically, we perform pairwise comparisons among automated cell type annotations to identify the latest common ancestor (LCA) in Cell Ontology5961. The consensus cell type is the LCA term with the fewest descendants (Figure 3B). To ensure specificity, cells are only assigned a consensus cell type if the identified LCA has fewer than 170 descendant terms (see Methods for exceptions). This threshold excludes overly general cell ontology terms like lymphocyte, while retaining meaningful classifications like T cell and B cell. Consensus cell type assignments are available in all processed objects on the Portal. Figure 3C shows an example heatmap comparing automated and consensus cell type labels for a glioma library on the Portal.

This ontology-based approach allowed us to account for different levels of granularity in annotation reference datasets. For example, Figure S4A displays cells that are annotated as different T cell subtypes by each automated method. Harmonizing annotations into a consensus cell type provides a single, consistent label for each cell (Figure S4B) and facilitates downstream analyses across multiple samples.

We validated consensus cell types by evaluating cell-type-specific marker gene expression across all cells (Figure 4A, Figure S5), observing high concordance between consensus labels and marker gene expression. Library-specific versions of Figures 4A and 3C are included in the QC report, allowing users to assess the reliability of consensus annotations and compare labels across methods.

Consensus cell type annotations in brain and CNS samples available on the Portal

Consensus annotations are particularly useful when examining samples from multiple projects. Figure 4B, for example, displays cell types across all high-grade (HGG) and low-grade glial/glioneuronal (LGG) samples, which originate from six projects and four investigators and reveals similar cell types across all samples.

Previous studies have characterized the glioma immune microenvironment as predominantly comprised of myeloid cells, including microglia and glioma-associated macrophages, with smaller proportions of lymphocytes such as T cells62,63. While we observe that most immune cells in glioma samples are myeloid or T cell types, there is notable notable heterogeneity within HGG and LGG subtypes (Figure 4C). Figure 4D shows the expression of cell-type specific markers for more granular immune cell types, validating the assignment of various immune cell types within the assigned consensus cell types. A summary of all the consensus cell types observed in all other ScPCA samples can be found in Figure S6.

Augmenting cell type annotations for malignant cell identification

Because the references used for automated methods do not consider tumor cell states, they provide limited information for distinguishing malignant from normal cells. We therefore sought complementary avenues to help identify malignant cells.

To this end, we launched the OpenScPCA project64, an open-science collaborative initiative to characterize and analyze Portal data, focusing first on improving cell type annotations. Thus far, we have added cell type annotations for two projects, SCPCP000004 (neuroblastoma) and SCPCP000015 (Ewing sarcoma), to the Portal based on OpenScPCA analyses. Figure 5A displays a UMAP of all libraries in SCPCP000004 highlighting this project’s OpenScPCA annotations, derived using the NBAtlas reference dataset65. Unlike the consensus cell type annotations, the OpenScPCA annotations distinguish between normal and malignant cells and contain far fewer uncharacterized cells. Summary cell type reports for projects with OpenScPCA annotations also include comparisons between scpca-nf and OpenScPCA annotations.

In addition, scpca-nf applies inferCNV53 to estimate copy-number alterations (Figure 2A) when enough normal cells are present in a library to serve as a reference (see Methods for details). The CNV estimates complement the consensus cell types by providing a proxy for a cell’s malignant status; cells with high levels of CNV are more likely to be tumor cells. Across libraries within SCPCP000004, malignant cell type annotations from OpenScPCA (which does not use CNV information) (Figure 5A) and the total per-cell CNV (Figure 5B) broadly correspond. We probed this relationship further within a single neuroblastoma library, finding signatures of canonical neuroblastoma CNV events such as 1q loss, 11q gain, and 17p loss6567 within malignant cells (Figure 5C). By contrast, normal cells show very few CNV events. Unknown cells show CNV event signatures similar to the malignant cells, suggesting many of these cells may be malignant.

Malignancy can also be assessed by interpreting consensus cell types alongside CNV inferences, which is particularly useful for projects which do not yet have OpenScPCA annotations. Figure 5D shows per-cell CNV distributions for the most common consensus cell types in a neuroblastoma library. Unknown and neuron cells have distinctly higher values, suggesting possible malignancy. We see similar patterns in a ganglioglioma library (Figure S4B-C), where consensus immune cell types have low CNV values, while other cell types including oligodendrocyte precursor cells, neuron associated cells, and unknown cells have much higher CNV values.

Analysis of bulk RNA-seq

Several projects in the ScPCA Portal contain bulk RNA-seq data in addition to single-cell or single-nuclei RNA-seq data. Compared to bulk RNA-seq, single-cell and single-nuclei RNA-seq technologies may fail to capture certain cell types68 due to technical aspects of library preparation. We therefore asked whether we could identify differences in biological signal between these two modalities that may suggest distinct cell type distributions by interrogating ScPCA projects of solid tumors, considering only samples with both sequencing modalities. We analyzed 97 samples across five projects: SCPCP00001 (high-grade glioma), SCPCP000002 (low-grade glioma), SCPCP000006 (Wilms tumor), SCPCP000009 (CNS tumors), and SCPCP000017 (osteosarcoma). As described in Methods, we derived pseudobulk expression matrices for each single-cell or single-nuclei library and compared the pseudobulk expression to bulk using a series of linear models (one per ScPCA project) with a random effect controlling for sample (Figure 6A, Figure S7A). As expected, all projects showed a positive relationship between bulk and pseudobulk expression.

We next performed an overrepresentation analysis to probe for differences in gene expression suggestive of different cell type compositions and/or abundances between modalities. We calculated the per-gene median of each project’s model residuals and identified outliers. “Positive outliers” are genes with higher bulk RNA-seq expression than expected from pseudobulk expression, and “negative outliers” are genes with lower bulk RNA-seq expression than expected from pseudobulk expression. Using cell type marker gene sets from each project’s respective CellAssign reference, we calculated the odds ratio in each direction as the odds a cell type marker gene is present in the given outlier direction compared to other genes. Following permutation testing and P-value correction to control the FDR at 5%, we found several cell type marker gene sets with higher, but never lower, bulk RNA-seq expression than expected (Figure 6B, Figure S7B).

In brain and CNS tumors, the overrepresented marker gene sets in bulk RNA-seq were primarily stromal (e.g., endothelial cells and pericytes) and neuronal cell types (e.g., astrocytes and glial cells), all of which are prevalent in glioma tumor microenvironments69,70 (Figure 6B). By contrast, only monocytes and neuronal cell types were overrepresented in SCPCP000009 (brain and CNS tumors) bulk RNA-seq. As SCPCP000009 was sequenced at the single-nuclei level while SCPCP000001 and SCPCP000002 were sequenced at the single-cell level, this difference may reflect the reduced sensitivity of single-nuclei approaches to detecting immune cells71. The other single-nuclei projects similarly showed immune cell overrepresentation in bulk RNA-seq: Monocytes for SCPCP000006 and a mix of immune and non-immune cell types for SCPCP000017 (Figure S7B), the latter possibly reflecting challenges in dissociating bone tissue72. Overall, while bulk and single-cell/nuclei expression are highly correlated, cell type composition differences between modalities may reflect cell-type-specific loss in single-cell experiments.

STAR Methods

RESOURCE AVAILABILITY

Lead contact

Requests for resources and information regarding resource sharing should be directed to the lead contact, Jaclyn N. Taroni (jaclyn.taroni@ccdatalab.org).

Materials availability

This study did not generate any new material resources.

Data and code availability

Data

All processed RNA-seq data and de-identified metadata described in this study are freely available through the ScPCA Portal at <scpca.alexslemonade.org>, which is designed for long-term open access of single-cell, single-nuclei, and spatial transcriptomics data; data are available as of the date of publication. Each project, sample, and library is assigned a stable, unique accession number. Raw data (e.g., FASTQ files) are not available for download from the Portal due to the human origins of most samples, which may be subject to controlled-access restrictions. When raw or processed data are also deposited to external sources such as the Database of Genotypes and Phenotypes (dbGaP) or Gene Expression Omnibus (GEO), the project accession numbers are available from the Portal.

All projects included in this publication are available from the ScPCA Portal with the following accession numbers: SCPCP000001, SCPCP000002, SCPCP000003, SCPCP000004, SCPCP000005, SCPCP000006, SCPCP000007, SCPCP000008, SCPCP000009, SCPCP000010, SCPCP000011, SCPCP000012, SCPCP000013, SCPCP000014, SCPCP000015, SCPCP000016, SCPCP000017, SCPCP000018, SCPCP000020, SCPCP000021, SCPCP000022, SCPCP000023, SCPCP000024.

Code
Additional information

ScPCA documentation that describes the contents of downloads available from the Portal is available at https://scpca.readthedocs.io/en/stable/. Any additional information required to reanalyze the data in this study is available from the lead contact upon request.

EXPERIMENTAL MODEL AND SUBJECT DETAILS

Data on the ScPCA Portal were generated and compiled by each contributing lab and institution. As of May 1, 2026, gene expression data from 704 samples are available. This includes 581 human patient samples, 117 samples from patient-derived xenografts, 4 samples from immortalized human cell lines, and 2 samples from cell lines passaged from patient-derived xenografts representing 55 different pediatric cancer types.

Metadata

Submitters were required to submit the age, sex, organism, diagnosis, subdiagnosis (if applicable), disease timing (e.g., initial diagnosis) and tissue of origin for each sample. The submitted metadata was standardized across projects, including converting all ages to years, removing abbreviations used in diagnosis, subdiagnosis, or tissue of origin, and using standard values across projects as much as possible for diagnosis, subdiagnosis, disease timing, and tissue of origin. For example, all samples obtained at diagnosis were assigned the value Initial diagnosis for disease timing.

In an effort to ensure sample metadata for ScPCA are compatible with CZI’s CELLxGENE, ontology term identifiers were assigned to metadata categories for each sample following the guidelines present in the CELLxGENE schema80,81, as shown in Table 1.

Table 1: Assignment of metadata fields to ontology terms.
Metadata field Ontology term description
Age Ontology term obtained from HsapDv16. For ages 0-11 months, the HsapDv for age in months was used. For ages 12 months and greater, the HsapDv for age in years was used.
Sex Ontology term obtained from PATO, either male (PATO:0000384), female (PATO:0000383), or unknown17,18.
Organism NCBI taxonomy term for organism. All current samples available on the Portal are from Homo sapiens or NCBITaxon:960619,20.
Diagnosis The most appropriate MONDO term based on the provided diagnosis21,22. An exact match was identified for most samples, but in a handful of cases, the most closely related term was used.
Tissue of origin The most appropriate UBERON term based on the provided tissue of origin2325. An exact match was identified for most samples, but in a handful of cases, the most closely related term was used.
Ethnicity (if applicable) If the submitter provided ethnicity, the associated Hancestro term26,27. If ethnicity is unavailable, unknown is used.

The majority (87%) of projects on the Portal have additional metadata fields, such as the presence or absence of treatment, tumor grade, or whether a sample was obtained from a primary tumor or metastasis.

Ethics statement

For ALSF-funded datasets comprised of human subjects data, Institutional Review Boards (IRB) or research ethics boards at grantee institutions approved the research or determined it was exempt. For community-contributed datasets containing summarized data and de-identified metadata from human subjects, submitting institutions certified that all approvals and consents were obtained or listed the IRB protocol in transfer agreements. ALSF-funded xenograft datasets were approved by the grantee institution’s Institutional Animal Care and Use Committee.

METHOD DETAILS

Data generation and processing

Raw data and metadata were generated and compiled by each lab and institution contributing to the Portal. Single-cell or single-nuclei libraries were generated using one of the commercially available kits from 10x Genomics. For bulk RNA-seq, RNA was collected and sequenced using either paired-end or single-end sequencing. For spatial transcriptomics, cDNA libraries were generated using the Visium kit from 10x Genomics. All libraries were processed using our open-source pipeline, scpca-nf, to produce summarized gene expression data. A detailed summary with the total number of samples and libraries collected for each sequencing method broken down by project is available in Table S1.

Processing single-cell and single-nuclei RNA-seq data with alevin-fry

To quantify RNA-seq gene expression for each cell or nucleus in a library, scpca-nf uses salmon alevin82 and alevin-fry15 to generate a gene-by-cell counts matrix. Prior to mapping, we generated an index using transcripts from both spliced cDNA and unspliced cDNA sequences, denoted as the splici index15. The index was generated from the human genome, GRCh38, Ensembl version 104. salmon alevin was run using selective alignment to the splici index with the --rad option to generate a reduced alignment data (RAD) file required for input to alevin-fry.

The RAD file was used as input to the recommended alevin-fry workflow, with the following customizations. At the generate-permit-list step, we used the --unfiltered-pl option to provide a list of expected barcodes specific to the 10x kit used to generate each library. The quant step was run using the cr-like-em resolution strategy for feature quantification and UMI de-duplication.

Post alevin-fry processing of single-cell and single-nuclei RNA-seq data

The output from running alevin-fry includes a gene-by-cell counts matrix, with reads from both spliced and unspliced reads for all potential cell barcodes. The gene-by-cell counts matrix is read into R to create a SingleCellExperiment object using fishpond::load_fry(). The resulting object contains a counts assay with a gene-by-cell counts matrix where all spliced and unspliced reads for a given gene are totaled together. We also include a spliced assay that contains a gene-by-cell counts matrix with only spliced reads. These matrices include all potential cells, including empty droplets, and are provided for all Portal downloads in the unfiltered objects saved as .rds files with the _unfiltered.rds suffix.

Each droplet was tested for deviation from the ambient RNA profile using DropletUtils::emptyDropsCellRanger()45,46 and those with an FDR ≤ 0.01 were retained as likely cells. If a library did not have a sufficient number of droplets and DropletUtils::emptyDropsCellRanger() failed, cells with fewer than 100 UMIs were removed. Gene expression data for any cells that remain after filtering are provided in the filtered objects saved as .rds files with the _filtered.rds suffix. These filtered objects additionally contain results from doublet detection performed with scDblFinder::scDblFinder()83, including each cell’s predicted class (“singlet” or “doublet”) as well as the associated doublet score. However, predicted doublets were not filtered out; users can instead use these scDblFinder results to filter doublets as needed for their specific analysis needs.

Following removal of empty droplets, scpca-nf proceeds to remove cells that are likely to be compromised by damage or low-quality sequencing. miQC was used to calculate the posterior probability that each cell is compromised47. Any cells with a probability of being compromised greater than 0.75 and fewer than 200 genes detected were removed before further processing. The gene expression counts from the remaining cells were log-normalized using the deconvolution method from Lun, Bach, and Marioni48. Briefly, scran::quickCluster() was used to derive cell clusters on which to calculate sum factors with scran::computeSumFactors(), which are in turn used during normalization with scuttle::logNormCounts(). If this deconvolution-based approach failed for any reason, only scuttle::logNormCounts() was used for normalization.

Next, scran::modelGeneVar() was used to model gene variance from the log-normalized counts and scran::getTopHVGs() was used to select the top 2000 high-variance genes. These were used as input to calculate the top 50 principal components using scater::runPCA(). Finally, UMAP embeddings were calculated from the principal components with scater::runUMAP(). The raw and log-normalized counts, list of 2000 high-variance genes, principal components, and UMAP embeddings are all stored in the processed objects saved as .rds files with the _processed.rds suffix.

Quantifying gene expression for libraries with CITE-seq or cell hashing

All libraries with antibody-derived tags (ADTs) or hashtag oligonucleotides (HTOs) were mapped to a reference index using salmon alevin and quantified using alevin-fry. The reference indices were constructed using the salmon index command with the --feature option. References were custom-built for each ScPCA project and constructed using the submitter-provided list of ADTs or HTOs and their barcode sequences.

The ADT-by-cell or HTO-by-cell counts matrix produced by alevin-fry were read into R as a SingleCellExperiment object and saved as an alternative experiment (altExp) in the same SingleCellExperiment object with the unfiltered gene expression counts data. The altExp within the unfiltered object contains all identified ADTs or HTOs and all barcodes identified in the RNA-seq gene expression data. Any barcodes that only appeared in either ADT or HTO data were discarded, and cell barcodes that were only found in the gene expression data (i.e., did not appear in the ADT or HTO data) were assigned zero counts for all ADTs and HTOs. Any cells removed after filtering empty droplets were also removed from the ADT and HTO counts matrices and before creating the filtered SingleCellExperiment object.

Processing ADT expression data from CITE-seq

The ADT count matrix stored in the unfiltered object was used to calculate an ambient profile with DropletUtils::ambientProfileEmpty(). The ambient profile was used to calculate quality-control statistics with DropletUtils::cleanTagCounts() for all cells remaining after removing empty droplets. Any negative or isotype controls were taken into account when calculating QC statistics. This function flags cells as low-quality if they either have very high levels of ambient contamination and/or negative/isotype controls (if present), or lack ambient expression altogether, which may indicate failed capture. However, we did not remove any cells based on ADT quality because that would remove those cells from the SingleCellExperiment object, regardless of the quality of the RNA expression. Instead, the filtered and processed objects contain the results from running DropletUtils::cleanTagCounts(), which users can leverage for filtering as desired.

ADT count data were then normalized using scuttle::computeMedianFactors(), which calculates a per-cell size factor as the median ratio of the cell’s counts to the background profile previously calculated with DropletUtils::ambientProfileEmpty(). We then used these factors to normalize ADT counts with scuttle::logNormCounts(). If median-based normalization failed for any reason, ADT counts were log-transformed after adding a pseudocount of 1. We only performed normalization on cells that would be retained after ADT filtering; we assigned NA normalized counts to any cells that would be filtered out based on DropletUtils::cleanTagCounts(). The normalized ADT data are available in the altExp of the processed object. Although scpca-nf normalizes ADT counts, the workflow does not perform any dimensionality reduction of ADT data; only the RNA counts data are used as input for dimensionality reduction. Additionally, note that we did not perform background subtraction on the ADT counts, but we provide the ambient profile calculated with DropletUtils::ambientProfileEmpty(), which users can employ to perform global de-noising as needed. During conversion to AnnData objects, the modalities are exported as separate RNA (_rna.h5ad) and ADT (_adt.h5ad) objects.

Processing HTO data from multiplexed libraries

As with ADT data, scpca-nf does not filter any cells based on HTO expression, and any cells removed after filtering empty droplets based on the unfiltered RNA counts matrix are also removed from the HTO counts matrix in the filtered object. scpca-nf does not perform any additional filtering or processing of the HTO-by-cell counts matrix, so the same filtered matrix is included in the processed object.

To identify which cells come from which sample in a multiplexed library, we applied three different demultiplexing methods: genetic demultiplexing, HTO demultiplexing using DropletUtils::hashedDrops(), and HTO demultiplexing using Seurat::HTODemux(). We do not provide separate SingleCellExperiment objects for each sample in a library. Each multiplexed library object contains the counts data from all samples and the results from all three demultiplexing methods to allow users to select which method(s) to use.

Genetic demultiplexing

If all samples in a multiplexed library were also sequenced using bulk RNA-seq, we performed genetic demultiplexing using genotype data from both bulk RNA-seq and single-cell or single-nuclei RNA-seq55. If bulk RNA-seq was not available, no genetic demultiplexing was performed.

Bulk RNA-seq reads for each sample were mapped to a reference genome using STAR84, and multiplexed single-cell or single-nuclei RNA-seq reads were mapped to the same reference genome using STARsolo85. The mapped bulk reads were used to call variants and assign genotypes with bcftools mpileup86. cellsnp-lite was then used to genotype single-cell data at the identified sites found in the bulk RNA-seq data87. Finally, vireo was used to identify the sample of origin87.

HTO demultiplexing

For all multiplexed libraries, we performed demultiplexing using DropletUtils::hashedDrops() and Seurat::HTODemux(). For both methods, we used the default parameters and only performed demultiplexing on the filtered cells present in the filtered object. The results from both these methods are available in the filtered and processed objects.

Quantification of spatial transcriptomics data

10x Genomics’ Space Ranger58 was used to quantify gene expression data from spatial transcriptomics libraries. cellranger mkref was used to create a reference index from the human genome, GRCh38, Ensembl version 104. The FASTQ files, microscopic slide image, and slide serial number were provided as input to spaceranger count. The raw and filtered counts matrix, slide images, and the summary report output by spaceranger count are included in the output from scpca-nf.

Quantification of bulk RNA-seq data

fastp was used to trim adapters and perform quality and length filtering on all FASTQ files from bulk RNA-seq. We used a decoy-aware reference created from spliced cDNA sequences with the entire human genome sequence (GRCh38, Ensembl version 104) as the decoy57. The trimmed reads were then provided as input to salmon quant for selective alignment. In addition to using the default parameters for salmon quant, we applied the --seqBias and --gcBias flags to correct for sequence-specific biases due to random hexamer priming and fragment-level GC biases, respectively.

Cell type annotation

Cell type labels determined by SingleR50, CellAssign51, and SCimilarity52 were added to processed SingleCellExperiment objects. If cell types were obtained from the submitter of the dataset, the submitter-provided annotations were incorporated into all SingleCellExperiment objects (unfiltered, filtered, and processed).

To prepare the references used for assigning cell types, we developed a separate workflow, build-celltype-index.nf, within scpca-nf. We used the BlueprintEncodeData reference from the celldex package50,88,89 to train the SingleR classification model with SingleR::trainSingleR(). In the main scpca-nf workflow, this model and the processed SingleCellExperiment object were input to SingleR::classifySingleR(). The SingleR output of cell type annotations and a score matrix for each cell and all possible cell types were added to the processed SingleCellExperiment object.

The BlueprintEncodeData reference was chosen because it had either a similar or higher delta median statistic compared to other references available in celldex when applied to samples from a variety of diagnoses. The delta median statistic for each cell was calculated by subtracting the median cell type score from the score associated with the assigned cell type90 and was used to evaluate confidence in SingleR cell type assignments. The cell type report shows the distribution of delta median values for each cell type. A higher delta median statistic for a cell generally indicates higher confidence in the final cell type annotation.

For CellAssign, marker gene references were created using the marker gene lists available on PanglaoDB91. Organ-specific references were built using all cell types in a specified organ listed in PanglaoDB to accommodate all ScPCA projects encompassing a variety of disease and tissue types. If a set of disease types in a given project encompassed cells that may be present in multiple organ groups, multiple organs were combined. Since many cancers may have infiltrating immune cells, all immune cells were also included in each organ-specific reference. For example, we created a reference containing bone, connective tissue, smooth muscle, and immune cells for sarcomas that appear in bone or soft tissue. The specific reference information and list of organs included in that reference for a given library is available in the metadata of each processed object.

Given the processed SingleCellExperiment object and organ-specific reference, scvi.external.CellAssign() was used in the main scpca-nf workflow to train the model and predict the assigned cell type. For each cell, CellAssign calculates a probability of assignment to each cell type in the reference. The probability matrix and a prediction based on the most probable cell type were added as cell type annotations to the processed SingleCellExperiment object. We also display the distribution of all probabilities calculated by CellAssign in the cell type report; more confident labels are expected to have many values close to 1.

For SCimilarity, the foundation model described in Heimberg et al.52 containing 7.3 million cells from various normal and diseased tissues was obtained from Zenodo92 and used to annotate cells in all samples. Embeddings were first computed on the processed AnnData objects using scimilarity.get_embeddings() followed by cell type prediction using scimilarity.get_predictions_knn() with weighting=True. The assigned cell type label and the distance of the query cell to the closest cell in the model were added to the processed SingleCellExperiment object. A plot showing the distribution of the distance metric calculated by SCimilarity is present in the cell type report. Distances larger than 0.05 can indicate that the model is less confident in the prediction.

Assigning consensus cell types

Cell type labels obtained from SingleR, CellAssign, and SCimilarity were then used to assign an ontology-aware consensus cell type label. We first assigned each of the cell types present in the PanglaoDB91 reference used with CellAssign to an appropriate Cell Ontology term5961,93. For cell types available in the BlueprintEncodeData reference used with SingleR and the foundation model used with SCimilarity, we used the provided Cell Ontology terms.

We then created a reference table containing all possible combinations of cell types assigned using SingleR, CellAssign, and SCimilarity. Consensus cell types are assigned if two of the three annotations share a latest common ancestor (LCA), identified using ontoProc::findCommonAncestors()94, that meets the following criteria. Otherwise, no consensus cell type is assigned, and the cell is labeled as “Unknown”.

  1. The terms share at least 1 LCA that either has fewer than 170 descendants or is one of “neuron”, “epithelial cell”, “columnar/cuboidal epithelial cell”, or “endo-epithelial cell”.

  2. If more than 1 LCA is shared between two terms, then the LCA with the fewest descendants is kept and all others are discarded.

  3. If the LCA has fewer than 170 descendants and is one of the following non-specific LCA terms, no consensus cell type is assigned: “bone cell”, “lining cell”, “blood cell”, “progenitor cell”, “supporting cell”, “biogenic amine secreting cell”, “protein secreting cell”, “extracellular matrix secreting cell”, “serotonin secreting cell”, “peptide hormone secreting cell”, “exocrine cell”, “sensory receptor cell”, or “interstitial cell”.

If more than one LCA is identified as a possible consensus cell type, meaning there is agreement among all three methods, the LCA with the fewest descendants is used as the consensus cell type.

The consensus cell type assignments, including both the Cell Ontology term and the associated human-readable name, are available in processed object files on the Portal.

Consensus cell type assignments were evaluated by looking at marker gene expression in a set of cell-type specific marker genes. Marker genes were obtained from the list of Human cell markers on CellMarker 2.095. We considered only those that are specific to a single cell type, with the exception of hematopoietic precursor cells, which express genes found in other, more differentiated immune cells.

Cell types annotated as part of the OpenScPCA Project

As part of the ongoing OpenScPCA project64, cell types for each project are manually annotated to label disease-specific cell types or cell states. After annotations for all samples in a given project have been validated, they are added to all SingleCellExperiment objects (unfiltered, filtered, and processed) for that project on the Portal. To date, cell types have been assigned and validated for SCPCP000004 (neuroblastoma) and SCPCP000015 (Ewing sarcoma). The approaches for cell type annotation were originally developed in the OpenScPCA-analysis GitHub repository96 in the cell-type-neuroblastoma-04 and cell-type-ewings analysis modules, respectively. These analysis modules provide full information on the specific approaches used for annotation. The cell type annotations included in the ScPCA Portal were subsequently generated in corresponding Nextflow modules in the OpenScPCA-nf GitHub repository97.

For SCPCP000004 (neuroblastoma), shown in Figure 5, cell type annotation is performed with both SingleR98 and scANVI/scArches99 using the NBAtlas from Bonine et al. as a reference65. Final annotations are derived based on agreement between these two methods and the consensus cell types. If SingleR and scANVI/scArches agree exactly, then that label is used. If SingleR and scANVI/scArches labels are in same broad family (e.g., T cell and CD4 + T cell) then the broad family label is used (e.g., T cell). If SingleR and scANVI/scArches disagree and at least one inference agrees with the consensus cell type label then we assign that label. If the consensus cell type is Unknown and one of the SingleR and/or scANVI/scArches labels is one of Neuroendocrine and the other is one of Schwann, Stromal other, or Fibroblast, the Neuroendocrine label is assigned.

For SCPCP000015 (Ewing sarcoma), tumor cells were first identified by running AUCell100 on a merged object containing all samples with EWS::FLI1 marker gene sets obtained from MSigDB and published literature. Cells with an AUC > 0.4 for the EWS::FLI1 marker gene set obtained from Aynaud et al.101 and AUC > 0.1 for the STAEGE_EWING_FAMILY_TUMOR gene set from MSigDB102 were classified as tumor EWS-high. Cells with an AUC > 0.1 for the EWS::FLI1 marker gene set obtained from Wrenn et al.103 and AUC > 0.05 for the HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION gene set from MSigDB104 were classified as tumor EWS-low. Cells that met the criteria for tumor EWS-high and had mean expression of proliferative markers (MKI67, PCNA, and TOP2A) > 0 were classified as tumor EWS-high proliferative. For all non-tumor cells, the consensus cell type label was used.

Copy-number variation inference

We used inferCNV53 with the i6 HMM to estimate copy-number variation (CNV) events for each library, for each chromosome arm. We designated a set of normal consensus cell types to use for each library’s normal reference based on the given sample’s diagnosis. The list of cell types included in the reference used for inferCNV can be found in the metadata of the processed object for a given library. All libraries were processed with inferCNV except: i) libraries without assigned consensus cell types, ii) libraries with fewer than 100 normal reference cells, and iii) libraries from non-cancerous samples. We calculated the total CNVs per cell using the feature output from the i6 HMM by summing CNV calls across all chromosome arms.

Generating merged data

Merged objects are created with the merge.nf workflow within scpca-nf. This workflow takes as input the processed SingleCellExperiment objects in a given ScPCA project output by scpca-nf and creates a single merged SingleCellExperiment object containing gene expression data and metadata from all libraries in that project. The merged object includes both raw and normalized counts for all cells from all libraries. Because the same reference index was used to quantify all single-cell and single-nuclei RNA-seq data, the set of genes is the same in the merged object and the individual objects. Library-, cell- and gene-specific metadata from each of the processed SingleCellExperiment objects are also combined and stored in the merged object. The merge.nf workflow does not perform batch correction or integration, so the counts in the merged object are not batch-corrected.

The top 2000 shared high-variance genes are identified from the merged counts matrix by modeling variance using scran::modelGeneVar() and specifying library IDs for the block argument. These genes are used to calculate library-aware principal components with batchelor::multiBatchPCA()105. The top 50 principal components were selected and used to calculate UMAP embeddings for the merged object.

If any libraries included in the ScPCA project contain additional ADT data, the raw and normalized ADT data are also merged and stored in the altExp slot of the merged SingleCellExperiment object. If the merged object contains an altExp with merged ADT data, two AnnData objects are exported to create separate RNA (_rna.h5ad) and ADT (_adt.h5ad) objects.

If any libraries in the ScPCA project are multiplexed and contain HTO data, no merged object is created due to potential ambiguity in identifying samples across multiplexed libraries. Merged objects were not created for projects with more than 100 samples because of the computational resources required to work with them.

Converting SingleCellExperiment objects to AnnData objects

zellkonverter::writeH5AD()106 was used to convert SingleCellExperiment objects to AnnData format and export the objects as .h5ad files. For any SingleCellExperiment objects containing an altExp (e.g., ADT data), the RNA and ADT data were exported and saved separately as RNA (_rna.h5ad) and ADT (_adt.h5ad) files. Multiplexed libraries were not converted to AnnData objects, due to the potential for ambiguity in sample origin assignments.

All merged SingleCellExperiment objects were converted to AnnData objects and saved as .h5ad files. If a merged SingleCellExperiment object contained any ADT data, the RNA and ADT data were exported and saved separately as RNA (_rna.h5ad) and ADT (_adt.h5ad) objects. In contrast, if a merged SingleCellExperiment object contained HTO data due to the presence of any multiplexed libraries in the merged object, the HTO data was removed from the SingleCellExperiment object and not included in the exported AnnData object.

QUANTIFICATION AND STATISTICAL REPORTING

Details regarding total sample and/or cell numbers and any statistical tests used can be found in the figure legends and figures. For Figure 4, Figure 5, Figure S4, and Figure S5, numbers in parentheses in the figure indicate total cell numbers. For Figure 1, Figure 6, and Figure S7, numbers in parenthesis in the figure indicate total sample numbers.

Benchmarking of alevin-fry and cellranger count performance

Six libraries, three single-cell and three single-nuclei, were randomly selected and used to benchmark the performance of alevin-fry and cellranger count, results of which are shown in Figure S1. Libraries were processed with salmon alevin v1.5.2 and alevin-fry v0.4.1 or cellranger count from Cell Ranger v6.1.2. Results were generated using default parameters for single-cell libraries and use of the --include_introns flag to include intronic reads for single-nuclei libraries only. The Pearson correlation between mean gene expression across both methods is reported in Figure S1B.

Analysis of bulk RNA-seq data

Data preparation

We identified solid tumor samples with both bulk and single-cell (or single-nuclei) RNA-seq data in the ScPCA Portal for analysis, with multiplexed samples excluded (N=105). We removed low-quality samples based on visual inspection of quality control reports (N=8), leaving a total of 97 samples across five ScPCA projects for analysis.

For each project, we transformed and normalized bulk counts matrices for all samples using DESeq2::rlog()107. We obtained pseudobulk counts by summing raw single-cell counts for each sample, and similarly transformed each project’s resulting counts matrix with DESeq2::rlog(). We filtered out genes which were not observed in either the bulk or pseudobulk raw counts matrices before subsequent analysis. For each project, we then used the lme4 R package108 to construct a linear model predicting bulk from pseudobulk counts considering a random effect for sample id: bulk ~ pseudobulk + (1|sample_id).

Overrepresentation analysis

We next conducted overrepresentation analysis (ORA) to ascertain whether certain cell types might be overrepresented either modality (bulk vs. pseudobulk). We specifically tested overrepresentation of the PanglaoDB cell type marker gene sets used for each project’s respective CellAssign reference.

For input to the ORA, we summarized model residuals within each project by taking the median residual for each gene across samples and then transformed these summarized residuals into Z-scores. We identified outlier genes as those with Z-scores greater than 2.5 (positive outliers) or less than -2.5 (negative outliers). In this case, positive outliers represent genes with comparatively higher expression in the bulk modality, and negative outliers represent genes with comparatively higher expression in the single-cell modality.

For each set of cell type marker genes, we calculated two odds ratios representing whether genes were overrepresented in the positive outliers (enriched in bulk) or negative outliers (enriched in pseudobulk). We calculated P-values for both the bulk and pseudobulk enrichment directions via permutation testing with 10,000 replicates. We defined gene sets with significant overrepresentation as those with a false-discovery-rate-corrected P-value ≤ 0.05109.

ADDITIONAL RESOURCES

Documentation for the ScPCA Portal can be found at https://scpca.readthedocs.io.

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited Data
Summarized gene expression data This paper https://scpca.alexslemonade.org
Software and algorithms
scpca-nf workflow used for processing all ScPCA Portal data This paper https://github.com/AlexsLemonade/scpca-nf73
ScPCA Portal code This paper https://github.com/AlexsLemonade/scpca-portal74
Benchmarking of tools used to build scpca-nf This paper https://github.com/AlexsLemonade/alsf-scpca75 and https://github.com/AlexsLemonade/sc-data-integration76
Code for creating reference files for consensus cell type assignment This paper https://github.com/AlexsLemonade/OpenScPCA-analysis110
Workflow for assigning OpenScPCA project cell type annotations to ScPCA data This paper https://github.com/AlexsLemonade/OpenScPCA-nf77
ScPCAr package for programmatically downloading from the Portal This paper https://github.com/AlexsLemonade/ScPCAr78
Code for underlying figures and analyses This paper https://github.com/AlexsLemonade/scpca-paper-figures79
Nextflow Tomasso et al.14 https://github.com/nextflow-io/nextflow/tree/master
salmon Patro et al.57 https://anaconda.org/channels/bioconda/packages/salmon/overview
alevin-fry He et al.15 https://anaconda.org/channels/bioconda/packages/alevin-fry/overview
Space Ranger 10x Genomics https://www.10xgenomics.com/support/software/space-ranger/latest
SingleCellExperiment Amezquita et al.12 https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html
anndata Virshup et al.54 https://pypi.org/project/scvi-tools/
DropletUtils Lun et al.45.; Griffiths et al.46 https://bioconductor.org/packages/release/bioc/html/DropletUtils.html
miQC Hippen et al.47 https://bioconductor.org/packages/release/bioc/html/miQC.html
scDblFinder Germain et al.83 https://bioconductor.org/packages/release/bioc/html/scDblFinder.html
scuttle McCarthy et al.111 https://bioconductor.org/packages/release/bioc/html/scuttle.html
scran Lun et al.112 https://bioconductor.org/packages/release/bioc/html/scran.html
scater McCarthy et al.111 https://bioconductor.org/packages/release/bioc/html/scater.html
Seurat Hao et al.113 https://satijalab.org/seurat/
vireo Huang et al.114; Weber et al.55 https://github.com/single-cell-genetics/vireo
batchelor Lun et al.105 https://bioconductor.org/packages/release/bioc/html/batchelor.html
SingleR Aran et al.50 https://bioconductor.org/packages/release/bioc/html/SingleR.html
celldex Aran et al.50 https://bioconductor.org/packages/release/data/experiment/html/celldex.html
CellAssign Zhang et al.51 https://docs.scvi-tools.org/en/stable/installation.html
SCimilarity Heimberg et al.52 https://genentech.github.io/scimilarity/index.html
inferCNV inferCNV of the Trinity CTAT Project53 https://www.bioconductor.org/packages/release/bioc/html/infercnv.html
zellkonverter Zappia et al.106 https://bioconductor.org/packages/release/bioc/html/zellkonverter.html
DESeq2 Love et al.107 https://bioconductor.org/packages/release/bioc/html/DESeq2.html
lme4 Bates et al.108 https://cran.r-project.org/web/packages/lme4/index.html
ontoProc Carey94 https://bioconductor.org/packages/release/bioc/html/ontoProc.html
Other
HsapDv Ontology Lookup Service16 https://www.ebi.ac.uk/ols4/ontologies/hsapdv
PATO Gkoutos et al.17 https://www.ebi.ac.uk/ols4/ontologies/pato
NCBI Taxonomy Schoch et al.19 https://www.ncbi.nlm.nih.gov/taxonomy
Mondo Vasilevsky et al.21 https://www.ebi.ac.uk/ols4/ontologies/mondo
UBERON Haendel et al.23 https://www.ebi.ac.uk/ols4/ontologies/uberon
Hancestro Morales et al.26 https://www.ebi.ac.uk/ols4/ontologies/hancestro
Cell Ontology Bard et al.61; Meehan et al.60; Diehl et al.59; Tan et al.93 https://www.ebi.ac.uk/ols4/ontologies/cl
Blueprint and ENCODE Martens et al.88; The ENCODE Consortium89 https://rdrr.io/github/LTLA/celldex/man/BlueprintEncodeData.html
PanglaoDB 91 https://panglaodb.se/
CellMarker 2.0 95 http://117.50.127.228/CellMarker/

Discussion

The ScPCA Portal is a downloadable collection of uniformly processed, summarized single-cell and single-nuclei RNA-seq data and de-identified metadata from pediatric tumor samples. The Portal includes over 700 samples from 55 tumor types, making this the most comprehensive collection of publicly available pediatric tumor single-cell transcriptomic datasets to our knowledge. Users can browse and search the Portal via an intuitive web interface and explore visualizations of individual samples via a UCSC Cell Browser interface39. Summarized data are available at three different processing stages (unfiltered, filtered, or processed objects) allowing users to start from a processed object or perform their own processing, e.g. filtering and normalization. Processed objects contain normalized gene expression data, reduced dimensionality results from PCA and UMAP, cell type annotations, and CNV inference results. Standardized metadata, containing human-readable values for all fields and ontology term identifiers for a subset of fields, is also provided for all samples. Every library includes a quality control report for assessing data quality. These processed results and metadata allow researchers to save time by moving directly to downstream analyses, such as identifying marker genes or exploring genes of interest.

Data on the Portal is available as either SingleCellExperiment or AnnData objects for ease of use within R or Python (i.e., Bioconductor or scverse, respectively) environments. The AnnData objects also allow users to integrate ScPCA data with data and tools on other platforms. In particular, our AnnData objects are designed to be mostly compliant with the requirements of CZI CELLxGENE5,6,115, and these objects can also be used with Kana116,117. Data can be downloaded both via the Portal web interface and programmatically using the ScPCAr R package, which is a wrapper for the ScPCA Portal API (https://api.scpca.alexslemonade.org/docs/).

Cell type annotations are also available from three automated methods: SingleR, CellAssign, and SCimilarity. The first two approaches use publicly available references, while the third uses a foundation model to label cells. These labels are used to construct ontology-aware consensus cell type labels, which provide a harmonized labeling scheme across samples.

Many samples on the Portal have additional sequencing data which can also be downloaded, including ADT data from CITE-seq, cell hashing data, bulk RNA-seq, or spatial transcriptomics. These modalities can facilitate expanded analyses. For example, ADT data can help label unknown cell types and correlate RNA to protein expression35, and spatial transcriptomics data can be leveraged to gain more insights about distribution of cell types in the tissue118. Similarly, users can gain more insight from bulk RNA-seq data available on the Portal by integrating with scRNA-seq data from the same sample119,120. The scRNA-seq data available on the Portal can also be used to deconvolute existing bulk RNA-seq datasets. The ScPCA Portal therefore enables multimodal comparisons that reveal biological and/or technical signals that would otherwise not be apparent from one sequencing modality alone.

We also introduced our open-source and efficient workflow for uniformly processing datasets available on the Portal, scpca-nf, which is available to the entire research community. In one command, scpca-nf can process raw data from various sequencing types, turning FASTQ files into processed SingleCellExperiment or AnnData objects ready for downstream analyses. Using Nextflow as the framework for scpca-nf with container images for each process makes the workflow modular and portable. Processed output from running scpca-nf on samples from pediatric tumors, cell lines, or other model organisms is eligible for submission to the ScPCA Portal, enabling us to continue to grow the Portal, improving its utility to the research community.

Portal data can also support re-analysis of existing pediatric cancer datasets with bulk RNA-seq, such as the Pediatric Brain Tumor Atlas121,122, allowing researchers to glean more insight from published data without obtaining additional samples. Ultimately, the ScPCA Portal will save researchers time and money, advancing pediatric cancer research.

Limitations of this study

We note several limitations of the current ScPCA Portal and scpca-nf workflow. First, labeling tumor cells is challenging as neither SingleR nor CellAssign references include them. We mitigate this in several ways. As references are biased towards normal cells, cells without a consensus cell type annotation are likely to be malignant. We provide CNV estimates for roughly half of Portal libraries, and together with consensus cell types, these can help identify tumor cells, particularly for diagnoses with common copy number alterations such as neuroblastoma (Figure 5B-D) and osteosarcoma. We are also augmenting projects through the ongoing OpenScPCA project64, which provides robust, fully open-source annotations that formally distinguish normal from tumor cells and will continue to expand across Portal projects.

Second, although we provide merged objects for each project, we do not provide integrated or batch-corrected objects, as the appropriate correction approach depends on the scientific question and is best left to the user.

Third, while the Portal features several modalities beyond single-cell and single-nuclei RNA-seq, not all modalities are currently represented — for example, epigenomic modalities such as scATAC-seq. Our modular Nextflow workflow is flexibly designed to accommodate additional modalities as the Portal expands.

Acknowledgments

We thank the data generators and submitters of the Single-cell Pediatric Cancer Atlas. We also thank Anna Greene for her role in constructing the Single-cell Pediatric Cancer Atlas funding opportunity.

This work was funded through the Alex’s Lemonade Stand Foundation Childhood Cancer Data Lab and Childhood Cancer Data Lab Postdoctoral Fellowship (SMF).

Author Contributions

Author Contributions
Allegra G. Hawkins Methodology, Software, Investigation, Validation, Formal analysis, Data curation, Writing - Original Draft, Writing - Review & Editing, Visualization
Joshua A. Shapiro Methodology, Software, Investigation, Validation, Formal analysis, Resources, Data curation, Writing - Original Draft, Writing - Review & Editing, Visualization
Stephanie J. Spielman Methodology, Software, Investigation, Validation, Formal analysis, Data curation, Writing - Original Draft, Writing - Review & Editing, Visualization
David S. Mejia Methodology, Software, Validation, Data curation, Writing - Review & Editing, Resources
Deepashree Venkatesh Prasad Methodology, Software, Validation, Visualization, Writing - Review & Editing
Nozomi Ichihara Methodology, Software, Writing - Review & Editing
Arkadii Yakovets Methodology, Software, Validation, Data curation, Resources, Writing - Review & Editing
Avrohom M. Gottlieb Methodology, Software, Validation, Data curation, Writing - Review & Editing, Resources
Kurt G. Wheeler Methodology, Software, Validation, Data curation, Resources, Writing - Review & Editing
Chante J. Bethell Software, Validation, Writing - Review & Editing
Steven M. Foltz Writing - Review & Editing
Jennifer O’Malley Data curation, Supervision, Writing - Review & Editing
Casey S. Greene Conceptualization, Project administration, Supervision, Writing - Review & Editing
Jaclyn N. Taroni Conceptualization, Methodology, Investigation, Validation, Data curation, Writing - Original Draft, Writing - Review & Editing, Visualization, Supervision, Project administration

Declarations of Interest

AGH, JAS, SJS, DSM, DVP, NI, AY, AMG, KGW, CJB, JO, and JNT are or were employees of Alex’s Lemonade Stand Foundation, a sponsor of this research.

Supplemental Information Index

Document S1. Figures S1-S7.

Table S1. Overview of ScPCA Portal Datasets, Related to Figure 1. This table provides descriptions and sample and library counts for projects in the ScPCA Portal.

scpca_project_id: ScPCA project unique identifier. Diagnosis group: Diagnosis group as shown in Figure 1A. Diagnoses: Full set of diagnoses for all samples in the project. Total number of samples (S): Number of samples in the project. Total number of libraries (L): Number of libraries in the project. Due to additional sequencing modalities and/or multiplexing, projects may have more libraries than samples. All remaining columns give the number of libraries (as designated with (L)) with the given suspension type, 10x kit version, or additional modality.

Table S2. Summary of references used for cell type annotation with CellAssign, Related to Figure 3. This table summarizes the references used for assigning cell types using CellAssign. References were built using all cell types from a set of organs in PanglaoDB’s marker gene list.

scpca_project_id: ScPCA project unique identifier. Diagnoses: Full set of diagnoses for all samples in the project. ScPCA reference name: Custom reference name. PanglaoDB organs included in reference: Organs from PanglaoDB included in the reference. Marker genes for all cell types in each organ were included.

Figure Titles and Legends

Figure 1: Overview of ScPCA Portal contents.

A. Barplots showing sample counts across four main cancer groupings in the ScPCA Portal with the total number of samples per cancer type displayed. Bars are colored by the number of samples with the indicated disease timing.

B. Barplot showing sample counts across modalities present in the ScPCA Portal. All single-cell or single-nuclei samples in the Portal are shown under the “All Samples” heading, a subset of which also have additional modalities as shown under the “Samples with additional modalities” heading. As indicated, sample suspensions are either single-cell or single-nuclei. For example, 75 single-cell samples and 101 single-nuclei samples have accompanying bulk RNA-seq data. Note that two samples were sequenced with both single-cell and single-nuclei suspensions. Samples with only bulk RNA-seq or spatial transcriptomics modalities are not shown.

C. Example of a project card as displayed on the “Browse” page of the ScPCA Portal and a “Visualize” view for a library within that project, colored by consensus cell type annotation. This project card and visualized sample are from project SCPCP000004123,124. Project cards include information about the number of samples, technologies and modalities, additional sample metadata information, submitter-provided diagnoses, and a submitter-provided abstract. Where available, submitter-provided citation information and other databases hosting this data are also provided. The visualization employs the UCSC Cell Browser39, enabling interactive exploration with user-selectable coloring options for cell types, gene-level expression, and other per-cell annotations created by scpca-nf.

Also see Table S1.




Figure 2: Overview of the scpca-nf workflow and download file structures.

A. Overview of scpca-nf, the primary workflow for processing single-cell/nuclei RNA-seq data for the ScPCA Portal. Mapping is first performed with alevin-fry to generate a gene-by-cell count matrix, which is read into R and converted into a SingleCellExperiment (SCE) object (Unfiltered SCE Object). Next, empty droplets are filtered out (Filtered SCE Object). The filtered object undergoes additional post-processing, including removing low-quality cells, normalizing counts, and performing dimension reduction with principal components analysis and UMAP. The object undergoes cell type annotation and CNV inference (Processed SCE Object). A summary QC report and a supplemental cell type report are prepared and exported. Finally, all SCE files are converted to AnnData format and exported.

Panels B-G show abbreviated versions of figures that appear in the summary QC report, shown here for SCPCL00000133, as follows:

B. The total UMI count for each cell in the Unfiltered SCE Object, ordered by rank. Points are colored by the percentage of cells that pass the empty droplets filter.

C. The number of genes detected in each cell passing the empty droplets filter against the total UMI count. Points are colored by the percentage of mitochondrial reads in the cell.

D. miQC model diagnostic plot showing the percent of mitochondrial reads in each cell against the number of genes detected in the Filtered SCE Object. Points are colored by the probability that the cell is compromised as determined by miQC.

E. The percent of mitochondrial reads in each cell against the number of genes detected in each cell. Points are colored by whether the cell was kept or removed, as determined by both miQC and a minimum unique gene count cutoff, prior to normalization and dimensionality reduction.

F. UMAP of log-normalized RNA expression colored by the number of genes detected.

G. UMAP of log-normalized RNA expression for the top four most variable genes, colored by the given gene’s expression. In the actual summary QC report, the top 12 most highly variable genes are shown.

H. File download structure for an ScPCA Portal project download in SingleCellExperiment (SCE) format. The download folder is named according to the project ID, data format, and the date it was downloaded. Download folders contain a folder for all single-cell data, _single-cell. Here, each sample ID has a dedicated folder containing the three processing levels of the expression data, the summary QC report, and cell type report all named according to the ScPCA library ID. The single-cell_metadata.tsv file contains sample metadata for all samples included in the download. The README.md file provides information about the contents of each download file, additional contact and citation information, and terms of use for data downloaded from the ScPCA Portal. The folder _bulk – only present for projects with bulk RNA-Seq data – contains a gene-by-sample matrix of counts quantified by salmon and associated metadata for samples with bulk RNA-Seq data.

I. File download structure for an ScPCA Portal merged project download in SCE format. As in panel H, the download folder is named by project ID, format, and date. Download folders contain a folder for all single-cell data, _single-cell_merged, with a single merged object containing all samples in the given project and a summary report detailing the merged object’s contents. Summary QC and cell type reports for each library are provided in the individual_reports folder arranged by their library ID. As in panel (H), additional files single-cell_metadata.tsv, _bulk_quant.tsv, _bulk_metadata.tsv, and README.md are also included.

Also see Figures S1-S3.

Figure 3: Consensus cell type annotation in scpca-nf.

A. Expanded view of cell type annotation process within scpca-nf, as introduced in Figure 2A. Cell type annotation is performed on the Processed SCE Object. SingleR50 annotation uses a celldex50 reference dataset with ontology labels, CellAssign51 annotation uses a list of marker genes compiled from PanglaoDB91, and SCimilarity52 annotation uses the SCimilarity foundation model. These cell typing results, along with consensus cell type annotations, are then added to the Processed SCE Object. A cell type summary report with information about reference sources, comparisons among cell type annotation methods, and diagnostic plots is created. Note that cell type annotations are also included in the Processed AnnData Object (Figure 2A).

B. Diagram of ontology-aware consensus cell type annotation performed in scpca-nf, using T cell annotation as an example. The numbers in parentheses indicate the number of descendants for each cell type in the Cell Ontology. Pairwise comparisons are made among annotations, and the consensus cell type is the latest common ancestor in Cell Ontology with the fewest descendants, indicated by the black check mark.

C. Example heatmap (SCPCL000001) as shown in the cell type summary report comparing the consensus cell type annotations to automated annotations assigned by SingleR, CellAssign, and SCimilarity. Heatmap cells are colored by the Jaccard similarity index. A value of 1 indicates complete overlap and 0 indicates no overlap between cells annotated with each label. This figure shows only the top seven consensus cell type annotations with at least three cells, but the heatmap in the cell type summary report shows all cell types.

Also see Figure S4 and Table S2.

Figure 4: Consensus cell type annotations in brain and CNS tumors.

A. Dot plot showing expression of cell-type-specific marker genes across all libraries from brain and central nervous system (CNS) tumors, excluding multiplexed libraries. Expression is shown for each broad cell type annotation, each of which represents a collection of similar consensus cell type annotations. The y-axis displays broad consensus cell types. The x-axis displays marker genes, determined by CellMarker 2.095, used to validate cell types in the top annotation bar. Dots are colored by mean gene expression across libraries and sized proportionally to the percent of libraries they are observed in, out of all cells with the same broad cell type annotation in brain and CNS tumor libraries. Up to 10 marker genes are shown per broad cell type. Only broad cell type annotations present in at least 50 cells across samples in the given diagnosis group are shown.

B. Barplot showing the percentage of each broad consensus cell type annotation across brain and CNS tumors libraries, separated into high-grade glioma(top panel) and low-grade glial/glioneuronal (bottom panel) diagnoses for non-multiplexed libraries from patient tissue samples.

C. Barplot showing all consensus cell types classified as immune cells across brain and CNS tumor libraries, separated into high-grade glioma (top panel) and low-grade glial/glioneuronal (bottom panel) diagnoses for non-multiplexed libraries. The percentage of immune cells classified as the indicated consensus cell type is shown. Only libraries comprised of at least 1% immune cells, based on consensus cell type annotations, are shown. Specific consensus cell types for myeloid and lymphocyte immune cells are shown, with all other consensus immune cell types included in “other.” Myeloid or lymphocyte immune cell types with fewer than 1000 cells across all libraries are included in “other.”

D. Dot plot as in panel A, but restricted to immune cells across all non-multiplexed libraries from brain and CNS tumors and showing consensus rather than broad cell types showing expression of cell-type-specific marker genes, considering only immune cells. Only broad cell type annotations present in at least 50 cells across samples in the given diagnosis group are shown. Cell types without associated marker genes in CellMarker 2.0 are excluded, including lymphocyte of B lineage, mature T cell, mature alpha-beta T cell, alpha-beta T cell, myeloid leukocyte, and tissue-resident macrophage.

Also see Figures S5-S6.

Figure 5: Cell type annotation and CNV inference on neuroblastoma samples.

A-B. UMAP of all libraries in the neuroblastoma-only ScPCA Project SCPCP000004 (N = 42). The UMAP was constructed from the merged SCPCP000004 object with equal library weighting, but no batch correction was performed. Panel A highlights cell type annotations made with the OpenScPCA Project, collapsed into broad annotation groups. Panel B highlights total per-cell CNV events calculated as the sum of chromosome arms with a CNV event, as estimated by the i6 HMM in InferCNV53. Gray cells in panel B represent libraries excluded from InferCNV inference due to insufficient normal cells for defining the InferCNV reference baseline.

C. Heatmap displaying per-cell CNV events across chromosomes with canonical neuroblastoma alterations6567 for a single library, SCPCL000130. Each cell in SCPCL000130 is represented by two adjacent rows, the first indicating copy-number gain and the second indicating copy-number loss. The heatmap is grouped by chromosome arm and OpenScPCA Project cell type annotation, where “normal” cells comprise all non-malignant cells.

D. Ridge plot from the summary QC report (SCPCL000130) of per-cell total CNV distributions across the top seven consensus cell type annotations. Other consensus cell types are shown in the “all remaining cell types” category.

Figure 6: Comparison of bulk and pseudobulk modalities.

A. Scatter plots colored by point density of DESeq2-transformed and normalized bulk RNA-seq expression compared to pseudobulk expression from single-cell/nuclei RNA-seq, with a regression line shown. Samples with RNA-seq for both bulk and single-cell/nuclei modalities, excluding multiplexed samples, from ScPCA projects of brain and CNS tumors are shown, with sample counts in parentheses.

B. Odds ratios, which indicate overrepresentation of cell-type marker genes in bulk relative to single-cell/nuclei RNA-seq, from overrepresentation analysis for the same samples shown in panel A, colored by FDR-corrected significance. 68 cell types were evaluated for each project.

Also see Figure S7.

Supplementary Figures

Figure S1: Results from benchmarking alevin-fry and cellranger count performance, Related to Figure 2.

Panels compare metrics for six ScPCA libraries (three single-cell and three single-nuclei), obtained from processing with salmon alevin and alevin-fry or cellranger count. Results were generated with CellRanger v6.1.2 using default parameters for single-cell libraries and the --include_introns flag to include intronic reads for single-nuclei libraries only. Libraries were processed with salmon alevin v1.5.2 and alevin-fry v0.4.1 using an index containing both spliced and unspliced cDNA (see Methods). Libraries used for benchmarking were randomly chosen.

A. Runtime in minutes (top row) and peak memory in GB (bottom row) for libraries processed with both platforms. Processing with alevin-fry was consistently faster and more memory-efficient than processing with cellranger count.

Panels B-D show only cells present in both alevin-fry and cellranger count outputs.

B. Comparison of mean gene expression values for libraries processed with both platforms, shown on a log-scale. Each point is a gene, and only genes detected in at least 5 cells are shown. \(R^2\) values shown in the top left corner of each panel reflect broad agreement in mean gene expression values between platforms.

C. Comparison of log total UMI counts for libraries processed with both platforms. The total UMI count per cell between platforms broadly agree, although alevin-fry returned slightly higher values for certain single-cell libraries.

D. Comparison of log total genes detected per cell for libraries processed with both platforms. The total number of genes detected per cell between platforms broadly agree, although alevin-fry returned slightly higher values for certain single-cell libraries.

Figure S2: Processing additional single-cell modalities in scpca-nf, Related to Figure 2.

A. Overview of the scpca-nf workflow for processing libraries with CITE-seq or antibody-derived tag (ADT) data. The workflow mirrors that shown in Figure 2A with several differences accounting for the presence of ADT data. First, both an RNA and ADT FASTQ file are required as input to alevin-fry, along with a TSV file containing information about ADT barcodes. Second, during post-processing, statistics are calculated to filter cells based on ADT counts, but the filter is not applied. ADT counts are also normalized and included in the Processed SCE Object. Third, the summary QC report will include a CITE-seq section with additional information about ADT-level processing. Fourth, the workflow exports SCE objects containing both RNA and ADT results, while separate AnnData objects for RNA and ADT are exported.

Panels B-D show example figures that appear in the CITE-seq section of the summary QC report, shown here for SCPCL000290.

B. The percent of mitochondrial reads in each cell against the number of genes detected in each cell. The panel labeled “Keep” displays cells that are retained based on both RNA and ADT counts. Other panels display cells that are filtered based only on the given type of counts.

C. Density plots of the log-normalized ADT counts for the library’s four most variable ADTs.

D. UMAP log-normalized RNA expression values. Cells are colored by expression of the given highly-variable ADT.

E. Overview of the scpca-nf workflow for multiplexed libraries. The workflow mirrors that shown in Figure 2A with several differences accounting for the presence of multiplexed data. First, both an RNA and HTO FASTQ file are required as input to alevin-fry, along with a TSV file providing information about library pools. Second, in parallel, the RNA FASTQ file, the HTO FASTQ file, and, if available, a corresponding Bulk RNA FASTQ file for each sample present in the multiplexed library are provided to a demultiplexing subprocess. The workflow calculates demultiplexing results based on HTO counts, as well as genetic demultiplexing results if the library has corresponding bulk RNA-seq data. Demultiplexing results are stored in all exported SCE objects versions, but libraries themselves are not demultiplexed. Third, only SCE, not AnnData, files are provided for multiplexed libraries.

Figure S3: Processing other sequencing modalities and merging objects with scpca-nf, Related to Figure 2.

A. Overview of the bulk RNA-Seq workflow. Reads are trimmed using fastp, and salmon is used to map reads and quantify counts. Quantified expression files are grouped by project and exported as a sample-by-gene count matrix in TSV format.

B. Overview of the spatial transcriptomics workflow. The FASTQ file and tissue and/or CytAssist image for a given library are input to spaceranger. spaceranger results are returned without any further processing.

C. Overview of the merged workflow. Processed SCE objects in a given project are merged into a single object, including ADT counts from CITE-seq data if present, and a merged summary report is generated. Merged objects are provided in either SCE or AnnData format.

D. Example of UMAPs as shown in the merged summary report. Cells from the library of interest are in red, and cells from other libraries are in gray. The UMAP was constructed from the merged object with equal library weighting, but no batch correction was performed. The libraries pictured are a subset of libraries in the ScPCA project SCPCP000003. For this figure specifically, the merged UMAP was constructed from these four libraries only, but the merged object and summary report on the ScPCA Portal for SCPCP000003 contain all of this project’s libraries.

Figure S4: Ontology-aware consensus cell type assignment provides harmonized labels for cells, Related to Figure 3.

A. UMAP highlighting cells annotated as types of T cells with SingleR, CellAssign, SCimilarity as well as the associated consensus cell types for the library SCPCL000049. All other cells are shown in gray. The top three T cell types are shown for each method, with remaining T cell types combined in “other T cells.”

B. UMAP showing the top seven consensus cell types in SCPCL0000049. Other consensus cell types are included in the “all remaining cell types” category.

C. UMAP showing total per-cell CNV events, calculated by summing the number of chromosome arms with a CNV event as estimated by the i6 HMM in InferCNV,53, for SCPCL0000049.

Figure S5: Consensus cell type annotation gene expression in other diagnosis groups, Related to Figure 4.

Dot plots showing expression of cell-type-specific marker genes across libraries from Leukemia (A), Sarcoma (B), and Other solid tumors (C) diagnosis groups. Expression is shown for each broad cell type annotation. The y-axis displays broad consensus cell types. The x-axis displays marker genes, determined by CellMarker 2.095, used to validate cell types in the top annotation bar. Dots are colored by mean gene expression across libraries and sized proportionally to the percent of libraries they are observed in, out of all cells with the same broad cell type annotation in the given diagnosis. Up to 10 marker genes are shown per broad cell type. Only broad cell type annotations present in at least 50 cells across samples in the given diagnosis group are shown. Cell types without associated marker genes in CellMarker 2.0 are excluded, including pigment cell for Sarcoma (B) and pigment cell and kidney cell for Other solid tumors (C).

Figure S6: Consensus cell type annotation distributions in other diagnosis groups, Related to Figure 4.

Barplots of the percentage of cells annotated as each broad consensus cell type annotation across all libraries from Leukemia (A), Sarcoma (B), and Other solid tumors (C) diagnosis groups, specifically for non-multiplexed libraries from patient tissue samples Libraries are grouped by diagnosis in each panel. Each column represents the distribution of cell types within a single library.

Figure S7: Comparison of bulk and pseudobulk modalities for additional projects, Related to Figure 6.

A. Scatter plots colored by point density of DESeq2-transformed and normalized bulk RNA-seq expression compared to pseudobulk expression from single-nuclei RNA-seq, with a regression line shown. Projects with RNA-seq for both bulk and single-cell/nuclei modalities that are not displayed in Figure 6A are shown, with sample counts in parentheses. All samples shown here are single-nuclei libraries.

B. Odds ratios, which indicate overrepresentation of cell-type marker genes in bulk relative to single-cell/nuclei RNA-seq, from overrepresentation analysis for the same samples shown in panel A, colored by FDR-corrected significance. 44 cell types were evaluated for project SCPCP000006, and 50 cell types were evaluated for project SCPCP000017.

References

1.
Svensson, V., Vento-Tormo, R., and Teichmann, S.A. (2018). Exponential scaling of single-cell RNA-seq in the past decade. Nat Protoc 13, 599–604. 10.1038/nprot.2017.149.
2.
Trapnell, C. (2015). Defining cell types and states with single-cell genomics. Genome Res. 25, 1491–1498. 10.1101/gr.190595.115.
3.
Patel, A.P., Tirosh, I., Trombetta, J.J., Shalek, A.K., Gillespie, S.M., Wakimoto, H., Cahill, D.P., Nahed, B.V., Curry, W.T., Martuza, R.L., et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401. 10.1126/science.1254257.
4.
Barkley, D., Moncada, R., Pour, M., Liberman, D.A., Dryg, I., Werba, G., Wang, W., Baron, M., Rao, A., Xia, B., et al. (2022). Cancer cell states recur across tumor types and form specific interactions with the tumor microenvironment. Nat Genet 54, 1192–1201. 10.1038/s41588-022-01141-9.
5.
Megill, C., Martin, B., Weaver, C., Bell, S., Prins, L., Badajoz, S., McCandless, B., Pisco, A.O., Kinsella, M., Griffin, F., et al. (2021). cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices. Preprint at bioRxiv. 10.1101/2021.04.05.438318.
6.
CZI Single-Cell Biology Program, Abdulla, S., Aevermann, B., Assis, P., Badajoz, S., Bell, S.M., Bezzi, E., Cakir, B., Chaffer, J., Chambers, S., et al. (2023). CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Preprint at bioRxiv. 10.1101/2023.10.30.563174.
7.
Regev, A., Teichmann, S.A., Lander, E.S., Amit, I., Benoist, C., Birney, E., Bodenmiller, B., Campbell, P., Carninci, P., Clatworthy, M., et al. (2017). The Human Cell Atlas. eLife 6, e27041. 10.7554/elife.27041.
8.
Rozenblatt-Rosen, O., Regev, A., Oberdoerffer, P., Nawy, T., Hupalowska, A., Rood, J.E., Ashenberg, O., Cerami, E., Coffey, R.J., Demir, E., et al. (2020). The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell 181, 236–249. 10.1016/j.cell.2020.03.053.
9.
Tang, X.-W., Jiang, J., Huang, S., Shi, X.-M., Xu, H., Xu, J., Peng, J.-Y., Zhang, W., Shi, L., Zhong, X.-L., et al. (2024). Long-term trends in cancer incidence and mortality among U.S. children and adolescents: a SEER database analysis from 1975 to 2018. Front. Pediatr. 12, 1357093. 10.3389/fped.2024.1357093.
10.
Xu, X., Saxon, J., Soon, M.S.F., Lee, C.Y., and Tuong, Z.K. (2025). Data standards for single-cell RNA-sequencing of paediatric cancer. Clinical & Translational Immunology 14, e70033. 10.1002/cti2.70033.
11.
Jeanquartier, F., Jean-Quartier, C., and Holzinger, A. (2019). Use case driven evaluation of open databases for pediatric cancer research. BioData Mining 12, 2. 10.1186/s13040-018-0190-8.
12.
Amezquita, R.A., Lun, A.T.L., Becht, E., Carey, V.J., Carpp, L.N., Geistlinger, L., Marini, F., Rue-Albrecht, K., Risso, D., Soneson, C., et al. (2019). Orchestrating single-cell analysis with Bioconductor. Nat Methods 17, 137–145. 10.1038/s41592-019-0654-x.
13.
Wolf, F.A., Angerer, P., and Theis, F.J. (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15. 10.1186/s13059-017-1382-0.
14.
Di Tommaso, P., Chatzou, M., Floden, E.W., Barja, P.P., Palumbo, E., and Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nat Biotechnol 35, 316–319. 10.1038/nbt.3820.
15.
He, D., Zakeri, M., Sarkar, H., Soneson, C., Srivastava, A., and Patro, R. (2022). Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322. 10.1038/s41592-022-01408-3.
16.
EMBL-EBI. Ontology Lookup Service (OLS) - HsapDv. https://www.ebi.ac.uk/ols4/ontologies/hsapdv.
17.
Gkoutos, G.V., Schofield, P.N., and Hoehndorf, R. (2017). The anatomy of phenotype ontologies: principles, properties and applications. Briefings in Bioinformatics 19, 1008–1021. 10.1093/bib/bbx035.
18.
EMBL-EBI. Ontology Lookup Service (OLS) - PATO. https://www.ebi.ac.uk/ols4/ontologies/pato.
19.
Schoch, C.L., Ciufo, S., Domrachev, M., Hotton, C.L., Kannan, S., Khovanskaya, R., Leipe, D., Mcveigh, R., O'Neill, K., Robbertse, B., et al. (2020). NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062. 10.1093/database/baaa062.
20.
National Center for Biotechnology Information. NCBI Taxonomy. https://www.ncbi.nlm.nih.gov/taxonomy.
21.
Vasilevsky, N.A., Toro, S., Matentzoglu, N., Flack, J.E., Mullen, K.R., Hegde, H., Gehrke, S., Whetzel, P.L., Shwetar, Y., Harris, N.L., et al. (2025). Mondo: integrating disease terminology across communities. Genetics 232, iyaf215. 10.1093/genetics/iyaf215.
22.
EMBL-EBI. Ontology Lookup Service (OLS) - MONDO. https://www.ebi.ac.uk/ols4/ontologies/mondo.
23.
Haendel, M.A., Balhoff, J.P., Bastian, F.B., Blackburn, D.C., Blake, J.A., Bradford, Y., Comte, A., Dahdul, W.M., Dececchi, T.A., Druzinsky, R.E., et al. (2014). Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon. J Biomed Sem 5, 21. 10.1186/2041-1480-5-21.
24.
Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., and Haendel, M.A. (2012). Uberon, an integrative multi-species anatomy ontology. Genome Biol 13, R5. 10.1186/gb-2012-13-1-r5.
25.
EMBL-EBI. Ontology Lookup Service (OLS) - UBERON. https://www.ebi.ac.uk/ols4/ontologies/uberon.
26.
Morales, J., Welter, D., Bowler, E.H., Cerezo, M., Harris, L.W., McMahon, A.C., Hall, P., Junkins, H.A., Milano, A., Hastings, E., et al. (2018). A standardized framework for representation of ancestry data in genomics studies, with application to the NHGRI-EBI GWAS Catalog. Genome Biol 19, 21. 10.1186/s13059-018-1396-2.
27.
EMBL-EBI. Ontology Lookup Service (OLS) - Hancestro. https://www.ebi.ac.uk/ols4/ontologies/hancestro.
28.
Patel, A.G., Chen, X., Huang, X., Clay, M.R., Komarova, N.L., Krasin, M.J., Pappo, A., Tillman, H., Orr, B.A., McEvoy, J., et al. (2022). The myogenesis program drives clonal selection and drug resistance in rhabdomyosarcoma. Developmental Cell 57, 1226–1240.e8. 10.1016/j.devcel.2022.04.003.
29.
Lee, M.K., Azizgolshani, N., Shapiro, J.A., Nguyen, L.N., Kolling, F.W., Zanazzi, G.J., Frost, H.R., and Christensen, B.C. (2024). Identifying tumor type and cell type-specific gene expression alterations in pediatric central nervous system tumors. Nat Commun 15, 3634. 10.1038/s41467-024-47712-8.
30.
Lee, M.K., Azizgolshani, N., Zhang, Z., Perreard, L., Kolling, F.W., Nguyen, L.N., Zanazzi, G.J., Salas, L.A., and Christensen, B.C. (2024). Associations in cell type-specific hydroxymethylation and transcriptional alterations of pediatric central nervous system tumors. Nat Commun 15, 3635. 10.1038/s41467-024-47943-9.
31.
Stewart, E., Federico, S.M., Chen, X., Shelat, A.A., Bradley, C., Gordon, B., Karlstrom, A., Twarog, N.R., Clay, M.R., Bahrami, A., et al. (2017). Orthotopic patient-derived xenografts of paediatric solid tumours. Nature 549, 96–100. 10.1038/nature23647.
32.
Norrie, J.L., Nityanandam, A., Lai, K., Chen, X., Wilson, M., Stewart, E., Griffiths, L., Jin, H., Wu, G., Orr, B., et al. (2021). Retinoblastoma from human stem cell-derived retinal organoids. Nat Commun 12, 4535. 10.1038/s41467-021-24781-7.
33.
DeSisto, J., Donson, A.M., Griesinger, A.M., Fu, R., Riemondy, K., Mulcahy Levy, J., Siegenthaler, J.A., Foreman, N.K., Vibhakar, R., and Green, A.L. (2023). Tumor and immune cell types interact to produce heterogeneous phenotypes of pediatric high-grade glioma. Neuro-Oncology 26, 538–552. 10.1093/neuonc/noad207.
34.
Zeng, A.G.X., Iacobucci, I., Shah, S., Mitchell, A., Wong, G., Bansal, S., Chen, D., Gao, Q., Kim, H., Kennedy, J.A., et al. (2025). Single-cell Transcriptional Atlas of Human Hematopoiesis Reveals Genetic and Hierarchy-Based Determinants of Aberrant AML Differentiation. Blood Cancer Discovery 6, 307–324. 10.1158/2643-3230.bcd-24-0342.
35.
Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-Loomis, B., Chattopadhyay, P.K., Swerdlow, H., Satija, R., and Smibert, P. (2017). Simultaneous epitope and transcriptome measurement in single cells. Nat Methods 14, 865–868. 10.1038/nmeth.4380.
36.
Stoeckius, M., Zheng, S., Houck-Loomis, B., Hao, S., Yeung, B.Z., Mauck, W.M., III, Smibert, P., and Satija, R. (2018). Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol 19, 224. 10.1186/s13059-018-1603-1.
37.
Mailman, M.D., Feolo, M., Jin, Y., Kimura, M., Tryka, K., Bagoutdinov, R., Hao, L., Kiang, A., Paschall, J., Phan, L., et al. (2007). The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 39, 1181–1186. 10.1038/ng1007-1181.
38.
Tryka, K.A., Hao, L., Sturcke, A., Jin, Y., Wang, Z.Y., Ziyabari, L., Lee, M., Popova, N., Sharopova, N., Kimura, M., et al. (2013). NCBI’s Database of Genotypes and Phenotypes: dbGaP. Nucl. Acids Res. 42, D975–D979. 10.1093/nar/gkt1211.
39.
Speir, M.L., Bhaduri, A., Markov, N.S., Moreno, P., Nowakowski, T.J., Papatheodorou, I., Pollen, A.A., Raney, B.J., Seninge, L., Kent, W.J., et al. (2021). UCSC Cell Browser: visualize your single-cell data. Bioinformatics 37, 4578–4580. 10.1093/bioinformatics/btab503.
40.
Langer, B.E., Amaral, A., Baudement, M.-O., Bonath, F., Charles, M., Chitneedi, P.K., Clark, E.L., Di Tommaso, P., Djebali, S., Ewels, P.A., et al. (2025). Empowering bioinformatics communities with Nextflow and nf-core. Genome Biol 26, 228. 10.1186/s13059-025-03673-9.
41.
Zheng, G.X.Y., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R., Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat Commun 8, 14049. 10.1038/ncomms14049.
42.
43.
Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y., Gentry, J., et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5, R80. 10.1186/gb-2004-5-10-r80.
44.
Huber, W., Carey, V.J., Gentleman, R., Anders, S., Carlson, M., Carvalho, B.S., Bravo, H.C., Davis, S., Gatto, L., Girke, T., et al. (2015). Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12, 115–121. 10.1038/nmeth.3252.
45.
Lun, A.T.L., Riesenfeld, S., Andrews, T., Dao, T.P., Gomes, T., participants in the 1st Human Cell Atlas Jamboree, and Marioni, J.C. (2019). EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 20, 63. 10.1186/s13059-019-1662-y.
46.
Griffiths, J.A., Richard, A.C., Bach, K., Lun, A.T.L., and Marioni, J.C. (2018). Detection and removal of barcode swapping in single-cell RNA-seq data. Nat Commun 9, 2667. 10.1038/s41467-018-05083-x.
47.
Hippen, A.A., Falco, M.M., Weber, L.M., Erkan, E.P., Zhang, K., Doherty, J.A., Vähärautio, A., Greene, C.S., and Hicks, S.C. (2021). miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data. PLoS Comput Biol 17, e1009290. 10.1371/journal.pcbi.1009290.
48.
Lun, A.T.L., Bach, K., and Marioni, J.C. (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol 17, 75. 10.1186/s13059-016-0947-7.
49.
McInnes, L., Healy, J., and Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint at arXiv. 10.48550/arXiv.1802.03426.
50.
Aran, D., Looney, A.P., Liu, L., Wu, E., Fong, V., Hsu, A., Chak, S., Naikawadi, R.P., Wolters, P.J., Abate, A.R., et al. (2019). Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 20, 163–172. 10.1038/s41590-018-0276-y.
51.
Zhang, A.W., O’Flanagan, C., Chavez, E.A., Lim, J.L.P., Ceglia, N., McPherson, A., Wiens, M., Walters, P., Chan, T., Hewitson, B., et al. (2019). Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods 16, 1007–1015. 10.1038/s41592-019-0529-1.
52.
Heimberg, G., Kuo, T., DePianto, D.J., Salem, O., Heigl, T., Diamant, N., Scalia, G., Biancalani, T., Turley, S.J., Rock, J.R., et al. (2024). A cell atlas foundation model for scalable search of similar human cells. Nature 638, 1085–1094. 10.1038/s41586-024-08411-y.
53.
Broad Institute. InferCNV: Inferring Copy Number Alterations from Tumor Single Cell RNA-Seq Data. https://github.com/broadinstitute/inferCNV.
54.
Virshup, I., Rybakov, S., Theis, F.J., Angerer, P., and Wolf, F.A. (2024). anndata: Access and store annotated data matrices. JOSS 9, 4371. 10.21105/joss.04371.
55.
Weber, L.M., Hippen, A.A., Hickey, P.F., Berrett, K.C., Gertz, J., Doherty, J.A., Greene, C.S., and Hicks, S.C. (2021). Genetic demultiplexing of pooled single-cell RNA-sequencing samples in cancer facilitates effective experimental design. GigaScience 10, giab062. 10.1093/gigascience/giab062.
56.
Chen, S., Zhou, Y., Chen, Y., and Gu, J. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890. 10.1093/bioinformatics/bty560.
57.
Patro, R., Duggal, G., Love, M.I., Irizarry, R.A., and Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14, 417–419. 10.1038/nmeth.4197.
58.
59.
Diehl, A.D., Meehan, T.F., Bradford, Y.M., Brush, M.H., Dahdul, W.M., Dougall, D.S., He, Y., Osumi-Sutherland, D., Ruttenberg, A., Sarntivijai, S., et al. (2016). The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semant 7, 44. 10.1186/s13326-016-0088-7.
60.
Meehan, T.F., Masci, A.M., Abdulla, A., Cowell, L.G., Blake, J.A., Mungall, C.J., and Diehl, A.D. (2011). Logical Development of the Cell Ontology. BMC Bioinformatics 12, 6. 10.1186/1471-2105-12-6.
61.
Bard, J., Rhee, S.Y., and Ashburner, M. (2005). An ontology for cell types. Genome Biol 6, R21. 10.1186/gb-2005-6-2-r21.
62.
Sarantopoulos, A., Ene, C., and Aquilanti, E. (2024). Therapeutic approaches to modulate the immune microenvironment in gliomas. npj Precis. Onc. 8, 241. 10.1038/s41698-024-00717-4.
63.
Sharma, P., Aaroe, A., Liang, J., and Puduvalli, V.K. (2023). Tumor microenvironment in glioblastoma: Current and emerging concepts. Neuro-Oncology Advances 5, vdad009. 10.1093/noajnl/vdad009.
64.
Alex's Lemonade Stand Foundation. OpenScPCA Project. https://openscpca.readthedocs.io.
65.
Bonine, N., Zanzani, V., Van Hemelryk, A., Vanneste, B., Zwicker, C., Thoné, T., Roelandt, S., Bekaert, S.-L., Koster, J., Janoueix-Lerosey, I., et al. (2024). NBAtlas: A harmonized single-cell transcriptomic reference atlas of human neuroblastoma tumors. Cell Reports 43, 114804. 10.1016/j.celrep.2024.114804.
66.
Matthay, K.K., Maris, J.M., Schleiermacher, G., Nakagawara, A., Mackall, C.L., Diller, L., and Weiss, W.A. (2016). Neuroblastoma. Nat Rev Dis Primers 2, 16078. 10.1038/nrdp.2016.78.
67.
Tirode, F., Surdez, D., Ma, X., Parker, M., Le Deley, M.C., Bahrami, A., Zhang, Z., Lapouble, E., Grossetête-Lalami, S., Rusch, M., et al. (2014). Genomic Landscape of Ewing Sarcoma Defines an Aggressive Subtype with Co-Association of STAG2 and TP53 Mutations. Cancer Discovery 4, 1342–1353. 10.1158/2159-8290.cd-14-0622.
68.
Ding, J., Adiconis, X., Simmons, S.K., Kowalczyk, M.S., Hession, C.C., Marjanovic, N.D., Hughes, T.K., Wadsworth, M.H., Burks, T., Nguyen, L.T., et al. (2020). Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 38, 737–746. 10.1038/s41587-020-0465-8.
69.
Messiaen, J., Jacobs, S.A., and De Smet, F. (2023). The tumor micro-environment in pediatric glioma: friend or foe? Front. Immunol. 14, 1227126. 10.3389/fimmu.2023.1227126.
70.
Tripathy, D.K., Panda, L.P., Biswal, S., and Barhwal, K. (2024). Insights into the glioblastoma tumor microenvironment: current and emerging therapeutic approaches. Front. Pharmacol. 15, 1355242. 10.3389/fphar.2024.1355242.
71.
Kim, N., Kang, H., Jo, A., Yoo, S.-A., and Lee, H.-O. (2023). Perspectives on single-nucleus RNA sequencing in different cell types and tissues. JPTM 57, 52–59. 10.4132/jptm.2022.12.19.
72.
Truong, D.D., Lamhamedi-Cherradi, S.-E., Porter, R.W., Krishnan, S., Swaminathan, J., Gibson, A., Lazar, A.J., Livingston, J.A., Gopalakrishnan, V., Gordon, N., et al. (2023). Dissociation protocols used for sarcoma tissues bias the transcriptome observed in single-cell and single-nucleus RNA sequencing. BMC Cancer 23, 488. 10.1186/s12885-023-10977-1.
73.
Hawkins, A., Spielman, S.J., Shapiro, J., Prasad, D., and Taroni, J. (2026). AlexsLemonade/scpca-nf. Zenodo. 10.5281/zenodo.20072636.
74.
Gottlieb, A., Mejia, D., Ichihara, N., Yakovets, A., Wheeler, K., Taroni, J., Hawkins, A., Prasad, D.V., and Shapiro, J. (2026). AlexsLemonade/scpca-portal. Zenodo. 10.5281/zenodo.20058961.
75.
Shapiro, J., Hawkins, A., Wheeler, K., Spielman, S.J., and Yakovets, A. (2026). AlexsLemonade/alsf-scpca. Zenodo. 10.5281/zenodo.20044281.
76.
Hawkins, A., Spielman, S.J., Shapiro, J., Bethell, C., and Taroni, J. (2026). AlexsLemonade/sc-data-integration. Zenodo. 10.5281/zenodo.20044314.
77.
Shapiro, J., Hawkins, A., Spielman, S.J., Jackson, R., and Taroni, J. (2026). AlexsLemonade/OpenScPCA-nf. Zenodo. 10.5281/zenodo.20056054.
78.
Shapiro, J. (2026). AlexsLemonade/ScPCAr. Zenodo. 10.5281/zenodo.20044462.
79.
Spielman, S.J., Hawkins, A., Shapiro, J., Prasad, D.V., and Taroni, J. (2026). AlexsLemonade/scpca-paper-figures. Zenodo. 10.5281/zenodo.20123383.
80.
Chan Zuckerberg Initiative. CELLxGENE. https://cellxgene.cziscience.com.
81.
Chan Zuckerberg Initiative. CZI Single Cell Curation Schema 3.0.0. https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md.
82.
Srivastava, A., Malik, L., Sarkar, H., Zakeri, M., Almodaresi, F., Soneson, C., Love, M.I., Kingsford, C., and Patro, R. (2020). Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 21, 239. 10.1186/s13059-020-02151-8.
83.
Germain, P.-L., Lun, A., Garcia Meixide, C., Macnair, W., and Robinson, M.D. (2022). Doublet identification in single-cell sequencing data using scDblFinder. F1000Res 10, 979. 10.12688/f1000research.73600.2.
84.
Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., and Gingeras, T.R. (2012). STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21. 10.1093/bioinformatics/bts635.
85.
Kaminow, B., Yunusov, D., and Dobin, A. (2021). STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data. Preprint at bioRxiv. 10.1101/2021.05.05.442755.
86.
Danecek, P., Bonfield, J.K., Liddle, J., Marshall, J., Ohan, V., Pollard, M.O., Whitwham, A., Keane, T., McCarthy, S.A., Davies, R.M., et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience 10, giab008. 10.1093/gigascience/giab008.
87.
Huang, X., and Huang, Y. (2021). Cellsnp-lite: an efficient tool for genotyping single cells. Bioinformatics 37, 4569–4571. 10.1093/bioinformatics/btab358.
88.
Martens, J.H.A., and Stunnenberg, H.G. (2013). BLUEPRINT: mapping human blood cell epigenomes. Haematologica 98, 1487–1489. 10.3324/haematol.2013.094243.
89.
The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. 10.1038/nature11247.
90.
Lun, A.T.L. SingleR Book: Annotation Diagnostics Based on the Deltas Across Cells. https://bioconductor.org/books/release/SingleRBook/annotation-diagnostics.html#based-on-the-deltas-across-cells.
91.
Franzén, O., Gan, L.-M., and Björkegren, J.L.M. (2019). PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046. 10.1093/database/baz046.
92.
Heimberg, G., Kuo, T., Diamant, N., Salem, O., Bravo, H.C., and Vander Heiden, J. (2024). SCimilarity Models. Zenodo. 10.5281/zenodo.10685499.
93.
Tan, S.Z.K., Puig-Barbe, A., Goutte-Gattat, D., Eastwood, C., Aevermann, B., Avola, A., Balhoff, J.P., Bayindir, I.U., Belfiore, J., Caron, A.R., et al. (2026). The Cell Ontology in the age of single-cell omics. Sci Data. 10.1038/s41597-026-07173-8.
94.
Carey, V. (2017). ontoProc. Bioconductor. 10.18129/b9.bioc.ontoproc.
95.
Hu, C., Li, T., Xu, Y., Zhang, X., Li, F., Bai, J., Chen, J., Jiang, W., Yang, K., Ou, Q., et al. (2022). CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Research 51, D870–D876. 10.1093/nar/gkac947.
96.
Alex's Lemonade Stand Foundation. OpenScPCA-analysis GitHub Repository. https://github.com/AlexsLemonade/OpenScPCA-analysis.
97.
Alex's Lemonade Stand Foundation. OpenScPCA-nf GitHub Repository. https://github.com/AlexsLemonade/OpenScPCA-nf.
98.
Bonine, N., Zanzani, V., Van Hemelryk, A., Vanneste, B., Zwicker, C., Thoné, T., Roelandt, S., Bekaert, S.-L., Koster, J., Janoueix-Lerosey, I., et al. (2024). NBAtlas: A harmonized single-cell transcriptomic reference atlas of human neuroblastoma tumors. Cell Reports 43, 114804. 10.1016/j.celrep.2024.114804.
99.
Lotfollahi, M., Naghipourfar, M., Luecken, M.D., Khajavi, M., Büttner, M., Wagenstetter, M., Avsec, Ž., Gayoso, A., Yosef, N., Interlandi, M., et al. (2021). Mapping single-cell data to reference atlases by transfer learning. Nat Biotechnol 40, 121–130. 10.1038/s41587-021-01001-7.
100.
Aibar, S. (2017). AUCell. Bioconductor. 10.18129/b9.bioc.aucell.
101.
Aynaud, M.-M., Mirabeau, O., Gruel, N., Grossetête, S., Boeva, V., Durand, S., Surdez, D., Saulnier, O., Zaïdi, S., Gribkova, S., et al. (2020). Transcriptional Programs Define Intratumoral Heterogeneity of Ewing Sarcoma at Single-Cell Resolution. Cell Reports 30, 1767–1779.e6. 10.1016/j.celrep.2020.01.049.
102.
Staege, M.S., Hutter, C., Neumann, I., Foja, S., Hattenhorst, U.E., Hansen, G., Afar, D., and Burdach, S.E.G. (2004). DNA Microarrays Reveal Relationship of Ewing Family Tumors to Both Endothelial and Fetal Neural Crest-Derived Cells and Define Novel Targets. Cancer Research 64, 8213–8221. 10.1158/0008-5472.can-03-4059.
103.
Wrenn, E.D., Apfelbaum, A.A., Rudzinski, E.R., Deng, X., Jiang, W., Sud, S., Van Noord, R.A., Newman, E.A., Garcia, N.M., Miyaki, A., et al. (2023). Cancer-Associated Fibroblast-Like Tumor Cells Remodel the Ewing Sarcoma Tumor Microenvironment. Clinical Cancer Research 29, 5140–5154. 10.1158/1078-0432.ccr-23-1111.
104.
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, Jill P., and Tamayo, P. (2015). The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems 1, 417–425. 10.1016/j.cels.2015.12.004.
105.
Haghverdi, L., Lun, A.T.L., Morgan, M.D., and Marioni, J.C. (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol 36, 421–427. 10.1038/nbt.4091.
106.
Zappia, L., and Lun, A.T.L. (2020). zellkonverter. Bioconductor. 10.18129/b9.bioc.zellkonverter.
107.
Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550. 10.1186/s13059-014-0550-8.
108.
Bates, D., Mächler, M., Bolker, B., and Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software 67, 1–48. 10.18637/jss.v067.i01.
109.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x.
110.
Spielman, S.J., Hawkins, A., Shapiro, J., Taroni, J., Plaschka, M., O'Malley, J., UTSouthwesternDSSR, Prasad, D.V., Chen, J., Truong, D., et al. (2026). AlexsLemonade/OpenScPCA-analysis. Zenodo. 10.5281/zenodo.18459136.
111.
McCarthy, D.J., Campbell, K.R., Lun, A.T.L., and Wills, Q.F. (2017). Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186. 10.1093/bioinformatics/btw777.
112.
Lun, A.T.L., McCarthy, D.J., and Marioni, J.C. (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res 5, 2122. 10.12688/f1000research.9501.2.
113.
Hao, Y., Stuart, T., Kowalski, M.H., Choudhary, S., Hoffman, P., Hartman, A., Srivastava, A., Molla, G., Madad, S., Fernandez-Granda, C., et al. (2023). Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 42, 293–304. 10.1038/s41587-023-01767-y.
114.
Huang, Y., McCarthy, D.J., and Stegle, O. (2019). Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference. Genome Biol 20, 273. 10.1186/s13059-019-1865-2.
115.
Chan Zuckerberg Initiative. Cellxgene Data Portal. https://cellxgene.cziscience.com/.
116.
Lun, A.T.L., and Kancherla, J. (2023). Powering single-cell analyses in the browser with WebAssembly. JOSS 8, 5603. 10.21105/joss.05603.
117.
Kancherla, J., and Lun, A.T.L. kana. https://www.kanaverse.org/kana/.
118.
Li, H., Zhou, J., Li, Z., Chen, S., Liao, X., Zhang, B., Zhang, R., Wang, Y., Sun, S., and Gao, X. (2023). A comprehensive benchmarking with practical guidelines for cellular deconvolution of spatial transcriptomics. Nat Commun 14, 1548. 10.1038/s41467-023-37168-7.
119.
Avila Cobos, F., Vandesompele, J., Mestdagh, P., and De Preter, K. (2018). Computational deconvolution of transcriptomics data from mixed cell populations. Bioinformatics 34, 1969–1979. 10.1093/bioinformatics/bty019.
120.
Cobos, F.A., Panah, M.J.N., Epps, J., Long, X., Man, T.-K., Chiu, H.-S., Chomsky, E., Kiner, E., Krueger, M.J., di Bernardo, D., et al. (2023). Effective methods for bulk RNA-seq deconvolution using scnRNA-seq transcriptomes. Genome Biol 24, 177. 10.1186/s13059-023-03016-6.
121.
Lilly, J.V., Rokita, J.L., Mason, J.L., Patton, T., Stefankiewiz, S., Higgins, D., Trooskin, G., Larouci, C.A., Arya, K., Appert, E., et al. (2023). The children's brain tumor network (CBTN) - Accelerating research in pediatric central nervous system tumors through collaboration and open science. Neoplasia 35, 100846. 10.1016/j.neo.2022.100846.
122.
Shapiro, J.A., Gaonkar, K.S., Spielman, S.J., Savonen, C.L., Bethell, C.J., Jin, R., Rathi, K.S., Zhu, Y., Egolf, L.E., Farrow, B.K., et al. (2023). OpenPBTA: The Open Pediatric Brain Tumor Atlas. Cell Genomics 3, 100340. 10.1016/j.xgen.2023.100340.
123.
Patel, A.G., Ashenberg, O., Collins, N.B., Segerstolpe, Å., Jiang, S., Slyper, M., Huang, X., Caraccio, C., Jin, H., Sheppard, H., et al. (2024). A spatial cell atlas of neuroblastoma reveals developmental, epigenetic and spatial axis of tumor heterogeneity. Preprint at bioRxiv. 10.1101/2024.01.07.574538.
124.
Chapple, R.H., Liu, X., Natarajan, S., Alexander, M.I.M., Kim, Y., Patel, A.G., LaFlamme, C.W., Pan, M., Wright, W.C., Lee, H.-M., et al. (2024). An integrated single-cell RNA-seq map of human neuroblastoma tumors and preclinical models uncovers divergent mesenchymal-like gene expression programs. Genome Biol 25, 161. 10.1186/s13059-024-03309-4.