Before installing smartSim, ensure that the appropriate Conda environment is activated. This environment is provided at: https://github.com/MarchalLab/smartSim.
To install smartSim from GitHub, follow the instructions below:
library(devtools)
options(timeout=400) #example datasets can take a while to download
install_github("MarchalLab/smartSim", dependencies = TRUE, repos = BiocManager::repositories())
Alternatively, you may install SmartSim from a local copy using:
ssPath <- "path/to/smartSim" #use your path to smartSim
install.packages(ssPath, repos = NULL, type="source")
Once installed, load SmartSim along with the remaining required packages:
library(smartSim)
library(R.utils)
## Warning: package 'R.oo' was built under R version 4.4.3
We begin by creating the main output directory, followed by subdirectories to organize the generated data:
# Define and create the main output directory
outDir <- "example_data_out"
dir.create(outDir, recursive = TRUE)
## Warning in dir.create(outDir, recursive = TRUE): 'example_data_out' already
## exists
# Define subdirectories
dataCharDir <- file.path(outDir, "dataChar")
trueTranscriptDir <- file.path(outDir, "trueTranscripts")
readDir <- file.path(outDir, "reads")
We begin by loading the required aligned bam files. For this tutorial, we use an example dataset derived from 192 HEK293FT cells that underwent Smart-seq3 library preparation, as described in Hagemann-Jensen et al. [1] (accession number: E-MTAB-8735).
We downloaded this data, and used zUMIs [2] to align the reads to the human reference genome and determine the cell barcode and UMI for each read. We provide the script used at: https://github.com/MarchalLab/smartSim/scripts/getData.sh
For demonstration purposes, only the reads aligning to chromosome 1 have been selected.
# Locate the example bam files provided with the smartSim package
bamDir <- system.file("extdata","aligned", package = "smartSim")
Next, we load the reference annotation files required for simulation. Ensure that the GTF and FASTA files match those used during the alignment step when providing your own bam files.
# Define paths for uncompressed reference files
gtfFile <- file.path(outDir, "chr1.gtf")
fastaFile <- file.path(outDir, "chr1.fa")
# Locate and uncompress the reference files provided with the smartSim package
if (! file.exists(gtfFile)){
gunzip(system.file("extdata","chr1.gtf.gz", package = "smartSim"), destname = gtfFile, remove = FALSE)
}
if (! file.exists(fastaFile)){
gunzip(system.file("extdata","chr1.fa.gz", package = "smartSim"), destname = fastaFile, remove = FALSE)
}
In this tutorial, we simulate two groups of single cells, each containing 10 cells. Every cell expresses 10 distinct genes, with each gene producing two (already annotated) transcript isoforms.
To mimic the characteristics of Smart-seq3 data from Hagemann-Jensen et al., we use:
In this tutorial, we will generate 2 groups of 10 cells each. Each cell expresses 10 different genes with each 2 (already annotated) transcripts. Similarly to the Smart-seq3 data from Hagemann-Jensen, we use a cell barcode with length 8, a UMI with length 8 and paired end reads with length x \(\times\) 150bp.
# Simulation design parameters
nClust = 2 #number of cell populations
nCellsPerClust = 10 #number of cells per cell population
nGenes = 10 #number of genes to simulate
nTx = 2 #number of transcripts per gene
# Smart-seq3 and sequencing configuration
barcodeLen = 8 #length of cell barcode
UMIlen = 8 #length of UMI
readLen = 150 #length of 1 end of PE read
tag = "ATTGCGCAATG" #Smart-seq3 tag used to identify UMI containing reads
# Set seed for reproducibility
set.seed(1234)
To simulate data that closely mimics the characteristics of the existing Smart-seq3 dataset, we first perform a data characterization step. This process extracts empirical distributions and summary statistics from the aligned reads, which are later used to guide simulation.
The following features are extracted:
fragmentLength_nonUMI.tsv
);fragmentLength_UMI.tsv
);UMIcounts.tsv
);intronicReads.tsv
);readQual
.tsv`);bcQual.tsv
).Note: For demonstration purposes, we limit the analysis to the first
10 000 reads per bam file using numReads = 10000
. By
default, all reads are processed (numReads = NULL
).
Additionally, we assume that cell barcodes are stored under the “BC” tag and UMIs are stored under the “UB” tag in the bam files. These tags are consistent with the output from zUMIs [2]. If you used a different tool for alignment and demultiplexing, ensure that the barcode and UMI tags follow the same convention.
# Characterize the Smart-seq3 dataset
dataChar <- characterizeData(bam = bamDir,
gtf = gtfFile,
fasta = fastaFile,
out = dataCharDir,
nCPU = 4,
numReads = 10000,
seed = 1234,
stage = "geneFilt",
phred = 33)
This step typically takes around 20 minutes, but the exact time may vary depending on your system’s performance, data size, and I/O speed. Runtime may vary across different machines.
To save time, you can skip the data characterization step by loading the precomputed characteristics that are provided with the smartSim package. This allows you to proceed directly to the simulation step without rerunning the resource-intensive characterization process.
# Load precomputed data characteristics
dataChar <- DataCharacteristics(inDir = system.file("extdata","dataChar", package = "smartSim"))
If your aligned dataset is not demultiplexed in cell specific bam files, you can instead provide a single bam file directly to the characterizeData() function.
As before, ensure that:
# Characterize data from a single bam file
dataChar <- characterizeData(bam = "path/to/your/bamFile.bam",
gtf = gtfFile,
fasta = fastaFile,
out = dataCharDir,
nCPU = 4,
numReads = NULL,
seed = 1234,
stage = "geneFilt",
phred = 33)
To generate synthetic data, we begin by selecting a subset of annotated transcripts from the reference GTF file. In this example, we randomly select 2 transcripts per gene for 10 randomly chosen genes.
# Select annotated transcripts for simulation
gtf_simul <- getTranscripts(gtf = gtfFile,
nTx = nTx,
out = trueTranscriptDir,
nRandomGenes = nGenes,
minTxLen = 1000,
seed = 1234)
# Preview selected transcripts
head(gtf_simul)
## [1] "Selecting transcripts for simulation"
## seqnames start end width strand source type score phase
## 64622 chr1 36455718 36464384 8667 - HAVANA gene NA NA
## 64644 chr1 36455772 36459703 3932 - HAVANA transcript NA NA
## 64645 chr1 36457923 36459703 1781 - HAVANA exon NA NA
## 64646 chr1 36456187 36456378 192 - HAVANA exon NA NA
## 64647 chr1 36455772 36455925 154 - HAVANA exon NA NA
## 64651 chr1 36455774 36464345 8572 - HAVANA transcript NA NA
## gene_id gene_type gene_name level hgnc_id
## 64622 ENSG00000116898.12 protein_coding MRPS15 2 HGNC:14504
## 64644 ENSG00000116898.12 protein_coding MRPS15 2 HGNC:14504
## 64645 ENSG00000116898.12 protein_coding MRPS15 2 HGNC:14504
## 64646 ENSG00000116898.12 protein_coding MRPS15 2 HGNC:14504
## 64647 ENSG00000116898.12 protein_coding MRPS15 2 HGNC:14504
## 64651 ENSG00000116898.12 protein_coding MRPS15 2 HGNC:14504
## havana_gene transcript_id transcript_type
## 64622 OTTHUMG00000008042.5 <NA> <NA>
## 64644 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64645 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64646 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64647 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64651 OTTHUMG00000008042.5 ENST00000462067.1 processed_transcript
## transcript_name transcript_support_level tag havana_transcript
## 64622 <NA> <NA> <NA> <NA>
## 64644 MRPS15-204 2 <NA> OTTHUMT00000022054.2
## 64645 MRPS15-204 2 <NA> OTTHUMT00000022054.2
## 64646 MRPS15-204 2 <NA> OTTHUMT00000022054.2
## 64647 MRPS15-204 2 <NA> OTTHUMT00000022054.2
## 64651 MRPS15-202 2 <NA> OTTHUMT00000022053.2
## exon_number exon_id ont protein_id ccdsid
## 64622 <NA> <NA> <NA> <NA> <NA>
## 64644 <NA> <NA> <NA> <NA> <NA>
## 64645 1 ENSE00001810196.1 <NA> <NA> <NA>
## 64646 2 ENSE00003613150.1 <NA> <NA> <NA>
## 64647 3 ENSE00001880390.1 <NA> <NA> <NA>
## 64651 <NA> <NA> <NA> <NA> <NA>
This step generates a table containing all gene, transcript and exon information for the selected transcripts.
If you have a predefined list of gene IDs, you can pass them directly
to getTranscripts()
instead of selecting genes at
random:
# Example: using 10 predefined gene IDs
genes <- c("ENSG00000116898.12", "ENSG00000162437.14", "ENSG00000223883.2", "ENSG00000009307.16", "ENSG00000177551.5",
"ENSG00000281741.2", "ENSG00000159208.16", "ENSG00000143418.20", "ENSG00000183856.11", "ENSG00000188585.9")
gtf_simul <- getTranscripts(gtf = gtfFile,
nTx = nTx,
out = trueTranscriptDir,
genes = genes,
minTxLen = 1000,
seed = 1234)
# Preview selected transcripts
head(gtf_simul)
Ensure that each gene in your list has at least nTx
annotated transcripts available in the reference GTF file. Genes that do
not meet this requirement will be ignored by
getTranscripts()
.
If you want to simulate different numbers of transcripts per gene, you can run getTranscripts() separately for each different groups of genes, and then concatenate the resulting GTF tables. This approach allows full control over transcript complexity per gene, enabling more flexible simulation scenarios.
The example below simulates:
# Define gene groups
genes1 <- c("ENSG00000116898.12", "ENSG00000162437.14", "ENSG00000177551.5", "ENSG00000281741.2", "ENSG00000183856.11")
genes2 <- c("ENSG00000223883.2", "ENSG00000009307.16", "ENSG00000159208.16", "ENSG00000143418.20", "ENSG00000188585.9")
# Select 2 transcripts per gene
gtf_simul1 <- getTranscripts(gtf = gtfFile,
nTx = 2,
out = trueTranscriptDir,
genes = genes1,
minTxLen = 1000,
seed = 1234)
# Select 3 transcripts per gene
gtf_simul2 <- getTranscripts(gtf = gtfFile,
nTx = 3,
out = trueTranscriptDir,
genes = genes2,
minTxLen = 1000,
seed = 1234)
# Combine into a single table
gtf_simul <- rbind(gtf_simul1, gtf_simul2)
# Preview selected transcripts
head(gtf_simul)
You can also extract transcript annotations from a predefined list of transcript IDs. This allows full control over which transcripts are included in the simulation. In the example below, we retrieve annotation data for 10 transcripts from 4 different genes.
# Predefined list of transcript IDs
transcripts <- c("ENST00000488606.5", "ENST00000462067.1",
"ENST00000371072.8", "ENST00000418058.1",
"ENST00000482825.7", "ENST00000345896.8", "ENST00000271688.10",
"ENST00000646925.1", "ENST00000623247.2", "ENST00000445098.2")
# Extract corresponding GTF annotation
gtf_simul <- getTranscripts(gtf = gtfFile,
out = trueTranscriptDir,
transcripts = transcripts,
minTxLen = 1000,
seed = 1234)
head(gtf_simul)
To simulate alternative splicing and transcriptome complexity, you can introduce novel splicing events to selected genes. Below, we demonstrate how to add both novel junctions and novel exons.
We introduce one new junction in each of the first two genes from the current annotation:
# Select first two genes for novel junctions
genes_nj <- unique(gtf_simul$gene_id)[1:2]
# Add new junctions
gtf_simul <- addNewJunctions(annot_gtf = gtf_simul,
gene_ids = genes_nj,
seed = 1234)
#inspect transcript IDs
print(unique(gtf_simul$transcript_id))
## [1] "adding de novo junctions"
## [1] NA "ENST00000488606.5"
## [3] "ENST00000462067.1" "ENSG00000116898.12newJunction"
## [5] "ENST00000371072.8" "ENST00000418058.1"
## [7] "ENSG00000162437.14newJunction" "ENST00000667299.1"
## [9] "ENST00000667473.1" "ENST00000610726.4"
## [11] "ENST00000369530.5" "ENST00000320238.3"
## [13] "ENST00000369506.1" "ENST00000626088.1"
## [15] "ENST00000640686.1" "ENST00000369094.5"
## [17] "ENST00000469255.5" "ENST00000368954.10"
## [19] "ENST00000368949.8" "ENST00000361170.7"
## [21] "ENST00000491900.1" "ENST00000641292.1"
## [23] "ENST00000445098.2"
Each new transcript containing a new junction is assigned a name
following the pattern: geneIDnewJunction
.
We add one new exon to each of the next two genes. By default, the new exon will be selected such that the splice sites are canonical, i.e. a GT donor site and an AG acceptor site:
# Select next two genes for novel exons
genes_ne <- unique(gtf_simul$gene_id)[3:4]
# Add new exons
gtf_simul <- addNewExons(annot_gtf = gtf_simul,
fasta = fastaFile,
gene_ids = genes_ne,
canonical = TRUE,
seed = 1234)
#inspect transcriptIDs
print(unique(gtf_simul$transcript_id))
## [1] "adding de novo exons"
## [1] NA "ENST00000488606.5"
## [3] "ENST00000462067.1" "ENSG00000116898.12newJunction"
## [5] "ENST00000371072.8" "ENST00000418058.1"
## [7] "ENSG00000162437.14newJunction" "ENST00000667299.1"
## [9] "ENST00000667473.1" "ENSG00000223883.2newExon"
## [11] "ENST00000610726.4" "ENST00000369530.5"
## [13] "ENSG00000009307.16newExon" "ENST00000320238.3"
## [15] "ENST00000369506.1" "ENST00000626088.1"
## [17] "ENST00000640686.1" "ENST00000369094.5"
## [19] "ENST00000469255.5" "ENST00000368954.10"
## [21] "ENST00000368949.8" "ENST00000361170.7"
## [23] "ENST00000491900.1" "ENST00000641292.1"
## [25] "ENST00000445098.2"
Each new transcript containing a new exon is assigned a name
following the pattern: geneIDnewExon
.
Finally, we save the updated GTF and corresponding transcript sequences to file. These will serve as the ground truth for simulation:
# Export ground truth transcript annotation and sequences
transcript_seq = writeTrueTranscripts(gtf = gtf_simul,
fasta = fastaFile,
out = trueTranscriptDir)
## [1] "writing true transcripts to file"
For each cluster of cells, we generate cell-specific gene expression
profiles using getRandomGeneExpression()
. This approach
samples expression values from real Smart-seq3 data to closely mimic
biological variability and read coverage patterns observed in
practice.
Since we are simulating two clusters, each cluster receives a
distinct gene expression profile that is shared across all cells in the
cluster. To introduce cell-specific variability in sequencing depth,
each expression profile is scaled by a randomly sampled cell-specific
cellSize
factor.
geneExpression <- getRandomGeneExpression(nClusters = nClust,
nCellsPerCluster = nCellsPerClust,
genes = unique(gtf_simul$gene_id),
dataChar = dataChar,
minCount = nTx,
seed = 1234)
print(head(geneExpression))
## [1] "getting random gene expression"
## cluster cell gene_id geneExpression
## 1 1 cluster1_cell1 ENSG00000116898.12 1.2028442
## 2 1 cluster1_cell2 ENSG00000116898.12 1.1573085
## 3 1 cluster1_cell3 ENSG00000116898.12 0.7771064
## 4 1 cluster1_cell4 ENSG00000116898.12 0.5586372
## 5 1 cluster1_cell5 ENSG00000116898.12 0.9171300
## 6 1 cluster1_cell6 ENSG00000116898.12 4.7318168
To simulate alternative splicing, we generate cluster-specific
Percent Spliced In (PSI) values for each transcript using
getRandomClusterPSI()
. PSI values represent the relative
abundance of each transcript within a gene and are used to simulate
transcript-level expression profiles.
In this example:
fixedGenes
) will have the
same PSI values across all clusters, representing genes without
differential spicing.diffGenes
) will have distinct PSI
values per cluster, mimicking differential splicing between cell
populations.# Split genes into those with and without differential splicing
geneIDs <- unique(gtf_simul$gene_id)
fixedGenes <- geneIDs[1:(length(geneIDs)/2)]
diffGenes <- geneIDs[((length(geneIDs)/2)+1):length(geneIDs)]
# Generate cluster-specific PSI values
clusterPSI <- getRandomClusterPSI(gtf = gtf_simul,
nClusters = nClust,
fixedGenes = fixedGenes,
diffGenes = diffGenes,
seed = 1234)
print(clusterPSI)
## # A tibble: 48 × 4
## # Groups: gene_id [10]
## cluster gene_id transcript_id PSI
## <chr> <chr> <chr> <dbl>
## 1 1 ENSG00000116898.12 ENST00000488606.5 0.0845
## 2 1 ENSG00000116898.12 ENST00000462067.1 0.463
## 3 1 ENSG00000116898.12 ENSG00000116898.12newJunction 0.453
## 4 1 ENSG00000162437.14 ENST00000371072.8 0.293
## 5 1 ENSG00000162437.14 ENST00000418058.1 0.405
## 6 1 ENSG00000162437.14 ENSG00000162437.14newJunction 0.301
## 7 1 ENSG00000223883.2 ENST00000667299.1 0.0105
## 8 1 ENSG00000223883.2 ENST00000667473.1 0.256
## 9 1 ENSG00000223883.2 ENSG00000223883.2newExon 0.733
## 10 1 ENSG00000009307.16 ENST00000610726.4 0.293
## # ℹ 38 more rows
You may also provide a custom PSI table, as long as it includes the following columns: “cluster”, “gene_id”, “transcript_id” and “PSI”. Ensure that for each gene in each cluster, the PSI values across transcripts sum to 1.
Here, we provide an example in which the transcripts are equally abundant in all genes.
library(dplyr, quietly = TRUE)
#PSI uniformly distributed across all transcripts of a gene
clusterPSI <- merge(unique(geneExpression[, c("cluster", "gene_id")]),
gtf_simul[gtf_simul$type == "transcript", c("gene_id", "transcript_id")],
by = "gene_id")
clusterPSI$PSI <- 1
#ensure PSI values of each gene sum to one per cluster
clusterPSI <- clusterPSI %>%
group_by(gene_id, cluster) %>%
mutate(PSI = PSI / sum(PSI))
print(clusterPSI)
To emulate realistic transcriptome complexity, we get the fraction of unspliced transcripts per gene per cluster. These values are derived from the provided example dataset to reflect patterns observed in real Smart-seq3 data.
In this example we will simulate reads such that:
fixedGenes
) have a shared
unspliced fraction across clusters, representing no cluster-specific
rate of RNA transcription and degradation.diffGenes
) can exhibit
differential unspliced proportions, allowing for variation in RNA
processing between clusters.# Define gene sets for fixed vs. differential unspliced ratios
geneIDs <- unique(gtf_simul$gene_id)
fixedGenes <- geneIDs[1:(length(geneIDs)/2)]
diffGenes <- geneIDs[((length(geneIDs)/2)+1):length(geneIDs)]
# Generate unspliced transcript fractions per gene and cluster
unspliced <- getRandomUnspliced(nClusters = nClust,
fixedGenes = fixedGenes,
diffGenes = diffGenes,
dataChar = dataChar,
seed = 1234)
print(unspliced)
## [1] "getting random fractions of unspliced transcript"
## cluster gene_id nUnspliced
## 1 1 ENSG00000116898.12 0.00
## 2 1 ENSG00000162437.14 0.00
## 3 1 ENSG00000223883.2 0.00
## 4 1 ENSG00000009307.16 0.00
## 5 1 ENSG00000177551.5 0.70
## 6 2 ENSG00000116898.12 0.00
## 7 2 ENSG00000162437.14 0.00
## 8 2 ENSG00000223883.2 0.00
## 9 2 ENSG00000009307.16 0.00
## 10 2 ENSG00000177551.5 0.70
## 11 1 ENSG00000281741.2 0.94
## 12 1 ENSG00000159208.16 0.00
## 13 1 ENSG00000143418.20 0.00
## 14 1 ENSG00000183856.11 0.93
## 15 1 ENSG00000188585.9 0.00
## 16 2 ENSG00000281741.2 0.95
## 17 2 ENSG00000159208.16 0.00
## 18 2 ENSG00000143418.20 0.00
## 19 2 ENSG00000183856.11 0.50
## 20 2 ENSG00000188585.9 0.00
Since the example dataset is derived from single-cell Smart-seq3 (not
single-nucleus RNA-seq), the majority of genes contain low or no
unspliced RNA. As a result, many nUnspliced
values are
equal to 0.
You can also provide a custom table of unspliced transcript fractions. This table must include the following columns: “cluster”, “gene_id” and “nUnspliced”.
Here, we provide an example in which there are no unspliced transcripts.
unspliced <- unique(geneExpression[, c("cluster", "gene_id")])
unspliced$nUnspliced <- 0
print(unspliced)
Finally, we combine the cluster-specific PSI values and fractions of unspliced transcripts with the cell-specific gene expression to get the expression for each spliced and unspliced transcript.
transcriptExpression <- getTranscriptExpression(clusterPSI = clusterPSI,
unspliced = unspliced,
geneExpression = geneExpression,
outFile = paste0(trueTranscriptDir, "/transcriptExpression.tsv"))
print(head(transcriptExpression))
## gene_id cluster transcript_id PSI cell
## 1 ENSG00000009307.16 1 ENST00000610726.4 0.2933855 cluster1_cell10
## 2 ENSG00000009307.16 1 ENST00000610726.4 0.2933855 cluster1_cell2
## 3 ENSG00000009307.16 1 ENST00000610726.4 0.2933855 cluster1_cell3
## 4 ENSG00000009307.16 1 ENST00000610726.4 0.2933855 cluster1_cell4
## 5 ENSG00000009307.16 1 ENST00000610726.4 0.2933855 cluster1_cell1
## 6 ENSG00000009307.16 1 ENST00000610726.4 0.2933855 cluster1_cell6
## geneExpression transcriptExpression
## 1 9.366274 3
## 2 19.095590 6
## 3 12.822255 4
## 4 9.217515 3
## 5 19.846930 6
## 6 78.074977 23
With all required inputs prepared, we now proceed to simulate the raw sequencing reads for all cells. We generate reads that mimic real Smart-seq3 data by combining transcript expression levels, sequencing characteristics, and error profiles.
simulateReads(transcriptExpression = transcriptExpression,
transcript_seq = transcript_seq,
dataChar = dataChar,
out = readDir,
barcodeLen = barcodeLen,
UMIlen = UMIlen,
readLen = readLen,
tag = tag,
PAF = 2,
errorRate = 0.005,
seed = 1234)
## [1] "simulating reads"
## [1] "cluster1_cell10"
## [1] "cluster1_cell2"
## [1] "cluster1_cell3"
## [1] "cluster1_cell4"
## [1] "cluster1_cell1"
## [1] "cluster1_cell6"
## [1] "cluster1_cell7"
## [1] "cluster1_cell8"
## [1] "cluster1_cell5"
## [1] "cluster1_cell9"
## [1] "cluster2_cell10"
## [1] "cluster2_cell2"
## [1] "cluster2_cell3"
## [1] "cluster2_cell4"
## [1] "cluster2_cell1"
## [1] "cluster2_cell6"
## [1] "cluster2_cell7"
## [1] "cluster2_cell8"
## [1] "cluster2_cell5"
## [1] "cluster2_cell9"
In the readDir
directory, smartSim will write out :
unassigned_I1.fq.gz
,
unassigned_I2.fq.gz
, unassigned_R1.fq.gz
and
unassinged_R2.fq.gz
);barcodes.txt
);cell2barcode.tsv
).