Installation

Before installing smartSim, ensure that the appropriate Conda environment is activated. This environment is provided at: https://github.com/MarchalLab/smartSim.

To install smartSim from GitHub, follow the instructions below:

library(devtools)
options(timeout=400) #example datasets can take a while to download
install_github("MarchalLab/smartSim", dependencies = TRUE, repos = BiocManager::repositories())

Alternatively, you may install SmartSim from a local copy using:

ssPath <- "path/to/smartSim" #use your path to smartSim
install.packages(ssPath, repos = NULL, type="source")

Once installed, load SmartSim along with the remaining required packages:

library(smartSim)
library(R.utils)

## Warning: package 'R.oo' was built under R version 4.4.3

Output directories

We begin by creating the main output directory, followed by subdirectories to organize the generated data:

# Define and create the main output directory
outDir <- "example_data_out"
dir.create(outDir, recursive = TRUE)

## Warning in dir.create(outDir, recursive = TRUE): 'example_data_out' already
## exists

# Define subdirectories
dataCharDir <- file.path(outDir, "dataChar")
trueTranscriptDir <- file.path(outDir, "trueTranscripts")
readDir <- file.path(outDir, "reads")

Input files

We begin by loading the required aligned bam files. For this tutorial, we use an example dataset derived from 192 HEK293FT cells that underwent Smart-seq3 library preparation, as described in Hagemann-Jensen et al. [1] (accession number: E-MTAB-8735).

We downloaded this data, and used zUMIs [2] to align the reads to the human reference genome and determine the cell barcode and UMI for each read. We provide the script used at: https://github.com/MarchalLab/smartSim/scripts/getData.sh

For demonstration purposes, only the reads aligning to chromosome 1 have been selected.

# Locate the example bam files provided with the smartSim package
bamDir <- system.file("extdata","aligned", package = "smartSim")

Next, we load the reference annotation files required for simulation. Ensure that the GTF and FASTA files match those used during the alignment step when providing your own bam files.

# Define paths for uncompressed reference files
gtfFile <- file.path(outDir, "chr1.gtf")
fastaFile <- file.path(outDir, "chr1.fa")

# Locate and uncompress the reference files provided with the smartSim package
if (! file.exists(gtfFile)){
  gunzip(system.file("extdata","chr1.gtf.gz", package = "smartSim"), destname = gtfFile, remove = FALSE)
}
if (! file.exists(fastaFile)){
  gunzip(system.file("extdata","chr1.fa.gz", package = "smartSim"), destname = fastaFile, remove = FALSE)
}

Input parameters

In this tutorial, we simulate two groups of single cells, each containing 10 cells. Every cell expresses 10 distinct genes, with each gene producing two (already annotated) transcript isoforms.

To mimic the characteristics of Smart-seq3 data from Hagemann-Jensen et al., we use:

Cell barcodes of length 8
Unique Molecular Identifiers (UMIs) of length 8
Paired-end reads with lengths of 150 bp each (i.e., 2 \(\times\) 150 bp)

In this tutorial, we will generate 2 groups of 10 cells each. Each cell expresses 10 different genes with each 2 (already annotated) transcripts. Similarly to the Smart-seq3 data from Hagemann-Jensen, we use a cell barcode with length 8, a UMI with length 8 and paired end reads with length x \(\times\) 150bp.

# Simulation design parameters
nClust = 2            #number of cell populations
nCellsPerClust = 10   #number of cells per cell population
nGenes = 10           #number of genes to simulate
nTx = 2               #number of transcripts per gene

# Smart-seq3 and sequencing configuration
barcodeLen = 8        #length of cell barcode
UMIlen = 8            #length of UMI
readLen = 150         #length of 1 end of PE read
tag = "ATTGCGCAATG"   #Smart-seq3 tag used to identify UMI containing reads

# Set seed for reproducibility
set.seed(1234)

Data characterization

To simulate data that closely mimics the characteristics of the existing Smart-seq3 dataset, we first perform a data characterization step. This process extracts empirical distributions and summary statistics from the aligned reads, which are later used to guide simulation.

The following features are extracted:

the distribution of fragment lengths for UMI containing reads (fragmentLength_nonUMI.tsv);
the distribution of fragment lengths for internal reads (fragmentLength_UMI.tsv);
the number of reads per UMI (including counts for internal reads) per gene per cell (UMIcounts.tsv);
the number of intronic vs exonic bases covered per gene per cell, as a proxy for the unspliced vs. spliced transcripts (intronicReads.tsv);
the positional distribution of quality scores for read sequences (readQual.tsv`);
the positional distribution of quality scores for barcode sequences (bcQual.tsv).

Note: For demonstration purposes, we limit the analysis to the first 10 000 reads per bam file using numReads = 10000. By default, all reads are processed (numReads = NULL).

Additionally, we assume that cell barcodes are stored under the “BC” tag and UMIs are stored under the “UB” tag in the bam files. These tags are consistent with the output from zUMIs [2]. If you used a different tool for alignment and demultiplexing, ensure that the barcode and UMI tags follow the same convention.

# Characterize the Smart-seq3 dataset
dataChar <- characterizeData(bam = bamDir,
                             gtf = gtfFile,
                             fasta = fastaFile,
                             out = dataCharDir,
                             nCPU = 4,
                             numReads = 10000,
                             seed = 1234,
                             stage = "geneFilt",
                             phred = 33)

This step typically takes around 20 minutes, but the exact time may vary depending on your system’s performance, data size, and I/O speed. Runtime may vary across different machines.

To save time, you can skip the data characterization step by loading the precomputed characteristics that are provided with the smartSim package. This allows you to proceed directly to the simulation step without rerunning the resource-intensive characterization process.

# Load precomputed data characteristics
dataChar <- DataCharacteristics(inDir = system.file("extdata","dataChar", package = "smartSim"))

If your aligned dataset is not demultiplexed in cell specific bam files, you can instead provide a single bam file directly to the characterizeData() function.

As before, ensure that:

Cell barcode information is stored in the “BC” tag
UMI information is stored in the “UB” tag.

# Characterize data from a single bam file
dataChar <- characterizeData(bam = "path/to/your/bamFile.bam",
                             gtf = gtfFile,
                             fasta = fastaFile,
                             out = dataCharDir,
                             nCPU = 4,
                             numReads = NULL,
                             seed = 1234,
                             stage = "geneFilt",
                             phred = 33)

Preparing ground truth

Getting all transcripts to simulate

To generate synthetic data, we begin by selecting a subset of annotated transcripts from the reference GTF file. In this example, we randomly select 2 transcripts per gene for 10 randomly chosen genes.

# Select annotated transcripts for simulation
gtf_simul <- getTranscripts(gtf = gtfFile,
                            nTx = nTx,
                            out = trueTranscriptDir,
                            nRandomGenes = nGenes,
                            minTxLen = 1000,
                            seed = 1234)

# Preview selected transcripts
head(gtf_simul)

## [1] "Selecting transcripts for simulation"
##       seqnames    start      end width strand source       type score phase
## 64622     chr1 36455718 36464384  8667      - HAVANA       gene    NA    NA
## 64644     chr1 36455772 36459703  3932      - HAVANA transcript    NA    NA
## 64645     chr1 36457923 36459703  1781      - HAVANA       exon    NA    NA
## 64646     chr1 36456187 36456378   192      - HAVANA       exon    NA    NA
## 64647     chr1 36455772 36455925   154      - HAVANA       exon    NA    NA
## 64651     chr1 36455774 36464345  8572      - HAVANA transcript    NA    NA
##                  gene_id      gene_type gene_name level    hgnc_id
## 64622 ENSG00000116898.12 protein_coding    MRPS15     2 HGNC:14504
## 64644 ENSG00000116898.12 protein_coding    MRPS15     2 HGNC:14504
## 64645 ENSG00000116898.12 protein_coding    MRPS15     2 HGNC:14504
## 64646 ENSG00000116898.12 protein_coding    MRPS15     2 HGNC:14504
## 64647 ENSG00000116898.12 protein_coding    MRPS15     2 HGNC:14504
## 64651 ENSG00000116898.12 protein_coding    MRPS15     2 HGNC:14504
##                havana_gene     transcript_id      transcript_type
## 64622 OTTHUMG00000008042.5              <NA>                 <NA>
## 64644 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64645 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64646 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64647 OTTHUMG00000008042.5 ENST00000488606.5 processed_transcript
## 64651 OTTHUMG00000008042.5 ENST00000462067.1 processed_transcript
##       transcript_name transcript_support_level  tag    havana_transcript
## 64622            <NA>                     <NA> <NA>                 <NA>
## 64644      MRPS15-204                        2 <NA> OTTHUMT00000022054.2
## 64645      MRPS15-204                        2 <NA> OTTHUMT00000022054.2
## 64646      MRPS15-204                        2 <NA> OTTHUMT00000022054.2
## 64647      MRPS15-204                        2 <NA> OTTHUMT00000022054.2
## 64651      MRPS15-202                        2 <NA> OTTHUMT00000022053.2
##       exon_number           exon_id  ont protein_id ccdsid
## 64622        <NA>              <NA> <NA>       <NA>   <NA>
## 64644        <NA>              <NA> <NA>       <NA>   <NA>
## 64645           1 ENSE00001810196.1 <NA>       <NA>   <NA>
## 64646           2 ENSE00003613150.1 <NA>       <NA>   <NA>
## 64647           3 ENSE00001880390.1 <NA>       <NA>   <NA>
## 64651        <NA>              <NA> <NA>       <NA>   <NA>

This step generates a table containing all gene, transcript and exon information for the selected transcripts.

If you have a predefined list of gene IDs, you can pass them directly to getTranscripts() instead of selecting genes at random:

# Example: using 10 predefined gene IDs
genes <- c("ENSG00000116898.12", "ENSG00000162437.14", "ENSG00000223883.2",  "ENSG00000009307.16", "ENSG00000177551.5", 
           "ENSG00000281741.2", "ENSG00000159208.16", "ENSG00000143418.20", "ENSG00000183856.11", "ENSG00000188585.9")
gtf_simul <- getTranscripts(gtf = gtfFile,
                            nTx = nTx,
                            out = trueTranscriptDir,
                            genes = genes,
                            minTxLen = 1000,
                            seed = 1234)

# Preview selected transcripts
head(gtf_simul)

Ensure that each gene in your list has at least nTx annotated transcripts available in the reference GTF file. Genes that do not meet this requirement will be ignored by getTranscripts().

If you want to simulate different numbers of transcripts per gene, you can run getTranscripts() separately for each different groups of genes, and then concatenate the resulting GTF tables. This approach allows full control over transcript complexity per gene, enabling more flexible simulation scenarios.

The example below simulates:

5 genes with 2 transcripts per gene
5 genes with 3 transcripts per gene

# Define gene groups
genes1 <- c("ENSG00000116898.12", "ENSG00000162437.14",  "ENSG00000177551.5", "ENSG00000281741.2", "ENSG00000183856.11") 
genes2 <- c("ENSG00000223883.2", "ENSG00000009307.16", "ENSG00000159208.16", "ENSG00000143418.20",  "ENSG00000188585.9")

# Select 2 transcripts per gene
gtf_simul1 <- getTranscripts(gtf = gtfFile,
                            nTx = 2,
                            out = trueTranscriptDir,
                            genes = genes1,
                            minTxLen = 1000,
                            seed = 1234)

# Select 3 transcripts per gene
gtf_simul2 <- getTranscripts(gtf = gtfFile,
                            nTx = 3,
                            out = trueTranscriptDir,
                            genes = genes2,
                            minTxLen = 1000,
                            seed = 1234)

# Combine into a single table
gtf_simul <- rbind(gtf_simul1, gtf_simul2)

# Preview selected transcripts
head(gtf_simul)

You can also extract transcript annotations from a predefined list of transcript IDs. This allows full control over which transcripts are included in the simulation. In the example below, we retrieve annotation data for 10 transcripts from 4 different genes.

# Predefined list of transcript IDs
transcripts <- c("ENST00000488606.5", "ENST00000462067.1", 
                 "ENST00000371072.8", "ENST00000418058.1", 
                 "ENST00000482825.7", "ENST00000345896.8", "ENST00000271688.10", 
                 "ENST00000646925.1", "ENST00000623247.2", "ENST00000445098.2")

# Extract corresponding GTF annotation
gtf_simul <- getTranscripts(gtf = gtfFile,
                            out = trueTranscriptDir,
                            transcripts = transcripts,
                            minTxLen = 1000,
                            seed = 1234)
head(gtf_simul)

Introducing novel splicing events

To simulate alternative splicing and transcriptome complexity, you can introduce novel splicing events to selected genes. Below, we demonstrate how to add both novel junctions and novel exons.

Add new junctions

We introduce one new junction in each of the first two genes from the current annotation:

# Select first two genes for novel junctions
genes_nj <- unique(gtf_simul$gene_id)[1:2]

# Add new junctions
gtf_simul <- addNewJunctions(annot_gtf = gtf_simul,
                             gene_ids = genes_nj,
                             seed = 1234)

#inspect transcript IDs
print(unique(gtf_simul$transcript_id))

## [1] "adding de novo junctions"
##  [1] NA                              "ENST00000488606.5"            
##  [3] "ENST00000462067.1"             "ENSG00000116898.12newJunction"
##  [5] "ENST00000371072.8"             "ENST00000418058.1"            
##  [7] "ENSG00000162437.14newJunction" "ENST00000667299.1"            
##  [9] "ENST00000667473.1"             "ENST00000610726.4"            
## [11] "ENST00000369530.5"             "ENST00000320238.3"            
## [13] "ENST00000369506.1"             "ENST00000626088.1"            
## [15] "ENST00000640686.1"             "ENST00000369094.5"            
## [17] "ENST00000469255.5"             "ENST00000368954.10"           
## [19] "ENST00000368949.8"             "ENST00000361170.7"            
## [21] "ENST00000491900.1"             "ENST00000641292.1"            
## [23] "ENST00000445098.2"

Each new transcript containing a new junction is assigned a name following the pattern: geneIDnewJunction.

Add new exons

We add one new exon to each of the next two genes. By default, the new exon will be selected such that the splice sites are canonical, i.e. a GT donor site and an AG acceptor site:

# Select next two genes for novel exons
genes_ne <- unique(gtf_simul$gene_id)[3:4]

# Add new exons
gtf_simul <- addNewExons(annot_gtf = gtf_simul,
                         fasta = fastaFile,
                         gene_ids = genes_ne,
                         canonical = TRUE,
                         seed = 1234)

#inspect transcriptIDs
print(unique(gtf_simul$transcript_id))

## [1] "adding de novo exons"
##  [1] NA                              "ENST00000488606.5"            
##  [3] "ENST00000462067.1"             "ENSG00000116898.12newJunction"
##  [5] "ENST00000371072.8"             "ENST00000418058.1"            
##  [7] "ENSG00000162437.14newJunction" "ENST00000667299.1"            
##  [9] "ENST00000667473.1"             "ENSG00000223883.2newExon"     
## [11] "ENST00000610726.4"             "ENST00000369530.5"            
## [13] "ENSG00000009307.16newExon"     "ENST00000320238.3"            
## [15] "ENST00000369506.1"             "ENST00000626088.1"            
## [17] "ENST00000640686.1"             "ENST00000369094.5"            
## [19] "ENST00000469255.5"             "ENST00000368954.10"           
## [21] "ENST00000368949.8"             "ENST00000361170.7"            
## [23] "ENST00000491900.1"             "ENST00000641292.1"            
## [25] "ENST00000445098.2"

Each new transcript containing a new exon is assigned a name following the pattern: geneIDnewExon.

Write final transcript annotation

Finally, we save the updated GTF and corresponding transcript sequences to file. These will serve as the ground truth for simulation:

# Export ground truth transcript annotation and sequences
transcript_seq = writeTrueTranscripts(gtf = gtf_simul,
                                      fasta = fastaFile,
                                      out = trueTranscriptDir)

## [1] "writing true transcripts to file"

Get gene expression

For each cluster of cells, we generate cell-specific gene expression profiles using getRandomGeneExpression(). This approach samples expression values from real Smart-seq3 data to closely mimic biological variability and read coverage patterns observed in practice.

Since we are simulating two clusters, each cluster receives a distinct gene expression profile that is shared across all cells in the cluster. To introduce cell-specific variability in sequencing depth, each expression profile is scaled by a randomly sampled cell-specific cellSize factor.

geneExpression <- getRandomGeneExpression(nClusters = nClust,
                                          nCellsPerCluster = nCellsPerClust,
                                          genes = unique(gtf_simul$gene_id),
                                          dataChar = dataChar,
                                          minCount = nTx,
                                          seed = 1234)
print(head(geneExpression))

## [1] "getting random gene expression"
##   cluster           cell            gene_id geneExpression
## 1       1 cluster1_cell1 ENSG00000116898.12      1.2028442
## 2       1 cluster1_cell2 ENSG00000116898.12      1.1573085
## 3       1 cluster1_cell3 ENSG00000116898.12      0.7771064
## 4       1 cluster1_cell4 ENSG00000116898.12      0.5586372
## 5       1 cluster1_cell5 ENSG00000116898.12      0.9171300
## 6       1 cluster1_cell6 ENSG00000116898.12      4.7318168

Get Percent Spliced In values

To simulate alternative splicing, we generate cluster-specific Percent Spliced In (PSI) values for each transcript using getRandomClusterPSI(). PSI values represent the relative abundance of each transcript within a gene and are used to simulate transcript-level expression profiles.

In this example:

The first half of the genes (fixedGenes) will have the same PSI values across all clusters, representing genes without differential spicing.
The second half (diffGenes) will have distinct PSI values per cluster, mimicking differential splicing between cell populations.

# Split genes into those with and without differential splicing
geneIDs <- unique(gtf_simul$gene_id)
fixedGenes <- geneIDs[1:(length(geneIDs)/2)]
diffGenes <- geneIDs[((length(geneIDs)/2)+1):length(geneIDs)]


# Generate cluster-specific PSI values
clusterPSI <- getRandomClusterPSI(gtf = gtf_simul,
                                  nClusters = nClust,
                                  fixedGenes = fixedGenes,
                                  diffGenes = diffGenes,
                                  seed = 1234)
print(clusterPSI)

## # A tibble: 48 × 4
## # Groups:   gene_id [10]
##    cluster gene_id            transcript_id                    PSI
##    <chr>   <chr>              <chr>                          <dbl>
##  1 1       ENSG00000116898.12 ENST00000488606.5             0.0845
##  2 1       ENSG00000116898.12 ENST00000462067.1             0.463 
##  3 1       ENSG00000116898.12 ENSG00000116898.12newJunction 0.453 
##  4 1       ENSG00000162437.14 ENST00000371072.8             0.293 
##  5 1       ENSG00000162437.14 ENST00000418058.1             0.405 
##  6 1       ENSG00000162437.14 ENSG00000162437.14newJunction 0.301 
##  7 1       ENSG00000223883.2  ENST00000667299.1             0.0105
##  8 1       ENSG00000223883.2  ENST00000667473.1             0.256 
##  9 1       ENSG00000223883.2  ENSG00000223883.2newExon      0.733 
## 10 1       ENSG00000009307.16 ENST00000610726.4             0.293 
## # ℹ 38 more rows

You may also provide a custom PSI table, as long as it includes the following columns: “cluster”, “gene_id”, “transcript_id” and “PSI”. Ensure that for each gene in each cluster, the PSI values across transcripts sum to 1.

Here, we provide an example in which the transcripts are equally abundant in all genes.

library(dplyr, quietly = TRUE)

#PSI uniformly distributed across all transcripts of a gene
clusterPSI <- merge(unique(geneExpression[, c("cluster", "gene_id")]), 
                    gtf_simul[gtf_simul$type == "transcript", c("gene_id", "transcript_id")], 
                    by = "gene_id")
clusterPSI$PSI <- 1 

#ensure PSI values of each gene sum to one per cluster
clusterPSI <- clusterPSI %>%
  group_by(gene_id, cluster) %>%
  mutate(PSI = PSI / sum(PSI))

print(clusterPSI)

Get fraction of unspliced transcripts

To emulate realistic transcriptome complexity, we get the fraction of unspliced transcripts per gene per cluster. These values are derived from the provided example dataset to reflect patterns observed in real Smart-seq3 data.

In this example we will simulate reads such that:

The first half of the genes (fixedGenes) have a shared unspliced fraction across clusters, representing no cluster-specific rate of RNA transcription and degradation.
The second half of the genes (diffGenes) can exhibit differential unspliced proportions, allowing for variation in RNA processing between clusters.

# Define gene sets for fixed vs. differential unspliced ratios
geneIDs <- unique(gtf_simul$gene_id)
fixedGenes <- geneIDs[1:(length(geneIDs)/2)]
diffGenes <- geneIDs[((length(geneIDs)/2)+1):length(geneIDs)]

# Generate unspliced transcript fractions per gene and cluster
unspliced <- getRandomUnspliced(nClusters = nClust, 
                                fixedGenes = fixedGenes, 
                                diffGenes = diffGenes, 
                                dataChar = dataChar, 
                                seed = 1234)
print(unspliced)

## [1] "getting random fractions of unspliced transcript"
##    cluster            gene_id nUnspliced
## 1        1 ENSG00000116898.12       0.00
## 2        1 ENSG00000162437.14       0.00
## 3        1  ENSG00000223883.2       0.00
## 4        1 ENSG00000009307.16       0.00
## 5        1  ENSG00000177551.5       0.70
## 6        2 ENSG00000116898.12       0.00
## 7        2 ENSG00000162437.14       0.00
## 8        2  ENSG00000223883.2       0.00
## 9        2 ENSG00000009307.16       0.00
## 10       2  ENSG00000177551.5       0.70
## 11       1  ENSG00000281741.2       0.94
## 12       1 ENSG00000159208.16       0.00
## 13       1 ENSG00000143418.20       0.00
## 14       1 ENSG00000183856.11       0.93
## 15       1  ENSG00000188585.9       0.00
## 16       2  ENSG00000281741.2       0.95
## 17       2 ENSG00000159208.16       0.00
## 18       2 ENSG00000143418.20       0.00
## 19       2 ENSG00000183856.11       0.50
## 20       2  ENSG00000188585.9       0.00

Since the example dataset is derived from single-cell Smart-seq3 (not single-nucleus RNA-seq), the majority of genes contain low or no unspliced RNA. As a result, many nUnspliced values are equal to 0.

You can also provide a custom table of unspliced transcript fractions. This table must include the following columns: “cluster”, “gene_id” and “nUnspliced”.

Here, we provide an example in which there are no unspliced transcripts.

unspliced <- unique(geneExpression[, c("cluster", "gene_id")])
unspliced$nUnspliced <- 0

print(unspliced)

Get transcript expression

Finally, we combine the cluster-specific PSI values and fractions of unspliced transcripts with the cell-specific gene expression to get the expression for each spliced and unspliced transcript.

transcriptExpression <- getTranscriptExpression(clusterPSI = clusterPSI, 
                                                unspliced = unspliced, 
                                                geneExpression = geneExpression, 
                                                outFile = paste0(trueTranscriptDir, "/transcriptExpression.tsv"))

print(head(transcriptExpression))

##              gene_id cluster     transcript_id       PSI            cell
## 1 ENSG00000009307.16       1 ENST00000610726.4 0.2933855 cluster1_cell10
## 2 ENSG00000009307.16       1 ENST00000610726.4 0.2933855  cluster1_cell2
## 3 ENSG00000009307.16       1 ENST00000610726.4 0.2933855  cluster1_cell3
## 4 ENSG00000009307.16       1 ENST00000610726.4 0.2933855  cluster1_cell4
## 5 ENSG00000009307.16       1 ENST00000610726.4 0.2933855  cluster1_cell1
## 6 ENSG00000009307.16       1 ENST00000610726.4 0.2933855  cluster1_cell6
##   geneExpression transcriptExpression
## 1       9.366274                    3
## 2      19.095590                    6
## 3      12.822255                    4
## 4       9.217515                    3
## 5      19.846930                    6
## 6      78.074977                   23

simulate reads

With all required inputs prepared, we now proceed to simulate the raw sequencing reads for all cells. We generate reads that mimic real Smart-seq3 data by combining transcript expression levels, sequencing characteristics, and error profiles.

simulateReads(transcriptExpression = transcriptExpression,
              transcript_seq = transcript_seq,
              dataChar = dataChar,
              out = readDir,
              barcodeLen = barcodeLen,
              UMIlen = UMIlen,
              readLen = readLen,
              tag = tag,
              PAF = 2,
              errorRate = 0.005,
              seed = 1234)

## [1] "simulating reads"
## [1] "cluster1_cell10"
## [1] "cluster1_cell2"
## [1] "cluster1_cell3"
## [1] "cluster1_cell4"
## [1] "cluster1_cell1"
## [1] "cluster1_cell6"
## [1] "cluster1_cell7"
## [1] "cluster1_cell8"
## [1] "cluster1_cell5"
## [1] "cluster1_cell9"
## [1] "cluster2_cell10"
## [1] "cluster2_cell2"
## [1] "cluster2_cell3"
## [1] "cluster2_cell4"
## [1] "cluster2_cell1"
## [1] "cluster2_cell6"
## [1] "cluster2_cell7"
## [1] "cluster2_cell8"
## [1] "cluster2_cell5"
## [1] "cluster2_cell9"

In the readDir directory, smartSim will write out :

the index and read fastq files (unassigned_I1.fq.gz, unassigned_I2.fq.gz, unassigned_R1.fq.gz and unassinged_R2.fq.gz);
the file containing all cell barcodes (barcodes.txt);
and the file linking the cell barcodes to the cell and cluster ids (cell2barcode.tsv).

References

[1]

Hagemann-Jensen M, Ziegenhain C, Chen P, Ramsköld D, Hendriks GJ, Larsson AJM, et al. Single-cell RNA counting at allele and isoform resolution using smart-seq3. Nature Biotechnology 2020;38:708–14. https://doi.org/10.1038/s41587-020-0497-0.

[2]

Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. zUMIs - a fast and flexible pipeline to process RNA sequencing data with UMIs. GigaScience 2018;7:giy059. https://doi.org/10.1093/gigascience/giy059.

smartSim tutorial