The following repository is an interactively executed demonstration of Scikick usage through an analysis of single cell transcriptomic (scRNAseq) data.
This analysis is based on work from the Orchestrating Single Cell Analysis book.
The notebooks
directory contains a series of notebooks which were developed to analyze scRNAseq datasets. Each notebook will be introduced and added to the Scikick project as if they were developed throughout a true project timeline.
The final report generated by Scikick for this tutorial project can be seen here.
Links to sk <command> --help
outputs are provided throughout this tutorial for review and convenience. It is recommended to first read at least the “Hello World” usage of Scikick prior to reading this page.
Essential screenshots of the report state are provided throughout this tutorial and there is no need to directly execute the commands. However, to interactively view the state of the report at each stage of the tutorial, commands must be executed interactively in-line with the text. Full execution of this notebook takes approximately 10 minutes and installation of the required software for the project can take up to a couple of hours.
Later in this tutorial, a demonstration will show how to execute this project using Singularity with Scikick such that no project-specific dependencies require installation.
If you are executing the tutorial commands interactively, you should ensure you have followed the setup instructions in this section.
If you are reading only, continue to the initialize scikick section.
This tutorial utilizes a bash kernel to execute bash commands in Jupyter. Additionally, this tutorial was developed in Jupyter Lab; when viewing the report site throughout the tutorial, it may be necessary to click “trust html” to enable all site content to be properly displayed.
Commands can also simply be executed in a shell terminal and reports viewed by opening in any modern web browser.
Ensure you have a working installation of Scikick.
To execute the code in this tutorial, download Scikick’s source code repository and navigate to the directory containing the analysis code and this Jupyter notebook found under docs/scikick_documentation/
.
The software used throughout the analysis is required for execution. This project uses R >= 4.0 for analysis. Dependencies for the analysis can be installed from the project root (i.e. single-cell_analysis/
) in R with:
# Install BiocManager and remotes prior to this command
::install(remotes::local_package_deps(dependencies=TRUE), version="3.12") BiocManager
# Starting from scikick source code docs/scikick_documentation (this notebook)
pwd
/Users/mcarlucci/development/scikickstuff/scikick/docs/scikick_documentation
# Move into the scRNAseq project
cd single-cell_analysis
# (this code cell is hidden by HTML tags) - Remove previous executions of this tutorial
if [ -f scikick.yml ]; then
rm scikick.yml || echo "Not found"
rm -rf report || echo "Not found"
rm -rf output || echo "Not found"
# If tutorial is executed some notebooks would have been moved
mv notebooks/nestorowa/* notebooks/ || echo "No notebooks found"
fi
mv: rename notebooks/nestorowa/* to notebooks/*: No such file or directory
No notebooks found
In the project root directory, the Scikick project is initiated with sk init.
sk init -y
sk: Checking scikick software dependencies
sk: Importing template analysis configuration file
sk: Writing to scikick.yml
We will add some optional website styling to the configuration file (scikick.yml
).
cat _site.yml
cat _site.yml >> scikick.yml
# Optional site theme customization
output:
BiocStyle::html_document:
code_folding: hide
theme: readable
toc_float: true
toc: true
number_sections: false
toc_depth: 5
self_contained: true
It is often useful to first start with writing some background on the project and to provide an overview of data sources, analysis goals, and the current state of the project. The index.Rmd
notebook provides an overview of the scRNAseq analysis.
# Inspecting index.Rmd contents
cat index.Rmd
---
title: Project Overview
---
This project implements analyses from the [OSCA](http://bioconductor.org/books/release/OSCA/)
book on hematopoetic stem cell (HSC) single-cell RNA sequencing (scRNAseq) datasets. The main purpose of the project is to illustrate how [Scikick](https://github.com/matthewcarlucci/scikick) projects are configured and typically used in a realistic setting. A [walkthrough](../../../report/out_html/SCRNA_walkthrough.html) documents how Scikick was used throughout the project to execute and manage the state of the workflow.
# Credits to the Original OSCA Work
All pages in this project (except this one) were obtained directly from the source code of the OSCA book (https://github.com/Bioconductor/OrchestratingSingleCellAnalysis/, revision d56676f9d) released under a [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/us/) license and created by: Robert Amezquita, Aaron Lun, Stephanie Hicks, and Raphael Gottardo.
Changes to the original source code are as follows:
- Notebooks were split across multiple notebooks and placed into directories to demonstrate the inter-notebook state management of Scikick.
- The first chunk of each original notebook was removed.
- Titles were added to each page.
- Headers were adjusted to the new document hierarchy.
- "Session Info" sections were removed from each page (replaced by Scikick outputs).
- The cache-based build system was replaced with standard read/write operations.
- Package loads were added to each page as needed.
- Referencing using functionality specific to the bookdown format was removed or replaced.
# Reading a Scikick Report
- Pages each correspond to a computational notebook.
- Notebooks were configured to execute in a specific order.
- The "Project Map" at the bottom of each page illustrates the order in which notebooks were executed.
## Project Map
The "Project Map" automatically generated by Scikick below (and at the bottom of all pages)
can aid in navigating and visualizing the project. Each node is a page that can be clicked to navigate to it. Pages can also be navigated with the standard bootstrap navigation at the top of each page.
index.Rmd
is added with sk add.
sk add index.Rmd
sk: An index file index.Rmd has been added and will be used as the homepage
sk: Added index.Rmd
index.Rmd
is the only special file name in Scikick projects. It will be used for homepage content. We can see in the message above that Scikick recognized this.
An initial website will now be created with sk run.
sk run
sk: Executing code in index.Rmd, outputting to report/out_md/index.md
sk: Adding project map to report/out_md/index.md as report/out_md/index_tmp.md
sk: Creating site layout from scikick.yml
sk: Converting report/out_md/index_tmp.md to report/out_html/index.html
sk: Done, homepage is report/out_html/index.html
Note: The console outputs seen above for project map, site layout, and html generation will be supressed throughout the rest of the tutorial by using -q
with sk run
.
We can see the contents of index.Rmd
were added to the homepage.
Now we are ready to start adding analysis to the project.
During data exploration it is often a good idea to develop analyses in separate notebooks when multiple stages of data transformation are being performed such that notebooks:
To practice this modularized approach, notebooks in this tutorial are small and focus on specific tasks. Scikick provides tools to manage notebooks developed in this modular fashion. Modularization may be slightly exagerated in this project where a typical project may combine many of these steps into a single notebook at the user’s discretion.
The notebooks/import.Rmd
notebook was developed to import the scRNAseq dataset from Nestorowa et al 2016. This notebook:
The data is then saved to a file for later usage.
cat notebooks/import.Rmd
---
title: Import
---
This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with Smart-seq2 [Nestorowa et al., 2016](https://doi.org/10.1182/blood-2016-05-716480).
# Data loading
```{r data-loading}
library(scRNAseq)
sce.nest <- NestorowaHSCData()
```
```{r gene-annotation}
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
anno <- select(ens.mm.v97, keys=rownames(sce.nest),
keytype="GENEID", columns=c("SYMBOL", "SEQNAME"))
rowData(sce.nest) <- anno[match(rownames(sce.nest), anno$GENEID),]
```
After loading and annotation, we inspect the resulting `SingleCellExperiment` object:
```{r}
sce.nest
```
```{r export}
dir.create("output",showWarnings = FALSE)
saveRDS(sce.nest,"output/nestorowa_import_sce.RDS")
```
Simulated change
Simulated change
The notebook will now be added to the project.
sk add notebooks/import.Rmd
sk: Added notebooks/import.Rmd
Checking the new state of the project with sk status.
sk status
--- index.Rmd
m-- notebooks/import.Rmd
Scripts to execute: 1
HTMLs to compile ('---'): 2
Scikick has determined that import.Rmd
must be executed since it is missing its output report file and the index.Rmd
only requires small changes to its final page output (to include a menu item for import.Rmd
).
sk run
sk: Adding project map to report/out_md/index.md as report/out_md/index_tmp.md
sk: Creating site layout from scikick.yml
sk: Converting report/out_md/index_tmp.md to report/out_html/index.html
sk: Executing code in notebooks/import.Rmd, outputting to report/out_md/notebooks/import.md
sk: Adding project map to report/out_md/notebooks/import.md as report/out_md/notebooks/import_tmp.md
sk: Converting report/out_md/notebooks/import_tmp.md to report/out_html/notebooks/import.html
sk: Done, homepage is report/out_html/index.html
We can see that import.Rmd
was added to the report site (under the navigation bar as “Import”) with no additional configuration necessary.
A notebook was developed which performs quality control on the data downloaded by the import.Rmd
notebook. That is, quality_control.Rmd
must be executed after import.Rmd
.
cat notebooks/quality_control.Rmd
---
title: Quality Control
---
```{r setup}
sce.nest <- readRDS("output/nestorowa_import_sce.RDS")
```
```{r}
unfiltered <- sce.nest
```
For some reason, no mitochondrial transcripts are available, so we will perform quality control using the spike-in proportions only.
```{r quality-control-grun}
library(scater)
stats <- perCellQCMetrics(sce.nest)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent")
sce.nest <- sce.nest[,!qc$discard]
```
We examine the number of cells discarded for each reason.
```{r}
colSums(as.matrix(qc))
```
We create some diagnostic plots for each metric.
```{r unref-nest-qc-dist, fig.wide=TRUE, fig.cap="Distribution of each QC metric across cells in the Nestorowa HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded."}
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard
gridExtra::grid.arrange(
plotColData(unfiltered, y="sum", colour_by="discard") +
scale_y_log10() + ggtitle("Total count"),
plotColData(unfiltered, y="detected", colour_by="discard") +
scale_y_log10() + ggtitle("Detected features"),
plotColData(unfiltered, y="altexps_ERCC_percent",
colour_by="discard") + ggtitle("ERCC percent"),
ncol=2
)
```
```{r export}
saveRDS(sce.nest,"output/nestorowa_quality_control_sce.RDS")
```
sk add
with the -d/--depends-on
flag is used to configure the execution order of these two notebooks.
sk add notebooks/quality_control.Rmd --depends-on notebooks/import.Rmd
sk: Added notebooks/quality_control.Rmd
sk: Added dependency notebooks/import.Rmd to notebooks/quality_control.Rmd
sk: notebooks/quality_control.Rmd will be executed after any executions of notebooks/import.Rmd
sk status
--- index.Rmd
--- notebooks/import.Rmd
m-- notebooks/quality_control.Rmd
Scripts to execute: 1
HTMLs to compile ('---'): 3
However, since import.Rmd
has already run, only quality_control.Rmd
requires execution.
sk run -q
sk: Executing code in notebooks/quality_control.Rmd, outputting to report/out_md/notebooks/quality_control.md
sk: Done, homepage is report/out_html/index.html
With multiple notebooks present in the final report, the order of execution is now no longer immediately clear. Viewing the project maps generated by Scikick at the bottom of each page rectifys this by clearly outlining the connection made between quality_control
and import
(with -d/--depends-on
). This map can also be used to navigate across pages of the report.
A notebook is added for implementing normalization and variance modeling of the transcript counts for the samples which survived quality control in quality_control.Rmd
.
cat notebooks/normalization.Rmd
---
title: Normalization and Variance Modelling
---
```{r setup}
library(scater)
library(scran)
library(BiocStyle)
library(pheatmap)
sce.nest <- readRDS("output/nestorowa_quality_control_sce.RDS")
```
# Normalization
```{r normalization}
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.nest)
sce.nest <- computeSumFactors(sce.nest, clusters=clusters)
sce.nest <- logNormCounts(sce.nest)
```
We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check.
```{r}
summary(sizeFactors(sce.nest))
```
```{r unref-nest-norm, fig.cap="Relationship between the library size factors and the deconvolution size factors in the Nestorowa HSC dataset."}
plot(librarySizeFactors(sce.nest), sizeFactors(sce.nest), pch=16,
xlab="Library size factors", ylab="Deconvolution factors", log="xy")
```
# Variance modelling
We use the spike-in transcripts to model the technical noise as a function of the mean.
```{r variance-modelling}
set.seed(00010101)
dec.nest <- modelGeneVarWithSpikes(sce.nest, "ERCC")
top.nest <- getTopHVGs(dec.nest, prop=0.1)
```
```{r unref-nest-var, fig.cap="Per-gene variance as a function of the mean for the log-expression values in the Nestorowa HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-ins (red)."}
plot(dec.nest$mean, dec.nest$total, pch=16, cex=0.5,
xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.nest)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
points(curfit$mean, curfit$var, col="red")
```
```{r export}
saveRDS(sce.nest,"output/nestorowa_normalization_sce.RDS")
saveRDS(top.nest,"output/nestorowa_normalization_top.RDS")
saveRDS(dec.nest,"output/nestorowa_normalization_dec.RDS")
```
sk add notebooks/normalization.Rmd -d notebooks/quality_control.Rmd
sk: Added notebooks/normalization.Rmd
sk: Added dependency notebooks/quality_control.Rmd to notebooks/normalization.Rmd
sk: notebooks/normalization.Rmd will be executed after any executions of notebooks/quality_control.Rmd
sk run -q
sk: Executing code in notebooks/normalization.Rmd, outputting to report/out_md/notebooks/normalization.md
sk: Done, homepage is report/out_html/index.html
We have now obtained a meaningful dataset for exploring biological results.
Once the data has been cleaned and normalized into a meaningful format for interprettation, it is common to perform some exploration of the data. Some common scRNAseq data exploration tasks are performed in a further_exploration.Rmd
notebook.
sk add notebooks/further_exploration.Rmd -d notebooks/normalization.Rmd
sk: Added notebooks/further_exploration.Rmd
sk: Added dependency notebooks/normalization.Rmd to notebooks/further_exploration.Rmd
sk: notebooks/further_exploration.Rmd will be executed after any executions of notebooks/normalization.Rmd
sk run -q
sk: Executing code in notebooks/further_exploration.Rmd, outputting to report/out_md/notebooks/further_exploration.md
sk: Done, homepage is report/out_html/index.html
We now have a complete analysis of the Nestorowa dataset. Each notebook’s respective page is found in the Scikick report. The project map clearly shows the order of execution of these pages.
With Scikick usage throughout this project thus far, we can now be confident that, when we remove the report directory, we are able to regenerate all results with sk run
.
# E.g. We could remove the report and regenerate from scratch.
# This will be skipped to save computations.
# rm -rf report
# sk run -q
We can take a look at the files required to re-execute this project.
ls notebooks/*Rmd
ls scikick.yml
notebooks/further_exploration.Rmd notebooks/normalization.Rmd
notebooks/import.Rmd notebooks/quality_control.Rmd
scikick.yml
We can now come back and check on the state of this project at any time in the future to find that there are no pending updates and that the report in report/out_html
represents a full execution of these notebooks.
# View notebooks, dependencies, and state
sk status -v
index.Rmd
notebooks/import.Rmd
notebooks/quality_control.Rmd
notebooks/import.Rmd
notebooks/normalization.Rmd
notebooks/quality_control.Rmd
notebooks/further_exploration.Rmd
notebooks/normalization.Rmd
Scripts to execute: 0
HTMLs to compile ('---'): 0
Up to date (' '): 5
If we now update the original data import (e.g. modify import.Rmd
to use a new version of the raw data), Scikick is aware of the state and relationship between multiple notebooks.
# simulate an update to import.Rmd
printf "\nSimulated change" >> notebooks/import.Rmd
sk status
s-- notebooks/import.Rmd
-e- notebooks/quality_control.Rmd
-e- notebooks/normalization.Rmd
-e- notebooks/further_exploration.Rmd
Scripts to execute: 4
HTMLs to compile ('---'): 4
We see that import
(s--
indicating an update to “self”) and all downstream notebooks (-e-
indicating an update to an external dependency) must be re-executed.
sk run -q
sk: Executing code in notebooks/import.Rmd, outputting to report/out_md/notebooks/import.md
sk: Executing code in notebooks/quality_control.Rmd, outputting to report/out_md/notebooks/quality_control.md
sk: Executing code in notebooks/normalization.Rmd, outputting to report/out_md/notebooks/normalization.md
sk: Executing code in notebooks/further_exploration.Rmd, outputting to report/out_md/notebooks/further_exploration.md
sk: Done, homepage is report/out_html/index.html
Scientific projects often develop as a series of experiments where analsyis is performed in stages. We will now add two similar parallel workflows for two new scRNAseq experiments.
As a project moves forward, more datasets may be added requiring reorganization of the project. Workflow configurations can be difficult to coordinate with file re-organization. Scikick provides features to accomodate this. The sk mv command enables dynamic reorganization of projects while attempting to minimize the need for re-execution. It does this by applying a standard shell mv
command (or git mv
if using -g
), while also moving the cooresponding report files such that re-execution is not necessary.
For this project, we will use sk mv
to migrate all analysis of the Nestorowa dataset above to a subfolder for this experiment such that new experiemnts can be added in a parallel fashion.
mkdir notebooks/nestorowa
sk mv notebooks/*.Rmd notebooks/nestorowa
mkdir: notebooks/nestorowa: File exists
sk: mv notebooks/further_exploration.Rmd notebooks/nestorowa
sk: mv notebooks/import.Rmd notebooks/nestorowa
sk: mv notebooks/normalization.Rmd notebooks/nestorowa
sk: mv notebooks/quality_control.Rmd notebooks/nestorowa
sk: mv report/out_md/notebooks/figure/further_exploration report/out_md/notebooks/nestorowa/figure/further_exploration
sk: notebooks/further_exploration.Rmd renamed to notebooks/nestorowa/further_exploration.Rmd in ./scikick.yml
sk: notebooks/import.Rmd renamed to notebooks/nestorowa/import.Rmd in ./scikick.yml
sk: mv report/out_md/notebooks/figure/normalization report/out_md/notebooks/nestorowa/figure/normalization
sk: notebooks/normalization.Rmd renamed to notebooks/nestorowa/normalization.Rmd in ./scikick.yml
sk: mv report/out_md/notebooks/figure/quality_control report/out_md/notebooks/nestorowa/figure/quality_control
sk: notebooks/quality_control.Rmd renamed to notebooks/nestorowa/quality_control.Rmd in ./scikick.yml
Note that no code requires re-execution.
sk status
--- index.Rmd
--- notebooks/nestorowa/import.Rmd
--- notebooks/nestorowa/quality_control.Rmd
--- notebooks/nestorowa/normalization.Rmd
--- notebooks/nestorowa/further_exploration.Rmd
Scripts to execute: 0
HTMLs to compile ('---'): 5
A similar series of notebooks were developed for the two additional experiments.
sk add notebooks/grun/import.Rmd
sk add notebooks/grun/quality_control.Rmd -d notebooks/grun/import.Rmd
sk add notebooks/grun/normalization.Rmd -d notebooks/grun/quality_control.Rmd
sk add notebooks/grun/further_exploration.Rmd -d notebooks/grun/normalization.Rmd
sk add notebooks/paul/import.Rmd
sk add notebooks/paul/quality_control.Rmd -d notebooks/paul/import.Rmd
sk add notebooks/paul/normalization.Rmd -d notebooks/paul/quality_control.Rmd
sk add notebooks/paul/further_exploration.Rmd -d notebooks/paul/normalization.Rmd
sk: Added notebooks/grun/import.Rmd
sk: Added notebooks/grun/quality_control.Rmd
sk: Added dependency notebooks/grun/import.Rmd to notebooks/grun/quality_control.Rmd
sk: notebooks/grun/quality_control.Rmd will be executed after any executions of notebooks/grun/import.Rmd
sk: Added notebooks/grun/normalization.Rmd
sk: Added dependency notebooks/grun/quality_control.Rmd to notebooks/grun/normalization.Rmd
sk: notebooks/grun/normalization.Rmd will be executed after any executions of notebooks/grun/quality_control.Rmd
sk: Added notebooks/grun/further_exploration.Rmd
sk: Added dependency notebooks/grun/normalization.Rmd to notebooks/grun/further_exploration.Rmd
sk: notebooks/grun/further_exploration.Rmd will be executed after any executions of notebooks/grun/normalization.Rmd
sk: Added notebooks/paul/import.Rmd
sk: Added notebooks/paul/quality_control.Rmd
sk: Added dependency notebooks/paul/import.Rmd to notebooks/paul/quality_control.Rmd
sk: notebooks/paul/quality_control.Rmd will be executed after any executions of notebooks/paul/import.Rmd
sk: Added notebooks/paul/normalization.Rmd
sk: Added dependency notebooks/paul/quality_control.Rmd to notebooks/paul/normalization.Rmd
sk: notebooks/paul/normalization.Rmd will be executed after any executions of notebooks/paul/quality_control.Rmd
sk: Added notebooks/paul/further_exploration.Rmd
sk: Added dependency notebooks/paul/normalization.Rmd to notebooks/paul/further_exploration.Rmd
sk: notebooks/paul/further_exploration.Rmd will be executed after any executions of notebooks/paul/normalization.Rmd
Finally, a set of notebooks are added which perform a combined analysis of the three datasets that were each prepared in parallel. This merge.Rmd
stage depends on the results of each of the data preparations performed previously (i.e., quality_control.Rmd
and normalization.Rmd
notebooks), however, it does not depend on the biological analysis that was performed for each dataset (i.e., the analyses performed in further_exploration.Rmd
notebooks are not used by merge.Rmd
).
sk add notebooks/merged/merge.Rmd -d notebooks/grun/quality_control.Rmd -d notebooks/paul/quality_control.Rmd -d notebooks/nestorowa/normalization.Rmd
sk add notebooks/merged/combined_analysis.Rmd -d notebooks/merged/merge.Rmd
sk: Added notebooks/merged/merge.Rmd
sk: Added dependency notebooks/grun/quality_control.Rmd to notebooks/merged/merge.Rmd
sk: notebooks/merged/merge.Rmd will be executed after any executions of notebooks/grun/quality_control.Rmd
sk: Added dependency notebooks/paul/quality_control.Rmd to notebooks/merged/merge.Rmd
sk: notebooks/merged/merge.Rmd will be executed after any executions of notebooks/paul/quality_control.Rmd
sk: Added dependency notebooks/nestorowa/normalization.Rmd to notebooks/merged/merge.Rmd
sk: notebooks/merged/merge.Rmd will be executed after any executions of notebooks/nestorowa/normalization.Rmd
sk: Added notebooks/merged/combined_analysis.Rmd
sk: Added dependency notebooks/merged/merge.Rmd to notebooks/merged/combined_analysis.Rmd
sk: notebooks/merged/combined_analysis.Rmd will be executed after any executions of notebooks/merged/merge.Rmd
With a parallel series of notebooks like in the current state of this project, execution of notebooks in parallel can be utilized with a flag (-j8
) passed to snakemake with the -s
sk run flag.
The two additional experiments and merged analysis will now be executed.
sk run -q -s -j8
sk: Snakemake arguments received: -j8
sk: Executing code in notebooks/grun/import.Rmd, outputting to report/out_md/notebooks/grun/import.md
sk: Executing code in notebooks/paul/import.Rmd, outputting to report/out_md/notebooks/paul/import.md
sk: Executing code in notebooks/grun/quality_control.Rmd, outputting to report/out_md/notebooks/grun/quality_control.md
sk: Executing code in notebooks/paul/quality_control.Rmd, outputting to report/out_md/notebooks/paul/quality_control.md
sk: Executing code in notebooks/grun/normalization.Rmd, outputting to report/out_md/notebooks/grun/normalization.md
sk: Executing code in notebooks/grun/further_exploration.Rmd, outputting to report/out_md/notebooks/grun/further_exploration.md
sk: Executing code in notebooks/paul/normalization.Rmd, outputting to report/out_md/notebooks/paul/normalization.md
sk: Executing code in notebooks/merged/merge.Rmd, outputting to report/out_md/notebooks/merged/merge.md
sk: Executing code in notebooks/merged/combined_analysis.Rmd, outputting to report/out_md/notebooks/merged/combined_analysis.md
sk: Executing code in notebooks/paul/further_exploration.Rmd, outputting to report/out_md/notebooks/paul/further_exploration.md
sk: Done, homepage is report/out_html/index.html
And a well organized final report site is generated. The project map in the report now features branching sets of analyses.
This demonstration utilizes real datasets and real analyses to demonstrate how Scikick is used in a practical setting when adapting to new analysis additions. Use of Scikick in a less controlled setting (when the analysis is not predetermined) should present even further utility to maintain workflow connections and reporting capabilities.
Scikick is implemented through snakemake workflows allowing for usage of many features of snakemake. Users familliar with snakemake usage may be able to take advantage of many flags while using Scikick. Some frequently used examples are provided here.
Passing snakemake arguments with -s
and utilizing the snakemake -F
flag will force the entire project to execute from scratch. This can be a useful sanity check when time permits.
# This will be skipped to save computations.
# sk run -s -F
When readers attempt to reproduce this work or any other computational work, they may run into issues with software or data. In these cases, it is useful for the reader to see:
The OSCA project maintains a docker image at Docker Hub. This image can first be assigned to the Scikick project with
# Using a fixed image tag for future reproducibility
sk config --singularity docker://bioconductor/orchestratingsinglecellanalysis:RELEASE_3_12
sk: Argument singularity set to docker://bioconductor/orchestratingsinglecellanalysis:RELEASE_3_12
Providing the flag --use-singularity
to snakemake, will download the container and execute all notebooks within this container.
Configurations of singularity like the above effectively makes it possible to execute any Scikick data analysis project with only the core Scikick dependencies and no analysis software dependencies directly required.
This scRNAseq project additionally requires the image to have write access to the /home/cache
directory for data downloads and so a writeable location must be provided for execution with the additional argument --singularity-args
. Additionally, the script is configured to distribute notebook executions across a SLURM cluster. A short run.sh
script contains a call to Scikick with these necessary arguments for full execution.
cat run.sh
##### Project Execution Overview
# This script contains:
# - Commands to handle scRNAseq (project-specific) data imports
# - Various commands to execute Scikick
# - Using singularity + SLURM (default)
# - Using singularity
#### Data Import
# For every project it is necessary to determine how
# to get the data. For this project, data comes from
# R packages with built-in caching mechanisms
### Copying Bioconductor data cache into container
# Currently it is necessary to reroute a directory
# to /home/cache since the docker container does not
# have read/write permissions and singularity does
# not have elevated permissions. Directories are passed to
# singularity with the -B/--bind argument.
## Best is to use the system cache dir to avoid re-downloads
# ehub=$(Rscript -e "cat(Sys.getenv('EXPERIMENT_HUB_CACHE'))")
# ahub=$(Rscript -e "cat(Sys.getenv('ANNOTATION_HUB_CACHE'))")
## For ease, creating dedicated directories for data downloads
mkdir -p input/cache/ExperimentHub
mkdir -p input/cache/AnnotationHub
### Analysis Execution Commands
## Run all with Singularity and on a SLURM cluster
# Note that the singularity pull can require tmp space which can be set prior to the run with:
# TMPDIR = /tmpdir/with/space
# The snakemake profile "slurm" must be confiugured prior to executing this command
sk run -v -s --use-singularity \
--singularity-args "'-B input/cache/ExperimentHub:/home/cache/ExperimentHub -B input/cache/AnnotationHub:/home/cache/AnnotationHub'" \
--profile slurm
## Singularity usage
# sk run -v -s --use-singularity \
# --singularity-args "'-B input/cache/ExperimentHub:/home/cache/ExperimentHub -B input/cache/AnnotationHub:/home/cache/AnnotationHub'"
Linking this execution with a continuous integration service implements a version of Continuous Analysis where the archived reports may be refered to at any time in the future knowing that they are reproducible. A template for such a configuration using GitLab CI is provided below:
cat .gitlab-ci.yml
# Stage 1 - Run the analysis and store results on GitLab
analysis:
stage: build
# This task is executed on each push of new code to the GitLab server
script:
- conda activate scikick # activate a preconfigured environment with scikick and dependencies installed
- bash run.sh # script calls scikick to run the analysis with singularity on a SLURM cluster
cache:
paths:
- .snakemake/singularity # Avoid pulling containers from DockerHub on each build
artifacts:
paths:
- report/ # keep the Scikick analysis archive files
- "*.out" # keep SLURM cluster log files
when: always # keep these artifacts on failure to investigate cluster error logs
# Stage 2 - On successful build, send the report to a server for sharing
deploy:
stage: deploy
script:
- mkdir -p .ci-deploy-dir/$CI_PROJECT_PATH/$CI_COMMIT_REF_NAME/ # prepare a directory based on GitLab project path
- rsync -av report/out_html/* .ci-deploy-dir/$CI_PROJECT_PATH/$CI_COMMIT_REF_NAME/ # put the site in the directory
- rsync -Pav .ci-deploy-dir/ admin@myserver.com:/home/admin/webroot/html/ # send to a dedicated gitlab-ci dir on webserver
The analysis codebase will now contain Scikick reports that are fully verified for reproducibility.