The following repository is an interactively executed demonstration of Scikick usage through an analysis of single cell transcriptomic (scRNAseq) data.

This analysis is based on work from the Orchestrating Single Cell Analysis book.

The notebooks directory contains a series of notebooks which were developed to analyze scRNAseq datasets. Each notebook will be introduced and added to the Scikick project as if they were developed throughout a true project timeline.

The final report generated by Scikick for this tutorial project can be seen here.

Links to sk <command> --help outputs are provided throughout this tutorial for review and convenience. It is recommended to first read at least the “Hello World” usage of Scikick prior to reading this page.

Preparing the Software Environment

Essential screenshots of the report state are provided throughout this tutorial and there is no need to directly execute the commands. However, to interactively view the state of the report at each stage of the tutorial, commands must be executed interactively in-line with the text. Full execution of this notebook takes approximately 10 minutes and installation of the required software for the project can take up to a couple of hours.

Later in this tutorial, a demonstration will show how to execute this project using Singularity with Scikick such that no project-specific dependencies require installation.

If you are executing the tutorial commands interactively, you should ensure you have followed the setup instructions in this section.

If you are reading only, continue to the initialize scikick section.

Executing Tutorial Commands

This tutorial utilizes a bash kernel to execute bash commands in Jupyter. Additionally, this tutorial was developed in Jupyter Lab; when viewing the report site throughout the tutorial, it may be necessary to click “trust html” to enable all site content to be properly displayed.

Commands can also simply be executed in a shell terminal and reports viewed by opening in any modern web browser.

Install Scikick

Ensure you have a working installation of Scikick.

Obtain Project Notebooks

To execute the code in this tutorial, download Scikick’s source code repository and navigate to the directory containing the analysis code and this Jupyter notebook found under docs/scikick_documentation/.

Install Project Dependencies

The software used throughout the analysis is required for execution. This project uses R >= 4.0 for analysis. Dependencies for the analysis can be installed from the project root (i.e. single-cell_analysis/) in R with:

# Install BiocManager and remotes prior to this command
BiocManager::install(remotes::local_package_deps(dependencies=TRUE), version="3.12")

Initialize Scikick

# Starting from scikick source code docs/scikick_documentation (this notebook)
pwd

/Users/mcarlucci/development/scikickstuff/scikick/docs/scikick_documentation

# Move into the scRNAseq project
cd single-cell_analysis

# (this code cell is hidden by HTML tags) - Remove previous executions of this tutorial
if [ -f scikick.yml ]; then
    rm scikick.yml || echo "Not found"
    rm -rf report || echo "Not found"
    rm -rf output || echo "Not found"
    # If tutorial is executed some notebooks would have been moved
    mv notebooks/nestorowa/* notebooks/ || echo "No notebooks found"
fi

mv: rename notebooks/nestorowa/* to notebooks/*: No such file or directory
No notebooks found

In the project root directory, the Scikick project is initiated with sk init.

sk init -y

sk: Checking scikick software dependencies
sk: Importing template analysis configuration file
sk: Writing to scikick.yml

We will add some optional website styling to the configuration file (scikick.yml).

cat _site.yml 
cat _site.yml >> scikick.yml

# Optional site theme customization
output:
  BiocStyle::html_document:
    code_folding: hide
    theme: readable
    toc_float: true
    toc: true
    number_sections: false
    toc_depth: 5
    self_contained: true

Adding to the Homepage

It is often useful to first start with writing some background on the project and to provide an overview of data sources, analysis goals, and the current state of the project. The index.Rmd notebook provides an overview of the scRNAseq analysis.

# Inspecting index.Rmd contents
cat index.Rmd

---
title: Project Overview
---

This project implements analyses from the [OSCA](http://bioconductor.org/books/release/OSCA/) 
book on hematopoetic stem cell (HSC) single-cell RNA sequencing (scRNAseq) datasets. The main purpose of the project is to illustrate how [Scikick](https://github.com/matthewcarlucci/scikick) projects are configured and typically used in a realistic setting. A [walkthrough](../../../report/out_html/SCRNA_walkthrough.html) documents how Scikick was used throughout the project to execute and manage the state of the workflow.

# Credits to the Original OSCA Work

All pages in this project (except this one) were obtained directly from the source code of the OSCA book (https://github.com/Bioconductor/OrchestratingSingleCellAnalysis/, revision d56676f9d) released under a [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/us/) license and created by: Robert Amezquita, Aaron Lun, Stephanie Hicks, and Raphael Gottardo. 

Changes to the original source code are as follows:

- Notebooks were split across multiple notebooks and placed into directories to demonstrate the inter-notebook state management of Scikick.
- The first chunk of each original notebook was removed.
- Titles were added to each page.
- Headers were adjusted to the new document hierarchy.
- "Session Info" sections were removed from each page (replaced by Scikick outputs).
- The cache-based build system was replaced with standard read/write operations.
- Package loads were added to each page as needed.
- Referencing using functionality specific to the bookdown format was removed or replaced.

# Reading a Scikick Report

- Pages each correspond to a computational notebook. 
- Notebooks were configured to execute in a specific order. 
- The "Project Map" at the bottom of each page illustrates the order in which notebooks were executed.

## Project Map

The "Project Map" automatically generated by Scikick below (and at the bottom of all pages) 
can aid in navigating and visualizing the project. Each node is a page that can be clicked to navigate to it. Pages can also be navigated with the standard bootstrap navigation at the top of each page.

index.Rmd is added with sk add.

sk add index.Rmd

sk: An index file index.Rmd has been added and will be used as the homepage
sk: Added index.Rmd

index.Rmd is the only special file name in Scikick projects. It will be used for homepage content. We can see in the message above that Scikick recognized this.

An initial website will now be created with sk run.

sk run

sk: Executing code in index.Rmd, outputting to report/out_md/index.md
sk:  Adding project map to report/out_md/index.md as report/out_md/index_tmp.md
sk:  Creating site layout from scikick.yml
sk:   Converting report/out_md/index_tmp.md to report/out_html/index.html
sk: Done, homepage is report/out_html/index.html

Note: The console outputs seen above for project map, site layout, and html generation will be supressed throughout the rest of the tutorial by using -q with sk run.

We can see the contents of index.Rmd were added to the homepage.

Now we are ready to start adding analysis to the project.

Nestorowa Dataset Analysis

Modular Notebooks

During data exploration it is often a good idea to develop analyses in separate notebooks when multiple stages of data transformation are being performed such that notebooks:

Remain focused on a single topic.
Each have manageable namespaces and resource usage for data objects.
Maintain clear checkpoints in the state of main data objects.

To practice this modularized approach, notebooks in this tutorial are small and focus on specific tasks. Scikick provides tools to manage notebooks developed in this modular fashion. Modularization may be slightly exagerated in this project where a typical project may combine many of these steps into a single notebook at the user’s discretion.

Importing Data

The notebooks/import.Rmd notebook was developed to import the scRNAseq dataset from Nestorowa et al 2016. This notebook:

Describes further details on where the data comes from.
Makes a few adjustments to the data to make it easier to work with.
Inspects the object after it has finished downloading.

The data is then saved to a file for later usage.

cat notebooks/import.Rmd

---
title: Import
---

This performs an analysis of the mouse haematopoietic stem cell (HSC) dataset generated with Smart-seq2 [Nestorowa et al., 2016](https://doi.org/10.1182/blood-2016-05-716480).

# Data loading

```{r data-loading}
library(scRNAseq)
sce.nest <- NestorowaHSCData()
```

```{r gene-annotation}
library(AnnotationHub)
ens.mm.v97 <- AnnotationHub()[["AH73905"]]
anno <- select(ens.mm.v97, keys=rownames(sce.nest), 
    keytype="GENEID", columns=c("SYMBOL", "SEQNAME"))
rowData(sce.nest) <- anno[match(rownames(sce.nest), anno$GENEID),]
```

After loading and annotation, we inspect the resulting `SingleCellExperiment` object:

```{r}
sce.nest
```

```{r export}
dir.create("output",showWarnings = FALSE)
saveRDS(sce.nest,"output/nestorowa_import_sce.RDS")
```

Simulated change

Simulated change

The notebook will now be added to the project.

sk add notebooks/import.Rmd

sk: Added notebooks/import.Rmd

Checking the new state of the project with sk status.

sk status

 ---    index.Rmd
 m--    notebooks/import.Rmd
Scripts to execute: 1
HTMLs to compile ('---'): 2

Scikick has determined that import.Rmd must be executed since it is missing its output report file and the index.Rmd only requires small changes to its final page output (to include a menu item for import.Rmd).

sk run

sk:  Adding project map to report/out_md/index.md as report/out_md/index_tmp.md
sk:  Creating site layout from scikick.yml
sk:   Converting report/out_md/index_tmp.md to report/out_html/index.html
sk: Executing code in notebooks/import.Rmd, outputting to report/out_md/notebooks/import.md
sk:  Adding project map to report/out_md/notebooks/import.md as report/out_md/notebooks/import_tmp.md
sk:   Converting report/out_md/notebooks/import_tmp.md to report/out_html/notebooks/import.html
sk: Done, homepage is report/out_html/index.html

We can see that import.Rmd was added to the report site (under the navigation bar as “Import”) with no additional configuration necessary.

Quality Control

A notebook was developed which performs quality control on the data downloaded by the import.Rmd notebook. That is, quality_control.Rmd must be executed after import.Rmd.

cat notebooks/quality_control.Rmd

---
title: Quality Control
---

```{r setup}
sce.nest <- readRDS("output/nestorowa_import_sce.RDS")
```

```{r}
unfiltered <- sce.nest
```

For some reason, no mitochondrial transcripts are available, so we will perform quality control using the spike-in proportions only.

```{r quality-control-grun}
library(scater)
stats <- perCellQCMetrics(sce.nest)
qc <- quickPerCellQC(stats, percent_subsets="altexps_ERCC_percent")
sce.nest <- sce.nest[,!qc$discard]
```

We examine the number of cells discarded for each reason.

```{r}
colSums(as.matrix(qc))
```

We create some diagnostic plots for each metric.

```{r unref-nest-qc-dist, fig.wide=TRUE, fig.cap="Distribution of each QC metric across cells in the Nestorowa HSC dataset. Each point represents a cell and is colored according to whether that cell was discarded."}
colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- qc$discard

gridExtra::grid.arrange(
    plotColData(unfiltered, y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, y="altexps_ERCC_percent",
        colour_by="discard") + ggtitle("ERCC percent"),
    ncol=2
)
```

```{r export}
saveRDS(sce.nest,"output/nestorowa_quality_control_sce.RDS")
```

sk add with the -d/--depends-on flag is used to configure the execution order of these two notebooks.

sk add notebooks/quality_control.Rmd --depends-on notebooks/import.Rmd

sk: Added notebooks/quality_control.Rmd
sk: Added dependency notebooks/import.Rmd to notebooks/quality_control.Rmd
sk:   notebooks/quality_control.Rmd will be executed after any executions of notebooks/import.Rmd

sk status

 ---    index.Rmd
 ---    notebooks/import.Rmd
 m--    notebooks/quality_control.Rmd
Scripts to execute: 1
HTMLs to compile ('---'): 3

However, since import.Rmd has already run, only quality_control.Rmd requires execution.

sk run -q

sk: Executing code in notebooks/quality_control.Rmd, outputting to report/out_md/notebooks/quality_control.md
sk: Done, homepage is report/out_html/index.html

Inspecting the Project Map

With multiple notebooks present in the final report, the order of execution is now no longer immediately clear. Viewing the project maps generated by Scikick at the bottom of each page rectifys this by clearly outlining the connection made between quality_control and import (with -d/--depends-on). This map can also be used to navigate across pages of the report.

Normalization

A notebook is added for implementing normalization and variance modeling of the transcript counts for the samples which survived quality control in quality_control.Rmd.

cat notebooks/normalization.Rmd

---
title: Normalization and Variance Modelling
---

```{r setup}
library(scater)
library(scran)
library(BiocStyle)
library(pheatmap)
sce.nest <- readRDS("output/nestorowa_quality_control_sce.RDS")
```

# Normalization

```{r normalization}
library(scran)
set.seed(101000110)
clusters <- quickCluster(sce.nest)
sce.nest <- computeSumFactors(sce.nest, clusters=clusters)
sce.nest <- logNormCounts(sce.nest)
```

We examine some key metrics for the distribution of size factors, and compare it to the library sizes as a sanity check.

```{r}
summary(sizeFactors(sce.nest))
```

```{r unref-nest-norm, fig.cap="Relationship between the library size factors and the deconvolution size factors in the Nestorowa HSC dataset."}
plot(librarySizeFactors(sce.nest), sizeFactors(sce.nest), pch=16,
    xlab="Library size factors", ylab="Deconvolution factors", log="xy")
```

# Variance modelling

We use the spike-in transcripts to model the technical noise as a function of the mean.

```{r variance-modelling}
set.seed(00010101)
dec.nest <- modelGeneVarWithSpikes(sce.nest, "ERCC")
top.nest <- getTopHVGs(dec.nest, prop=0.1)
```

```{r unref-nest-var, fig.cap="Per-gene variance as a function of the mean for the log-expression values in the Nestorowa HSC dataset. Each point represents a gene (black) with the mean-variance trend (blue) fitted to the spike-ins (red)."}
plot(dec.nest$mean, dec.nest$total, pch=16, cex=0.5,
    xlab="Mean of log-expression", ylab="Variance of log-expression")
curfit <- metadata(dec.nest)
curve(curfit$trend(x), col='dodgerblue', add=TRUE, lwd=2)
points(curfit$mean, curfit$var, col="red")
```

```{r export}
saveRDS(sce.nest,"output/nestorowa_normalization_sce.RDS")
saveRDS(top.nest,"output/nestorowa_normalization_top.RDS")
saveRDS(dec.nest,"output/nestorowa_normalization_dec.RDS")
```

sk add notebooks/normalization.Rmd -d notebooks/quality_control.Rmd

sk: Added notebooks/normalization.Rmd
sk: Added dependency notebooks/quality_control.Rmd to notebooks/normalization.Rmd
sk:   notebooks/normalization.Rmd will be executed after any executions of notebooks/quality_control.Rmd

sk run -q

sk: Executing code in notebooks/normalization.Rmd, outputting to report/out_md/notebooks/normalization.md
sk: Done, homepage is report/out_html/index.html

We have now obtained a meaningful dataset for exploring biological results.

Further Exploration of the Nestorowa Data

Once the data has been cleaned and normalized into a meaningful format for interprettation, it is common to perform some exploration of the data. Some common scRNAseq data exploration tasks are performed in a further_exploration.Rmd notebook.

sk add notebooks/further_exploration.Rmd -d notebooks/normalization.Rmd

sk: Added notebooks/further_exploration.Rmd
sk: Added dependency notebooks/normalization.Rmd to notebooks/further_exploration.Rmd
sk:   notebooks/further_exploration.Rmd will be executed after any executions of notebooks/normalization.Rmd

sk run -q

sk: Executing code in notebooks/further_exploration.Rmd, outputting to report/out_md/notebooks/further_exploration.md
sk: Done, homepage is report/out_html/index.html

Nestorowa Analysis Summary

We now have a complete analysis of the Nestorowa dataset. Each notebook’s respective page is found in the Scikick report. The project map clearly shows the order of execution of these pages.

With Scikick usage throughout this project thus far, we can now be confident that, when we remove the report directory, we are able to regenerate all results with sk run.

# E.g. We could remove the report and regenerate from scratch.
# This will be skipped to save computations.
# rm -rf report
# sk run -q

We can take a look at the files required to re-execute this project.

ls notebooks/*Rmd
ls scikick.yml

notebooks/further_exploration.Rmd   notebooks/normalization.Rmd
notebooks/import.Rmd            notebooks/quality_control.Rmd
scikick.yml

We can now come back and check on the state of this project at any time in the future to find that there are no pending updates and that the report in report/out_html represents a full execution of these notebooks.

# View notebooks, dependencies, and state
sk status -v

        index.Rmd
        notebooks/import.Rmd
        notebooks/quality_control.Rmd
          notebooks/import.Rmd
        notebooks/normalization.Rmd
          notebooks/quality_control.Rmd
        notebooks/further_exploration.Rmd
          notebooks/normalization.Rmd
Scripts to execute: 0
HTMLs to compile ('---'): 0
Up to date ('   '): 5

Updating the Dataset

If we now update the original data import (e.g. modify import.Rmd to use a new version of the raw data), Scikick is aware of the state and relationship between multiple notebooks.

# simulate an update to import.Rmd
printf "\nSimulated change" >> notebooks/import.Rmd
sk status

 s--    notebooks/import.Rmd
 -e-    notebooks/quality_control.Rmd
 -e-    notebooks/normalization.Rmd
 -e-    notebooks/further_exploration.Rmd
Scripts to execute: 4
HTMLs to compile ('---'): 4

We see that import (s-- indicating an update to “self”) and all downstream notebooks (-e- indicating an update to an external dependency) must be re-executed.

sk run -q

sk: Executing code in notebooks/import.Rmd, outputting to report/out_md/notebooks/import.md
sk: Executing code in notebooks/quality_control.Rmd, outputting to report/out_md/notebooks/quality_control.md
sk: Executing code in notebooks/normalization.Rmd, outputting to report/out_md/notebooks/normalization.md
sk: Executing code in notebooks/further_exploration.Rmd, outputting to report/out_md/notebooks/further_exploration.md
sk: Done, homepage is report/out_html/index.html

Additional Experiments

Scientific projects often develop as a series of experiments where analsyis is performed in stages. We will now add two similar parallel workflows for two new scRNAseq experiments.

Reorganizing Content

As a project moves forward, more datasets may be added requiring reorganization of the project. Workflow configurations can be difficult to coordinate with file re-organization. Scikick provides features to accomodate this. The sk mv command enables dynamic reorganization of projects while attempting to minimize the need for re-execution. It does this by applying a standard shell mv command (or git mv if using -g), while also moving the cooresponding report files such that re-execution is not necessary.

For this project, we will use sk mv to migrate all analysis of the Nestorowa dataset above to a subfolder for this experiment such that new experiemnts can be added in a parallel fashion.

mkdir notebooks/nestorowa
sk mv notebooks/*.Rmd notebooks/nestorowa

mkdir: notebooks/nestorowa: File exists
sk: mv notebooks/further_exploration.Rmd notebooks/nestorowa
sk: mv notebooks/import.Rmd notebooks/nestorowa
sk: mv notebooks/normalization.Rmd notebooks/nestorowa
sk: mv notebooks/quality_control.Rmd notebooks/nestorowa
sk: mv report/out_md/notebooks/figure/further_exploration report/out_md/notebooks/nestorowa/figure/further_exploration
sk: notebooks/further_exploration.Rmd renamed to notebooks/nestorowa/further_exploration.Rmd in ./scikick.yml
sk: notebooks/import.Rmd renamed to notebooks/nestorowa/import.Rmd in ./scikick.yml
sk: mv report/out_md/notebooks/figure/normalization report/out_md/notebooks/nestorowa/figure/normalization
sk: notebooks/normalization.Rmd renamed to notebooks/nestorowa/normalization.Rmd in ./scikick.yml
sk: mv report/out_md/notebooks/figure/quality_control report/out_md/notebooks/nestorowa/figure/quality_control
sk: notebooks/quality_control.Rmd renamed to notebooks/nestorowa/quality_control.Rmd in ./scikick.yml

Note that no code requires re-execution.

sk status

 ---    index.Rmd
 ---    notebooks/nestorowa/import.Rmd
 ---    notebooks/nestorowa/quality_control.Rmd
 ---    notebooks/nestorowa/normalization.Rmd
 ---    notebooks/nestorowa/further_exploration.Rmd
Scripts to execute: 0
HTMLs to compile ('---'): 5

Add New Notebook Collections

A similar series of notebooks were developed for the two additional experiments.

sk add notebooks/grun/import.Rmd
sk add notebooks/grun/quality_control.Rmd -d notebooks/grun/import.Rmd
sk add notebooks/grun/normalization.Rmd -d notebooks/grun/quality_control.Rmd
sk add notebooks/grun/further_exploration.Rmd -d notebooks/grun/normalization.Rmd

sk add notebooks/paul/import.Rmd
sk add notebooks/paul/quality_control.Rmd -d notebooks/paul/import.Rmd
sk add notebooks/paul/normalization.Rmd -d notebooks/paul/quality_control.Rmd
sk add notebooks/paul/further_exploration.Rmd -d notebooks/paul/normalization.Rmd

sk: Added notebooks/grun/import.Rmd
sk: Added notebooks/grun/quality_control.Rmd
sk: Added dependency notebooks/grun/import.Rmd to notebooks/grun/quality_control.Rmd
sk:   notebooks/grun/quality_control.Rmd will be executed after any executions of notebooks/grun/import.Rmd
sk: Added notebooks/grun/normalization.Rmd
sk: Added dependency notebooks/grun/quality_control.Rmd to notebooks/grun/normalization.Rmd
sk:   notebooks/grun/normalization.Rmd will be executed after any executions of notebooks/grun/quality_control.Rmd
sk: Added notebooks/grun/further_exploration.Rmd
sk: Added dependency notebooks/grun/normalization.Rmd to notebooks/grun/further_exploration.Rmd
sk:   notebooks/grun/further_exploration.Rmd will be executed after any executions of notebooks/grun/normalization.Rmd
sk: Added notebooks/paul/import.Rmd
sk: Added notebooks/paul/quality_control.Rmd
sk: Added dependency notebooks/paul/import.Rmd to notebooks/paul/quality_control.Rmd
sk:   notebooks/paul/quality_control.Rmd will be executed after any executions of notebooks/paul/import.Rmd
sk: Added notebooks/paul/normalization.Rmd
sk: Added dependency notebooks/paul/quality_control.Rmd to notebooks/paul/normalization.Rmd
sk:   notebooks/paul/normalization.Rmd will be executed after any executions of notebooks/paul/quality_control.Rmd
sk: Added notebooks/paul/further_exploration.Rmd
sk: Added dependency notebooks/paul/normalization.Rmd to notebooks/paul/further_exploration.Rmd
sk:   notebooks/paul/further_exploration.Rmd will be executed after any executions of notebooks/paul/normalization.Rmd

Merging Experiments

Finally, a set of notebooks are added which perform a combined analysis of the three datasets that were each prepared in parallel. This merge.Rmd stage depends on the results of each of the data preparations performed previously (i.e., quality_control.Rmd and normalization.Rmd notebooks), however, it does not depend on the biological analysis that was performed for each dataset (i.e., the analyses performed in further_exploration.Rmd notebooks are not used by merge.Rmd).

sk add notebooks/merged/merge.Rmd -d notebooks/grun/quality_control.Rmd -d notebooks/paul/quality_control.Rmd -d notebooks/nestorowa/normalization.Rmd
sk add notebooks/merged/combined_analysis.Rmd -d notebooks/merged/merge.Rmd

sk: Added notebooks/merged/merge.Rmd
sk: Added dependency notebooks/grun/quality_control.Rmd to notebooks/merged/merge.Rmd
sk:   notebooks/merged/merge.Rmd will be executed after any executions of notebooks/grun/quality_control.Rmd
sk: Added dependency notebooks/paul/quality_control.Rmd to notebooks/merged/merge.Rmd
sk:   notebooks/merged/merge.Rmd will be executed after any executions of notebooks/paul/quality_control.Rmd
sk: Added dependency notebooks/nestorowa/normalization.Rmd to notebooks/merged/merge.Rmd
sk:   notebooks/merged/merge.Rmd will be executed after any executions of notebooks/nestorowa/normalization.Rmd
sk: Added notebooks/merged/combined_analysis.Rmd
sk: Added dependency notebooks/merged/merge.Rmd to notebooks/merged/combined_analysis.Rmd
sk:   notebooks/merged/combined_analysis.Rmd will be executed after any executions of notebooks/merged/merge.Rmd

Utilizing Parallelization

With a parallel series of notebooks like in the current state of this project, execution of notebooks in parallel can be utilized with a flag (-j8) passed to snakemake with the -s sk run flag.

The two additional experiments and merged analysis will now be executed.

sk run -q -s -j8

sk: Snakemake arguments received: -j8
sk: Executing code in notebooks/grun/import.Rmd, outputting to report/out_md/notebooks/grun/import.md
sk: Executing code in notebooks/paul/import.Rmd, outputting to report/out_md/notebooks/paul/import.md
sk: Executing code in notebooks/grun/quality_control.Rmd, outputting to report/out_md/notebooks/grun/quality_control.md
sk: Executing code in notebooks/paul/quality_control.Rmd, outputting to report/out_md/notebooks/paul/quality_control.md
sk: Executing code in notebooks/grun/normalization.Rmd, outputting to report/out_md/notebooks/grun/normalization.md
sk: Executing code in notebooks/grun/further_exploration.Rmd, outputting to report/out_md/notebooks/grun/further_exploration.md
sk: Executing code in notebooks/paul/normalization.Rmd, outputting to report/out_md/notebooks/paul/normalization.md
sk: Executing code in notebooks/merged/merge.Rmd, outputting to report/out_md/notebooks/merged/merge.md
sk: Executing code in notebooks/merged/combined_analysis.Rmd, outputting to report/out_md/notebooks/merged/combined_analysis.md
sk: Executing code in notebooks/paul/further_exploration.Rmd, outputting to report/out_md/notebooks/paul/further_exploration.md
sk: Done, homepage is report/out_html/index.html

And a well organized final report site is generated. The project map in the report now features branching sets of analyses.

Summary

This demonstration utilizes real datasets and real analyses to demonstrate how Scikick is used in a practical setting when adapting to new analysis additions. Use of Scikick in a less controlled setting (when the analysis is not predetermined) should present even further utility to maintain workflow connections and reporting capabilities.

Appendix: Robust Archival

Utilizing the Snakemake Backend

Scikick is implemented through snakemake workflows allowing for usage of many features of snakemake. Users familliar with snakemake usage may be able to take advantage of many flags while using Scikick. Some frequently used examples are provided here.

Forced re-execution

Passing snakemake arguments with -s and utilizing the snakemake -F flag will force the entire project to execute from scratch. This can be a useful sanity check when time permits.

# This will be skipped to save computations.
# sk run -s -F

Execution with Singularity

When readers attempt to reproduce this work or any other computational work, they may run into issues with software or data. In these cases, it is useful for the reader to see:

That a fully reproducible archive exists (i.e. that the work is reproducible if the exact compute environment is used).
That, if needed, they could utilize this static software environment to reproduce the results.

The OSCA project maintains a docker image at Docker Hub. This image can first be assigned to the Scikick project with

# Using a fixed image tag for future reproducibility 
sk config --singularity docker://bioconductor/orchestratingsinglecellanalysis:RELEASE_3_12

sk: Argument singularity set to docker://bioconductor/orchestratingsinglecellanalysis:RELEASE_3_12

Providing the flag --use-singularity to snakemake, will download the container and execute all notebooks within this container.

Automated Re-execution (i.e. Continuous Analysis)

Configurations of singularity like the above effectively makes it possible to execute any Scikick data analysis project with only the core Scikick dependencies and no analysis software dependencies directly required.

This scRNAseq project additionally requires the image to have write access to the /home/cache directory for data downloads and so a writeable location must be provided for execution with the additional argument --singularity-args. Additionally, the script is configured to distribute notebook executions across a SLURM cluster. A short run.sh script contains a call to Scikick with these necessary arguments for full execution.

cat run.sh

##### Project Execution Overview 
# This script contains:
#   - Commands to handle scRNAseq (project-specific) data imports
#   - Various commands to execute Scikick
#      - Using singularity + SLURM (default)
#      - Using singularity
#### Data Import
# For every project it is necessary to determine how
# to get the data. For this project, data comes from
# R packages with built-in caching mechanisms
### Copying Bioconductor data cache into container
# Currently it is necessary to reroute a directory
# to /home/cache since the docker container does not
# have read/write permissions and singularity does
# not have elevated permissions. Directories are passed to 
# singularity with the -B/--bind argument.
## Best is to use the system cache dir to avoid re-downloads
# ehub=$(Rscript -e "cat(Sys.getenv('EXPERIMENT_HUB_CACHE'))")
# ahub=$(Rscript -e "cat(Sys.getenv('ANNOTATION_HUB_CACHE'))")
## For ease, creating dedicated directories for data downloads
mkdir -p input/cache/ExperimentHub
mkdir -p input/cache/AnnotationHub

### Analysis Execution Commands
## Run all with Singularity and on a SLURM cluster
# Note that the singularity pull can require tmp space which can be set prior to the run with:
#   TMPDIR = /tmpdir/with/space
# The snakemake profile "slurm" must be confiugured prior to executing this command
sk run -v -s --use-singularity \
    --singularity-args "'-B input/cache/ExperimentHub:/home/cache/ExperimentHub -B input/cache/AnnotationHub:/home/cache/AnnotationHub'" \
    --profile slurm

## Singularity usage
# sk run -v -s --use-singularity \
#   --singularity-args "'-B input/cache/ExperimentHub:/home/cache/ExperimentHub -B input/cache/AnnotationHub:/home/cache/AnnotationHub'"

Scikick Continuous Analysis with GitLab

Linking this execution with a continuous integration service implements a version of Continuous Analysis where the archived reports may be refered to at any time in the future knowing that they are reproducible. A template for such a configuration using GitLab CI is provided below:

cat .gitlab-ci.yml

# Stage 1 - Run the analysis and store results on GitLab
analysis:
 stage: build
 # This task is executed on each push of new code to the GitLab server
 script:
 - conda activate scikick # activate a preconfigured environment with scikick and dependencies installed
 - bash run.sh # script calls scikick to run the analysis with singularity on a SLURM cluster
 cache:
   paths:
   - .snakemake/singularity # Avoid pulling containers from DockerHub on each build
 artifacts:
  paths:
  - report/ # keep the Scikick analysis archive files
  - "*.out" # keep SLURM cluster log files
  when: always # keep these artifacts on failure to investigate cluster error logs

# Stage 2 - On successful build, send the report to a server for sharing
deploy:
 stage: deploy
 script:
 - mkdir -p .ci-deploy-dir/$CI_PROJECT_PATH/$CI_COMMIT_REF_NAME/ # prepare a directory based on GitLab project path
 - rsync -av report/out_html/* .ci-deploy-dir/$CI_PROJECT_PATH/$CI_COMMIT_REF_NAME/ # put the site in the directory
 - rsync -Pav .ci-deploy-dir/ admin@myserver.com:/home/admin/webroot/html/ # send to a dedicated gitlab-ci dir on webserver

The analysis codebase will now contain Scikick reports that are fully verified for reproducibility.

Next (Project Map)

Scikick for Single Cell Analysis

21 July 2023