Scikick

Notebook-Centric Workflow Automation

Matthew Carlucci1,2*, Tadas Bareikis1,2, Karolis Koncevičius2, Povilas Gibas1,2, Algimantas Kriščiūnas2, Art Petronis1,2 and Gabriel Oh1,2,3

1The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health
2Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
3Stanford University School of Medicine, Stanford, California, USA

*matthew.carlucci@camh.ca

21 July 2023

Abstract

Collections of data analysis notebooks often lose coherence due to the complex and branching nature of investigations. Scikick is a command line utility for managing ensembles of computational notebooks developed throughout a project by providing simple commands for workflow configuration, report generation, and state management.

Preface: Notebook-Centric Workflows

A thorough data analysis will involve multiple computational notebooks (e.g., in Rmarkdown, Jupyter, or plain scripts). Consider this two stage data analysis where QC.Rmd provides a cleaned dataset for model.Rmd to perform modeling:

|-- input/raw_data.csv
|-- code
│   |-- QC.Rmd
│   |-- model.Rmd
|-- output/QC/QC_data.csv
|-- report/out_html
|   |-- QC.html
|   |-- model.html

Each of these notebooks may be internally complex, but the essence of this workflow is:

QC.Rmd must run before model.Rmd

This simple definition can be applied to:

Re-execute the notebook collection in the correct order.
Avoid unnecessary execution of QC.Rmd when only model.Rmd changes.
Build a shareable report from the rendered notebooks.
Collect relevant execution logs.

These features are key to the use of notebooks for complex analyses, however, too much configuration is currently required to accomplish these goals. To remain focused on an investigation, tools are needed to streamline the organization of notebook collections.

Scikick

Scikick is a command-line-tool for connecting and executing related data analyses with a few simple commands to generate cohesive investigative reports and ensure future reproducibility.

Figure 1 (reference to be added upon availability).

Common useful features for ad hoc data analysis are managed through Scikick:

Preset methods for executing a variety of notebook formats to markdown output
Awareness of up-to-date results
Explicit definitions of notebook execution order
Website generation with automated navigation based on workflow definition
Collection of page metadata (session info, page runtime, git history)
Defining notebook import dependencies (e.g. importing functions from source file)

These features allow for easy development of transparent data analysis repositories.

Commands are inspired by git for configuring the workflow: sk init, sk add, sk status, sk del, sk mv.

Scikick currently contains methods for executing .R, .Rmd, .ipynb (experimental), and .md (simple copy) to .md output pages. .md files are then compiled into a website (currently with rmarkdown::render_site).

See the output website for an implementation of a single cell transcriptomic analysis.

Installation

Scikick is currently tested and working on Unix systems (i.e. macOS, Linux etc.) with the following versions of software installed:

Requirements	Recommended
python3 (>=3.6)	git >= 2.0
R + packages `install.packages(c("rmarkdown", "knitr", "yaml","git2r"))`	singularity >= 2.4
pandoc > 2.0	conda
	GraphViz (for project maps)

With the requirements above installed, Scikick can be installed using pip:

pip install scikick

The entire installation process should take no more than 5-10 minutes excluding software download times.

Getting Started

Follow the short “hello world” usage of Scikick.
Execute the demo project in a terminal with sk init --demo. This will walk through a short demonstration which executes basic Scikick commands and generates a project for inspection.
Read a longer realistic usage of Scikick for single cell transcriptomics.
Read about the core design of Scikick.

Next (Project Map)