Abstract

Collections of data analysis notebooks often lose coherence due to the complex and branching nature of investigations. Scikick is a command line utility for managing ensembles of computational notebooks developed throughout a project by providing simple commands for workflow configuration, report generation, and state management.

Preface: Notebook-Centric Workflows

A thorough data analysis will involve multiple computational notebooks (e.g., in Rmarkdown, Jupyter, or plain scripts). Consider this two stage data analysis where QC.Rmd provides a cleaned dataset for model.Rmd to perform modeling:

|-- input/raw_data.csv
|-- code
│   |-- QC.Rmd
│   |-- model.Rmd
|-- output/QC/QC_data.csv
|-- report/out_html
|   |-- QC.html
|   |-- model.html

Each of these notebooks may be internally complex, but the essence of this workflow is:

QC.Rmd must run before model.Rmd

This simple definition can be applied to:

  • Re-execute the notebook collection in the correct order.
  • Avoid unnecessary execution of QC.Rmd when only model.Rmd changes.
  • Build a shareable report from the rendered notebooks.
  • Collect relevant execution logs.

These features are key to the use of notebooks for complex analyses, however, too much configuration is currently required to accomplish these goals. To remain focused on an investigation, tools are needed to streamline the organization of notebook collections.

Scikick

Scikick is a command-line-tool for connecting and executing related data analyses with a few simple commands to generate cohesive investigative reports and ensure future reproducibility.

Figure 1 (reference to be added upon availability).

Common useful features for ad hoc data analysis are managed through Scikick:

  • Preset methods for executing a variety of notebook formats to markdown output
  • Awareness of up-to-date results
  • Explicit definitions of notebook execution order
  • Website generation with automated navigation based on workflow definition
  • Collection of page metadata (session info, page runtime, git history)
  • Defining notebook import dependencies (e.g. importing functions from source file)

These features allow for easy development of transparent data analysis repositories.

Commands are inspired by git for configuring the workflow: sk init, sk add, sk status, sk del, sk mv.

Scikick currently contains methods for executing .R, .Rmd, .ipynb (experimental), and .md (simple copy) to .md output pages. .md files are then compiled into a website (currently with rmarkdown::render_site).

See the output website for an implementation of a single cell transcriptomic analysis.

Installation

Scikick is currently tested and working on Unix systems (i.e. macOS, Linux etc.) with the following versions of software installed:

Requirements Recommended
python3 (>=3.6) git >= 2.0
R + packages install.packages(c("rmarkdown", "knitr", "yaml","git2r")) singularity >= 2.4
pandoc > 2.0 conda
GraphViz (for project maps)

With the requirements above installed, Scikick can be installed using pip:

pip install scikick

The entire installation process should take no more than 5-10 minutes excluding software download times.

Getting Started

  • Follow the short “hello world” usage of Scikick.

  • Execute the demo project in a terminal with sk init --demo. This will walk through a short demonstration which executes basic Scikick commands and generates a project for inspection.

  • Read a longer realistic usage of Scikick for single cell transcriptomics.

  • Read about the core design of Scikick.



Next (Project Map) skmap cluster_/ / introduction.ipynb Introduction hello_world.ipynb Hello World SCRNA_walkthrough.ipynb SCRNA Walkthrough core_design.ipynb Core Design advanced_usage.ipynb Advanced Usage other_notes.ipynb Other Notes CLI_demo.ipynb CLI Demo help.ipynb Help