Abstract
Collections of data analysis notebooks often lose coherence due to the complex and branching nature of investigations. Scikick is a command line utility for managing ensembles of computational notebooks developed throughout a project by providing simple commands for workflow configuration, report generation, and state management.
A thorough data analysis will involve multiple computational notebooks (e.g., in Rmarkdown, Jupyter, or plain scripts).
Consider this two stage data analysis where QC.Rmd
provides a cleaned dataset
for model.Rmd
to perform modeling:
|-- input/raw_data.csv |-- code │ |-- QC.Rmd │ |-- model.Rmd |-- output/QC/QC_data.csv |-- report/out_html | |-- QC.html | |-- model.html
Each of these notebooks may be internally complex, but the essence of this workflow is:
QC.Rmd
must run before model.Rmd
This simple definition can be applied to:
QC.Rmd
when only model.Rmd
changes.These features are key to the use of notebooks for complex analyses, however, too much configuration is currently required to accomplish these goals. To remain focused on an investigation, tools are needed to streamline the organization of notebook collections.
Scikick is a command-line-tool for connecting and executing related data analyses with a few simple commands to generate cohesive investigative reports and ensure future reproducibility.
Figure 1 (reference to be added upon availability).
Common useful features for ad hoc data analysis are managed through Scikick:
These features allow for easy development of transparent data analysis repositories.
Commands are inspired by git for configuring the workflow: sk init
, sk add
, sk status
, sk del
, sk mv
.
Scikick currently contains methods for executing .R
, .Rmd
, .ipynb
(experimental), and .md
(simple copy) to .md
output pages. .md
files are then compiled into a website (currently with rmarkdown::render_site
).
See the output website for an implementation of a single cell transcriptomic analysis.
Scikick is currently tested and working on Unix systems (i.e. macOS, Linux etc.) with the following versions of software installed:
Requirements | Recommended |
---|---|
python3 (>=3.6) | git >= 2.0 |
R + packages install.packages(c("rmarkdown", "knitr", "yaml","git2r")) |
singularity >= 2.4 |
pandoc > 2.0 | conda |
GraphViz (for project maps) |
With the requirements above installed, Scikick can be installed using pip:
pip install scikick
The entire installation process should take no more than 5-10 minutes excluding software download times.
Follow the short “hello world” usage of Scikick.
Execute the demo project in a terminal with sk init --demo
. This will walk through a short demonstration which executes basic Scikick commands and generates a project for inspection.
Read a longer realistic usage of Scikick for single cell transcriptomics.
Read about the core design of Scikick.