Documentation Status

Welcome

Welcome, general notes, explanations and the like for the EpiCompBio group.

The plan is to collectively build an approach that integrates reproducibility into our research.

Here you should be able to find some general ideas, introduction, references and the overall approach.

Go to the documentation for this repository to see the notes.

As usual, feel free to add your own, edit, comment, question, etc.

Notes on reproducibility in biomedical research

Author:Antonio
Date:22 Dec 2016

General approach

The core idea is about reproducibility, integrating best practice, creating good habits and learning together. Reproducibility is the key principle. This (simply but not easily) requires discipline and a few tools that can be used and/or understood by as many people as possible.

Our research needs to be:

  • Reproducible (same question, same data, same code, same results)
  • Replicable (by other labs using other experiments, data, analysis, technology, etc.)
  • And ideally open source, open access and open data with code that is re-usable (by us and others)

The above requires:

  • Clear questions and hypotheses
  • Accessible, read-only data (data collection and experiments largely already follow these guidelines but the same applies)
  • Transparent, easy to read, well documented code
  • Transparent, easy to read, well documented workflows (e.g. data processing steps and analysis)
  • Software version control
  • Computer environment logging (e.g. what tools, what versions, what OS)
  • Parameter logging

There are many guidelines and thoughts (see below for a few suggestions).

Data analysis (of any size) should aim to have:

  • Clear, documented code
  • Ideally automation and re-usability
  • Unit testing at its basic: does my new code change my expected results?

We’ve started with the EpiCompBio GitHub account

There are many tools out there that aim to put the above into practice (see list of pipeline review references for example).

We don’t need to re-invent the wheel, simply to make best use of our collective skills and existing tools/approaches.

See the suggested approach and tools for what we are thinking of and currently implementing/using as well as the repos and tools that are/will appear in the EpiCompBio group.

Some references and tutorials

Other training resources

Software Carpentry’s (SC) git novice

Notes on computational pipelines for biomedial research

Author:Antonio Berlanga
Date:22 Dec 2016

(Disclaimer (!): I’ve been learning as I go and I still have a long way… Please add, discuss, correct, etc.)

Suggested tools and approach to implement in computational and statistical analysis

Conceptually the problem is simple: chain third party tools and custom scripts together to answer a specific scientific question.

In practice this is a nightmare. It requires project discipline and very good statistical and scripting practices.

For example, in a genetic association project you might need:

  • Genotype QC pipeline
  • Imputation pipeline
  • Association testing pipeline
  • Downstream annotation of significant SNPs

All of these require many programs (third party like plink), some custom scripts (like plotting and stats in R) and will probably need to be run in a high performance cluster.

Keeping track of results, re-running with different parameters, logging, version control, communicating with colleagues, etc. quickly becomes difficult. A systematic approach is needed from the beginning that will cut the overhead and allow all results and plots to be traced back to the commands, parameters, software versions, and any steps used to process the original data, in an easy, transparent and reproducible way.

I have failed to do this and over time the costs can be high.

There are many tools out there that aim to put reproducibility in computational biology (or general data analysis) into practice (see list of pipeline review references for example).

Reproducibility requires data + code

We don’t need to re-invent the wheel, simply to make best use of our collective skills and existing tools/approaches.

Sound principles come first, languages and tools are secondary (to an extent, tools over time shape our thinking so good initial choices are important).

However a general, common framework and way of working is necessary. After a lot of initial personal and group pain we should hopefully see gains.

Using Python and UNIX philosophy as the building bases

Python:

Python is a popular, well-supported, general programming language which is flexible, powerful and readable. A great choice overall for beginners. It can serve as the glue for pipelines even if many scripts and programs are in other languages. Ruby, Perl and others are largely equivalent. There are dozens of online source for learning and a very active community.

Ultimately a combination of unix (or equivalent compute environment), stats and programming is needed. Different people do/use different combinations.

Using python, R and Unix is pretty powerful and a well trodden path.

The main/basic idea is to be able to structure scripts into packages and re-use them (or at least freeze and present them at publication).

There’s a lot out there on software structure, see for example:

Structuring Your Project — The Hitchhiker’s Guide to Python
Software Carpentry - Intermediate and Advanced

UNIX is at the heart of most of the common and powerful operating systems available. See:

Phylosophy

A classic book on UNIX:

The Unix Programming Environment (Prentice-Hall Software Series) Paperback – 1 Nov 1983 by Brian W. Kernighan (Author), Rob Pike (Author)

A general update on the above:

The Art of Unix Programming (Addison-Wesley Professional Computing) Paperback – 23 Sep 2003 by Eric S. Raymond (Author)

Actual tools and practice

In general, pipelines should ideally be:

  • Well documented
  • Configurable for available compute resources
  • Not hard-coded: configurable for actual job parameters which will be arbitrary and project specific probably
  • Run from the command line
  • Report extensive logging for debugging and versioning
  • Easy to build on
  • Runnable locally or on a cluster
  • Able to handle single and multi-jobs
  • Portable across computational environments

A big problem across the field is portability, currently without good answers, but pipelines can go some way towards this.

The general approach I’m suggesting is the one used at CGAT, which in turn adopts many current computational best practice standards). See:

The CGAT Code Collection includes cgat scripts for genomics and CGAT Pipelines, a framework and set of ruffus based pipelines to run workflows in computational biology.

CGAT scripts and pipelines use popular, open source, mostly free, proven tools with excellent community support such as Python, R, Github, Travis CI, plus the myriad of genomics and biology software options for specific tasks.

A lot of this work is in beta (as are most pipeline approaches, of which there are many, galaxy is a well known one and could be an answer but version control, scalability and other issues exist, it is designed to ease use for biologists and works well like this). See Galaxy and the Biostars community for example.

CGAT is based on Ruffus, a python pipeline tool which is flexible, powerful and readable (being python).

CGAT Pipelines can help manage computer resources, clusters, logging, execution, versioning and, more importantly, to work under a common framework (think languages, style, choice of tools, etc.).

CGAT Pipelines have their own backbone (for controlling jobs, communicating with the cluster, logging, software/package structuring, etc.). I’m still on the learning curve but think this is one of the best approaches because of its flexibility and power (once you get to grips with it). See the backbone scripts.

A pipeline example can be:

Limitations of CGAT (but common to these types of tools) are:

  • Pipelines have many dependencies
  • Setting up the initial environment is often very problematic
  • Keeping track of packages and managing them is a big overhead
  • There’s a steep learning curve in general and to each pipeline/approach
  • The “system” (eg funders and current science practice) rewards results not repeatability, so no time and little interest

An excellent complement/alternative is Jupyter and its notebook (aka IPython), particularly for interactive work:

On a side note, for managing packages see Conda, a great way to reduce time spent on this.

Structuring code

A general, proven approach to follow is one based on basic python organisation:

  • Scripts - Write stand-alone scripts which are callable from the CLI and can take arbitrary parameters
  • Modules - Include functions and code which could be used by more than one script/pipeline, bundled by overall aim/use
  • Pipeline - a (e.g. ruffus) python script which chains multiple tasks (e.g. functions or steps needed to obtain an answer to the project’s question) and jobs (input data, e.g. you have 10 fastq files which will all be treated in the same way) and can be submitted to the cluster (e.g. managed by drmaa which will then communicate with SGE or PBSPro).

To this, we can ideally add:

  • Unit tests - aiming to test each script, parameter, function, with small, example data. Aimed for stability only (ie do new code changes mess up the expected results?).
  • A good option is to use via Travis CI or Jenkins CI, integrated to GitHub (tests are automatically triggered after each commit, need configuration (eg yaml), data and expected result).
  • Report - aiming to write a basic automated report that picks up some basic stats, tables and plots from the pipeline results and puts them in one document (using e.g. sphinx, markdown, or similar tool).

Tools to use

All of the above can be achieved with:

  • Version control such as Github
  • Unit testing such as Travis (runs with Github)
  • Choice of programming and statistical languages (e.g. Python, Perl, R, Matlab, etc)
  • Computation pipeline tool such as Ruffus
  • Sufficient computing resources: your laptop, a unix cluster, etc. depending on tasks and data
  • A general framework which is extendable, allows us to keep relatively sane, and enhances the above (CGAT Pipelines, Galaxy, etc.).

Other languages

In terms of packaging and structuring of projects and programs other languages do their own thing.

For examples of R and its repository take a look at:

Packaging in Python

Python packaging has a messy history and isn’t completely straightforward.

See these webpages first and follow their guidelines:

python - Differences between distribute, distutils, setuptools and distutils2? - Stack Overflow

Python Packaging User Guide — Python Packaging User Guide documentation

pypa/sampleproject: A sample project that exists for PyPUG’s “Tutorial on Packaging and Distributing Projects”


You can also see other package examples (e.g. cryptography) and the Python package sample.

You can also use the structure provided by project_quickstart.py as a starting point.


Once you have the project file and directory structure: Get the packages you need, e.g.

pip install -U pip twine check-manifest setuptools

Create/edit MANIFEST.in and setup.py files as necessary (and possibly an INI file depending on how you’ve set things up).

Use check-manifest to detect errors in setup.py:

check-manifest

Run the following to test and create the discribution:

python setup.py check

python setup.py sdist bdist_wheel

You can create an environment and test in a separate directory (using conda for example):

cd .. ; mkdir test_package ; cd test_package

conda create -n test_env python=3.6

Install any dependencies needed with conda or pip

source activate test_env

git clone https://github.com/xxx/package_xxx.git

Install and test package:

cd package_xxx

python setup.py install

Then test the main script elsewhere (you can create an entry point in setup.py), run a test example, etc.

If this is an actual package to share with others upload to PyPI:

See the instructions_ page for all the details.

  • Register yourself at PyPI and create a ~/.pypirc file.
  • Register your package manually or with twine (see instructions above)
  • Upload your package (requires twine for this example):
twine upload dist/*

The package should now be ready to install anywhere with:

pip install package_xxx

Further references:

This blog_ has an explanation of how to carry out all of this with examples of MANIFEST.in and setup.py files.

It also has further information on how to use PyPI’s test server.

Things changed a fair amount from python 2x to 3x so check whatever is the most recent information (see the links above for this).

https://the-hitchhikers-guide-to-packaging.readthedocs.io/en/latest/introduction.html

https://wiki.python.org/moin/Distutils/Tutorial

http://www.diveintopython3.net/packaging.html

https://blog.niteoweb.com/setuptools-run-custom-code-in-setup-py

http://stackoverflow.com/questions/774824/explain-python-entry-points

http://stackoverflow.com/questions/13307408/python-packaging-data-files-are-put-properly-in-tar-gz-file-but-are-not-install?rq=1

Notes on how to create technical documents

restructuredText (reST, rST)

  • reStructuredText may provide the best tool as an accessible, flexible, readable, plain text, python based technical documentation option. reST is the source format, Sphinx is a builder tool that transforms reStructuredText into different target output formats.
  • It can be integrated with Sphinx for converting to html, pdf, etc. and for supporting multiple languages (not just Python).
  • reStructuredText is similar to Markdown. If webpages are the main objective maybe use Markdown though.

For technical documents maybe reStructuredText is better.

  • reStructuredText seems much easier than html and LaTex and travels much better as it is plain text and can be used as the source for later conversion.
  • With python+docutils+sphinx it can be formatted to html, with pdflatex to pdf for example.
  • reStructuredText can be transfomed to Markdown with Pandoc and then rendered as html.
  • reStructuredText supports math
  • A good option is CGATReport, based on Python, Sphinx and rST. It integrates plotting code directly into its workflow though (good?). The central datastructure is a pandas dataframe. Has a table renderer.
  • Also see Jupyter ecosystem and its notebook (previously IPython Notebook):
    http://jupyter.org/
    • This works well, looks great, has huge flexibility, is becoming language independent, great when run interactively.
    • Requires windowing though for the notebook which can be very cumbersome when working remotely.
    • Some problems when turning from notebook to script for automated analysis?

Including citations

reST directives to include external files

To include images

Images can be pulled in with e.g.:
`.. image:: /path/to/image.jpg`

Making slides with rst

Go to your default browser and open the localport (e.g. http://0.0.0.0:8000) to see the presentation, or create a dir and use -o to save the output.

More on Hovercraft with additional tools.

Other options are:
  • Create rst document
  • Convert from rst to e.g. beamer (latex slides) with pandoc:
pandoc myfile.rst -s -f rst --to beamer -o myfile.pdf
pandoc -V theme:Warsaw -N -S --toc --normalize -f rst -t beamer -o myfile.pdf myfile.rst

See this page for more options and explanations.

Other options: - See this tutorial and github code with rst and Beamer.

Problems with reST

  • Tracking changes is a problem though (between collaborators not using git, i.e. collaborator’s comments in a Word review form):
    http://criticmarkup.com/
  • Rendering external tables easily with rST? See CGATReport and R library xtable

These aren’t specific to rST though.

Jupyter notebooks

For exploratory analysis these might be a great solution. They are very flexible, can mix languages, keep plots, code, text together. See an example of a publication of RNA-seq here and a blog_ with some tips and info. A notebook server is needed to run properly.

R markdown and its notebook

R markdown v2 is another excellent option in this regard. See also R Markdown to Word. If you’re running analysis locally, R notebooks and Jupyter are probably far better than rst and Sphinx for reports. See these blogs (a, b, c, d, e) comparing R and Jupyter notebooks for instance and other tutorials.

You can also run Rmd files with command line parameters like (f, g, h). This is the main tutorial.

Check the reference guide and article templates for Rmarkdown.

VIM or Emacs?

See org mode in vim for example (originally emacs). Although it seems like working with code in vim isn’t possible, for vim users maybe emacs + evil + org-mode is better.

TO DO

Additional references and blogs

First Steps with Sphinx — Sphinx 1.5.1 documentation
reStructuredText Primer
reStructuredText vs Markdown for documentation
Why Sphinx and reStructuredText ? — Varnish version 2.1.5 documentation
Managing bibliographic citations in Sphinx — Wiser 0.1 documentation
reStructuredText - Wikipedia
How To Write Papers with Restructured Text
Standard format conversions between reST and LaTeX:
There is some support for reST to Word::

restructuredText tutorials/info


rst can be used on its own and then converted to html, pdf etc with different tools.

Sphinx adds many useful tools and is based on rst, one of the main ones is to connect many files to a single hierarchy of documents. Sphinx also makes it easy to document your python based project.

Generate the documentation with sphinx:

pip install sphinx

A sphinx template report can be generated with:

mkdir my_report_docs

cd my_report_docs

sphinx-quickstart

After adding content, you can generate html, pdf, etc. with:

make html

The rendered file will be found in the _build directory.


Include other rst files:

.. toctree::
    :maxdepth: 2
    :numbered:
    :titlesonly:
    :glob:
    :hidden:

    intro.rst
    chapter1.rst
    chapter2.rst

See the toctree directive for full info.


It is also possible to include the literal contents of a file with:

.. literalinclude:: filename
    :linenos:
    :language: python
    :lines: 1, 3-5
    :start-after: 3
    :end-before: 5

Include an image:

.. image:: images/ball1.gif

Or:

.. image:: images/xxx.png
   :height: 100
  :width: 200
  :scale: 50
  :alt: alternate text

See image directive full markup.

Or import a figure which can have a caption and whatever else you add:

.. figure:: xxx.jpg
    :width: 200px
    :align: center
    :height: 100px
    :alt: alternate text
    :figclass: align-center

    a caption would be written here as plain text. You can add more with eg::

  .. code-block:: python

      import image

Include a simple csv table:

.. csv-table:: a title
   :header: "name", "firstname", "age"
   :widths: 20, 20, 10

   "Smith", "John", 40
   "Smith", "John, Junior", 20

See csv-table directive for example.


For useful extensions to rst and sphinx see this tutorial on extensions:

In a sphinx conf.py file you can specify the extensions needed, e.g.:

 extensions = [-
   'easydev.copybutton',
   'sphinx.ext.autodoc',
   'sphinx.ext.autosummary',
   'sphinx.ext.coverage',
   'sphinx.ext.graphviz',
   'sphinx.ext.doctest',
   'sphinx.ext.intersphinx',
   'sphinx.ext.todo',
   'sphinx.ext.coverage',
   'sphinx.ext.pngmath',
   'sphinx.ext.ifconfig',
   'matplotlib.sphinxext.only_directives',
   'matplotlib.sphinxext.plot_directive',
]

The math directive, e.g.:

.. math::

    n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k

TODO, it needs the conf.py file:

would produce:

\[n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k\]

References, e.g. [CIT2002] are defined at the bottom of the page as:

.. [CIT2002] A citation

and called with:

[CIT2002]_

Generate python package documentation with Sphinx and render rst docs

  • Install Sphinx, sphinx-quickstart, sphinx-apidoc, sphinx-build

  • In the python package docs directory that you would’ve created:

    • Run ‘sphinx-quickstart’ to setup configuration values and generate template rst docs. See an example_ and First Steps with Sphinx.
    • Manually edit index.rst and other files to use as content
    • sphinx-build -b html sourcedir builddir ; or pdf (needs the LaTex builder installed)
    • sphinx-apidoc -o . .. to generate module and function documents from docstrings within scripts
    • make clean, make html/latexpdf to clean and generate further builds (make is equivalent to sphinx-build because quickstart creates a Makefile and make.bat files)
    • sphinx-autobuild . _build_html updates (use if large docs that are faster to update than build from scratch)
    • Pull/push etc to GitHub account
    • Create an account in ReadTheDocs (RTD) and connect to GitHub repository
    • Some configuration is needed (TO DO) at RTDs
    • RTD rebuilds after every commit using ‘sphinx-build -b html . _build/html’
    • If the build on RTD doesn’t work try ‘Wipe’ in /projects/[project]/versions/
    • You can add the ‘_build’ dir to .gitignore

If using project_quickstart, which copies templates generated by a basic sphinx-quickstart run, you can do:

These commands need to be run where the conf.py and other Sphinx templates are, usually:

  • project_XXXX/code/docs for the code documentation
  • project_XXXX/documents_and_manuscript/ for the manuscript preparation

pandoc common commands

See pandoc demos

Some examples for rst to PDF:

Use option --bibliography=FILE for rendering citations.

Word docx:


Links with more information (some are old):

http://docs.readthedocs.io/en/latest/builds.html

https://daler.github.io/sphinxdoc-test/includeme.html

http://lucasbardella.com/blog/2010/02/hosting-your-sphinx-docs-in-github

http://cgat.readthedocs.io/en/latest/PipelineReports.html

Annoying quirks?!

For titles to appear in the ReadTheDocs table of contents they need to be as:

####
Why?
####

If using ‘=’ instead they don’t seem to build…

Substitutions in restructuredText files

Variables are specified as “|xxxx|”.

An example would be:

The main file, e.g.:

.. Include templates from external file (this is a comment).
.. include:: substitution_vars.rst

I can include text, like |my custom text|, from other files.

I can include such as the following: |more text|

And |here|.

Then specify the substitutions in a separate rst file, such as “substitution_vars.rst”.

.. Fill in the variables in the external rst file:

.. |my custom text| replace:: "example text 1"

.. |more text| replace:: several other things as well

.. |here| replace:: bla bla bla

Because the substitution process uses the “.. include:: ” directive, everything in the file will get included. Thus use hidden comments if needed.

Use rst2pdf to render, which will then substitute the variables with the appropriate changes. pandoc doesn’t seem to read the include directive properly. rst2pdf runs in Python 2.7 only.

Run:

rst2pdf file_with_include.rst -o file_with_include.pdf

See SO question and technical documenation for more information.

Creating figure layouts for publication programmatically

It seems surprisingly hard to find tools for this. The current workflow I’d suggest (based on Python, essentially out of the svgutils blog):

  1. Plot with any tool, save as svg
  2. Import figures you want into sgvutils
  3. Create legends, titles, layout, etc.
  4. Save multi-panel figure as svg
  5. Convert svg to pdf (with inkscape on the command line for example)
  6. Insert images into rst with image and other directives to create a file with text, figures, tables, etc. and which can later be converted to pdf or html.

See below for details, references and basic examples. I haven’t tested many of these but left them here as suggestions.

Python package svgutils, probably the one to use, starts and ends with svg files.

See this explanation for more details.

from svgutils.compose import *

Figure("16cm", "6.5cm",
        Panel(
              SVG("sigmoid_fit.svg"),
              Text("A", 25, 20, size=12, weight='bold')
             ),
        Panel(
              SVG("anscombe.svg").scale(0.5),
              Text("B", 25, 20, size=12, weight='bold')
             ).move(280, 0)
        ).save("fig_final_compose.svg")

A similar package is svg_stack: concatenate SVG files

A useful svg utility package might be scour.

Inkscape

e.g.

inkscape --file=fig_final.svg --export-area-drawing --without-gui --export-pdf=output.pdf

Full suite, equivalent to Adobe Illustrator but open source and free.

To install in Mac you can use:

brew install caskformula/caskformula/inkscape

Another package for image file conversions is CairoSVG.

cairosvg -o fig_final.pdf fig_final.svg

restructuredText and SVG

See the documentation on reStructuredText and svg. The mix of pdf, svg, rst, html, etc. can become nightmarish.

rst doesn’t seem to have a specific figure layout tool but there are some workarounds.

See the image rst directive details for more information and examples.

There is also the figure directive.

e.g.

Latex does not support svg, requiring first to convert svg files to pdf or eps. Inkscape can be used for this.

If you’re using latex see this document for further help.

inkscape -D -z --file=image.svg --export-pdf=image.pdf --export-latex

There is a Sphinx svg image directive that you can try:

Tables are a different matter altogether. You can wrap figures in a table within rst.

grImport does something similar and can manipulate figures/images starting from PostScript:

Use imager (also here) package which can import vector graphics, but is meant for image manipulation not creating layouts.

Create and format PowerPoint documents from R software - Easy Guides - Wiki - STHDA

Microsoft Word and PowerPoint Documents Generation ReporteRs package

CRAN - Package cowplot

Arranging plots in a grid

CMBX12 - import.pdf

OpenCV

PIL Pillow Fork

Both are for statistical image processing

How to Create Publication-Quality Figures

Overview Inkscape

Ten Simple Rules for Better Figures

imagemagick - How can I convert a PNG to a PDF in high quality so it’s not blurry or fuzzy? - Unix & Linux Stack Exchange

Combine several images horizontally with Python - Stack Overflow

Python Image Library: How to combine 4 images into a 2 x 2 grid? - Stack Overflow

Notes on continuous integration

Did the code I just pushed change all my results? Did I introduce a bug?

Basic steps with Travis for Python:

For a general example of a simple Travis setup see:

https://github.com/EpiCompBio/genotype_tools

Pages to check and further info:

https://www.hackzine.org/even-better-testing-with-pytest-tox-travis.html

If too heavy for Travis CI see Jenkins, CGAT Jenkins setup and some internal instructions_ for it:

References and reviews of computational pipelines

  • ‘Big data’, Hadoop and cloud computing in genomics. - PubMed - NCBI

https://www.ncbi.nlm.nih.gov/pubmed/23872175

  • Using Docker Bioconductor containers

http://databio.org/docker_bioconductor/#/title

  • Ruffus — ruffus 2.6.3 documentation

http://www.ruffus.org.uk/index.html

  • Best practices — pypiper 0.3 documentation

http://pypiper.readthedocs.io/en/latest/best-practices.html

  • A review of bioinformatic pipeline frameworks

http://bib.oxfordjournals.org/content/early/2016/03/23/bib.bbw020.full

  • ssadedin/bpipe: Bpipe - a tool for running and managing bioinformatics pipelines

https://github.com/ssadedin/bpipe

  • Snakemake—a scalable bioinformatics workflow engine

http://bioinformatics.oxfordjournals.org/content/28/19/2520.short

  • Omics Pipe: a community-based framework for reproducible multi-omics data analysis

https://bioinformatics.oxfordjournals.org/content/31/11/1724.full

  • Scientific workflow systems - can one size fit all? - IEEE Xplore Document

http://ieeexplore.ieee.org/document/4786077/?reload=true&arnumber=4786077

  • Experiences with workflows for automating data-intensive bioinformatics

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539931/

  • NLeSC Guide

https://nlesc.gitbooks.io/guide/content/index.html

  • Top 10 metrics for life science software good practices - F1000Research

https://f1000research.com/articles/5-2000/v1

  • danielskatz/sustaining-research-projects: sustainability models for research software projects

https://github.com/danielskatz/sustaining-research-Projects

  • Science Code Manifesto

http://sciencecodemanifesto.org/

Notes on using Docopt for script command line options

Docopt:

http://docopt.org/

https://github.com/docopt/docopt

An example for loading arguments from an INI file:

Help message format

Basic reminders for docopt:
  • Use two spaces to separate options with their informal description
  • () are required, [] are optional
  • Default values are specified in the Options section with eg [default: xxx.log]

See also Schema for input argument validation example.

“.. Usage” pattern in docopt can’t have empty lines and ends with an empty line.

The first word after “Usage:” is interpreted as the program’s name, e.g.:

‘python xxx.py’ makes it think your programme is called ‘python’ with
option ‘xxx.py’

Docopt reads multi-line descriptions in Options so 80 character lines can be wrapped.

‘Usage’ and ‘Options’ case insensitive and followed by ‘:’ are recognised by docopt in the docstrings.

R script examples and notes

Simple example (shebangs are the first line in a real script):

That’s it in order to have command line options for R scripts.

Docopt basically reads the message string and converts these to a dictionary that is then passed on.


Second case

See how to specify the arguments and options (docopt has been ported to many languages):

See examples:

Docker workflow notes and Dockerfile example

Author:Antonio J Berlanga-Taylor
Date:July 3 2016

Docker is software for container platforms. Containers are like virtual machines (but generally better, see links below).

You can use Docker to develop tools in your local machine (regardless of its own operating system) and test without worrying about how your software will perform in a different environment.

Docker automates setting up and configuring environments. This makes it easy to build software, collaborate with others in data analysis, etc.

You can specify in your Dockerfile all the needed dependencies and instructions to run your application, scripts, etc.

Your application could be software for others or even your specific data analysis project.

The Dockerfile can be like a lab notebook containing instructions on the computing environment, external and internal software, versions used and example data for instance.

Learning Docker and writing a Dockerfile may be time consuming initially but could save you and collaborators major headaches when you try to reproduce results.

You can find many and better tutorials online and in the official Docker documentation. As with other files in this repository, these are basic notes, links and reminders to get started.

Note

Some code snipets, info, etc. is directly from some of the references. Files need cleaning up.

Docker notes

General steps to create a Dockerfile (see more details below):

See Dockerfile writing best practice guidelines.

  1. Create a file called Dockerfile, which we will use to place the relevant instructions and commands that will define what our container can do
  2. Build the Docker container, this will install and apply all the environment variables and configurations that you have specified in the Dockerfile
  3. Run the Dockerfile across any computing platform (with Docker installed on that platform).

Alternatively, use a pre-specified image with all the tools you need. Anything that isn’t in the Dockerfile though is not kept (ie any new tools which are installed).

Base image with all the tools needed except the application itself:

  • Linux distro
  • Python
  • R
  • Conda
  • Apline Linux (minimalistic)

Copy the Dockerfile, e.g.:

Add necessary tools to it and run as in steps above.

Otherwise, pull already available image and work from there:

Do e.g.:

docker pull continuumio/miniconda3
docker run -i -t continuumio/miniconda3 /bin/bash

Data science base images:

Alpine Linux (5 MB) is a minimal image that can serve as a starting point.

With this use Dockerfile containing e.g.:

FROM alpine:3.3
RUN apk add --no-cache mysql-client
ENTRYPOINT ["mysql"]

Steps to create a Dockerfile

  1. Write a dockerfile, see:

Dockerfile for data science examples:

Minimal alpine Python 3 image example.


Dockerfile contents:

# specifiy base image
FROM ubuntu:14.04 # also FROM jupyter/scipy-notebook

# provide creator/maintainer of this Dockerfile
MAINTAINER Antonio J Berlanga-Taylor <a.berlanga@imperial.ac.uk>

# Specify some of the useful system tools and libraries to include in the Ubuntu bare bones image:

# Update the sources list
RUN apt-get update
# RUN cmds are linux cmds

# install useful system tools and libraries
RUN apt-get install -y libfreetype6-dev && \
        apt-get install -y libglib2.0-0 \
        libxext6 \
        libsm6 \
        libxrender1 \
        libblas-dev \
        liblapack-dev \
        gfortran \
        libfontconfig1 --fix-missing

RUN apt-get install tar \
        git \
        curl \
        nano \
        wget \
        dialog \
        net-tools \
        build-essential

# install Python and pip package manager
# TO DO: change for conda, eg:
# https://hub.docker.com/r/continuumio/miniconda/
# from where jupyter can run
RUN apt-get install -y python \
        python-dev \
        python-distribute \
        python-pip

# install useful and/or required Python libraries to run your script
# # TO DO: change for conda recipe, Bioc image, etc.
RUN pip install matplotlib \
        seaborn \
        pandas \
        numpy \
        scipy \
        sklearn \
        python-dateutil \
        gensim

COPY localfile.R /home/ubuntu/localfilecopy.R
# COPY copies local files into the container

EXPOSE 5000
# opens ports that can be mapped to server ports (?)

# define command to when Docker container starts
ENTRYPOINT ["python"]
CMD ["my_script.py"]
# This is the first command that will run once the container starts
# Note Docker doesn't accept single quotes, only double.

  1. Build the docker image

From within the docker terminal and from the directory where the dockerfile (and eg python script) are:

docker build -t your_image_name .

See the messages printed and if it builds successfully it will appear in:

docker image list

  1. Run the docker image:

Create and enter a folder where data is located and/or will be saved:

cd ~/Documents/github.dir/docker_tests.dir

Install Docker on the platform to use beforehand:

https://docs.docker.com/engine/installation/linux/rhel/

docker run --rm -ti your_image_name

An -it flag makes the container run interactively

–rm automatically remove the container when exiting

Install a package:

apt-get update
apt-get install vim

Get the data (mount a volume) and point docker to it:

docker run -it -d -p 8888:8888 -v /home/ubuntu/xxx:/Users/USER/data/datfile.data your_image_name
# or e.g.:
docker run -it -v $(pwd):/tmp DOCKER_IMAGE /bin/bash
docker run -it -v $(pwd):/anyname/ continuumio/miniconda3 bash

-p flag sets the ports (to access a Jupyter notebook server locally). To find Docker’s ip address use:

docker-machine ip default #’default’ for docker machine

-d runs the container in detached mode, as a background process.

-v specifies which directory on the local machine to store results

Files get copied across after a certain period (?)

Data/files can be copied across with docker cp:

docker cp <containerId>:/file/path/within/container /host/path/target

These will be lost when the container is stopped (but not results save locally) unless pre-specified in the Dockerfile.

Shut down the docker container:

docker ps # to get CONTAINER_ID
docker rm -f CONTAINER_ID

Get system info:

docker info

Remove old containers and images:

docker ps -a
docker rm CONTAINER_ID
docker images
docker rmi IMAGE

Stop and remove all containers:

docker stop $(docker ps -aq) # stop is 'graceful', can also use docker kill $(docker ps -aq)
docker rm $(docker ps -aq)

Create a Docker Hub account to upload your images and make them available to others.

Testing workflow example

Create a Dockerfile as above, see for example:

https://github.com/AntonioJBT/project_quickstart/blob/master/Dockerfile

See these minimal images with Alpine and Python for instance.

Copy Dockerfile to test directory (not necessary though), build image locally and run:

mkdir docker_tests
cd docker_tests
cp /path_to/project_xxx/Dockerfile .
docker build --no-cache=true -t user_xxx/my_docker_tag . # Build a local image, disable the cache
docker images # Check it's there
docker run --rm -ti user_xxx/my_docker_tag # Run your image interactively and remove the container when exiting

If you didn’t share a volume between the container and host, you can copy files across with docker cp (see also question on Stack Overflow):

docker ps -a # to get the container numeric ID
docker cp b0cbb62d9cd7:/home/my_files.tar.gz . # docker cp <numeric ID>:/full_path/to/file /host/location/

Aggressive clean up:

docker images -a # Show all images
docker images -f "dangling=true" # Show <none> images, remains of previous builds
docker rmi $(docker images -qf "dangling=true") # Delete <none> images, add '-a' option to delete ALL images
docker ps -a # Show all containers
docker stop $(docker ps -aq) # stop is 'graceful', can also use docker kill $(docker ps -aq)
docker rm $(docker ps -aq) # Delete all containers

# A softer removal of containers:
d

You can then go back to your code, make changes, push/pull to your version control system and start again with Dockerfile to test your package in a different environment to your machine.

TO DO

  • Integrate with GitHub and Travis ?
  • Use dockerHub to push and pull akin to GitHub (integrate?)

To do

Author:Antonio
Date:22 Dec 2016

DONE

  • The HPC team has installed ruffus and drmaa, these are working fine. David and I have been testing these and everything seems in order.
  • I’ve installed (in my user space) the CGAT scripts and pipeline framework, others may need to do this as well later on. The CGAT tools work well for me, they required some manual installation of several of the python libraries though. I also made some changes so that they could communicate with PBS Pro, these work without problems. There has been a recent switch to python 3 though so I have to test the modifications in the CGAT scripts. Not expecting problems though (famous last words)…

TO DO

  • Antonio, David, Vangelis, Gao: Finish building the Genotype QC tool
  • Genotype QC is currently the first pipeline we’ll build with these tools/approaches. Although not straightforward it’ll essentially simply follow CGAT and Ruffus’ workflow and tools.
  • Discuss with Ibrahim, Rui, Deborah: Matlab isn’t open source, big problem:
    • Discuss Matlab users’ needs and how principles can be applied without forcing others to learn new languages.
    • Can we simply creae Matlab scripts and run in Ruffus pipelines? Permissions?
    • Matlab can still be open source (i.e. we can publish code but people without Matlab licence can’t re-run/use/modify it?)

Best practice checklists

See also:

Ten simple rules for making research software more robust

Box 1

Our complete list of potential topics to be indicative of good practices in software development.


Reference: Artaza et al., 2016


Each of these topics have quantitative and qualitative metrics that may help track the adoption of good practice and monitor compliance with the guidelines in life sciences.

  1. Version control:

    1. Yes/no?
    2. How many committers?
    3. When was the version control started?
    4. When was the last commit?
  2. Code reviews:

    1. Yes/no?
    2. Star rating based on code description
  3. Automated testing:

    1. Yes/no?

    2. Coverage for unit tests

    3. Yes/no for individual tests:

      1. Unit tests
      2. Functional tests
      3. Integration tests
      4. Regression tests
    4. Are the tests part of the code in the repository?

  4. Not reinventing the wheel:

    1. Using libraries?
    2. Using Frameworks?
    3. Describing the algorithm, explaining why known code is reimplemented.
    4. Reinventing should be documented. References to the algorithm?
    5. Percentage of code written from scratch?
    6. Percentage of code that is involved in the core functionality?
  5. Discoverability:

    1. Via structured search on functionality?
    2. Is it in the ELIXIR Tools and Data Services Registry2 or others (e.g., BioSharing3)?
  6. Reusability of source code:

    1. Number or reuses = number of derived projects/external commits?
  7. Reusability of software:

    1. Number of citations on the paper
    2. Having basic description of features in structured ELIXIR format (EDAM ontology4) - in ELIXIR Tools and Data Services Registry?
  8. Licensing:

    1. Is there a license?
    2. Is the source available?
    3. Is it open source according to opensource.org?
  9. Issue tracking/bug tracking:

    1. Does it have a publicly accessible issue tracker?
    2. How long are issues open?
    3. What is the number of unresolved issues?
    4. How much activity has there been in the last three months in the issue tracker?
  10. Support processes:

    1. Are basic processes defined? Like governance, mailing list, releases, …
  11. Compliance with community standards:

    1. Yes/no?
    2. Specifies the level of compliance, specification version or metrics?
  12. Buildable code:

    1. Does the compiler give warnings?
    2. Does a static analysis (“lint”) give warnings?
    3. Is an automated build system used?
  13. Open development:

    1. Number of external committers in the repositories
  14. Making data available:

    1. Yes/no?
    2. Where?
  15. Documentation:

    1. Ratio code/comments, code lines/document lines?
    2. Percentage of code dedicated to documentation?
  16. Simplicity:

    1. Measure of cyclomatic complexity
  17. Dependency management:

    1. Is it done automatically using a system?
    2. Does it use a language-standard repository to pull in dependencies?
    3. Is software made available as a dependency in a dependency repository?