Welcome¶

Welcome, general notes, explanations and the like for the EpiCompBio group.

The plan is to collectively build an approach that integrates reproducibility into our research.

Here you should be able to find some general ideas, introduction, references and the overall approach.

Go to the documentation for this repository to see the notes.

As usual, feel free to add your own, edit, comment, question, etc.

Notes on reproducibility in biomedical research¶

Author:	Antonio
Date:	22 Dec 2016

General approach¶

The core idea is about reproducibility, integrating best practice, creating good habits and learning together. Reproducibility is the key principle. This (simply but not easily) requires discipline and a few tools that can be used and/or understood by as many people as possible.

Our research needs to be:

Reproducible (same question, same data, same code, same results)
Replicable (by other labs using other experiments, data, analysis, technology, etc.)
And ideally open source, open access and open data with code that is re-usable (by us and others)

The above requires:

Clear questions and hypotheses
Accessible, read-only data (data collection and experiments largely already follow these guidelines but the same applies)
Transparent, easy to read, well documented code
Transparent, easy to read, well documented workflows (e.g. data processing steps and analysis)
Software version control
Computer environment logging (e.g. what tools, what versions, what OS)
Parameter logging

There are many guidelines and thoughts (see below for a few suggestions).

Data analysis (of any size) should aim to have:

Clear, documented code
Ideally automation and re-usability
Unit testing at its basic: does my new code change my expected results?

We’ve started with the EpiCompBio GitHub account

There are many tools out there that aim to put the above into practice (see list of pipeline review references for example).

We don’t need to re-invent the wheel, simply to make best use of our collective skills and existing tools/approaches.

See the suggested approach and tools for what we are thinking of and currently implementing/using as well as the repos and tools that are/will appear in the EpiCompBio group.

Some references and tutorials¶

Git and GitHub¶

A 15 minute intro to GitHub

Git cheatsheet

GitHub training and on demand.

Other training resources¶

Software Carpentry’s (SC) git novice

Unix and command line basics¶

Basic tutorial

SC shell novice

Code Academy command line

References on data analysis and reproducibility¶

PLOS Biology: Best Practices for Scientific Computing

PLOS Computational Biology: A Quick Guide to Organizing Computational Biology Projects

PLOS Computational Biology: Ten Simple Rules for a Computational Biologist’s Laboratory Notebook

Reproducible Research in Computational Science

Enhancing reproducibility for computational methods

Liberating field science samples and data

Promoting an open research culture

Notes on computational pipelines for biomedial research¶

Author:	Antonio Berlanga
Date:	22 Dec 2016

(Disclaimer (!): I’ve been learning as I go and I still have a long way… Please add, discuss, correct, etc.)

Suggested tools and approach to implement in computational and statistical analysis¶

Conceptually the problem is simple: chain third party tools and custom scripts together to answer a specific scientific question.

In practice this is a nightmare. It requires project discipline and very good statistical and scripting practices.

For example, in a genetic association project you might need:

Genotype QC pipeline
Imputation pipeline
Association testing pipeline
Downstream annotation of significant SNPs

All of these require many programs (third party like plink), some custom scripts (like plotting and stats in R) and will probably need to be run in a high performance cluster.

Keeping track of results, re-running with different parameters, logging, version control, communicating with colleagues, etc. quickly becomes difficult. A systematic approach is needed from the beginning that will cut the overhead and allow all results and plots to be traced back to the commands, parameters, software versions, and any steps used to process the original data, in an easy, transparent and reproducible way.

I have failed to do this and over time the costs can be high.

There are many tools out there that aim to put reproducibility in computational biology (or general data analysis) into practice (see list of pipeline review references for example).

Reproducibility requires data + code

We don’t need to re-invent the wheel, simply to make best use of our collective skills and existing tools/approaches.

Sound principles come first, languages and tools are secondary (to an extent, tools over time shape our thinking so good initial choices are important).

However a general, common framework and way of working is necessary. After a lot of initial personal and group pain we should hopefully see gains.

Using Python and UNIX philosophy as the building bases¶

Python:

Python is a popular, well-supported, general programming language which is flexible, powerful and readable. A great choice overall for beginners. It can serve as the glue for pipelines even if many scripts and programs are in other languages. Ruby, Perl and others are largely equivalent. There are dozens of online source for learning and a very active community.

Ultimately a combination of unix (or equivalent compute environment), stats and programming is needed. Different people do/use different combinations.

Using python, R and Unix is pretty powerful and a well trodden path.

The main/basic idea is to be able to structure scripts into packages and re-use them (or at least freeze and present them at publication).

There’s a lot out there on software structure, see for example:

Design patterns : elements of reusable object-oriented software

Python Project Howto

Organising my Python project - Stack Overflow

Filesystem structure of a Python project - Jp Calderone

Structuring Your Project — The Hitchhiker’s Guide to Python

Software Carpentry - Intermediate and Advanced

UNIX is at the heart of most of the common and powerful operating systems available. See:

Phylosophy

A classic book on UNIX:

The Unix Programming Environment (Prentice-Hall Software Series) Paperback – 1 Nov 1983 by Brian W. Kernighan (Author), Rob Pike (Author)

A general update on the above:

The Art of Unix Programming (Addison-Wesley Professional Computing) Paperback – 23 Sep 2003 by Eric S. Raymond (Author)

Actual tools and practice¶

In general, pipelines should ideally be:

Well documented
Configurable for available compute resources
Not hard-coded: configurable for actual job parameters which will be arbitrary and project specific probably
Run from the command line
Report extensive logging for debugging and versioning
Easy to build on
Runnable locally or on a cluster
Able to handle single and multi-jobs
Portable across computational environments
…

A big problem across the field is portability, currently without good answers, but pipelines can go some way towards this.

The general approach I’m suggesting is the one used at CGAT, which in turn adopts many current computational best practice standards). See:

The CGAT Code Collection includes cgat scripts for genomics and CGAT Pipelines, a framework and set of ruffus based pipelines to run workflows in computational biology.

CGAT scripts and pipelines use popular, open source, mostly free, proven tools with excellent community support such as Python, R, Github, Travis CI, plus the myriad of genomics and biology software options for specific tasks.

A lot of this work is in beta (as are most pipeline approaches, of which there are many, galaxy is a well known one and could be an answer but version control, scalability and other issues exist, it is designed to ease use for biologists and works well like this). See Galaxy and the Biostars community for example.

CGAT is based on Ruffus, a python pipeline tool which is flexible, powerful and readable (being python).

CGAT Pipelines can help manage computer resources, clusters, logging, execution, versioning and, more importantly, to work under a common framework (think languages, style, choice of tools, etc.).

CGAT Pipelines have their own backbone (for controlling jobs, communicating with the cluster, logging, software/package structuring, etc.). I’m still on the learning curve but think this is one of the best approaches because of its flexibility and power (once you get to grips with it). See the backbone scripts.

A pipeline example can be:

Limitations of CGAT (but common to these types of tools) are:

Pipelines have many dependencies
Setting up the initial environment is often very problematic
Keeping track of packages and managing them is a big overhead
There’s a steep learning curve in general and to each pipeline/approach
The “system” (eg funders and current science practice) rewards results not repeatability, so no time and little interest

An excellent complement/alternative is Jupyter and its notebook (aka IPython), particularly for interactive work:

http://jupyter.org/
http://nbviewer.jupyter.org/
The notebook needs windowing, not great when working remotely. It can be set up though and JupyterHub server can (?) solve using notebooks to interact with a cluster (e.g. submitting notebooks as jobs).
Notebooks can be run locally but submitting jobs remotely:
- https://zonca.github.io/2015/04/jupyterhub-hpc.html
- http://ipyrad.readthedocs.io/HPC_Tunnel.html

On a side note, for managing packages see Conda, a great way to reduce time spent on this.

Structuring code¶

A general, proven approach to follow is one based on basic python organisation:

Scripts - Write stand-alone scripts which are callable from the CLI and can take arbitrary parameters
Modules - Include functions and code which could be used by more than one script/pipeline, bundled by overall aim/use
Pipeline - a (e.g. ruffus) python script which chains multiple tasks (e.g. functions or steps needed to obtain an answer to the project’s question) and jobs (input data, e.g. you have 10 fastq files which will all be treated in the same way) and can be submitted to the cluster (e.g. managed by drmaa which will then communicate with SGE or PBSPro).

To this, we can ideally add:

Unit tests - aiming to test each script, parameter, function, with small, example data. Aimed for stability only (ie do new code changes mess up the expected results?).
A good option is to use via Travis CI or Jenkins CI, integrated to GitHub (tests are automatically triggered after each commit, need configuration (eg yaml), data and expected result).
Report - aiming to write a basic automated report that picks up some basic stats, tables and plots from the pipeline results and puts them in one document (using e.g. sphinx, markdown, or similar tool).

Tools to use¶

All of the above can be achieved with:

Version control such as Github
Unit testing such as Travis (runs with Github)
Choice of programming and statistical languages (e.g. Python, Perl, R, Matlab, etc)
Computation pipeline tool such as Ruffus
Sufficient computing resources: your laptop, a unix cluster, etc. depending on tasks and data
A general framework which is extendable, allows us to keep relatively sane, and enhances the above (CGAT Pipelines, Galaxy, etc.).

Other languages¶

In terms of packaging and structuring of projects and programs other languages do their own thing.

For examples of R and its repository take a look at:

R package primer
See this package for an R example.
Welcome: R packages
Developing Packages with RStudio – RStudio Support
Package Development Prerequisites – RStudio Support
`E.W.Dijkstra Archive: The Humble Programmer`_
Working with R on a Cluster - The Coatless Professor

Packaging in Python¶

Python packaging has a messy history and isn’t completely straightforward.

See these webpages first and follow their guidelines:

python - Differences between distribute, distutils, setuptools and distutils2? - Stack Overflow

Python Packaging User Guide — Python Packaging User Guide documentation

pypa/sampleproject: A sample project that exists for PyPUG’s “Tutorial on Packaging and Distributing Projects”

You can also see other package examples (e.g. cryptography) and the Python package sample.

You can also use the structure provided by project_quickstart.py as a starting point.

Once you have the project file and directory structure: Get the packages you need, e.g.

pip install -U pip twine check-manifest setuptools

Create/edit MANIFEST.in and setup.py files as necessary (and possibly an INI file depending on how you’ve set things up).

Use check-manifest to detect errors in setup.py:

check-manifest

Run the following to test and create the discribution:

python setup.py check

python setup.py sdist bdist_wheel

You can create an environment and test in a separate directory (using conda for example):

cd .. ; mkdir test_package ; cd test_package

conda create -n test_env python=3.6

Install any dependencies needed with conda or pip

source activate test_env

git clone https://github.com/xxx/package_xxx.git

Install and test package:

cd package_xxx

python setup.py install

Then test the main script elsewhere (you can create an entry point in setup.py), run a test example, etc.

If this is an actual package to share with others upload to PyPI:

See the instructions_ page for all the details.

Register yourself at PyPI and create a ~/.pypirc file.

Register your package manually or with twine (see instructions above)
Upload your package (requires twine for this example):

twine upload dist/*

The package should now be ready to install anywhere with:

pip install package_xxx

Further references:

This blog_ has an explanation of how to carry out all of this with examples of MANIFEST.in and setup.py files.

It also has further information on how to use PyPI’s test server.

Things changed a fair amount from python 2x to 3x so check whatever is the most recent information (see the links above for this).

https://the-hitchhikers-guide-to-packaging.readthedocs.io/en/latest/introduction.html

https://wiki.python.org/moin/Distutils/Tutorial

http://www.diveintopython3.net/packaging.html

https://blog.niteoweb.com/setuptools-run-custom-code-in-setup-py

http://stackoverflow.com/questions/774824/explain-python-entry-points

http://stackoverflow.com/questions/13307408/python-packaging-data-files-are-put-properly-in-tar-gz-file-but-are-not-install?rq=1

Notes on how to create technical documents¶

restructuredText (reST, rST)¶

reStructuredText may provide the best tool as an accessible, flexible, readable, plain text, python based technical documentation option. reST is the source format, Sphinx is a builder tool that transforms reStructuredText into different target output formats.
It can be integrated with Sphinx for converting to html, pdf, etc. and for supporting multiple languages (not just Python).
reStructuredText is similar to Markdown. If webpages are the main objective maybe use Markdown though.

For technical documents maybe reStructuredText is better.

reStructuredText seems much easier than html and LaTex and travels much better as it is plain text and can be used as the source for later conversion.
With python+docutils+sphinx it can be formatted to html, with pdflatex to pdf for example.
reStructuredText can be transfomed to Markdown with Pandoc and then rendered as html.
reStructuredText supports math
A good option is CGATReport, based on Python, Sphinx and rST. It integrates plotting code directly into its workflow though (good?). The central datastructure is a pandas dataframe. Has a table renderer.
Also see Jupyter ecosystem and its notebook (previously IPython Notebook):
http://jupyter.org/
- This works well, looks great, has huge flexibility, is becoming language independent, great when run interactively.
- Requires windowing though for the notebook which can be very cumbersome when working remotely.
- Some problems when turning from notebook to script for automated analysis?

Including citations¶

There are extensions for Mendeley and BibTex for citations. Mendeley Python:
See the Sphinx BibTeX extension:
http://build-me-the-docs-please.readthedocs.io/en/latest/Using_Sphinx/UsingBibTeXCitationsInSphinx.html This is in beta though.
- BibTeX is the file format for jabref, a citation manager:
  
  http://www.bibtex.org/
  
  http://www.jabref.org/
Mendeley looks good but is now owned by Elsevier, JabRef looks like a good open source option:
- http://www.jabref.org/
- https://en.wikipedia.org/wiki/JabRef
It can import Endnote, might work with Word. It can export for Endnote format. Limited reference formats though.
Endnote can export to bibtex. The bibtex file can then be used as th einput for the rST directive.

reST directives to include external files¶

The subdocument and include directives in reStructuredText can include external files.
- http://docutils.sourceforge.net/sandbox/package-doc/multiple-input-files.html#id6
- http://reinout.vanrees.org/weblog/2010/12/08/include-external-in-sphinx.html

To include images¶

Images can be pulled in with e.g.:: `.. image:: /path/to/image.jpg`

Making slides with rst¶

Hovercraft Looks like it does the trick.

Go to your default browser and open the localport (e.g. http://0.0.0.0:8000) to see the presentation, or create a dir and use -o to save the output.

More on Hovercraft with additional tools.

Other options are:

Create rst document
Convert from rst to e.g. beamer (latex slides) with pandoc:

pandoc myfile.rst -s -f rst --to beamer -o myfile.pdf
pandoc -V theme:Warsaw -N -S --toc --normalize -f rst -t beamer -o myfile.pdf myfile.rst

See this page for more options and explanations.

Other options: - See this tutorial and github code with rst and Beamer.

See example_ of rst2pdf.

A few references for writing slides with rst which can then be converted to PDF:
hieroglyph looks good but has a bug with the Sphinx 1.5 update:
https://github.com/nyergler/hieroglyph/issues
Pandoc info (general doc converter):
http://pandoc.org/getting-started.html http://pandoc.org/demos.html
This looks good but errors on import:
http://rst2html5slides.readthedocs.io/en/latest/
More on Beamer?
http://web.mit.edu/rsi/www/pdfs/beamer-tutorial.pdf http://mirror.unl.edu/ctan/macros/latex/contrib/beamer/doc/beameruserguide.pdf

Problems with reST¶

Tracking changes is a problem though (between collaborators not using git, i.e. collaborator’s comments in a Word review form):

http://criticmarkup.com/
Rendering external tables easily with rST? See CGATReport and R library xtable

These aren’t specific to rST though.

reST example sheet¶

http://docutils.sourceforge.net/docs/user/rst/demo.txt

Miscellaneous¶

Pandoc is a universal document converter, it can do rST to ODT (for Word for example):

http://pandoc.org/ http://ralsina.me/stories/BBS52.html

and back (untested, probably not great if it has complex reviewer changes, logos, styles, etc.):

https://peintinger.com/?p=365 https://ronn-bundgaard.dk/blog/convert-docx-to-markdown-with-pandoc/ https://www.tummy.com/blogs/2011/11/28/word-doc-authoring-with-pandoc/ http://stackoverflow.com/questions/14249811/markdown-to-docx-including-complex-template
R Markdown:
- http://rmarkdown.rstudio.com/authoring_bibliographies_and_citations.html
- https://nicercode.github.io/guides/reports/
R with rST and docutils and then conversion to any format (including ODT that can be opened with Word):
- https://www.r-project.org/conferences/useR-2010/abstracts/Dasgupta.pdf
- Examples, e.g. knitr for R with rst:
  
  https://yihui.name/knitr/demo/minimal/
  
  Input of R script for rST: https://github.com/yihui/knitr-examples/blob/master/006-minimal.Rrst
  
  Output of the above: https://github.com/yihui/knitr-examples/blob/master/006-minimal.rst
  
  http://www.agapow.net/science/data-science/writing-knitr-in-restructured-text/

The downside is that these formats then don’t easily (?) allow to run the code as a script from the command line:

http://stackoverflow.com/questions/21512918/how-to-use-knitr-from-command-line-with-rscript-and-command-line-argument

Jupyter notebooks¶

For exploratory analysis these might be a great solution. They are very flexible, can mix languages, keep plots, code, text together. See an example of a publication of RNA-seq here and a blog_ with some tips and info. A notebook server is needed to run properly.

R markdown and its notebook¶

R markdown v2 is another excellent option in this regard. See also R Markdown to Word. If you’re running analysis locally, R notebooks and Jupyter are probably far better than rst and Sphinx for reports. See these blogs (a, b, c, d, e) comparing R and Jupyter notebooks for instance and other tutorials.

You can also run Rmd files with command line parameters like (f, g, h). This is the main tutorial.

Check the reference guide and article templates for Rmarkdown.

VIM or Emacs?¶

See org mode in vim for example (originally emacs). Although it seems like working with code in vim isn’t possible, for vim users maybe emacs + evil + org-mode is better.

TO DO¶

Thoughts:
- Keep code, data and reports separate.
- Use rST for automatic reports run after pipeline analysis which could output plots, database, results table, methods, legends, etc.
- Generate all plots in SVG for easier conversion, processing, etc. downstream.
- Include generic narrative and pull in plots, tables, legends and methods text from external files (generated by the plot script or as text output from a given analysis).
- Create meta rST to pull in automated reports and add ad hoc interpretation.
- Use Python’s svg_utils to create (simple) figure layouts (multi-plot figures for publication), convert with command line inkscape or Python library CairoSVG to other formats.
How to include code (or reference to location) in the report? See notebooks (R or Jupyter for this)
How to include parameters run, date, author, location, etc.?
Check how to import tables, with CGATReport for example:
- https://github.com/AndreasHeger/CGATReport/blob/master/doc/GalleryTables.rst
- R notebooks have options that look good for this.
And examples of reports:
- https://www.cgat.org/downloads/qbh6mmrDkX/analysis_fdr0.01_report/pipeline/Methods.html#irf5-motifs
- https://github.com/AndreasHeger/CGATReport/blob/master/doc/UseCase.rst
See David M. use of R library to format for latex with e.g.:

(from SwIMA_v1.0.1.Rnw ; http://web.bioinformatics.cicbiogune.es/swima/ library(xtable) xtable(samples[,1:2], caption=”Groups and their samples.”, label=”groups”) xtable(contrasts, caption=”Comparisons between groups.”, label=”comps”)
Similar to xtable is:

https://www.rforge.net/doc/packages/knitr/kable.html
Check examples of directory structure and source rst files to build a meta-report:
- /ifs/projects/proj008/web/pipeline_proj008_meta_report/_static and /_sources
- https://www.cgat.org/downloads/qbh6mmrDkX/analysis_fdr0.01_report/contents.html
Check Jupyter ecosystem and Rstudio with R notebook as these are multi-language and can solve several of these issues.

Additional references and blogs¶

https://github.com/kiith-sa/RestructuredText-tutorial

http://openalea.gforge.inria.fr/doc/openalea/doc/_build/html/source/sphinx/rest_syntax.html#restructured-text-rest-and-sphinx-cheatsheet

http://www.sphinx-doc.org/en/1.5.1/tutorial.html

First Steps with Sphinx — Sphinx 1.5.1 documentation

reStructuredText Primer

http://www.sphinx-doc.org/en/1.5.1/rest.html#

rst-cheatsheet.rst

https://github.com/ralsina/rst-cheatsheet/blob/master/rst-cheatsheet.rst

http://docutils.sourceforge.net/docs/user/rst/quickref.html#hyperlink-targets

DocOnce may also be an option, looks nice:
http://hplgit.github.io/doconce/doc/pub/slides/scientific_writing-1.html
http://hplgit.github.io/doconce/doc/web/index.html
http://hplgit.github.io/teamods/writing_reports/

Blogs with comparisons:
https://opensource.com/life/15/8/markup-lowdown
http://hyperpolyglot.org/lightweight-markup

http://zverovich.net/2016/06/16/rst-vs-markdown.html

reStructuredText vs Markdown for documentation

https://www.pydanny.com/markup-language-faceoff-lists.html

Markup Language Faceoff: Lists

https://varnish-cache.org/docs/2.1/phk/sphinx.html

Why Sphinx and reStructuredText ? — Varnish version 2.1.5 documentation

http://build-me-the-docs-please.readthedocs.io/en/latest/Using_Sphinx/UsingBibTeXCitationsInSphinx.html

Managing bibliographic citations in Sphinx — Wiser 0.1 documentation

https://en.wikipedia.org/wiki/ReStructuredText

reStructuredText - Wikipedia

https://www.mendeley.com/reference-management/reference-manager

Reference Manager | Mendeley

https://en.wikipedia.org/wiki/Comparison_of_document_markup_languages

Writing Scientific Papers Using Markdown

https://danieljhocking.wordpress.com/2014/12/09/writing-scientific-papers-using-markdown/

How To Write Papers with Restructured Text

http://acooke.org/cute/HowToWrite1.html

Standard format conversions between reST and LaTeX:

http://goer.org/Journal/2011/01/publishing_with_sphinx_rest_and_sffms_latex.html

Writing and publishing with Git and reST::

https://jimmyg.org/blog/2009/my-experience-of-using-restructuredtext-to-write-the-definitive-guide-to-pylons.html

There is some support for reST to Word::

http://docutils.sourceforge.net/sandbox/rst2wordml/readme.html

Sphinx tutorial::

https://evolvingweb.ca/blog/writing-documentation-restructured-text-and-sphinx

Reference manager comparison::

https://en.wikipedia.org/wiki/Comparison_of_reference_management_software

restructuredText tutorials/info¶

This stackoverflow webpage lists all the tools available and the converters (e.g. rst to pdf, html, odt, etc.).

http://thomas-cokelaer.info/tutorials/sphinx/rest_syntax.html

http://docutils.sourceforge.net/docs/user/rst/quickstart.html

https://github.com/kiith-sa/RestructuredText-tutorial

http://www.sphinx-doc.org/en/1.5.1/rest.html

Play around with the online editor which converts rst to html: http://rst.ninjs.org/

rst can be used on its own and then converted to html, pdf etc with different tools.

Sphinx adds many useful tools and is based on rst, one of the main ones is to connect many files to a single hierarchy of documents. Sphinx also makes it easy to document your python based project.

Generate the documentation with sphinx:

pip install sphinx

A sphinx template report can be generated with:

mkdir my_report_docs

cd my_report_docs

sphinx-quickstart

After adding content, you can generate html, pdf, etc. with:

make html

The rendered file will be found in the _build directory.

Include other rst files:

.. toctree::
    :maxdepth: 2
    :numbered:
    :titlesonly:
    :glob:
    :hidden:

    intro.rst
    chapter1.rst
    chapter2.rst

See the toctree directive for full info.

It is also possible to include the literal contents of a file with:

.. literalinclude:: filename
    :linenos:
    :language: python
    :lines: 1, 3-5
    :start-after: 3
    :end-before: 5

Include an image:

.. image:: images/ball1.gif

Or:

.. image:: images/xxx.png
   :height: 100
  :width: 200
  :scale: 50
  :alt: alternate text

See image directive full markup.

Or import a figure which can have a caption and whatever else you add:

.. figure:: xxx.jpg
    :width: 200px
    :align: center
    :height: 100px
    :alt: alternate text
    :figclass: align-center

    a caption would be written here as plain text. You can add more with eg::

  .. code-block:: python

      import image

Include a simple csv table:

.. csv-table:: a title
   :header: "name", "firstname", "age"
   :widths: 20, 20, 10

   "Smith", "John", 40
   "Smith", "John, Junior", 20

See csv-table directive for example.

For useful extensions to rst and sphinx see this tutorial on extensions:

In a sphinx conf.py file you can specify the extensions needed, e.g.:

 extensions = [-
   'easydev.copybutton',
   'sphinx.ext.autodoc',
   'sphinx.ext.autosummary',
   'sphinx.ext.coverage',
   'sphinx.ext.graphviz',
   'sphinx.ext.doctest',
   'sphinx.ext.intersphinx',
   'sphinx.ext.todo',
   'sphinx.ext.coverage',
   'sphinx.ext.pngmath',
   'sphinx.ext.ifconfig',
   'matplotlib.sphinxext.only_directives',
   'matplotlib.sphinxext.plot_directive',
]

The math directive, e.g.:

.. math::

    n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k

TODO, it needs the conf.py file:

would produce:

\[n_{\mathrm{offset}} = \sum_{k=0}^{N-1} s_k n_k\]

References, e.g. [CIT2002] are defined at the bottom of the page as:

.. [CIT2002] A citation

and called with:

[CIT2002]_

Generate python package documentation with Sphinx and render rst docs¶

Install Sphinx, sphinx-quickstart, sphinx-apidoc, sphinx-build
In the python package docs directory that you would’ve created:
- Run ‘sphinx-quickstart’ to setup configuration values and generate template rst docs. See an example_ and First Steps with Sphinx.
- Manually edit index.rst and other files to use as content
- sphinx-build -b html sourcedir builddir ; or pdf (needs the LaTex builder installed)
- sphinx-apidoc -o . .. to generate module and function documents from docstrings within scripts
- make clean, make html/latexpdf to clean and generate further builds (make is equivalent to sphinx-build because quickstart creates a Makefile and make.bat files)
- sphinx-autobuild . _build_html updates (use if large docs that are faster to update than build from scratch)
- Pull/push etc to GitHub account
- Create an account in ReadTheDocs (RTD) and connect to GitHub repository
- Some configuration is needed (TO DO) at RTDs
- RTD rebuilds after every commit using ‘sphinx-build -b html . _build/html’
- If the build on RTD doesn’t work try ‘Wipe’ in /projects/[project]/versions/
- You can add the ‘_build’ dir to .gitignore

If using project_quickstart, which copies templates generated by a basic sphinx-quickstart run, you can do:

These commands need to be run where the conf.py and other Sphinx templates are, usually:

project_XXXX/code/docs for the code documentation
project_XXXX/documents_and_manuscript/ for the manuscript preparation

pandoc common commands¶

See pandoc demos

Some examples for rst to PDF:

Use option --bibliography=FILE for rendering citations.

Word docx:

Links with more information (some are old):

http://docs.readthedocs.io/en/latest/builds.html

https://daler.github.io/sphinxdoc-test/includeme.html

http://lucasbardella.com/blog/2010/02/hosting-your-sphinx-docs-in-github

http://cgat.readthedocs.io/en/latest/PipelineReports.html

Annoying quirks?!¶

For titles to appear in the ReadTheDocs table of contents they need to be as:

####
Why?
####

If using ‘=’ instead they don’t seem to build…

Substitutions in restructuredText files¶

Variables are specified as “|xxxx|”.

An example would be:

The main file, e.g.:

.. Include templates from external file (this is a comment).
.. include:: substitution_vars.rst

I can include text, like |my custom text|, from other files.

I can include such as the following: |more text|

And |here|.

Then specify the substitutions in a separate rst file, such as “substitution_vars.rst”.

.. Fill in the variables in the external rst file:

.. |my custom text| replace:: "example text 1"

.. |more text| replace:: several other things as well

.. |here| replace:: bla bla bla

Because the substitution process uses the “.. include:: ” directive, everything in the file will get included. Thus use hidden comments if needed.

Use rst2pdf to render, which will then substitute the variables with the appropriate changes. pandoc doesn’t seem to read the include directive properly. rst2pdf runs in Python 2.7 only.

Run:

rst2pdf file_with_include.rst -o file_with_include.pdf

See SO question and technical documenation for more information.

Creating figure layouts for publication programmatically¶

It seems surprisingly hard to find tools for this. The current workflow I’d suggest (based on Python, essentially out of the svgutils blog):

Plot with any tool, save as svg
Import figures you want into sgvutils
Create legends, titles, layout, etc.
Save multi-panel figure as svg
Convert svg to pdf (with inkscape on the command line for example)
Insert images into rst with image and other directives to create a file with text, figures, tables, etc. and which can later be converted to pdf or html.

See below for details, references and basic examples. I haven’t tested many of these but left them here as suggestions.

Python package svgutils, probably the one to use, starts and ends with svg files.

See this explanation for more details.

from svgutils.compose import *

Figure("16cm", "6.5cm",
        Panel(
              SVG("sigmoid_fit.svg"),
              Text("A", 25, 20, size=12, weight='bold')
             ),
        Panel(
              SVG("anscombe.svg").scale(0.5),
              Text("B", 25, 20, size=12, weight='bold')
             ).move(280, 0)
        ).save("fig_final_compose.svg")

A similar package is svg_stack: concatenate SVG files

A useful svg utility package might be scour.

Inkscape

e.g.

inkscape --file=fig_final.svg --export-area-drawing --without-gui --export-pdf=output.pdf

Full suite, equivalent to Adobe Illustrator but open source and free.

To install in Mac you can use:

brew install caskformula/caskformula/inkscape

Another package for image file conversions is CairoSVG.

cairosvg -o fig_final.pdf fig_final.svg

restructuredText and SVG¶

See the documentation on reStructuredText and svg. The mix of pdf, svg, rst, html, etc. can become nightmarish.

rst doesn’t seem to have a specific figure layout tool but there are some workarounds.

See the image rst directive details for more information and examples.

There is also the figure directive.

e.g.

Latex does not support svg, requiring first to convert svg files to pdf or eps. Inkscape can be used for this.

If you’re using latex see this document for further help.

inkscape -D -z --file=image.svg --export-pdf=image.pdf --export-latex

There is a Sphinx svg image directive that you can try:

Tables are a different matter altogether. You can wrap figures in a table within rst.

grImport does something similar and can manipulate figures/images starting from PostScript:

Use imager (also here) package which can import vector graphics, but is meant for image manipulation not creating layouts.

Create and format PowerPoint documents from R software - Easy Guides - Wiki - STHDA

Microsoft Word and PowerPoint Documents Generation ReporteRs package

CRAN - Package cowplot

Arranging plots in a grid

CMBX12 - import.pdf

OpenCV

PIL Pillow Fork

Both are for statistical image processing

How to Create Publication-Quality Figures

Overview Inkscape

Ten Simple Rules for Better Figures

imagemagick - How can I convert a PNG to a PDF in high quality so it’s not blurry or fuzzy? - Unix & Linux Stack Exchange

Combine several images horizontally with Python - Stack Overflow

Python Image Library: How to combine 4 images into a 2 x 2 grid? - Stack Overflow

Notes on continuous integration¶

Did the code I just pushed change all my results? Did I introduce a bug?¶

Basic steps with Travis for Python:

Integrate and set GitHub and Travis to on (commits will trigger the tests)
Create a .travis.yml file at the root directory of the repository
Create a requirements.txt file at the root directory of the repository
Create tests to run, e.g.:

https://github.com/EpiCompBio/genotype_tools/blob/master/tests/test_style.py

These can go in a separate tests folder within the repo.
TO DO/Check: Consider if you need an external_dependencies.txt file, see for example:

https://github.com/EpiCompBio/genotype_tools/blob/master/external_dependencies.txt

Create to keep track?
Create a run_travis_tests.sh file at the root directory which will indicate which tests to actually run in Travis
Go to the Travis builds webpage, see the job log, config info, etc.
You can go to Travis settings and add the “build passing” logo to the README file in the repo:

https://matthewmoisen.com/blog/how-to-set-up-travis-ci-with-github-for-a-python-project/
Add unit tests if possible, e.g.:

https://github.com/CGATOxford/cgat/blob/master/tests/bam2bam.py/tests.yaml
See about flake8, pep8, autocorrecting style, etc.:

http://flake8.pycqa.org/en/latest/index.html#quickstart

https://www.caktusgroup.com/blog/2015/08/15/making-clean-code-part-your-build-process/

For a general example of a simple Travis setup see:

https://github.com/EpiCompBio/genotype_tools

Pages to check and further info:

https://www.hackzine.org/even-better-testing-with-pytest-tox-travis.html

If too heavy for Travis CI see Jenkins, CGAT Jenkins setup and some internal instructions_ for it:

General references¶

Basic steps with Travis: https://matthewmoisen.com/blog/how-to-set-up-travis-ci-with-github-for-a-python-project/

Test automation: https://en.wikipedia.org/wiki/Test_automation

CI practices on wiki: https://en.wikipedia.org/wiki/Continuous_integration#Best_practices

Travis CI https://docs.travis-ci.com/user/getting-started

For Python: https://docs.travis-ci.com/user/languages/python/

For R: https://docs.travis-ci.com/user/languages/r/

Jenkins CI https://jenkins.io/

Travis vs Jenkins: http://stackoverflow.com/questions/32422264/jenkins-vs-travis-ci-which-one-would-you-use-for-a-open-source-project

References and reviews of computational pipelines¶

‘Big data’, Hadoop and cloud computing in genomics. - PubMed - NCBI

https://www.ncbi.nlm.nih.gov/pubmed/23872175

Using Docker Bioconductor containers

http://databio.org/docker_bioconductor/#/title

Ruffus — ruffus 2.6.3 documentation

http://www.ruffus.org.uk/index.html

Best practices — pypiper 0.3 documentation

http://pypiper.readthedocs.io/en/latest/best-practices.html

A review of bioinformatic pipeline frameworks

http://bib.oxfordjournals.org/content/early/2016/03/23/bib.bbw020.full

ssadedin/bpipe: Bpipe - a tool for running and managing bioinformatics pipelines

https://github.com/ssadedin/bpipe

Snakemake—a scalable bioinformatics workflow engine

http://bioinformatics.oxfordjournals.org/content/28/19/2520.short

Omics Pipe: a community-based framework for reproducible multi-omics data analysis

https://bioinformatics.oxfordjournals.org/content/31/11/1724.full

Scientific workflow systems - can one size fit all? - IEEE Xplore Document

http://ieeexplore.ieee.org/document/4786077/?reload=true&arnumber=4786077

Experiences with workflows for automating data-intensive bioinformatics

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539931/

NLeSC Guide

https://nlesc.gitbooks.io/guide/content/index.html

Top 10 metrics for life science software good practices - F1000Research

https://f1000research.com/articles/5-2000/v1

danielskatz/sustaining-research-projects: sustainability models for research software projects

https://github.com/danielskatz/sustaining-research-Projects

Science Code Manifesto

http://sciencecodemanifesto.org/

Notes on using Docopt for script command line options¶

Docopt:

http://docopt.org/

https://github.com/docopt/docopt

An example for loading arguments from an INI file:

Help message format

Basic reminders for docopt:

Use two spaces to separate options with their informal description
() are required, [] are optional
Default values are specified in the Options section with eg [default: xxx.log]

“.. Usage” pattern in docopt can’t have empty lines and ends with an empty line.

The first word after “Usage:” is interpreted as the program’s name, e.g.:

‘python xxx.py’ makes it think your programme is called ‘python’ with

option ‘xxx.py’

Docopt reads multi-line descriptions in Options so 80 character lines can be wrapped.

‘Usage’ and ‘Options’ case insensitive and followed by ‘:’ are recognised by docopt in the docstrings.

R script examples and notes¶

Simple example (shebangs are the first line in a real script):

That’s it in order to have command line options for R scripts.

Docopt basically reads the message string and converts these to a dictionary that is then passed on.

Second case

See how to specify the arguments and options (docopt has been ported to many languages):

See examples:

Docker workflow notes and Dockerfile example¶

Author:	Antonio J Berlanga-Taylor
Date:	July 3 2016

Docker is software for container platforms. Containers are like virtual machines (but generally better, see links below).

You can use Docker to develop tools in your local machine (regardless of its own operating system) and test without worrying about how your software will perform in a different environment.

Docker automates setting up and configuring environments. This makes it easy to build software, collaborate with others in data analysis, etc.

You can specify in your Dockerfile all the needed dependencies and instructions to run your application, scripts, etc.

Your application could be software for others or even your specific data analysis project.

The Dockerfile can be like a lab notebook containing instructions on the computing environment, external and internal software, versions used and example data for instance.

Learning Docker and writing a Dockerfile may be time consuming initially but could save you and collaborators major headaches when you try to reproduce results.

You can find many and better tutorials online and in the official Docker documentation. As with other files in this repository, these are basic notes, links and reminders to get started.

Note

Some code snipets, info, etc. is directly from some of the references. Files need cleaning up.

Docker notes¶

General steps to create a Dockerfile (see more details below):

See Dockerfile writing best practice guidelines.

Create a file called Dockerfile, which we will use to place the relevant instructions and commands that will define what our container can do
Build the Docker container, this will install and apply all the environment variables and configurations that you have specified in the Dockerfile
Run the Dockerfile across any computing platform (with Docker installed on that platform).

Alternatively, use a pre-specified image with all the tools you need. Anything that isn’t in the Dockerfile though is not kept (ie any new tools which are installed).

Base image with all the tools needed except the application itself:

Linux distro

Python

R

Conda

Apline Linux (minimalistic)

Copy the Dockerfile, e.g.:

https://hub.docker.com/r/continuumio/miniconda3/~/dockerfile/

Add necessary tools to it and run as in steps above.

Otherwise, pull already available image and work from there:

Do e.g.:

docker pull continuumio/miniconda3
docker run -i -t continuumio/miniconda3 /bin/bash

Data science base images:

Alpine Linux (5 MB) is a minimal image that can serve as a starting point.

With this use Dockerfile containing e.g.:

FROM alpine:3.3
RUN apk add --no-cache mysql-client
ENTRYPOINT ["mysql"]

Steps to create a Dockerfile¶

Write a dockerfile, see:

Dockerfile for data science examples:

Minimal alpine Python 3 image example.

Dockerfile contents:

# specifiy base image
FROM ubuntu:14.04 # also FROM jupyter/scipy-notebook

# provide creator/maintainer of this Dockerfile
MAINTAINER Antonio J Berlanga-Taylor <a.berlanga@imperial.ac.uk>

# Specify some of the useful system tools and libraries to include in the Ubuntu bare bones image:

# Update the sources list
RUN apt-get update
# RUN cmds are linux cmds

# install useful system tools and libraries
RUN apt-get install -y libfreetype6-dev && \
        apt-get install -y libglib2.0-0 \
        libxext6 \
        libsm6 \
        libxrender1 \
        libblas-dev \
        liblapack-dev \
        gfortran \
        libfontconfig1 --fix-missing

RUN apt-get install tar \
        git \
        curl \
        nano \
        wget \
        dialog \
        net-tools \
        build-essential

# install Python and pip package manager
# TO DO: change for conda, eg:
# https://hub.docker.com/r/continuumio/miniconda/
# from where jupyter can run
RUN apt-get install -y python \
        python-dev \
        python-distribute \
        python-pip

# install useful and/or required Python libraries to run your script
# # TO DO: change for conda recipe, Bioc image, etc.
RUN pip install matplotlib \
        seaborn \
        pandas \
        numpy \
        scipy \
        sklearn \
        python-dateutil \
        gensim

COPY localfile.R /home/ubuntu/localfilecopy.R
# COPY copies local files into the container

EXPOSE 5000
# opens ports that can be mapped to server ports (?)

# define command to when Docker container starts
ENTRYPOINT ["python"]
CMD ["my_script.py"]
# This is the first command that will run once the container starts
# Note Docker doesn't accept single quotes, only double.

Build the docker image

From within the docker terminal and from the directory where the dockerfile (and eg python script) are:

docker build -t your_image_name .

See the messages printed and if it builds successfully it will appear in:

docker image list

Run the docker image:

Create and enter a folder where data is located and/or will be saved:

cd ~/Documents/github.dir/docker_tests.dir

Install Docker on the platform to use beforehand:

https://docs.docker.com/engine/installation/linux/rhel/

docker run --rm -ti your_image_name

An -it flag makes the container run interactively

–rm automatically remove the container when exiting

Install a package:

apt-get update
apt-get install vim

Get the data (mount a volume) and point docker to it:

docker run -it -d -p 8888:8888 -v /home/ubuntu/xxx:/Users/USER/data/datfile.data your_image_name
# or e.g.:
docker run -it -v $(pwd):/tmp DOCKER_IMAGE /bin/bash
docker run -it -v $(pwd):/anyname/ continuumio/miniconda3 bash

-p flag sets the ports (to access a Jupyter notebook server locally). To find Docker’s ip address use:

docker-machine ip default #’default’ for docker machine

-d runs the container in detached mode, as a background process.

-v specifies which directory on the local machine to store results

Files get copied across after a certain period (?)

Data/files can be copied across with docker cp:

docker cp <containerId>:/file/path/within/container /host/path/target

These will be lost when the container is stopped (but not results save locally) unless pre-specified in the Dockerfile.

Shut down the docker container:

docker ps # to get CONTAINER_ID
docker rm -f CONTAINER_ID

Get system info:

docker info

Remove old containers and images:

docker ps -a
docker rm CONTAINER_ID
docker images
docker rmi IMAGE

Stop and remove all containers:

docker stop $(docker ps -aq) # stop is 'graceful', can also use docker kill $(docker ps -aq)
docker rm $(docker ps -aq)

Create a Docker Hub account to upload your images and make them available to others.

Testing workflow example¶

Create a Dockerfile as above, see for example:

https://github.com/AntonioJBT/project_quickstart/blob/master/Dockerfile

See these minimal images with Alpine and Python for instance.

Copy Dockerfile to test directory (not necessary though), build image locally and run:

mkdir docker_tests
cd docker_tests
cp /path_to/project_xxx/Dockerfile .
docker build --no-cache=true -t user_xxx/my_docker_tag . # Build a local image, disable the cache
docker images # Check it's there
docker run --rm -ti user_xxx/my_docker_tag # Run your image interactively and remove the container when exiting

If you didn’t share a volume between the container and host, you can copy files across with docker cp (see also question on Stack Overflow):

docker ps -a # to get the container numeric ID
docker cp b0cbb62d9cd7:/home/my_files.tar.gz . # docker cp <numeric ID>:/full_path/to/file /host/location/

Aggressive clean up:

docker images -a # Show all images
docker images -f "dangling=true" # Show <none> images, remains of previous builds
docker rmi $(docker images -qf "dangling=true") # Delete <none> images, add '-a' option to delete ALL images
docker ps -a # Show all containers
docker stop $(docker ps -aq) # stop is 'graceful', can also use docker kill $(docker ps -aq)
docker rm $(docker ps -aq) # Delete all containers

# A softer removal of containers:
d

You can then go back to your code, make changes, push/pull to your version control system and start again with Dockerfile to test your package in a different environment to your machine.

TO DO¶

Integrate with GitHub and Travis ?
Use dockerHub to push and pull akin to GitHub (integrate?)

Additional links and references¶

To do¶

Author:	Antonio
Date:	22 Dec 2016

DONE¶

The HPC team has installed ruffus and drmaa, these are working fine. David and I have been testing these and everything seems in order.
I’ve installed (in my user space) the CGAT scripts and pipeline framework, others may need to do this as well later on. The CGAT tools work well for me, they required some manual installation of several of the python libraries though. I also made some changes so that they could communicate with PBS Pro, these work without problems. There has been a recent switch to python 3 though so I have to test the modifications in the CGAT scripts. Not expecting problems though (famous last words)…

TO DO¶

Antonio, David, Vangelis, Gao: Finish building the Genotype QC tool
Genotype QC is currently the first pipeline we’ll build with these tools/approaches. Although not straightforward it’ll essentially simply follow CGAT and Ruffus’ workflow and tools.
Discuss with Ibrahim, Rui, Deborah: Matlab isn’t open source, big problem:
- Discuss Matlab users’ needs and how principles can be applied without forcing others to learn new languages.
- Can we simply creae Matlab scripts and run in Ruffus pipelines? Permissions?
- Matlab can still be open source (i.e. we can publish code but people without Matlab licence can’t re-run/use/modify it?)

Add (hand) visualising diagram as first step
For logging within a single package (based on CGAT Experiment.py check

https://github.com/CGATOxford/UMI-tools/blob/master/umi_tools/Utilities.py
Integrate github with zenodo in order to deposit data, code, manuscript, etc. with DOI generation and release freeze for software citation?
- https://zenodo.org/
- https://guides.github.com/activities/citable-code/

Integrate unit tests (Travis setup, needs workflow for how to create tests e.g. script options, script output, etc.)
Integrate reST with e.g. Mendeley for reporting and citation manager, check bibtex
Code review flow/guide?
Check /ifs/projects/proj032/src/pipeline_miRNA.py for more examples
Also see Tom and Ian’s UMI-tools: https://github.com/CGATOxford/UMI-tools

and paper: http://genome.cshlp.org/content/early/2017/01/18/gr.209601.116.abstract

and Zenodo DOI: https://zenodo.org/record/215974

Nice!

Welcome¶

Notes on reproducibility in biomedical research¶

General approach¶

Some references and tutorials¶

Git and GitHub¶

Other training resources¶

Unix and command line basics¶

References on data analysis and reproducibility¶

Notes on computational pipelines for biomedial research¶

Suggested tools and approach to implement in computational and statistical analysis¶

Using Python and UNIX philosophy as the building bases¶

Actual tools and practice¶

Structuring code¶

Tools to use¶

Other languages¶

Packaging in Python¶

Notes on how to create technical documents¶

restructuredText (reST, rST)¶

Including citations¶

reST directives to include external files¶

To include images¶

Making slides with rst¶

Problems with reST¶

reST example sheet¶

Miscellaneous¶

Jupyter notebooks¶

R markdown and its notebook¶

VIM or Emacs?¶

TO DO¶

Additional references and blogs¶

restructuredText tutorials/info¶

Generate python package documentation with Sphinx and render rst docs¶

pandoc common commands¶

Annoying quirks?!¶

Substitutions in restructuredText files¶

Creating figure layouts for publication programmatically¶

restructuredText and SVG¶

Notes on continuous integration¶

Did the code I just pushed change all my results? Did I introduce a bug?¶

General references¶

References and reviews of computational pipelines¶

Notes on using Docopt for script command line options¶

R script examples and notes¶

Docker workflow notes and Dockerfile example¶

Docker notes¶

Steps to create a Dockerfile¶

Testing workflow example¶

TO DO¶

Additional links and references¶

To do¶

DONE¶

TO DO¶

Best practice checklists¶