We already had the chance to discuss about the importance of reproduciblity in computational research, and to comment some good practices to improve it. As we read the Ten Simple Rules that Sandve and co-workers proposed the last October, we cannot help but underline the importance of pipelines. A correct pipeline- based approach will prevent researchers from potentially harmful manual interventions on data, and to get them to have a correct tracking of their workflow. Pipelines are just perfect to deal with the usually huge bioinformatics tasks, that require a big amount of calculation, and several sorting and filtering steps. Despite the usual controversies about this, we can tell that Python is becoming the prime choice of many bioinformaticians, because of its powerful features, dynamic and populated community, and ease of use. That is why I think it is fair to discuss about a couple of python-based pipeline creation tools.
The first point, is to take stock of which main features we should ask to a python pipeline creation tool. Of course, anyone will appreciate a lightweight system for obvious reasons, and things like a simple syntax, scalability and the possibility to manage complex workflows with ease will be very welcome. Another aspect I want to mention, is the possibility to include previously created code into a pipeline system. There are two main reasons for this. First, functions and classes may be re-used in different projects, and having a pipeline system working as a “wrapper” around your code may ease this. Second, many python beginners are not really oriented to the pipeline-philosophy, as python works great with a module-based approach (even if not exclusively).
The different solutions I got to find around, can be distinguished on the basis of their relationship with the code, and according on how they thread into a Python script. Let’s assume, for simplicity, that a program is made up by functions that are included into modules, and that several modules can constitute the whole thing. We have thus identified three concentric levels: a function level, a module level, and a multi-file level. Pipeline systems for Python are basically modules providing the possibility to include a simple-syntax code in your scripts to manage the data flow. Several commands, usually defined as decorators, are thus formalized to sort the data flow into an organized pipeline. So, we will discuss how different solution will work at different levels.
Pipelines working at function- level: Ruffus and Joblib
Published in 2010 on BMC Bioinformatics by Leo Goodstadt at Oxford University, Ruffus is available on its official website, where you can find a complete documentation and tutorials. As you can notice, Ruffus works by connecting consecutive input/output files, and imposes the developer to write the functions in the code following the order of the dataflow. Any function must be preceded by the @follows decorator, indicating the flow direction, and the @files, that calls the in/out files. That is why I mention it as a pipeline system working at a “funcion level”, as the internal module structure of a script depends on the structure of the pipeline.
This approach is someway related to the one implemented into Joblib, a python pipeline system that is mostly oriented to ease parallel computation. Despite the substantial differences, the structure of the script depends on the structure of the pipeline in both cases.
Pipelines working at module- level: Leaf
Leaf is a project published a couple of months ago by Francesco Napolitano, at the University of Salerno in Italy. The key-idea, is to provide a system to declare a pipeline structure without changing the code. At the beginning of the module, it is possible to enclose a decorator to build a graphical scheme of the pipeline you have in mind. A simple visual language, the Leaf Graphical Language, is implemented to graphically build the dependencies, with the possibility to export all the workflow as hypertext to share results. Leaf comes out as python library and can be downloaded here.
The key differences between Ruffus and Leaf are shown in the following picture (Napolitano et al., 2010).
As evident, Leaf works as a real wrapper, whereas Ruffus requires a specific script structure.
Pipelines working at multi-file level.
Pipelines can be designed to interconnect different python modules. In this case, the pipeline tool will work at an “upper level”, standing above different modules. It is the philosophy underlying the most common pipeline creation software, and I would like to mention Bpipe, that is one of the most recently developed (but there are quite a lot around). Of course, as scripts in any language can work with standard streams, we are slipping a bit away from the range of “python-dedicated pipeline tools”, and learning the good ol’ GNU-make is still worthwile if you are keen to work (or in the need of working) with pipelines at a module-level.
I cannot really tell which is the best one, since the choice will depend on the project, the coder’s attitude and the specific needs. Furthermore, this post is just rattling off some few projects I got to find around, and more suggestions will be just welcome.