Tag Archives: pipeline

Tune up your pipeline with Luigi, the Python module to manage workflow used in Spotify.

Yesterday I have found an amazing audio comment on Nature’s Arts and Books blog that was discussing a possible influence of music on the development of modern science. Among the many connections we may find between science and music, the one I am going to propose today turns out quite unexpected.

I understand that pipeline development is overtaking the discussion in this blog, and this could actually result quite boring. That’s because I have to face my very first big project in genomics, and I am in the need to explore the best solutions and strategies to manage complex workflows. So, as I already discussed some Python solutions for pipelines and the NextFlow DSL Project, let me take some lines to talk about Luigi.

Luigi is a Python package to build complex pipelines of batch jobs. Long batch processes can be easily managed, and there are pre-designed templates. For instance, there is a full support for MapReduce development in Python, that is the only language used in Luigi. All the workflow can be monitored with a very useful graphical interface, providing a graph representing the structure of your pipeline and the status of data processing. More information and downloads are available on the Luigi GitHub page.

How is this related with music? Well, the picture above displays a romantic view of what music was in the past. Nowadays, anything is managed as a big data thing, tunes and chords are transformed into bit, with a cruel disregard for any romance. Luigi was developed in the context of the very famous (and my favourite) music application Spotify. The main developer, Erik Bernhardsson, is a NYC-based computer scientist who headed the Machine Learning Division at Spotify for six years long.

So, we can actually agree with Kerri Smith’s point on Nature: music influences scientific production. Sometimes is a matter of cultural environment, sometimes is a matter of data science.

Post-pub integration. I was informed on twitter about this page with examples of Luigi usage. Think is worth to be mentioned. Thanks to @smllmp

Nextflow: a DSL for parallel and scalable computational pipelines.

I return on pipeline creation tools because I was warned about Nextflow project, a DSL for parallel and scalable computational pipeline creation. As already discussed in a recent post, we can sort pipeline solutions according to how they deal with your code. Some of them are able to connect functions, some of them require dedicated modules, and others are able to connect different files by integrating the standard I/O.

If we think it up, the most of the people working in bioinformatics will end up mixing different things. Usually, a scripting language is used for data mining, and R comes over for statistical analysis, but other languages or tools may be needed. Thus, any solution that is able to integrate different languages, it is warmly suggested in the most of the cases.

The pretty amazing thing about Nextflow, is that the authors took their time to implement a real Domain Specific Language, namely a programming language dedicated to solve specific tasks. To better explain, languages as R, SQL o Mathematica are defined as DSL too. Of course, this bodes well in the extensibility and power of this language.

Nextflow is provided as a simple- installation package, and Java 7+ is the only required dependency. Syntax is simple as well as scalability, since you can develop on your laptop, run in the grid, and scale-out to the cloud with no modifications to your code needed. More intriguingly, the whole thing is designed to work with message passing to better deal with a parallel computing approach.

The project is developed at the CRG- Comparative Genomics group in Barcelona (the guys who created and maintain the T-COFFEE suite), and is headed by Paolo di Tommaso.

Software for pipeline creation with Python.

We already had the chance to discuss about the importance of reproduciblity in computational research, and to comment some good practices to improve it. As we read the Ten Simple Rules that Sandve and co-workers proposed the last October, we cannot help but underline the importance of pipelines. A correct pipeline- based approach will prevent researchers from potentially harmful manual interventions on data, and to get them to have a correct tracking of their workflow. Pipelines are just perfect to deal with the usually huge bioinformatics tasks, that require a big amount of calculation, and several sorting and filtering steps. Despite the usual controversies about this, we can tell that Python is becoming the prime choice of many bioinformaticians, because of its powerful features, dynamic and populated community, and ease of use. That is why I think it is fair to discuss about a couple of python-based pipeline creation tools.

The first point, is to take stock of which main features we should ask to a python pipeline creation tool. Of course, anyone will appreciate a lightweight system for obvious reasons, and things like a simple syntax, scalability and the possibility to manage complex workflows with ease will be very welcome. Another aspect I want to mention, is the possibility to include previously created code into a pipeline system. There are two main reasons for this. First, functions and classes may be re-used in different projects, and having a pipeline system working as a “wrapper” around your code may ease this. Second, many python beginners are not really oriented to the pipeline-philosophy, as python works great with a module-based approach (even if  not exclusively).

The different solutions I got to find around, can be distinguished on the basis of their relationship with the code, and according on how they thread into a Python script. Let’s assume, for simplicity, that a program is made up by functions that are included into modules, and that several modules can constitute the whole thing. We have thus identified three concentric levels: a function level, a module level, and a multi-file level. Pipeline systems for Python are basically modules providing the possibility to include a simple-syntax code in your scripts to manage the data flow. Several commands, usually defined as decorators, are thus formalized to sort the data flow into an organized pipeline. So, we will discuss how different solution will work at different levels.

Pipelines working at function- level: Ruffus and Joblib

Published in 2010 on BMC Bioinformatics by Leo Goodstadt at Oxford University, Ruffus is available on its official website, where you can find a complete documentation and tutorials. As you can notice, Ruffus works by connecting consecutive input/output files, and imposes the developer to write the functions in the code following the order of the dataflow. Any function must be preceded by the @follows decorator, indicating the flow direction, and the @files, that calls the in/out files. That is why I mention it as a pipeline system working at a “funcion level”, as the internal module structure of a script depends on the structure of the pipeline.

This approach is someway related to the one implemented into Joblib, a python pipeline system that is mostly oriented to ease parallel computation. Despite the substantial differences, the structure of the script depends on the structure of the pipeline in both cases.

Pipelines working at module- level: Leaf

Leaf is a project published a couple of months ago by Francesco Napolitano, at the University of Salerno in Italy. The key-idea, is to provide a system to declare a pipeline structure without changing the code. At the beginning of the module, it is possible to enclose a decorator to build a graphical scheme of the pipeline you have in mind. A simple visual language, the Leaf Graphical Language, is implemented to graphically build the dependencies, with the possibility to export all the workflow as hypertext to share results. Leaf comes out as python library and can be downloaded here.

The key differences between Ruffus and Leaf are shown in the following picture (Napolitano et al., 2010).


As evident, Leaf works as a real wrapper, whereas Ruffus requires a specific script structure.

Pipelines working at multi-file level.

Pipelines can be designed to interconnect different python modules. In this case, the pipeline tool will work at an “upper level”, standing above different modules. It is the philosophy underlying the most common pipeline creation software, and I would like to mention Bpipe, that is one of the most recently developed (but there are quite a lot around). Of course, as scripts in any language can work with standard streams, we are slipping a bit away from the range of “python-dedicated pipeline tools”, and learning the good ol’ GNU-make is still worthwile if you are keen to work (or in the need of working) with pipelines at a module-level.

I cannot really tell which is the best one, since the choice  will depend on the project, the coder’s attitude and the specific needs. Furthermore, this post is just rattling off some few projects I got to find around, and more suggestions will be just welcome.