Tag Archives: bigdata

Tune up your pipeline with Luigi, the Python module to manage workflow used in Spotify.

Yesterday I have found an amazing audio comment on Nature’s Arts and Books blog that was discussing a possible influence of music on the development of modern science. Among the many connections we may find between science and music, the one I am going to propose today turns out quite unexpected.

I understand that pipeline development is overtaking the discussion in this blog, and this could actually result quite boring. That’s because I have to face my very first big project in genomics, and I am in the need to explore the best solutions and strategies to manage complex workflows. So, as I already discussed some Python solutions for pipelines and the NextFlow DSL Project, let me take some lines to talk about Luigi.

Luigi is a Python package to build complex pipelines of batch jobs. Long batch processes can be easily managed, and there are pre-designed templates. For instance, there is a full support for MapReduce development in Python, that is the only language used in Luigi. All the workflow can be monitored with a very useful graphical interface, providing a graph representing the structure of your pipeline and the status of data processing. More information and downloads are available on the Luigi GitHub page.

How is this related with music? Well, the picture above displays a romantic view of what music was in the past. Nowadays, anything is managed as a big data thing, tunes and chords are transformed into bit, with a cruel disregard for any romance. Luigi was developed in the context of the very famous (and my favourite) music application Spotify. The main developer, Erik Bernhardsson, is a NYC-based computer scientist who headed the Machine Learning Division at Spotify for six years long.

So, we can actually agree with Kerri Smith’s point on Nature: music influences scientific production. Sometimes is a matter of cultural environment, sometimes is a matter of data science.

Post-pub integration. I was informed on twitter about this page with examples of Luigi usage. Think is worth to be mentioned. Thanks to @smllmp

Nextflow: a DSL for parallel and scalable computational pipelines.

I return on pipeline creation tools because I was warned about Nextflow project, a DSL for parallel and scalable computational pipeline creation. As already discussed in a recent post, we can sort pipeline solutions according to how they deal with your code. Some of them are able to connect functions, some of them require dedicated modules, and others are able to connect different files by integrating the standard I/O.

If we think it up, the most of the people working in bioinformatics will end up mixing different things. Usually, a scripting language is used for data mining, and R comes over for statistical analysis, but other languages or tools may be needed. Thus, any solution that is able to integrate different languages, it is warmly suggested in the most of the cases.

The pretty amazing thing about Nextflow, is that the authors took their time to implement a real Domain Specific Language, namely a programming language dedicated to solve specific tasks. To better explain, languages as R, SQL o Mathematica are defined as DSL too. Of course, this bodes well in the extensibility and power of this language.

Nextflow is provided as a simple- installation package, and Java 7+ is the only required dependency. Syntax is simple as well as scalability, since you can develop on your laptop, run in the grid, and scale-out to the cloud with no modifications to your code needed. More intriguingly, the whole thing is designed to work with message passing to better deal with a parallel computing approach.

The project is developed at the CRG- Comparative Genomics group in Barcelona (the guys who created and maintain the T-COFFEE suite), and is headed by Paolo di Tommaso.

MapReduce: using the Google bigdata algorithm in bioinformatics

By this post, I will start exploring the possible contributions Google may provide to bioinformatics, a topic I am considering to deepen for a while. The rise of bigdata in biology is causing Big G to show a growing interest in this field, and very soon, the Mountain View giant may grow its investments in Life Sciences. This is definitely desirable, since the big expertise in matter of big data they have at Google could significantly boost up bioinformatics research.

Today, we’ll try to understand how the most fundamental Google’s big data analysis algorithm works. MapReduce is a programming model developed and brought to fame by Google. It is used to simplify huge data amounts processing. The main point they have in commercial data analysis, is to provide special- purpose solutions to compute big quantities of data. Companies are interested in analysing a lot of datas very quickly to better perform their campaigns and to understand markets.

How does MapReduce work?

MapReduce allows computation in workstation clusters, that can be executed in parallel on non- structured data, or within a database. The name “MapReduce” derives from two functions implemented in many languages, map() and reduce(). In Python for instance, map() function is used on a list to apply function to every item of iterable and return a list of the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. The reduce() function instead, applies a function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value. Whereas map() is able to assign a value to any object in a list, reduce() will combine this values by iteration, reducing them into a single value. A MapReduce program is thus composed of three fundamental steps: the map step, the shuffle step and the reduce step.

In the Map Step, the master node imports input data, parses them in small subsets, and distributes the work on slaves nodes. Any slave node will produce the intermediate result of a map() function, in the form of [key,value] pairs, that are saved on a distributed file. Output file location is notified to the master at the end of mapping phase.

In the Shuffle Step, the master node collect the answers from slave nodes, to combine the key,value pairs in value lists sharing the same key, and sort them by the key. Sorting can be lexicographical, increasing or user- defined.

In the Reduce Step, the reduction function is performed.


Despite it may appear quite tricky to understand, it’s all but complicated. Data are first labeled with a mapping function, they are then sorted, and finally summarised by a reducing procedure. The *magic* is in the distribution of the work on more nodes in order to speed up bigdata processing. Output file could be processed again in a new MapReduce cycle.

Surfing around youtube, I really enjoyed this video by Jesse Anderson on youtube, that explains MapReduce functioning with playing cards. Simple, effective and genial in deed.

How could this help in bioinformatics?

As Lin Dai and collaborators pointed out on Biology Direct in 2012, bioinformatics is moving from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. As a matter of fact, cloud computing is getting more and more important in genomics, and is expected to become fundamental on the long run.

Some MapReduce implementations dedicated to sequencing analyses are already available in literature. Michael Schat’s Cloudburst can be a fair example. Published in 2008, Cloudburst is based on the open-source MapReduce package Hadoop to reduce the running time from hours to minutes for typical jobs involving mapping of millions of short reads to the human genome.

More recently, many methods have been developed to make cloud technology available for bioinformatics analyses. I just rattle off three examples, but there is a plenty of these software solutions out there. Galaxy Cloud, software for NGS data analysis, is most likely the best known. In protein structure and function prediction, the PredictProtein Debian package uses cloud technologies as well, and I have found this cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples very interesting in deed.

The list could be far longer, and a more detailed research among literature may return a plenty of cloud- based methods. In the most of them, MapReduce is reference programming model, and the understanding of its basic functioning could turn out as important as the comprehension of classic bioinformatics algorithms.

Some resources to learn and use MapReduce.

Hadoop is a MapReduce package developed by Apache and distributed under the Apache free- software licence. I assume this as a reference distribution, but it is not the only one. You can explore the official website and download a very detailed PDF guide. The tutorial is oriented to Java language, but you can easily find further documentation to run MapReduce in other languages.

Eventually, you may also consider looking a video tutorial. I had a look on this one on Vimeo, and it seems quite complete, but more will come out if you go searching through.

Will MapReduce replace SQL databases?

Someone may have noticed that MapReduce allows to sort and process data, and that their structuring is not essential. Asking whether this method could make SQL databases obsolete is a fair question, still quite a lot open. As far as I can understand surfing around the web, it is most likely not the case. In 2009, some researchers from American universities argued that MapReduce lacked many key features, and a bit of criticism accompanied MapReduce over the years. MapReduce is a great method to process data quickly, but structured languages still provide the best tools for their storage. That is why you will get to find many solutions to apply MapReduce to SQL databases.

I take this as a further demonstration that competition is often pointless, and the best will come from collaboration. If possible, we’d better take the best of anything we have available.