Tune up your pipeline with Luigi, the Python module to manage workflow used in Spotify.

Yesterday I have found an amazing audio comment on Nature’s Arts and Books blog that was discussing a possible influence of music on the development of modern science. Among the many connections we may find between science and music, the one I am going to propose today turns out quite unexpected.

I understand that pipeline development is overtaking the discussion in this blog, and this could actually result quite boring. That’s because I have to face my very first big project in genomics, and I am in the need to explore the best solutions and strategies to manage complex workflows. So, as I already discussed some Python solutions for pipelines and the NextFlow DSL Project, let me take some lines to talk about Luigi.

Luigi is a Python package to build complex pipelines of batch jobs. Long batch processes can be easily managed, and there are pre-designed templates. For instance, there is a full support for MapReduce development in Python, that is the only language used in Luigi. All the workflow can be monitored with a very useful graphical interface, providing a graph representing the structure of your pipeline and the status of data processing. More information and downloads are available on the Luigi GitHub page.

How is this related with music? Well, the picture above displays a romantic view of what music was in the past. Nowadays, anything is managed as a big data thing, tunes and chords are transformed into bit, with a cruel disregard for any romance. Luigi was developed in the context of the very famous (and my favourite) music application Spotify. The main developer, Erik Bernhardsson, is a NYC-based computer scientist who headed the Machine Learning Division at Spotify for six years long.

So, we can actually agree with Kerri Smith’s point on Nature: music influences scientific production. Sometimes is a matter of cultural environment, sometimes is a matter of data science.

Post-pub integration. I was informed on twitter about this page with examples of Luigi usage. Think is worth to be mentioned. Thanks to @smllmp


MapReduce: using the Google bigdata algorithm in bioinformatics

By this post, I will start exploring the possible contributions Google may provide to bioinformatics, a topic I am considering to deepen for a while. The rise of bigdata in biology is causing Big G to show a growing interest in this field, and very soon, the Mountain View giant may grow its investments in Life Sciences. This is definitely desirable, since the big expertise in matter of big data they have at Google could significantly boost up bioinformatics research.

Today, we’ll try to understand how the most fundamental Google’s big data analysis algorithm works. MapReduce is a programming model developed and brought to fame by Google. It is used to simplify huge data amounts processing. The main point they have in commercial data analysis, is to provide special- purpose solutions to compute big quantities of data. Companies are interested in analysing a lot of datas very quickly to better perform their campaigns and to understand markets.

How does MapReduce work?

MapReduce allows computation in workstation clusters, that can be executed in parallel on non- structured data, or within a database. The name “MapReduce” derives from two functions implemented in many languages, map() and reduce(). In Python for instance, map() function is used on a list to apply function to every item of iterable and return a list of the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. The reduce() function instead, applies a function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value. Whereas map() is able to assign a value to any object in a list, reduce() will combine this values by iteration, reducing them into a single value. A MapReduce program is thus composed of three fundamental steps: the map step, the shuffle step and the reduce step.

In the Map Step, the master node imports input data, parses them in small subsets, and distributes the work on slaves nodes. Any slave node will produce the intermediate result of a map() function, in the form of [key,value] pairs, that are saved on a distributed file. Output file location is notified to the master at the end of mapping phase.

In the Shuffle Step, the master node collect the answers from slave nodes, to combine the key,value pairs in value lists sharing the same key, and sort them by the key. Sorting can be lexicographical, increasing or user- defined.

In the Reduce Step, the reduction function is performed.


Despite it may appear quite tricky to understand, it’s all but complicated. Data are first labeled with a mapping function, they are then sorted, and finally summarised by a reducing procedure. The *magic* is in the distribution of the work on more nodes in order to speed up bigdata processing. Output file could be processed again in a new MapReduce cycle.

Surfing around youtube, I really enjoyed this video by Jesse Anderson on youtube, that explains MapReduce functioning with playing cards. Simple, effective and genial in deed.

How could this help in bioinformatics?

As Lin Dai and collaborators pointed out on Biology Direct in 2012, bioinformatics is moving from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. As a matter of fact, cloud computing is getting more and more important in genomics, and is expected to become fundamental on the long run.

Some MapReduce implementations dedicated to sequencing analyses are already available in literature. Michael Schat’s Cloudburst can be a fair example. Published in 2008, Cloudburst is based on the open-source MapReduce package Hadoop to reduce the running time from hours to minutes for typical jobs involving mapping of millions of short reads to the human genome.

More recently, many methods have been developed to make cloud technology available for bioinformatics analyses. I just rattle off three examples, but there is a plenty of these software solutions out there. Galaxy Cloud, software for NGS data analysis, is most likely the best known. In protein structure and function prediction, the PredictProtein Debian package uses cloud technologies as well, and I have found this cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples very interesting in deed.

The list could be far longer, and a more detailed research among literature may return a plenty of cloud- based methods. In the most of them, MapReduce is reference programming model, and the understanding of its basic functioning could turn out as important as the comprehension of classic bioinformatics algorithms.

Some resources to learn and use MapReduce.

Hadoop is a MapReduce package developed by Apache and distributed under the Apache free- software licence. I assume this as a reference distribution, but it is not the only one. You can explore the official website and download a very detailed PDF guide. The tutorial is oriented to Java language, but you can easily find further documentation to run MapReduce in other languages.

Eventually, you may also consider looking a video tutorial. I had a look on this one on Vimeo, and it seems quite complete, but more will come out if you go searching through.

Will MapReduce replace SQL databases?

Someone may have noticed that MapReduce allows to sort and process data, and that their structuring is not essential. Asking whether this method could make SQL databases obsolete is a fair question, still quite a lot open. As far as I can understand surfing around the web, it is most likely not the case. In 2009, some researchers from American universities argued that MapReduce lacked many key features, and a bit of criticism accompanied MapReduce over the years. MapReduce is a great method to process data quickly, but structured languages still provide the best tools for their storage. That is why you will get to find many solutions to apply MapReduce to SQL databases.

I take this as a further demonstration that competition is often pointless, and the best will come from collaboration. If possible, we’d better take the best of anything we have available.

Book now for hack.summit(), a virtual conference for developers.

Next week, December 1st- 4th, the big names of informatics will be online to discuss about coding and software development. The hack.summit() is an initiative aimed at supporting non- profit coding, and it is promoted by the online consulting service hack.hands(). On their website, you can scroll the list of participants, and the names are truly impressive.

Google Glass creator Tom Chi, Bitcoin inventor Bram Cohen, Brian Fox, who created the GNU Shell, Python Software Foundation president Alex Gaynor and even the Microsoft Executive Vice- President Qi Lu, will be online and available to talk with attenders. You’ll have the chance to interact and ask questions to speakers, getting in touch with an impressive cohort of top- level developers.

Registration is quite simple. You can decide wether donate a small amount, or use the Pay With a Tweet system. Revenues will be used to support nonprofits projects. Initiative’s goals are in fact to raise money for coding nonprofits, educate programmers of all languages and skill-sets and encourage mentorship among software developers.

I will definitely attend this, hoping to hear some good tip from the guy who’s running the foundation supporting my favourite language Python.

Up until now, the initiative broke a new record, recoiling more than 36k participants, and registration are still growing. So, do not miss this chance and join this amazing meeting and to meet your programming idol.

Click here to explore and register to hack.summit().

Click here to view hack.hands() website.

An exceptional youtube playlist to start with machine learning.

A “Machine Learning” algorithm is defined as an algorithm able to change its structure and functioning according to the data submitted. In other words, a machine learning algorithm is capable to learn from data and be refined after implementation. Nowadays, many structural biology (e.g. psi-pred, jpred), bioinformatics (HMM-based software) and systems biology (network analysis and db comparison) algorithms rely on machine learning methods, and an insight of the basic principles underlying them is very useful to all those that are working on software development. Unfortunately, an extensive study of such an advanced topic may be pretty tough for someone with a biological background.

Surfing on YouTube, I have been really pleased to find the mathematicalmonk’s channel. Actually, I have no clue on who this guy is, but I am pretty sure that he did a good work with his tutorials. Along with other advanced mathematical topics, Machine Learning is explained in a 160 videos playlist, where the author explains the base concept of Machine Learning with simplicity and great clearness. The course goes through all the major topics needed for an introduction to machine learning methods, and it’s a perfect point to start your exploration in the machine learning.

Above this post, you can play the introductory video, to get an idea about the topic and the kind of lessons proposed. The whole playlist can be found following this link.

Software Carpentry. A great resource to learn scientific computing.

Software Carpentry is a volunteer organization founded in 1998, whose mission is to enhance scientists’ productivity by teaching them basic computer skills. From January 2012, Software Carpentry joined the Mozilla galaxy, to become part of Mozilla Science Lab in June 2013. The organization provides workshops in several parts of the word and free online lessons. Their focus is to teach the basics of some very useful in silico methods to scientists, such as structured programming (Python and R), numerical programming (NumPy), version control (Git, mercurial and SVN), unit testing and automated tasks in Unix, regular expressions and relational databases (SQL).

Most likely, you have heard about them, but I still take some time to report their amazing work. After a long experience in this, guys at Software Carpentery have developed a very simple deal with their customers: short courses at an affordable price. You pay them accommodation, put yourself in the mood to become more efficient and dedicate a couple of days to learn very useful scientific computing stuff. As they affirm on their website, courses must be short to meet better the needs of very busy researchers, and the workshop represent the most efficient way to learn. efficiency is actually the thing they care the most, and you may consider its importance in your work.

The truly amazing thing is that they really operate worldwide. So, I think you may consider surfing their website and get to know this long- established and very useful project.


UbuntuOne is a goner. But Ubuntu still rocks.

Today I tried out the new version of Ubuntu, the 14.04 LTS stable release “Trusty Tahr”. I am still having a look to the whole thing, and I guess that another review on this Ubuntu version would be a pointless addition to the global information entropy. I’d better redirect you on some very interesting reviews out there such as this on ZDNet, or this other on OMG!Ubuntu!. Stable, fast and no big changes with previous releases, confirming the path the British software company developing Ubuntu is following for a while: make Ubuntu easier and better integrated with different devices.

The very bad news to me have come from the UbuntuOne Project. I was trying to understand how in the world could be that I couldn’t find the Ubuntu One application in my system, and why I couldn’t even install it from the Software Center. After a swift search on the web, I have found the sad answer. On the official Canonical Blog, Jane Silber (Canonical CEO since 2010) announces that Ubuntu One file sharing services will be discontinued with effect from June 2014. UbuntuOne is a goner. This took me a little sadness because, as many other UbuntuOne premium users, I enjoyed  the velocity and affordability of this file sync service. I remember when, during a cloudburst, my lab’s computer breakdown made me understand the importance of cloud technologies. Ubuntu one had saved my whole thesis work.

I will have to surrender to DropBox. Sad story.

Chaos theory and theoretical modeling of Gut Microbiota.

Some old papers are just like the old pictures you get to find if you rummage in the garret during a rainy afternoon. Cluttered in a trunk just to be forgotten, they tell you about days gone by, but sometimes they can reveal facts you just couldn’t suspect. In the last decade, the research on the symbiotic interactions between mammals and bacteria made impressive steps forward. The so-called microbiota has been extensively studied, revealing his fundamental role in immunity, aging and several pathogenesis events. The last intriguing hypotesis links the composition of microbiota with autism, stating how can be crucial the role of microbes in our body.

For my very last exam, I had to study the composition of microbiota and his role in pathogenesis and immunity. We have been committed to choose a paper to make a small presentation. And as in the rainy afternoon in the old garret in front of a fucking trunk, I have found a picture from the past, a paper healing from the 90s, when no one had a clear idea on what those bacteria were doing around our body’s external surface. So, let me show you this old, but still very interesting picture.

The title immediately caught my attention:

Nonlinear dynamics, chaos-theory, and the” sciences of complexity”: their relevance to the study of the interaction between host and microflora

The article (PDF), dating 1997, is written by M.H.F. Wilkinson, a computer scientist from the University of Groningen, in Holland. After a concise and really clear explanation of non-linear dynamics, chaos theory and complex systems, the paper propose that gut and microbiota constitute a pseudo-chaotic complex system, thus showing a non-linear dynamic. With this assumption, a computer model of microflora has been built. The “organisms” are divided into aerobians and anaerobians and brought to compete for space and nutrients. IMHO, the result is quite striking. Several simulations of gut colonization confirm the interdependency between anaerobes and facultative anaerobes, but the most intriguing result is presented here:


If you check out the recently proposed dynamics for gut colonization of microbiota, from the birth till the very first weeks of life, you will find a very similar path. Facultative anaerobes, such as Firmicutes, take the stage as first to prepare the ground to strict anaerobes, such as Bacteroides, which overcome later to become the biggest population. Intriguingly, the theoretical modeling forecasted this years before the experimental approaches.

There is another important fact to highlight. This work gets the point when it remarks the semi-chaotic behavior of the system constituted by the intestine and the microbial community. As mentioned by the author, a semi-chaotic organization is really suitable for those systems who need a compromise between plasticity and stability, such as biological systems undergoing to evolutionary constraints.

An old picture, but still really interesting. A memory from a past in wich there were no mass sequencing, big data, proteomics and systems biology. To understand the complexity of biological systems, scholars made use of theoretical models developed in physics and math, such as chaos theory or fractal geometry. A theoretical modeling step that is quite often underestimated by genomic scientists and systems biologists, and wrongly forgotten in the post-genomic era.

Old pictures can be interesting.