Tune up your pipeline with Luigi, the Python module to manage workflow used in Spotify.

Yesterday I have found an amazing audio comment on Nature’s Arts and Books blog that was discussing a possible influence of music on the development of modern science. Among the many connections we may find between science and music, the one I am going to propose today turns out quite unexpected.

I understand that pipeline development is overtaking the discussion in this blog, and this could actually result quite boring. That’s because I have to face my very first big project in genomics, and I am in the need to explore the best solutions and strategies to manage complex workflows. So, as I already discussed some Python solutions for pipelines and the NextFlow DSL Project, let me take some lines to talk about Luigi.

Luigi is a Python package to build complex pipelines of batch jobs. Long batch processes can be easily managed, and there are pre-designed templates. For instance, there is a full support for MapReduce development in Python, that is the only language used in Luigi. All the workflow can be monitored with a very useful graphical interface, providing a graph representing the structure of your pipeline and the status of data processing. More information and downloads are available on the Luigi GitHub page.

How is this related with music? Well, the picture above displays a romantic view of what music was in the past. Nowadays, anything is managed as a big data thing, tunes and chords are transformed into bit, with a cruel disregard for any romance. Luigi was developed in the context of the very famous (and my favourite) music application Spotify. The main developer, Erik Bernhardsson, is a NYC-based computer scientist who headed the Machine Learning Division at Spotify for six years long.

So, we can actually agree with Kerri Smith’s point on Nature: music influences scientific production. Sometimes is a matter of cultural environment, sometimes is a matter of data science.

Post-pub integration. I was informed on twitter about this page with examples of Luigi usage. Think is worth to be mentioned. Thanks to @smllmp


Let's explore rSeqNP, a non-parametric approach to detect differential expression from RNA-Seq data

Even if the affirmation of RNAseq as “tool of choice” to explore the transcriptome is a fact, the challenges to improve the reliability and reproducibility of this method are still quite a lot. On the statistical side, the choice between parametric and non-parametric tests is a point of controversy. In 2011, Robert Tibshiran argued that even though parametric tests are widely used in genomic sequencing analysis, they are still very sensitive to data dispersion. Poisson or negative binomial models are very useful, but they can still be affected by outliers, and a possible solution may come from the application of a non-parametric statistical test. This opinion seems to be quite widely accepted among biostatisticians () and many methods are being developed to implement non-parametric approaches in software packages.

Even if I suggest to expore Tibshiran’s paper and the method proposed, today I would like to focus on rSeqNP, an R package implemented from a non-parametric approach, and recently published on Bionformatics. Before applying the analysis, the package needs to deal with a processed raw-dataset. This is because the package works on the expression estimates of all the genes and their isoforms for each sample in the RNA-Seq study, and the outputs from rSeq, RSEM and Cuffdiff are accepted.

The package allows to use different methods for different kind of analysis, as explained in the following table (Shi et al., Bioinformatics 2015)


After simulation analyses, the package has been proven to have a well controlled type I error rate, and achieves good statistical power for moderate sample sizes and effect sizes. In the supplementary data, you can find a demo analysis on real data (Leng, et al., 2013) and a comparison with the EBSeq package functioning. rSeqNP can also detect alternative splicing, by computing an overall score, the gene-level differential score (GDS).

The package and it documentation are free for download at http://wwwpersonal.umich.edu/~jianghui/rseqnp/.

Project Rosalind: learn bioinformatics and programming through problem solving

We could agree that a bioinformatician is basically a naked, starving castaway who’s trying to survive in a desert island. As in one of those realities that run on tv, or in the movie starring Tom Hanks, he is provided with a knife, quite a few clothes, and a good dose of motivation. In this allegory, the island is the computational research in life sciences, the knife represents the programming and mathematical skills, and the few clothes are the biological knowledge. As a castaway, the main occupation of the computational biology is to solve problems, doing the best to build new tools, explore the environment, fetch food (or a fair amount of coffee), and grow his/her knowledge.

Many educational programs in bioinformatics, both at academic and open-course level, are oriented in providing the basis for the computational work, the programming skills, the minimum biological knowledge, and statistics. In our story, this would mean that the most of the programmes you are going to meet will just provide you with the knife and a couple of tattered clothes.

This is the reason why I was really amazed when I discovered Rosalind, a website proposing a bioinformatics training system that is oriented to problem solving. The training is organised as a game. You subscribe with you email, and they propose you to solve bioinformatics problems at different level of complexity. Problems are divided into several topics, and any problem will give you points if solved, with no penalisation for failure. Remarkably, and despite any expectation, this doesn’t look a website for students only. The diversity of problems proposed and the number of ambits involved are really high, and even experienced bioinformaticians may find this website really useful to learn new things. More, there is an option available for lecturers to apply for a professor account and use Rosalind to generate exercises to propose in classes.

The project is carried on by a Russian- American collaboration between the University of California at San Diego, and Saint Petersburg Academic University, along with the Russian Academy of Sciences. It is inspired by an handful of e-learning projects that are oriented to provide a problem-solving platform on the web, such as the Project Euler and Google Code Jam.

Luckily, computational research in biology can be represented only partially by the castaway allegory. In deed, as you do bioinformatics you are not in a remote island, as you can enjoy the communication with other scientists, and the (more or less) free learning resources available on the web. And even if you may feel alone in your island sometimes, with dirty and torn clothes on and a blunt knife in your hand, you can still lean on some comfort and help. In this optic, we may assume projects like Rosalind as a nice volley ball friend keeping you up during the darkest nights.

Genome3D organisation and evolution. Going beyond flat files.

The genome is a real thing, and this is something we strongly need to keep in mind. The development of bioinformatics has brought us to make a very important, but still bold simplification. A strong focus on sequences, and the information they bear, allowed us to understand how genes determine the structure and function of proteins, and is driving the work of anyone focusing on the interpretation of non-coding elements, in the restless seek of what someone calls the regulatory code. Basically, we took the object shown in the picture above, and transformed it in flat files that underwent to the application of information theory. Beyond the obvious and widely discussed advantages, this approach may have the potential to be misleading. The genome is a physical body, with its physical and chemical features. And as epigenetics is putting the protein- DNA interaction under the spotlight, many studies are underlying that the functioning, the regulation, and thus the evolution of the genome need to be explored considering the genome as what it really is: a complex three-dimensional object.

I really enjoyed the read of a paper dating back to the 2011, authored by Johan H. Gibcus and Job Dekker from the University of Massachusetts. Entitled The Hierarchy of the 3D Genome, the article provides an effective point of view on how radically the DNA folding affects the genome regulation. Recent innovation in probing interphase chromatin folding are in fact providing new insights into the spatial organisation of genomes and its role in gene regulation. In fact, a paper by Marc M. Renom (CNAG- Barcelona) on PlOS, that is aimed at explaining the state of the art of computational methods for genome folding analysis, argues that after the advent of fluorescent in situ hybridisation imaging and chromosome conformation capture methods, the availability of experimental data on genome three-dimensional organisation has dramatically increased. This information has been recently made available in the 3D Genome Database (3DGD), that is the result of the work of a Chinese team, and gathers the Hi-C chromatin conformation capture data of four species (human, mouse, drosophila and yeast).

Of course, many results proving a role of genome folding in gene regulation and phenotype determination are leaping off. As already discussed in this blog, researchers from McGill University in Canada have proven that leukaemia types can be classified with chromatin conformation data. Under an evolutionary point of view, we could have a look to this paper published on Nature in 2012, in which specific chromatin- interaction domains, defined as topological domains, are found to be conserved over the time and in different species.

Beyond any consideration, and further discussion, we could assume that a change in the approach we adopt in genome studies is needed. These findings suggest that a level of major complexity affects genome regulation, and this cannot definitely be ignored. In evolution, we should ask how the chromatin structures have established over the years, and understand their meaning in phenotype and adaptation. Of particular interest, would be the role of non-coding sequences, the so-called junk (and not so) junk DNA, that has been found in many topological domains and may have a role. Ultimately, as we assign a function of three dimensional structure for DNA, as we did in proteins, we should investigate the relationship between the sequence and the structure, and the information exchange between proteins and DNA in protein binding. It seems that not everything is clear about the nature of the information in biological macromolecules, but that’s all but a novelty.

Nextflow: a DSL for parallel and scalable computational pipelines.

I return on pipeline creation tools because I was warned about Nextflow project, a DSL for parallel and scalable computational pipeline creation. As already discussed in a recent post, we can sort pipeline solutions according to how they deal with your code. Some of them are able to connect functions, some of them require dedicated modules, and others are able to connect different files by integrating the standard I/O.

If we think it up, the most of the people working in bioinformatics will end up mixing different things. Usually, a scripting language is used for data mining, and R comes over for statistical analysis, but other languages or tools may be needed. Thus, any solution that is able to integrate different languages, it is warmly suggested in the most of the cases.

The pretty amazing thing about Nextflow, is that the authors took their time to implement a real Domain Specific Language, namely a programming language dedicated to solve specific tasks. To better explain, languages as R, SQL o Mathematica are defined as DSL too. Of course, this bodes well in the extensibility and power of this language.

Nextflow is provided as a simple- installation package, and Java 7+ is the only required dependency. Syntax is simple as well as scalability, since you can develop on your laptop, run in the grid, and scale-out to the cloud with no modifications to your code needed. More intriguingly, the whole thing is designed to work with message passing to better deal with a parallel computing approach.

The project is developed at the CRG- Comparative Genomics group in Barcelona (the guys who created and maintain the T-COFFEE suite), and is headed by Paolo di Tommaso.