MethylMix: an R package for identifying DNA methylation-driven genes

The paper I am going to explore today introduces MethylMix, an R package designed to identify DNA methylation-driven genes. DNA methylation is one of the processes that are more extensively studied in biomedicine, since it has been found as a principal mechanism of gene regulation in many diseases. Although high-throughput methods are able to produce huge amounts of DNA methylation measurements, there are quite a few tools to formally identify hypo and hypermethylated genes.

This is the reason why Olivier Gevaert from Stanford proposed his MethylMix, an algorithm to identify disease-specific hyper- and hypo-methylated genes, published online yesterday on Oxford Bioinformatics.

The key idea of this work is that it is not possible to lean on an arbitrary threshold to determine the differential methylation of a gene, and the assessment of differential methylation has to be made in comparison to normal tissue. Moreover, the identification of differentially methylated genes must come along with a transcriptionally predictive effect, thus implying a functional relevance of methylation.

MethylMix first calculates a set of possible methylation states for each CpG site that is found to be associated with genes showing differential expression. This set is created by comparison with clinical samples and using the Bayesian Information Criterion (BIC). Then, a normal methylation state is defined as the mean DNA-methylation level in normal tissue samples. Each set is compared with the normal methylation state in order to calculate the Differential Methylation Value or DM- value, defined as the difference between the methylation state with the mean DNA-methylation in control samples. The output is thus an indication of which genes are differentially methylated and differentially expressed.

As mentioned, the algorithm is implemented as an R package, it’s already included in the Bioconductor package section.


Information and intelligence: let's meet the Smart Slime

As we talk about information in biology, mind goes through DNA and protein sequences, on the tracks of what we have learned to call “bioinformatics”, along with a fair amount of algorithms, open source methods, coding hacks, genome assemblies, libraries and bugs. Luckily, Biology is much more complex and beautiful than mere green strings, and information in biological systems flows at different scales, involving several processes. And if we can call communication the information transfer between two entities, the capability of a system to learn from environmental information in order to improve its adaptive response, could be fairly (even if a bit boldly) defined as intelligence.

What is intelligence? Where does it rely? How can we measure it? Great questions, that would generate a huge discussion. Far bigger than this small blog. Anyway, I guess that the best option here is to start talking about something very simple. A slimy mold, for instance.

Physarum polycephalum is known as the many-headed slime and, as reported on Wikipedia is a slime mold that inhabits shady, cool, moist areas, such as decaying leaves and logs. Like slime molds in general, it is sensitive to light; in particular, light can repel the slime mold and be a factor in triggering spore growth. The really amazing fact about this slimy fellow, is that many investigations have proved him as capable of the capability to solve complex problems.

The video I am sharing above these lines, is a TED talk held by Heather Barnett. Designer working with bio-materials and artist, Heather Barnett creates art with slime mold, and shows us how much amazing this organism can be.

With a simple, but very effective cell-based information processing system,  P. polycepalum has been proved to quickly find the best path to food through a maze, way faster than me when I had to find the best path to train station from my home through Gràcia. More, the video shows how the mold reconstructed in scale the Tokyo suburban rail system, proving its capability to solve complex problems and tasks.

Of course, to anyone studying cognitive processes at a  molecular level, this organism provides an excellent model, but we may also fetch some good idea for evolutionary biology too. The best way to thread into this is a visit to Heather Barnett’s website, that is provided with many video (there is a youtube channel too), information and references.


Software for pipeline creation with Python.

We already had the chance to discuss about the importance of reproduciblity in computational research, and to comment some good practices to improve it. As we read the Ten Simple Rules that Sandve and co-workers proposed the last October, we cannot help but underline the importance of pipelines. A correct pipeline- based approach will prevent researchers from potentially harmful manual interventions on data, and to get them to have a correct tracking of their workflow. Pipelines are just perfect to deal with the usually huge bioinformatics tasks, that require a big amount of calculation, and several sorting and filtering steps. Despite the usual controversies about this, we can tell that Python is becoming the prime choice of many bioinformaticians, because of its powerful features, dynamic and populated community, and ease of use. That is why I think it is fair to discuss about a couple of python-based pipeline creation tools.

The first point, is to take stock of which main features we should ask to a python pipeline creation tool. Of course, anyone will appreciate a lightweight system for obvious reasons, and things like a simple syntax, scalability and the possibility to manage complex workflows with ease will be very welcome. Another aspect I want to mention, is the possibility to include previously created code into a pipeline system. There are two main reasons for this. First, functions and classes may be re-used in different projects, and having a pipeline system working as a “wrapper” around your code may ease this. Second, many python beginners are not really oriented to the pipeline-philosophy, as python works great with a module-based approach (even if  not exclusively).

The different solutions I got to find around, can be distinguished on the basis of their relationship with the code, and according on how they thread into a Python script. Let’s assume, for simplicity, that a program is made up by functions that are included into modules, and that several modules can constitute the whole thing. We have thus identified three concentric levels: a function level, a module level, and a multi-file level. Pipeline systems for Python are basically modules providing the possibility to include a simple-syntax code in your scripts to manage the data flow. Several commands, usually defined as decorators, are thus formalized to sort the data flow into an organized pipeline. So, we will discuss how different solution will work at different levels.

Pipelines working at function- level: Ruffus and Joblib

Published in 2010 on BMC Bioinformatics by Leo Goodstadt at Oxford University, Ruffus is available on its official website, where you can find a complete documentation and tutorials. As you can notice, Ruffus works by connecting consecutive input/output files, and imposes the developer to write the functions in the code following the order of the dataflow. Any function must be preceded by the @follows decorator, indicating the flow direction, and the @files, that calls the in/out files. That is why I mention it as a pipeline system working at a “funcion level”, as the internal module structure of a script depends on the structure of the pipeline.

This approach is someway related to the one implemented into Joblib, a python pipeline system that is mostly oriented to ease parallel computation. Despite the substantial differences, the structure of the script depends on the structure of the pipeline in both cases.

Pipelines working at module- level: Leaf

Leaf is a project published a couple of months ago by Francesco Napolitano, at the University of Salerno in Italy. The key-idea, is to provide a system to declare a pipeline structure without changing the code. At the beginning of the module, it is possible to enclose a decorator to build a graphical scheme of the pipeline you have in mind. A simple visual language, the Leaf Graphical Language, is implemented to graphically build the dependencies, with the possibility to export all the workflow as hypertext to share results. Leaf comes out as python library and can be downloaded here.

The key differences between Ruffus and Leaf are shown in the following picture (Napolitano et al., 2010).


As evident, Leaf works as a real wrapper, whereas Ruffus requires a specific script structure.

Pipelines working at multi-file level.

Pipelines can be designed to interconnect different python modules. In this case, the pipeline tool will work at an “upper level”, standing above different modules. It is the philosophy underlying the most common pipeline creation software, and I would like to mention Bpipe, that is one of the most recently developed (but there are quite a lot around). Of course, as scripts in any language can work with standard streams, we are slipping a bit away from the range of “python-dedicated pipeline tools”, and learning the good ol’ GNU-make is still worthwile if you are keen to work (or in the need of working) with pipelines at a module-level.

I cannot really tell which is the best one, since the choice  will depend on the project, the coder’s attitude and the specific needs. Furthermore, this post is just rattling off some few projects I got to find around, and more suggestions will be just welcome.