Tag Archives: python

Tune up your pipeline with Luigi, the Python module to manage workflow used in Spotify.

Yesterday I have found an amazing audio comment on Nature’s Arts and Books blog that was discussing a possible influence of music on the development of modern science. Among the many connections we may find between science and music, the one I am going to propose today turns out quite unexpected.

I understand that pipeline development is overtaking the discussion in this blog, and this could actually result quite boring. That’s because I have to face my very first big project in genomics, and I am in the need to explore the best solutions and strategies to manage complex workflows. So, as I already discussed some Python solutions for pipelines and the NextFlow DSL Project, let me take some lines to talk about Luigi.

Luigi is a Python package to build complex pipelines of batch jobs. Long batch processes can be easily managed, and there are pre-designed templates. For instance, there is a full support for MapReduce development in Python, that is the only language used in Luigi. All the workflow can be monitored with a very useful graphical interface, providing a graph representing the structure of your pipeline and the status of data processing. More information and downloads are available on the Luigi GitHub page.

How is this related with music? Well, the picture above displays a romantic view of what music was in the past. Nowadays, anything is managed as a big data thing, tunes and chords are transformed into bit, with a cruel disregard for any romance. Luigi was developed in the context of the very famous (and my favourite) music application Spotify. The main developer, Erik Bernhardsson, is a NYC-based computer scientist who headed the Machine Learning Division at Spotify for six years long.

So, we can actually agree with Kerri Smith’s point on Nature: music influences scientific production. Sometimes is a matter of cultural environment, sometimes is a matter of data science.

Post-pub integration. I was informed on twitter about this page with examples of Luigi usage. Think is worth to be mentioned. Thanks to @smllmp

The Oncodrive suite. Bionformatics methods to detect driver mutations in cancer.

One of the most amazing groups whose work I have recently explored, is based in the rapidly- growing young UPF university in Barcelona. The Biomedical Genomics Group applies its high computational expertise to cancer research, focusing on the identification of those mutations that are actually involved in determining the tumor phenotype, the so- called driver mutations. The tool I share with you today is aimed at the identification of driver mutations using a clustering approach. The idea is quite simple: since gain of function mutations in cancer use to cluster in specific protein regions, thus providing an adaptive advantage to cancer cells, one can use this feature to identify a driver mutation. This is a crucial need for anyone working in cancer genomics. As you sequence the genome of a cancer cell, you basically find a total mess of mutations, and your job is to distinguish the ones that determine cancer.

One of the current challenges of oncogenomics is to distinguish the genomic alterations that are involved in tumourigenesis (i.e. drivers), from those that give no advantage to cancer cells, but occur stochastically as a by-product of cancer development. (Bioinformatics, 2013)

The lab published a set of tools, actually a real software suite called Oncodrive, to provide a computational method to the identification of cancer mutations. On august the 27th 2014, the group announced the publication of a new member of this suite: OncodriveROLE, and I take this to publish a short resume of the whole suite.

 

OncodriveFM

.

Method to identify cancer drivers from cancer somatic mutations in a cohort of tumors. It computes the bias towards the accumulation of variants with high functional impact (FM bias).

link | paper

OncodriveCIS

Method to identify genes that accumulate copy number alterations important for tumour development. This is done by computing the functional impact of CNAs by measuring their effect on the expression of the genes affected.

link | paper

OncodriveCLUST

Method to identify genes in which mutations accumulate within specific regions of the protein, which denote events selected by the tumour. It computes a score measuring the mutation clustering of a gene across the protein sequence and compares it with a background model.

link | paper

OncodriveROLE

Method to classify cancer driver genes into to Activating or Loss of Function roles.

link | paper

I haven’t tried them since I am working on dystrophy and still have no mutations to detect, but if I got this straight, all the scripts come out as python libraries. Moreover, I really suggest you to visit the lab’s page for tools to find out up to 13 different cancer- dedicated software solutions available for the use.

Software Carpentry. A great resource to learn scientific computing.

Software Carpentry is a volunteer organization founded in 1998, whose mission is to enhance scientists’ productivity by teaching them basic computer skills. From January 2012, Software Carpentry joined the Mozilla galaxy, to become part of Mozilla Science Lab in June 2013. The organization provides workshops in several parts of the word and free online lessons. Their focus is to teach the basics of some very useful in silico methods to scientists, such as structured programming (Python and R), numerical programming (NumPy), version control (Git, mercurial and SVN), unit testing and automated tasks in Unix, regular expressions and relational databases (SQL).

Most likely, you have heard about them, but I still take some time to report their amazing work. After a long experience in this, guys at Software Carpentery have developed a very simple deal with their customers: short courses at an affordable price. You pay them accommodation, put yourself in the mood to become more efficient and dedicate a couple of days to learn very useful scientific computing stuff. As they affirm on their website, courses must be short to meet better the needs of very busy researchers, and the workshop represent the most efficient way to learn. efficiency is actually the thing they care the most, and you may consider its importance in your work.

The truly amazing thing is that they really operate worldwide. So, I think you may consider surfing their website and get to know this long- established and very useful project.

 

Dealing with Uniprot- Python programming interface.

My thesis is mostly focused on proteins, and be sure that I got really familiar with Uniprot. Uniprot comes out with a strong user interface that ease any approach. Biochemists can easily find what they need as a bioinformatician can set up his scripts to obtain information directly from the database. In fact, Uniprot has a very good programming interface, compatible with all the main programming languages, and very well explained on a detailed official tutorial. To find it, you can click this link, or you can search in google with the query “uniprot programmatically”. To me, it’s been quite complicated to find this page browsing in the website, since Uniprot documentation is huge (and I have no patience for this).

For instance, I have to retrieve a brunch of Helix Turn Helix transcriptional factors in fasta format. I’ve got a text file with one ID per line and must save them as Seq objects from the Bio.SeqIO Biopython module. Quite easy indeed, all the game is on the following function:

import urllib,urllib2

def getseq(ID, extention):
    base_url=”http://www.uniprot.org/uniprot/”
    url=base_url+ID+”.”+extention
    req = urllib2.Request(url)
    response = urllib2.urlopen(req)
    return response.read()

Please, consider that I still can’t figure out how to display ‘t’ tabs in wordpress,  change the spacing if you’ll ever use this. The response.read() returning value can be managed as a string. One can iterate this in order to print everything on a text file to be parsed with the SeqIO.parse() method. As shown, importing both urllib and urllib2 is mandatory. The amazing thing of uniprot is that the programmatic acces to the website is facilitated by the very simple organization of the database. If you know the ID, you just have to add the file type you need and build a web address with the filetype as extention.

PANDAS creator explains how to move the very first steps in Python data analysis.

The very first thing that comes to mind when the words statistics and programming are associated is definetly R. The R language represents the most used coding language in statistics and data analysis. But, what if you had to embedd the statistic part of your work in a bigger scripting project? What if you need to use the output of your statistical analysis as an input for your script?

Python users can enjoy Pandas

There are many solutions actually. You can use pipelines to connect different scripts, or you can use “bridges”, libraries designed to connect R scripts with other languages, such as R to Python or JRI for Java. Python users have another option, the very famous library PANDAS, that imports the R phylosophy in a full python library.

I don’t think that many readers out there are totally unaware of this library. Anyways, I still remember that you can have a look and download PANDAS from the officlal website.

A very simple 3H seminar!

For those who are starting to use this library and want to move their very first steps, the video embedded on the top can be a good tool. Wes Mckinney, PANDAS creator, gives a hands-on introduction to manipulating and analyzing large and small structured data sets in Python using the pandas library. So, if you have 3 hours to spend on this, you are very very welcome (WTF Wes!?!?).

Something about Wes Mckinney

I have heard about him since a while and really looks like a proven authority on Python data analysis. San Francisco- based python hacker and enterprouer, he’s also the author of the Python for data analysis book. I often keep an eye on his work on his blog and his twitter profile.

PyMOD, a PyMol plugin for embedding multiple alignments in homology modelling

The project I want to discuss today is probably the best thing that came out from my actual lab in the latest years (Bioinformatics Lab at Biochemistry Department of Sapienza University of Rome). Carried on by Emanuele Bramucci for his Master thesis, PyMOD is a plugin for the famous molecular visualization system PyMol and it has been released in 2011.

It represents a simple and user- friendly bridge between PyMol and other several applications of interest, such as PSI- BLAST, MUSCLE, CEalign, Modeller and ClustalW. Sequence similarity searches, multiple sequence-structure alignments, and homology modelling within PyMOL, as said on the homepage of the project. It is full supported for any OS (Windows, Mac OS and Linux), but not tested on PyMol 1.5 yet.

On the top, the video of a workflow example is embedded. I suggest you to visit the project’s homepage:

http://schubert.bio.uniroma1.it/pymod/index.html

and enjoy the video and his delighting blues music. It really worths in any case.

 

 

Introducing pyphylogenomics 0.2.0, a python module for phylogenomics.

If we have to find a word to best represent the encounter between informatics and evolutionary biology, that word would definitely be phylogenomics. The large-scale analysis, allowed by the big amount of data made available in the last 20 years, is rapidly imposing the rule that “nothing makes sense in evolutionary biology except in the light of high throughput analysis”. The uptake of the most important evolutionary relationships is subject to large-scale analysis, since the relevant information is often hidden in a mountain of genetic material.

The python module I am going to present is a brand-new module, completely dedicated to phylogenomics, made by Carlos Peña, from the Nymphalidae Systematic Group at the University of Turku (Finland). The package implements several functionalities including blast integration, sequence analysis, fastq- fasta conversions and primer sites finding.

To try it on you python environment you’ll need the modules MySQLdb and biopython installed. Just connect to the pypi page and download the package, then unzip, open the terminal in the unzipped folder and type:


python setup.py build
python setup.py install

Ubuntu users will need to sudo first. For more information check the github page.