Category Archives: bioinformatics

Is "how to do bioinformatics" the major topic in bioinformaticians online reading habits?

Sifting through my website stats, I realised that bioinformaticians are reading more posts discussing “how to do bioinformatics” than the ones with a strict scientific content. Is this a feature of this blog, or does it reflect a common problem with working habits?

Drawing some conclusions after two years of atcgeek

After almost a couple of years blogging on atcgeek, I can dare to say a thing or two about this experience. If I scroll the statistics of this blog, I cannot really complain about the interest generated in the readers. Even though I won’t become famous by writing here, I can tell that 38k views from January 2014, and peaks around 1k views/day is not a bad result, considering the long pause I had to take whilst moving to Barcelona and starting my PhD. Nothing really special, but not even a disastrous failure.

The three main topics at atcgeek

Although I use to divide my posts into thematic categories (bioinformatics, biochemistry, structural biology, etc.) and into types of article (news, insights, video, hacks and personal blog), I realised that I basically tend to write on three topics: education and work practices, methods and reflections. The posts of the first kind are about “how to work in bioinformatics” or “where to learn the basics”. The second ones are the one in which I report the new methods that have recently published, and the third category recoils the posts that propose scientific insights on the role and the nature of computational and theoretical biology.

The most of the interest goes to educational and work habits posts.

The order in which I mentioned these three topics, coincides with their ranking in terms of interest generated. Education and work practices come first, methods are the second ones, and the bronze medal goes to the insights. Swiftly and boldly comparing my site statistics with the interest generated on social network, I can dare to say that the people who read atcgeek are particularly interested in discussing about how to improve their working habits, how to start working in bioinformatics, or to share a bit of self-irony with me as I talk about the shit I use to do when I work. Take it as an impression that is barely supported by statistics, but plausible enough to put a question.

Based on what I see on atcgeek, people is more into discussing how to do bioinformatics or how to learn the basics, rather than the bioinformatics itself, and there could be some reasons behind this.

Of course, we should keep clear that this blog is written by a PhD student who is sharing his experience while walking the first steps in computational biology. This is a point, since anyone would be more interested in the opinions of someone more influential than me for anything about the “scientific part”. The main goal of this blog is to horizontally share my experience and to productively interact with my visitors, more than claiming to be an expert of the field and aiming at “coaching” the readers. On the other hand, anyways, if experience matters, it should matter in both the topics, since the thoughts of an experienced scientist are more evaluable than mine in both work habits and in science.

Do we have a problem with how to do our work?

Despite the shift I am seeing in the readers’ interest may be due to the characteristics of this blog, I still have the feeling that “how to work” is the major “hot topic” in bioinformatics community, and we may strongly suspect that this reflects a problem. Bioinformatics is basically the domain of non-computer scientists working with computers, the merge of two super-rapidly changing sciences, and the development of proven, shared and consolidated work strategies is far to be a reality, especially if compared with experimental biology, were lab practises are widely discussed and protocols are consolidated.

There is one last thing to say. In the real ranking of the most visited post, the most read one is not really about bioinformatics. Let’s say that this discussion should be focused on what bioinformaticians use to read online when they are keen to read about science. Including the other interests could be puzzling.

BTW, thank you for the interest in this stupid diary.

The four most stupid things I have ever done in bioinformatics.

It was a cold November morning, year 2011. Sapienza University has a huge campus next to the city centre of Rome, where the main faculties are stored in huge buildings in the rationalist style. Yet, the faculty of Biochemistry has a detached site in the neighboured flanking the campus, San Lorenzo. I was crossing the streets of this wonderful ex-industrial alternative hood to reach my new lab. The clock was marking 10:30 AM, and I was joining bioinformatics. Professor Stefano Pascarella had accepted to supervise me in my master thesis, and it was my very first day. Four years have passed, I have graduated, worked in five different labs, and even if my experience is not really long, I think I have already a couple of stories to tell.

Stupidity matters. Despite the most of the people use to link science to intelligence and genius, seeing research as a matter of the “smart guys”, we must admit that the lab routine is often studded with the crap we make, and that researchers can become protagonists of actions of remarkable stupidity. And if we scan the first, faltering steps of a researcher’s career, we may find a couple of funny nerdish stories to tell with colleagues in a bar. And since I’d be so sorry to know that someone of you may run out of funny anecdotes about grad students’ stupidity to tell, let me report the four most stupid things I have ever done in bioinformatics.

Trying to fetch information from uniprot on 1750 genes without any programming

Munch_The_Scream_lithography
The first task of my master thesis was simple. My advisor provided me with a list of 250 uniprot IDs of MocR proteins in several bacterial genomes. Helix-turn-helix transcriptional factors, with an amminotransferasic domain allosterically regulating them by pyridoxal-5’-phosphate binding. The lab had identified these sequences with HMMer, and we wanted to know something more about the flanking regions. The professor told me to annotate 3 upstream and 3 downstream coding regions in order to see wether some recurrences could indicate a conserved multigenic region; simple and straightforward.

The next day I was shattered, reclining a lost look on my screen, at 8 pm and after ten hours of work. A hard lesson that I have learned by the time, is that if you did something wrong in designing your bioinformatics workflow, a spreadsheet will show up at a certain point. I was staring at an OpenOffice Calc window with about 40 rows, and had managed to find a way to manually scan the flanking region. I don’t remember exactly my glorious strategy, but it should have sounded like this:

  1. Copy and paste the id on uniprot and search it.
  2. Scroll the way down to the crosslink pointing at a graphical genome browser and open it.
  3. Perfect, you are on the spot! Now move the browser forward and back, you will find the flanking sequences.
  4. Select any flanking gene in the interval and make your way back to uniprot
  5. Save the information you get (the Uniprot ID basically) on a spreadshit and go on

I was then suggested to stop doing this and go further with studying python. That was the day when I learned that there is no bioinformatics without programming.

The protein-DNA docking to fetch promoters.

Doc-BrownAfter the first explorations, the final goal of my M.Sc. thesis work became the identification of a conserved promoter region upstream the neighbouring genes pdxS and pdxT, coding for the two subunits of the pyridoxal-phosphate polymerase holoenzyme in bacteria. This memory tastes a bit sweet, as usual when you end up remembering how naive you were when just a newbee. It was the early 2012, January or maybe Feburary. During a lab meeting, I argued that a good option to find our promoters was to perform a docking analysis on a set of candidate promoter sequences, docked with the MocR transcriptional factor that was found activating their transcription. After having explained my point, I realised that anyone was just looking at me with dismay. Do you know that awful feeling of anyone in the room looking at you like you’re crazy? I was explained that the methods developed for protein-DNA docking were still too ineffective to fetch a reliable result. Protein – DNA docking to infer the binding region of an HTH? Pure science fiction. At least, that day I have been introduced into one of my favourite topics in bioinformatics: the communication between DNA and proteins.

Declaring profanities as variables in your code.

1142382632_swearing_xlargeEven if I am quite used at threading jokes in my code, taking it as a “nerdish rebellion” against my even more nerdish work routine, what I am going to tell here didn’t actually happen to me. I include this story I have heard of in my post because it’s really worth reading.

In team-working sharing code is fundamental, and the best habit you can take is to write variables in a human language, and to write proper comments in order to get the people who will read your code to understand it (to any possible extent). Anyway, the first thing you should care about before sharing your code is to make sure that it won’t worsen the opinion your colleagues have about you.

This story has all the ingredients that a good academic joke needs to succeed: a polite and old-mannered thesis director, a graduate student with a sense of humor that his advisor won’t get, swear words, profanities, and a Perl script to show them up.

Stefano Pascarella is not old at all, but he is still the kind of super-mannered and polite Italian professor. I worked in his lab for two years long, and never heard him yelling at anyone or just expressing disappointment with harsh. Quite remarkable, since he was my thesis advisor. Instead, I never met the student who’s the protagonist in this story, and I can just assume him as the typical 20-something master student. The only thing that I am pretty sure about him is that one day he wasn’t at the lab, and his code was needed for some reason.

Professor Pascarella sat down in front of the terminal and rapidly found the file he needed. The people who told me this story just can’t forget the expression on professor’s face. A calm and bored expression ran immediately into a serious face, that swiftly faded into disconcert. Any given variable of the code he was reading was either a bad word or a profanity.

Later on that day, the student received a mail “kindly asking” him “to take his coding routine more seriously”.

Ignoring the find/replace function in a text editor.

maxresdefaultOk, I am figuring out what you are thinking. “This moron didn’t know that text editors had a find/replace function and corrected a whole code manually to change a single word”. Not so, I did something that is possibly worse. When I started to write code, actually I did not know much about the existence of this amazing function in my text editor, but I was still very sure that the process had to be automatised. My ignorance on text editors mixed dramatically with my inclination to programming to give rise to one of the most stupid things I have ever done.

As I finished and tested the script named changeword.py, I was totally sure that it was one of the best things I could produce with my short programming experience. I don’t really remember the code, but it should have sounded as follows:

#! /usr/bin/python
import sys
filein = sys.argv[1]
word_to_change = sys.argv[2]
replacement = sys.argv[3]
a = open(filein,’rU’)
b = a.read()
a.close()
print b.replace(word_to_change,replacement)

To run it, you just needed to input the file and the word you wanted to change with its replacement, and anything went to the standard output:

$> ./chageword.py my_file.txt first_word second_word > my_corrected_file.txt

Et voilà, the text came out changed. Luckily, at a certain point I realised that my fantastic script didn’t work for any change I could need, and decided to discuss this problem with a postdoc in my lab. He is still laughing about this.

Write the MD5-checksum code on the same file from which I extracted it.

MKSB023_UselessMachine_Animation_largeFatigue plays tricks, and makes a perfect source of inspiration for stupid actions. When you are tired you can experience severe logical failures, and brightly shatter your work in seconds.

This happened a few months ago. Tracking your input, output and script files is very important, and even if we are not used at version control systems, annotating any file with its MD5 code may help, to some extent, in having a better tracking of your work.

The MD5 algorithm assigns a unique code given an input. If you input a file to the MD5, the output code will correspond to that file univocally. Of course, if you modify the file the resulting MD5 code will change.

I was finishing a long scripting course and was adding information on my output tabbed file in an hashed header. As I calculated my MD5 code, I had the brilliant idea to write it on the same file from where I extracted it. Not to mention that after having pasted the MD5 code on the file, the MD5 code of that new file inexorably changed.

It took to me a good quarter of hour to realise it. It was 9 PM, and I thought it was just my brain asking me to go home for some rest.

As I said at the beginning of this article, stupidity matters. And ironising at yourself matters even more. Cognitive work requires the application of all your rationality, and it is thus fundamental to understand its limits, or else the borders of your intellectual skills that are shaped by stupidity. I think that there is no shame in recognising you own limits, and publicly admitting them is someway therapeutic.

Quoting an Italian PhD student I have met at my department who recently graduated, “there is no use for a PhD course except in the light of understanding how stupid you are”. I have recently registered for my second year of PhD here at the CRAG, and still have a long way ahead to explore the deepest corners of my stupidity.

After all, the Diesel advertisement showed as heading image of this post, may be right. You are stupid only if you try to explore your limits. And this is right about what I am up to.

Applying phylogenetics and bioinformatics to NF-kB studies

To anyone having to do something with immunity studies, the nuclear factor kappa-light-chain-enhancer of activated B cells, will sound really familiar. The NF-kB is a protein complex deputed to initiate the transcriptional response to external stimuli, such as stress, citokines, antigens, bacteria, free radicals or UV light irradiation. Expressed in active B cells, it is the protagonist of the immune response at molecular level.

For quite a long time, its evolutionary characterization has been rather neglected, since no homologous sequence is found. Actually, I often happen to realize that biomedical studies tend to keep quite far from evolutionary approach. Biomedicine is about to understand processes happening here and now, and it often aims to quickly find a reliable therapeutic approach for the disease of subject. So many factors to study, so little time. This shifts biomedical studies away from the influence of evolutionary biology. A real pity, as Catriona MacCallum  pointed out on PLoS Biology in 2007, since the contribution of evolutionary biology to biomedicine has a big, almost unexplored potential.

Recently, NF-kB and NF-kB-like proteins have been discovered in “basal” marine animals and non-metazoans, allowing the study of the early evolution of this nuclear complex of extraordinary importance for human health. John R. Finnerty and Thomas D. Gilmore from the University of Boston published an interesting paper on this topic just a few months ago, and I dare to introduce it here for two main reasons.

Beyond the clear scientific interest of their work, representing one of the few and really valuable evolutionary approaches to an all-biomedical subject, and highlighting deep conservation and repeated instances of parallel evolution in the sequence and structure of NF-κB in distant animal groups, which suggest that important functional constraints limit the evolution of this protein, it also provides an explanation of how to easily apply phylogenetic and bioinformatic approaches even without a previous hard training.

The authors run on the double track of reporting a scientific result, and introducing the reader to some simple (but still effective) computational tools that more or less anyone may use to implement phylogenetics in his/her work, rendering Methods for Analyzing the Evolutionary Relationship of NF-κB Proteins Using Free, Web-Driven Bioinformatics and Phylogenetic Tools a very interesting reading for both bionformaticians who need to communicate with experimentalists, and people working with NF-kB.

The article is part of the methodological book NF-kappaB. Methods and Protocols edited by Michael J. May and published by Springer Protocols.

Let's explore rSeqNP, a non-parametric approach to detect differential expression from RNA-Seq data

Even if the affirmation of RNAseq as “tool of choice” to explore the transcriptome is a fact, the challenges to improve the reliability and reproducibility of this method are still quite a lot. On the statistical side, the choice between parametric and non-parametric tests is a point of controversy. In 2011, Robert Tibshiran argued that even though parametric tests are widely used in genomic sequencing analysis, they are still very sensitive to data dispersion. Poisson or negative binomial models are very useful, but they can still be affected by outliers, and a possible solution may come from the application of a non-parametric statistical test. This opinion seems to be quite widely accepted among biostatisticians () and many methods are being developed to implement non-parametric approaches in software packages.

Even if I suggest to expore Tibshiran’s paper and the method proposed, today I would like to focus on rSeqNP, an R package implemented from a non-parametric approach, and recently published on Bionformatics. Before applying the analysis, the package needs to deal with a processed raw-dataset. This is because the package works on the expression estimates of all the genes and their isoforms for each sample in the RNA-Seq study, and the outputs from rSeq, RSEM and Cuffdiff are accepted.

The package allows to use different methods for different kind of analysis, as explained in the following table (Shi et al., Bioinformatics 2015)

table_one

After simulation analyses, the package has been proven to have a well controlled type I error rate, and achieves good statistical power for moderate sample sizes and effect sizes. In the supplementary data, you can find a demo analysis on real data (Leng, et al., 2013) and a comparison with the EBSeq package functioning. rSeqNP can also detect alternative splicing, by computing an overall score, the gene-level differential score (GDS).

The package and it documentation are free for download at http://wwwpersonal.umich.edu/~jianghui/rseqnp/.

Project Rosalind: learn bioinformatics and programming through problem solving

We could agree that a bioinformatician is basically a naked, starving castaway who’s trying to survive in a desert island. As in one of those realities that run on tv, or in the movie starring Tom Hanks, he is provided with a knife, quite a few clothes, and a good dose of motivation. In this allegory, the island is the computational research in life sciences, the knife represents the programming and mathematical skills, and the few clothes are the biological knowledge. As a castaway, the main occupation of the computational biology is to solve problems, doing the best to build new tools, explore the environment, fetch food (or a fair amount of coffee), and grow his/her knowledge.

Many educational programs in bioinformatics, both at academic and open-course level, are oriented in providing the basis for the computational work, the programming skills, the minimum biological knowledge, and statistics. In our story, this would mean that the most of the programmes you are going to meet will just provide you with the knife and a couple of tattered clothes.

This is the reason why I was really amazed when I discovered Rosalind, a website proposing a bioinformatics training system that is oriented to problem solving. The training is organised as a game. You subscribe with you email, and they propose you to solve bioinformatics problems at different level of complexity. Problems are divided into several topics, and any problem will give you points if solved, with no penalisation for failure. Remarkably, and despite any expectation, this doesn’t look a website for students only. The diversity of problems proposed and the number of ambits involved are really high, and even experienced bioinformaticians may find this website really useful to learn new things. More, there is an option available for lecturers to apply for a professor account and use Rosalind to generate exercises to propose in classes.

The project is carried on by a Russian- American collaboration between the University of California at San Diego, and Saint Petersburg Academic University, along with the Russian Academy of Sciences. It is inspired by an handful of e-learning projects that are oriented to provide a problem-solving platform on the web, such as the Project Euler and Google Code Jam.

Luckily, computational research in biology can be represented only partially by the castaway allegory. In deed, as you do bioinformatics you are not in a remote island, as you can enjoy the communication with other scientists, and the (more or less) free learning resources available on the web. And even if you may feel alone in your island sometimes, with dirty and torn clothes on and a blunt knife in your hand, you can still lean on some comfort and help. In this optic, we may assume projects like Rosalind as a nice volley ball friend keeping you up during the darkest nights.

Nextflow: a DSL for parallel and scalable computational pipelines.

I return on pipeline creation tools because I was warned about Nextflow project, a DSL for parallel and scalable computational pipeline creation. As already discussed in a recent post, we can sort pipeline solutions according to how they deal with your code. Some of them are able to connect functions, some of them require dedicated modules, and others are able to connect different files by integrating the standard I/O.

If we think it up, the most of the people working in bioinformatics will end up mixing different things. Usually, a scripting language is used for data mining, and R comes over for statistical analysis, but other languages or tools may be needed. Thus, any solution that is able to integrate different languages, it is warmly suggested in the most of the cases.

The pretty amazing thing about Nextflow, is that the authors took their time to implement a real Domain Specific Language, namely a programming language dedicated to solve specific tasks. To better explain, languages as R, SQL o Mathematica are defined as DSL too. Of course, this bodes well in the extensibility and power of this language.

Nextflow is provided as a simple- installation package, and Java 7+ is the only required dependency. Syntax is simple as well as scalability, since you can develop on your laptop, run in the grid, and scale-out to the cloud with no modifications to your code needed. More intriguingly, the whole thing is designed to work with message passing to better deal with a parallel computing approach.

The project is developed at the CRG- Comparative Genomics group in Barcelona (the guys who created and maintain the T-COFFEE suite), and is headed by Paolo di Tommaso.

MethylMix: an R package for identifying DNA methylation-driven genes

The paper I am going to explore today introduces MethylMix, an R package designed to identify DNA methylation-driven genes. DNA methylation is one of the processes that are more extensively studied in biomedicine, since it has been found as a principal mechanism of gene regulation in many diseases. Although high-throughput methods are able to produce huge amounts of DNA methylation measurements, there are quite a few tools to formally identify hypo and hypermethylated genes.

This is the reason why Olivier Gevaert from Stanford proposed his MethylMix, an algorithm to identify disease-specific hyper- and hypo-methylated genes, published online yesterday on Oxford Bioinformatics.

The key idea of this work is that it is not possible to lean on an arbitrary threshold to determine the differential methylation of a gene, and the assessment of differential methylation has to be made in comparison to normal tissue. Moreover, the identification of differentially methylated genes must come along with a transcriptionally predictive effect, thus implying a functional relevance of methylation.

MethylMix first calculates a set of possible methylation states for each CpG site that is found to be associated with genes showing differential expression. This set is created by comparison with clinical samples and using the Bayesian Information Criterion (BIC). Then, a normal methylation state is defined as the mean DNA-methylation level in normal tissue samples. Each set is compared with the normal methylation state in order to calculate the Differential Methylation Value or DM- value, defined as the difference between the methylation state with the mean DNA-methylation in control samples. The output is thus an indication of which genes are differentially methylated and differentially expressed.

As mentioned, the algorithm is implemented as an R package, it’s already included in the Bioconductor package section.