Right about one year ago, I was sharing a flat with some Spanish guys in the deep heart of Grácia, an historical neighbourhood in Barcelona. To be honest, those guys fitted quite well into the definition of “friki” – Spanish transliteration of the term “freaky” – that indicate that kind of people attracted by oriental spirituality, organic food, ecological behaviours, flea market handmade clothes and hemp derivatives of all kinds. Boldly and briefly: hippies. Being an ecologist activist with radical autonomist positions (I am a bit hippie too), I tend to have a good relationship with this kind of people, at least till the moment when they understand that I am working in Science, in Biology, and most importantly in Plant Biology. The path from me explaining my work, and they asking about GMOs is very short, and my efforts to explain that I just study the evolution of plants without modifying them are normally useless. And right about one year ago, I had to spend a whole afternoon defending my work, and debunking a lot of misconceptions of them. Continue reading →
It was a cold November morning, year 2011. Sapienza University has a huge campus next to the city centre of Rome, where the main faculties are stored in huge buildings in the rationalist style. Yet, the faculty of Biochemistry has a detached site in the neighboured flanking the campus, San Lorenzo. I was crossing the streets of this wonderful ex-industrial alternative hood to reach my new lab. The clock was marking 10:30 AM, and I was joining bioinformatics. Professor Stefano Pascarella had accepted to supervise me in my master thesis, and it was my very first day. Four years have passed, I have graduated, worked in five different labs, and even if my experience is not really long, I think I have already a couple of stories to tell.
Stupidity matters. Despite the most of the people use to link science to intelligence and genius, seeing research as a matter of the “smart guys”, we must admit that the lab routine is often studded with the crap we make, and that researchers can become protagonists of actions of remarkable stupidity. And if we scan the first, faltering steps of a researcher’s career, we may find a couple of funny nerdish stories to tell with colleagues in a bar. And since I’d be so sorry to know that someone of you may run out of funny anecdotes about grad students’ stupidity to tell, let me report the four most stupid things I have ever done in bioinformatics.
Trying to fetch information from uniprot on 1750 genes without any programming
The first task of my master thesis was simple. My advisor provided me with a list of 250 uniprot IDs of MocR proteins in several bacterial genomes. Helix-turn-helix transcriptional factors, with an amminotransferasic domain allosterically regulating them by pyridoxal-5’-phosphate binding. The lab had identified these sequences with HMMer, and we wanted to know something more about the flanking regions. The professor told me to annotate 3 upstream and 3 downstream coding regions in order to see wether some recurrences could indicate a conserved multigenic region; simple and straightforward.
The next day I was shattered, reclining a lost look on my screen, at 8 pm and after ten hours of work. A hard lesson that I have learned by the time, is that if you did something wrong in designing your bioinformatics workflow, a spreadsheet will show up at a certain point. I was staring at an OpenOffice Calc window with about 40 rows, and had managed to find a way to manually scan the flanking region. I don’t remember exactly my glorious strategy, but it should have sounded like this:
- Copy and paste the id on uniprot and search it.
- Scroll the way down to the crosslink pointing at a graphical genome browser and open it.
- Perfect, you are on the spot! Now move the browser forward and back, you will find the flanking sequences.
- Select any flanking gene in the interval and make your way back to uniprot
- Save the information you get (the Uniprot ID basically) on a spreadshit and go on
I was then suggested to stop doing this and go further with studying python. That was the day when I learned that there is no bioinformatics without programming.
The protein-DNA docking to fetch promoters.
After the first explorations, the final goal of my M.Sc. thesis work became the identification of a conserved promoter region upstream the neighbouring genes pdxS and pdxT, coding for the two subunits of the pyridoxal-phosphate polymerase holoenzyme in bacteria. This memory tastes a bit sweet, as usual when you end up remembering how naive you were when just a newbee. It was the early 2012, January or maybe Feburary. During a lab meeting, I argued that a good option to find our promoters was to perform a docking analysis on a set of candidate promoter sequences, docked with the MocR transcriptional factor that was found activating their transcription. After having explained my point, I realised that anyone was just looking at me with dismay. Do you know that awful feeling of anyone in the room looking at you like you’re crazy? I was explained that the methods developed for protein-DNA docking were still too ineffective to fetch a reliable result. Protein – DNA docking to infer the binding region of an HTH? Pure science fiction. At least, that day I have been introduced into one of my favourite topics in bioinformatics: the communication between DNA and proteins.
Declaring profanities as variables in your code.
Even if I am quite used at threading jokes in my code, taking it as a “nerdish rebellion” against my even more nerdish work routine, what I am going to tell here didn’t actually happen to me. I include this story I have heard of in my post because it’s really worth reading.
In team-working sharing code is fundamental, and the best habit you can take is to write variables in a human language, and to write proper comments in order to get the people who will read your code to understand it (to any possible extent). Anyway, the first thing you should care about before sharing your code is to make sure that it won’t worsen the opinion your colleagues have about you.
This story has all the ingredients that a good academic joke needs to succeed: a polite and old-mannered thesis director, a graduate student with a sense of humor that his advisor won’t get, swear words, profanities, and a Perl script to show them up.
Stefano Pascarella is not old at all, but he is still the kind of super-mannered and polite Italian professor. I worked in his lab for two years long, and never heard him yelling at anyone or just expressing disappointment with harsh. Quite remarkable, since he was my thesis advisor. Instead, I never met the student who’s the protagonist in this story, and I can just assume him as the typical 20-something master student. The only thing that I am pretty sure about him is that one day he wasn’t at the lab, and his code was needed for some reason.
Professor Pascarella sat down in front of the terminal and rapidly found the file he needed. The people who told me this story just can’t forget the expression on professor’s face. A calm and bored expression ran immediately into a serious face, that swiftly faded into disconcert. Any given variable of the code he was reading was either a bad word or a profanity.
Later on that day, the student received a mail “kindly asking” him “to take his coding routine more seriously”.
Ignoring the find/replace function in a text editor.
Ok, I am figuring out what you are thinking. “This moron didn’t know that text editors had a find/replace function and corrected a whole code manually to change a single word”. Not so, I did something that is possibly worse. When I started to write code, actually I did not know much about the existence of this amazing function in my text editor, but I was still very sure that the process had to be automatised. My ignorance on text editors mixed dramatically with my inclination to programming to give rise to one of the most stupid things I have ever done.
As I finished and tested the script named changeword.py, I was totally sure that it was one of the best things I could produce with my short programming experience. I don’t really remember the code, but it should have sounded as follows:
filein = sys.argv
word_to_change = sys.argv
replacement = sys.argv
a = open(filein,’rU’)
b = a.read()
To run it, you just needed to input the file and the word you wanted to change with its replacement, and anything went to the standard output:
$> ./chageword.py my_file.txt first_word second_word > my_corrected_file.txt
Et voilà, the text came out changed. Luckily, at a certain point I realised that my fantastic script didn’t work for any change I could need, and decided to discuss this problem with a postdoc in my lab. He is still laughing about this.
Write the MD5-checksum code on the same file from which I extracted it.
This happened a few months ago. Tracking your input, output and script files is very important, and even if we are not used at version control systems, annotating any file with its MD5 code may help, to some extent, in having a better tracking of your work.
The MD5 algorithm assigns a unique code given an input. If you input a file to the MD5, the output code will correspond to that file univocally. Of course, if you modify the file the resulting MD5 code will change.
I was finishing a long scripting course and was adding information on my output tabbed file in an hashed header. As I calculated my MD5 code, I had the brilliant idea to write it on the same file from where I extracted it. Not to mention that after having pasted the MD5 code on the file, the MD5 code of that new file inexorably changed.
It took to me a good quarter of hour to realise it. It was 9 PM, and I thought it was just my brain asking me to go home for some rest.
As I said at the beginning of this article, stupidity matters. And ironising at yourself matters even more. Cognitive work requires the application of all your rationality, and it is thus fundamental to understand its limits, or else the borders of your intellectual skills that are shaped by stupidity. I think that there is no shame in recognising you own limits, and publicly admitting them is someway therapeutic.
Quoting an Italian PhD student I have met at my department who recently graduated, “there is no use for a PhD course except in the light of understanding how stupid you are”. I have recently registered for my second year of PhD here at the CRAG, and still have a long way ahead to explore the deepest corners of my stupidity.
After all, the Diesel advertisement showed as heading image of this post, may be right. You are stupid only if you try to explore your limits. And this is right about what I am up to.
War on Science, yet another chapter. I think that anyone working in Science or caring about it, and anyone who aims to a growth in the public opinion’s awareness on the scientific issues of global interest, tends to spend some time to contrast hoaxes, misconceptions and anti-scientific propaganda. The most of the times, you end up returning references to the documents published by health and science officials to those ones claiming that “official science” is lying. If someone affirms that vaccines cause autism, or are potentially harmful for the child’s health, you may consider responding with data provided by health institutions. Likewise, if someone is keen to promote homoeopathy as a real cure, some documents published by the NIH, FDA or WHO, and proving its flat inefficacy, could turn out really useful. Basically, the most of the times, national and international health institutions are on your side, providing you and the whole public opinion with referenced data and clear positions in favour of “official” biomedical science. But what happens if a national health institution turns its way, and starts supporting one of the major scientific hoaxes ever, such as homoeopathy?
This disturbing scenario has just become reality in Italy. The Minister of Health in charge, Beatrice Lorenzin, authored the preface of a book supporting homoeopathy, entitled In Praise Of Homoeopathy (Elogio della omeopatia) and written by Giovanni Gorga, president of an association of enterprises producing and delivering homoeopathic products. Even if the official position of the Ministry in matter of homoeopathy has remained unchanged, and homoeopathic products are sold in Italy as “medicals without any approved therapeutic indication“, this clear stance of the Minister Lorenzin generates concerns in the Italian scientific community.
The Italian non-profit organization CICAP, devoted since yeas to counteract the diffusion of anti-scientific information in Italy, has presented an open- letter to ask the Minister to clarify her position about the real efficacy of homoeopathic products, and to publicly declare that there are no evidences supporting it. The International Association of Italian Researchers (AIRI) is spending as well to spread this letter and rally the support of Italian researchers.
Being an ecologist and a radical leftist, I am very far from being a Beatrice Lorenzin’s supporter. Forty-four years old, serving as minister of health since the formation of the government led by Matteo Renzi in the spring of 2013, Beatrice Lorenzin grew her political career within the right-wing coalition led by Silvio Berlusconi. Anyways, I have always considered her a very reasonable woman and a politician of rare quality in the awful italian political landscape (not a big medal, actually). I am in fact pretty surprised by this awkward fail, and I still comfy that anything could be fixed.
I would limit to consider this fact as the usual yet another strange thing coming from Italy, or one of the many events stating how difficult the relationship between science and governance is, but I fear that something more serious is on the way. In the neo-liberal West, governments are all about economy, and the promotion of private sector has become the only concern of administrations of any political area. The last April, bloomberg published an insight pointing out that homoeopathy constitutes a billionaire market in the United States. I fear that the only element we don’t consider about this matter, is how much money an hoax could generate. Even if it is pretty clear that homoeopathic products have no effect on human health, it is still able to generate consent and to turn it into business and jobs. And a disturbing question comes along: will governments shut a wealthy sector for ethical reasons?
We could agree that a bioinformatician is basically a naked, starving castaway who’s trying to survive in a desert island. As in one of those realities that run on tv, or in the movie starring Tom Hanks, he is provided with a knife, quite a few clothes, and a good dose of motivation. In this allegory, the island is the computational research in life sciences, the knife represents the programming and mathematical skills, and the few clothes are the biological knowledge. As a castaway, the main occupation of the computational biology is to solve problems, doing the best to build new tools, explore the environment, fetch food (or a fair amount of coffee), and grow his/her knowledge.
Many educational programs in bioinformatics, both at academic and open-course level, are oriented in providing the basis for the computational work, the programming skills, the minimum biological knowledge, and statistics. In our story, this would mean that the most of the programmes you are going to meet will just provide you with the knife and a couple of tattered clothes.
This is the reason why I was really amazed when I discovered Rosalind, a website proposing a bioinformatics training system that is oriented to problem solving. The training is organised as a game. You subscribe with you email, and they propose you to solve bioinformatics problems at different level of complexity. Problems are divided into several topics, and any problem will give you points if solved, with no penalisation for failure. Remarkably, and despite any expectation, this doesn’t look a website for students only. The diversity of problems proposed and the number of ambits involved are really high, and even experienced bioinformaticians may find this website really useful to learn new things. More, there is an option available for lecturers to apply for a professor account and use Rosalind to generate exercises to propose in classes.
The project is carried on by a Russian- American collaboration between the University of California at San Diego, and Saint Petersburg Academic University, along with the Russian Academy of Sciences. It is inspired by an handful of e-learning projects that are oriented to provide a problem-solving platform on the web, such as the Project Euler and Google Code Jam.
Luckily, computational research in biology can be represented only partially by the castaway allegory. In deed, as you do bioinformatics you are not in a remote island, as you can enjoy the communication with other scientists, and the (more or less) free learning resources available on the web. And even if you may feel alone in your island sometimes, with dirty and torn clothes on and a blunt knife in your hand, you can still lean on some comfort and help. In this optic, we may assume projects like Rosalind as a nice volley ball friend keeping you up during the darkest nights.
The genome is a real thing, and this is something we strongly need to keep in mind. The development of bioinformatics has brought us to make a very important, but still bold simplification. A strong focus on sequences, and the information they bear, allowed us to understand how genes determine the structure and function of proteins, and is driving the work of anyone focusing on the interpretation of non-coding elements, in the restless seek of what someone calls the regulatory code. Basically, we took the object shown in the picture above, and transformed it in flat files that underwent to the application of information theory. Beyond the obvious and widely discussed advantages, this approach may have the potential to be misleading. The genome is a physical body, with its physical and chemical features. And as epigenetics is putting the protein- DNA interaction under the spotlight, many studies are underlying that the functioning, the regulation, and thus the evolution of the genome need to be explored considering the genome as what it really is: a complex three-dimensional object.
I really enjoyed the read of a paper dating back to the 2011, authored by Johan H. Gibcus and Job Dekker from the University of Massachusetts. Entitled The Hierarchy of the 3D Genome, the article provides an effective point of view on how radically the DNA folding affects the genome regulation. Recent innovation in probing interphase chromatin folding are in fact providing new insights into the spatial organisation of genomes and its role in gene regulation. In fact, a paper by Marc M. Renom (CNAG- Barcelona) on PlOS, that is aimed at explaining the state of the art of computational methods for genome folding analysis, argues that after the advent of fluorescent in situ hybridisation imaging and chromosome conformation capture methods, the availability of experimental data on genome three-dimensional organisation has dramatically increased. This information has been recently made available in the 3D Genome Database (3DGD), that is the result of the work of a Chinese team, and gathers the Hi-C chromatin conformation capture data of four species (human, mouse, drosophila and yeast).
Of course, many results proving a role of genome folding in gene regulation and phenotype determination are leaping off. As already discussed in this blog, researchers from McGill University in Canada have proven that leukaemia types can be classified with chromatin conformation data. Under an evolutionary point of view, we could have a look to this paper published on Nature in 2012, in which specific chromatin- interaction domains, defined as topological domains, are found to be conserved over the time and in different species.
Beyond any consideration, and further discussion, we could assume that a change in the approach we adopt in genome studies is needed. These findings suggest that a level of major complexity affects genome regulation, and this cannot definitely be ignored. In evolution, we should ask how the chromatin structures have established over the years, and understand their meaning in phenotype and adaptation. Of particular interest, would be the role of non-coding sequences, the so-called junk (and not so) junk DNA, that has been found in many topological domains and may have a role. Ultimately, as we assign a function of three dimensional structure for DNA, as we did in proteins, we should investigate the relationship between the sequence and the structure, and the information exchange between proteins and DNA in protein binding. It seems that not everything is clear about the nature of the information in biological macromolecules, but that’s all but a novelty.
We already had the chance to discuss about the importance of reproduciblity in computational research, and to comment some good practices to improve it. As we read the Ten Simple Rules that Sandve and co-workers proposed the last October, we cannot help but underline the importance of pipelines. A correct pipeline- based approach will prevent researchers from potentially harmful manual interventions on data, and to get them to have a correct tracking of their workflow. Pipelines are just perfect to deal with the usually huge bioinformatics tasks, that require a big amount of calculation, and several sorting and filtering steps. Despite the usual controversies about this, we can tell that Python is becoming the prime choice of many bioinformaticians, because of its powerful features, dynamic and populated community, and ease of use. That is why I think it is fair to discuss about a couple of python-based pipeline creation tools.
The first point, is to take stock of which main features we should ask to a python pipeline creation tool. Of course, anyone will appreciate a lightweight system for obvious reasons, and things like a simple syntax, scalability and the possibility to manage complex workflows with ease will be very welcome. Another aspect I want to mention, is the possibility to include previously created code into a pipeline system. There are two main reasons for this. First, functions and classes may be re-used in different projects, and having a pipeline system working as a “wrapper” around your code may ease this. Second, many python beginners are not really oriented to the pipeline-philosophy, as python works great with a module-based approach (even if not exclusively).
The different solutions I got to find around, can be distinguished on the basis of their relationship with the code, and according on how they thread into a Python script. Let’s assume, for simplicity, that a program is made up by functions that are included into modules, and that several modules can constitute the whole thing. We have thus identified three concentric levels: a function level, a module level, and a multi-file level. Pipeline systems for Python are basically modules providing the possibility to include a simple-syntax code in your scripts to manage the data flow. Several commands, usually defined as decorators, are thus formalized to sort the data flow into an organized pipeline. So, we will discuss how different solution will work at different levels.
Pipelines working at function- level: Ruffus and Joblib
Published in 2010 on BMC Bioinformatics by Leo Goodstadt at Oxford University, Ruffus is available on its official website, where you can find a complete documentation and tutorials. As you can notice, Ruffus works by connecting consecutive input/output files, and imposes the developer to write the functions in the code following the order of the dataflow. Any function must be preceded by the @follows decorator, indicating the flow direction, and the @files, that calls the in/out files. That is why I mention it as a pipeline system working at a “funcion level”, as the internal module structure of a script depends on the structure of the pipeline.
This approach is someway related to the one implemented into Joblib, a python pipeline system that is mostly oriented to ease parallel computation. Despite the substantial differences, the structure of the script depends on the structure of the pipeline in both cases.
Pipelines working at module- level: Leaf
Leaf is a project published a couple of months ago by Francesco Napolitano, at the University of Salerno in Italy. The key-idea, is to provide a system to declare a pipeline structure without changing the code. At the beginning of the module, it is possible to enclose a decorator to build a graphical scheme of the pipeline you have in mind. A simple visual language, the Leaf Graphical Language, is implemented to graphically build the dependencies, with the possibility to export all the workflow as hypertext to share results. Leaf comes out as python library and can be downloaded here.
The key differences between Ruffus and Leaf are shown in the following picture (Napolitano et al., 2010).
As evident, Leaf works as a real wrapper, whereas Ruffus requires a specific script structure.
Pipelines working at multi-file level.
Pipelines can be designed to interconnect different python modules. In this case, the pipeline tool will work at an “upper level”, standing above different modules. It is the philosophy underlying the most common pipeline creation software, and I would like to mention Bpipe, that is one of the most recently developed (but there are quite a lot around). Of course, as scripts in any language can work with standard streams, we are slipping a bit away from the range of “python-dedicated pipeline tools”, and learning the good ol’ GNU-make is still worthwile if you are keen to work (or in the need of working) with pipelines at a module-level.
I cannot really tell which is the best one, since the choice will depend on the project, the coder’s attitude and the specific needs. Furthermore, this post is just rattling off some few projects I got to find around, and more suggestions will be just welcome.
The role of mathematical modelling in evolutionary biology is pretty questioned, although its integral role in studies on evolution. Differently from other scientific disciplines, such as Physics or Chemistry, Biology was born as a descriptive science, and the affirmation of mathematics as an effective and indispensable part of investigation is still to be fully accomplished. In evolutionary research, an important role of mathematics is to provide a “proof-of-concept” test of verbal explanations, paralleling the way in which empirical data are used to test hypotheses.
Whereas the connection between empirical analyses and theoretical modelling is straightforward in some cases, such as the construction of likelihood functions for parameter inference and model choice, empiricists may not appreciate the importance of highly abstract models, which might not provide immediately testable predictions. Probably, skepticism stems from some misconceptions and misunderstandings about mathematical modelling, and a clarification about its role may ease the communication between experimentalists and theoretical biologists.
Some evolutionary biologists from the USA point this out in a very clear paper, published some days ago on PLOS Biology, and first-authored by Maria Servedio from the University of North Carolina. The parallels between empirical experimental techniques and proof-of-concept modeling in the scientific process are explained in the following flowchart.
As shown, the proof-of-concept models are best suited to test the logical correctness of verbal hypotheses, such as the effectivity certain assumptions have to lead to certain prediction. Hypotheses which assumptions are most commonly met in Nature, are instead argued to be possibly addressed by empirical approaches only.
Discussion on most common misunderstandings is centred around three main points.
The authors first argue that the main misunderstandings, in matter of mathematical modelling, happen as theoreticians are asked how they might test their proof-of- concept models empirically. The models are discussed to be themselves tests of validity of verbal assumptions, and their outcome can thus determine whether a verbal model is valid or defective.
Second, this does not mean that proof-of-concept models do not need to interact with empirical work. Actually, in most of cases, quite the contrary is true. Many vital links between theory and natural systems can be found in assumption stage, prediction stage and even in discussion stage, when empirical results are threaded into a broader conceptual framework.
Third, authors point out that a discordance between theoretical predictions and empirical data may be a great point of interest, giving to both theoreticians and experimentalists the opportunity to appreciate underrated phenomena, or to reconsider the assumptions and empirical procedures.
Despite this paper discusses in detail the role of theoretical modelling in evolutionary biology, we should take our time to reflect, in general terms, on the relationship between experimental work and mathematical modelling. I am very next to write about the criticism that is investing some of the most common algorithms for NGS data analysis, because I have the feeling that the search of proper mathematical modelling algorithms will be one of bioinformaticians’ main occupation in coming years. This article serves thus as a fair example, even if not directly applicable to all the fields of life sciences, of how the relationship between empirical and mathematical work should be properly interpreted.