Five ingredients to become a bioinformatician.

How to become a bioinformatician? Many people, at different career stages, are trying to answer this question, looking for the best path to achieve the required knowledge to become a bioinformatician. Many academic institutions are provided with Bionformatics degrees at undergraduate and postgraduate level, and you can easily find free courses on the internet to get a fair introduction to bioinformatics. Anyway, the high development-rate of this field, and the huge diversity in its applications, tend to weaken the effectiveness of official courses. If Science was a war, bioinformatics would be guerilla, a merciless battle where you are alone, fighting in a jungle of algorithms, big data, statistics and software solutions.

One of the blogs I enjoy the most, is Guillaume Fillon’s “The Grand Locus“. Guillaume is a group leader at CRG in Barcelona, and linked his informal group’s page to a personal blog, where he writes mostly about bioinformatics and biostatistics, putting in an amazing, and very effective mix of scientific insights and personal experience. In one of his latest posts, Guillaume tries to give an answer to a question that is becoming very common as the interest in Bioinformatics grows: “How to become a bioinformatician?”. The answer given on The Grand Locus is pretty unexpected, since no particular path, language to learn, skill-set, course or strategy are suggested, but just three simple tips on how to change your mind before to start your “journey in bioinformatics”. A great point indeed. Actually, you really have to get out from your “comfort zone”, and try over and over with a lot of things you used to ignore before. Also, you must understand the importance of collaboration and community, and definitely need to become addicted.

Anyways, some more practical tips may be needed, and a discussion about the very first things to learn to move the first steps in bioinformatics could turn out useful to many. Honestly, I am not the one to give suggestions. I am in bioinformatics from three years only, still submitting my first papers, and am most reasonably in the need of some good hint. That is why I want to share a couple of ideas with you, asking to put some good criticism in this quick receipt I am going to write down.

I fear I must apologize with the experienced users who will get to read this. I will go in detail, describing a lot of well-known things for computational biologists, and I understand that this could result a bit boring.

First ingredient: the minimal biological knowledge.

Computational biology is a strongly interdisciplinary field, recoiling the interest of scientists with radically diverse backgrounds. Also, the number of possible applications is pretty high, since biology is a very vast and heterogeneous area of study. Trivially, the very first thing to do is to get linked with the required biological knowledge. Scientists with a non- biological profile, will need to train on biology basics. The trick is to focus on the key concepts of biology (cell structure, molecular biology, genome organization, evolution…) without being overly fussy. On the long run, anyone tends to achieve a very high expertise level by working in biology. At the beginning, you are not really asked for a big knowledge, but you need to understand the sense of what you are going to do. Have some internet courses, read some introductory books (cell biology books are great for this, because they summarize the key concepts of biology at sub- cellular and histological level) and surf in wikipedia (that is quite despised, but still very useful if you can use it properly).

This may apply to biologists as well. Of course, if you are just changing your role in your lab, you will be hopefully quite aware about your work. Anyways, during my long application round, I have understood that is quite common , for a bioinformatician, to range over different projects and subjects. IMHO, one of the main points a bioinformatician should work on, is the capability to get rapidly into a totally new biological field.

Second ingredient: get yourself to love statistics.

Math in bioinformatics is very important. Despite I kinda reject the idea that statistics is the only math branch you will need (complex systems, fractal geometry and logical algorithm development may be very needed in evolutionary studies), we can definitely assume statistics and data analysis as the common denominator of almost all bioinformatics projects. This requires an effort, since many project are supplemented with Bayesian statistics that is an advanced topic, and classically taught at the end of any statistics course. The best is to attend a good course, or to patiently afford a big, heavy, but updated statistics book. I am exploring DeGroot’s Probability and Statistics, considered by many as the most complete book around.

Third ingredient: the software quartet.

On the computer side, there are four elements to keep in mind: Scripting, Unix, R and Databases. Honestly, the very first thing you need to do, most likely before adding any other ingredient, is to focus strongly on a Scripting language. A scripting language is what will ease and speed up your work, is the thing by which you will be free from excel, a real Swiss Army Knife you won’t be able to do without. Scripting languages are high-level programming languages designed for quick application. Their syntax is pretty easy, and are quite fast to be learned. I have started and love Python, but Perl is also very used, and you can eventually consider Ruby, that is pretty spread in Asia.

The second member in this quartet is R. As many will surely know, R is a development environment dedicated to statistics. One year ago, I was pretty sure that R was not needed if you know Python and use mathematical and statistical libraries (NumPy and PANDAS). After having joined in a project involving NGS data analysis, I had to change my mind. R is provided with a huge set of applications, the Bioconductor Suite, that are aimed to, very useful in, and deemed as standards in NGS and experimental data analysis. Actually, one could also consider using R as a scripting language itself. Very personal opinion: not the case. I am still a beginner with R, and I may be biased since I love Python, but I think that a scripting language is still the best to process information. Also, consider that many software distributions and online databases are provided with APIs (Application Programming Interfaces) that allow to implement your scripts to extend the functionalities. For instance, Ensembl has a Perl API, Uniprot can be queried programmatically by Python, Perl and even Java. In structural biology, Rosetta is provided with a Python interface (PyRosetta), and PyMol is written in Python and allows the creation of plugins.

Third, comes the Operative System. If you play cool and are geeky enough, you are most likely viewing this post on a Mac. Good point, maybe not the best though. In bioinformatics, Unix-based systems are very used, and you will very often be required to have experience in Unix-like environment in many job advises. Mac OS is a Unix- like environment, and you can easily learn how the filesystem works and practice the BASH language, that is the command-line language for Unix. Unfortunately, Mac OS is not really optimized for programmers, and you may experience some bad trip. For instance, I am trying over and over to get this huge iMac in my lab to understand that I need to fucking link mySQL to Python, in order to install a library to fetch genomic sequences from the internet. No way. I warmly suggest to take your courage and install linux. Ubuntu, user-friendly and always beautiful, is the best choice.

The last member of this quartet is the Database. You need to learn how to deal with the main online databases, and will probably need to create your own to explore and analyze your data. It is very important to understand how a DB is organized, and how the management  system work. Boldly, we can say that the dominating database form is the relational database. You can consider learning a bit of SQL (structured query language) and practice with SQL- based software, such as MySQL or SQLite. Relevantly, you can link your scripts in Perl, Python, Ruby and even R with an SQL- based database. Very useful indeed.

Fourth ingredient: stay very tuned

Bioinformatics is a rapidly-developing science, and novelties come pretty often. Together with the basic knowledge of fundamental algorithms (FASTA, BLAST, CLUSTAL…) and file formats, you must improve your attitude to find new algorithms and decide what is the most proper for your work. On this blog, I try to share and review new software, because this is of great interest in bioinformatics. Basically, you should work on your geekery and interactiveness. Search a lot, discuss on internet, rummage around the web. This will help you quite a lot.

Fifth ingredient: the computer awesomeness

Many people I use to talk with, wet lab guys in particular, are surprised at how I am able to spend so much time on the computer. Terminal work can be wearisome, and you need to find the best deal with the machine. Over the time, I realized I have developed a set of habits that speed me up in my computer work. This is very personal, as anyone will find his/her own best way to work on a PC. I can tell that I have found very useful to train myself in using the keyboard over the mouse, in keeping my desktop beautiful, my screen clean and my files ordered, and a couple of tricks to stay comfortable and zen when working. The point here is that you should not limit to focus on learning new notions, but also to improve your customs to get more effective.

The legendary Russian chess player Garry Kasparov use to tell that is very important, in a game strategy, to bring all your heavy pieces at the center of the chessboard. This way,  you will have the full control of the game and your opponent’s moves. This is a very good idea in bioinformatics too. I am fairly sure that the first thing to do, is to put yourself at the center of the chessboard. Even considering the huge diversity of bioinformatics applications, and the breakneck speed with which they evolve, a proper set of skills will help you not to drown. After so many words, the main point is just: keep calm and learn to program.



    1. Thanks for your comment, glad that my post was useful. Actually, after a long discussion on reddit, I think I need to add a couple of points.

      First, I realised that many biologists are suffering from their relationship with people coming from an informatics background. Basically, a computer scientist doesn’t have a biological formation, and this lack could impede a good link with a Biology lab. The best would be to study a good Cell Biology manual first, I suggest the Becker’s “World of Cell”, that is complete and very clear. You most likely won’t end up working on Cell Biology actually, but the book is an overview of all the main topics of biology at a molecular level, and serves as a fair introduction. Then, you may consider studying Molecular Biology and Genomics (James Watson’s manual is the best) and Genetics, that is a very old subject and any book is just fine.

      Of course, you have to make your best to be patient and convince biologists to understand that you are still learning. Anyway, don’t be too much worried at the beginning, as said, knowledge comes along with experience.

      The second point is that I should most likely not be so sharp in my software suggestions. I mentioned scripting languages because they are easy, and quite effective for the most of the tasks. Of course, if you know an hard-core programming language, and know how to use it, you will have more control on your algorithm structure.



    1. Thank you for your comment, André, and thanks a lot for mentioning home-brew. Great indication. I never got to know this one, but tried out Macports for a while, with a bit of disappointment. Actually, I have never found a package manager in MacOS as effective as linux ones. But this worths a try. Thank you again.



  1. nice tips…..!!!!
    being done BS in bioinformatics, your post helped me in identification of my weak points 🙂 now i know where i have to work out 😉



  2. “If Science was a war, bioinformatics would be guerilla, a merciless battle where you are alone, fighting in a jungle of algorithms, big data, statistics and software solutions” great quote. I will borrow it from you.

    A dual monitor computer will get be a good thing to consider. One screen for your terminal and the other for your browser pages and other widgets. That way you minimize having to alternative between pages on the same screen.

    An added thing to the point on computer awesomeness is learning how to do things over on the cloud or a super computer.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s