Is "how to do bioinformatics" the major topic in bioinformaticians online reading habits?

Sifting through my website stats, I realised that bioinformaticians are reading more posts discussing “how to do bioinformatics” than the ones with a strict scientific content. Is this a feature of this blog, or does it reflect a common problem with working habits?

Drawing some conclusions after two years of atcgeek

After almost a couple of years blogging on atcgeek, I can dare to say a thing or two about this experience. If I scroll the statistics of this blog, I cannot really complain about the interest generated in the readers. Even though I won’t become famous by writing here, I can tell that 38k views from January 2014, and peaks around 1k views/day is not a bad result, considering the long pause I had to take whilst moving to Barcelona and starting my PhD. Nothing really special, but not even a disastrous failure.

The three main topics at atcgeek

Although I use to divide my posts into thematic categories (bioinformatics, biochemistry, structural biology, etc.) and into types of article (news, insights, video, hacks and personal blog), I realised that I basically tend to write on three topics: education and work practices, methods and reflections. The posts of the first kind are about “how to work in bioinformatics” or “where to learn the basics”. The second ones are the one in which I report the new methods that have recently published, and the third category recoils the posts that propose scientific insights on the role and the nature of computational and theoretical biology.

The most of the interest goes to educational and work habits posts.

The order in which I mentioned these three topics, coincides with their ranking in terms of interest generated. Education and work practices come first, methods are the second ones, and the bronze medal goes to the insights. Swiftly and boldly comparing my site statistics with the interest generated on social network, I can dare to say that the people who read atcgeek are particularly interested in discussing about how to improve their working habits, how to start working in bioinformatics, or to share a bit of self-irony with me as I talk about the shit I use to do when I work. Take it as an impression that is barely supported by statistics, but plausible enough to put a question.

Based on what I see on atcgeek, people is more into discussing how to do bioinformatics or how to learn the basics, rather than the bioinformatics itself, and there could be some reasons behind this.

Of course, we should keep clear that this blog is written by a PhD student who is sharing his experience while walking the first steps in computational biology. This is a point, since anyone would be more interested in the opinions of someone more influential than me for anything about the “scientific part”. The main goal of this blog is to horizontally share my experience and to productively interact with my visitors, more than claiming to be an expert of the field and aiming at “coaching” the readers. On the other hand, anyways, if experience matters, it should matter in both the topics, since the thoughts of an experienced scientist are more evaluable than mine in both work habits and in science.

Do we have a problem with how to do our work?

Despite the shift I am seeing in the readers’ interest may be due to the characteristics of this blog, I still have the feeling that “how to work” is the major “hot topic” in bioinformatics community, and we may strongly suspect that this reflects a problem. Bioinformatics is basically the domain of non-computer scientists working with computers, the merge of two super-rapidly changing sciences, and the development of proven, shared and consolidated work strategies is far to be a reality, especially if compared with experimental biology, were lab practises are widely discussed and protocols are consolidated.

There is one last thing to say. In the real ranking of the most visited post, the most read one is not really about bioinformatics. Let’s say that this discussion should be focused on what bioinformaticians use to read online when they are keen to read about science. Including the other interests could be puzzling.

BTW, thank you for the interest in this stupid diary.


Five ingredients to become a bioinformatician.

How to become a bioinformatician? Many people, at different career stages, are trying to answer this question, looking for the best path to achieve the required knowledge to become a bioinformatician. Many academic institutions are provided with Bionformatics degrees at undergraduate and postgraduate level, and you can easily find free courses on the internet to get a fair introduction to bioinformatics. Anyway, the high development-rate of this field, and the huge diversity in its applications, tend to weaken the effectiveness of official courses. If Science was a war, bioinformatics would be guerilla, a merciless battle where you are alone, fighting in a jungle of algorithms, big data, statistics and software solutions.

One of the blogs I enjoy the most, is Guillaume Fillon’s “The Grand Locus“. Guillaume is a group leader at CRG in Barcelona, and linked his informal group’s page to a personal blog, where he writes mostly about bioinformatics and biostatistics, putting in an amazing, and very effective mix of scientific insights and personal experience. In one of his latest posts, Guillaume tries to give an answer to a question that is becoming very common as the interest in Bioinformatics grows: “How to become a bioinformatician?”. The answer given on The Grand Locus is pretty unexpected, since no particular path, language to learn, skill-set, course or strategy are suggested, but just three simple tips on how to change your mind before to start your “journey in bioinformatics”. A great point indeed. Actually, you really have to get out from your “comfort zone”, and try over and over with a lot of things you used to ignore before. Also, you must understand the importance of collaboration and community, and definitely need to become addicted.

Anyways, some more practical tips may be needed, and a discussion about the very first things to learn to move the first steps in bioinformatics could turn out useful to many. Honestly, I am not the one to give suggestions. I am in bioinformatics from three years only, still submitting my first papers, and am most reasonably in the need of some good hint. That is why I want to share a couple of ideas with you, asking to put some good criticism in this quick receipt I am going to write down.

I fear I must apologize with the experienced users who will get to read this. I will go in detail, describing a lot of well-known things for computational biologists, and I understand that this could result a bit boring.

First ingredient: the minimal biological knowledge.

Computational biology is a strongly interdisciplinary field, recoiling the interest of scientists with radically diverse backgrounds. Also, the number of possible applications is pretty high, since biology is a very vast and heterogeneous area of study. Trivially, the very first thing to do is to get linked with the required biological knowledge. Scientists with a non- biological profile, will need to train on biology basics. The trick is to focus on the key concepts of biology (cell structure, molecular biology, genome organization, evolution…) without being overly fussy. On the long run, anyone tends to achieve a very high expertise level by working in biology. At the beginning, you are not really asked for a big knowledge, but you need to understand the sense of what you are going to do. Have some internet courses, read some introductory books (cell biology books are great for this, because they summarize the key concepts of biology at sub- cellular and histological level) and surf in wikipedia (that is quite despised, but still very useful if you can use it properly).

This may apply to biologists as well. Of course, if you are just changing your role in your lab, you will be hopefully quite aware about your work. Anyways, during my long application round, I have understood that is quite common , for a bioinformatician, to range over different projects and subjects. IMHO, one of the main points a bioinformatician should work on, is the capability to get rapidly into a totally new biological field.

Second ingredient: get yourself to love statistics.

Math in bioinformatics is very important. Despite I kinda reject the idea that statistics is the only math branch you will need (complex systems, fractal geometry and logical algorithm development may be very needed in evolutionary studies), we can definitely assume statistics and data analysis as the common denominator of almost all bioinformatics projects. This requires an effort, since many project are supplemented with Bayesian statistics that is an advanced topic, and classically taught at the end of any statistics course. The best is to attend a good course, or to patiently afford a big, heavy, but updated statistics book. I am exploring DeGroot’s Probability and Statistics, considered by many as the most complete book around.

Third ingredient: the software quartet.

On the computer side, there are four elements to keep in mind: Scripting, Unix, R and Databases. Honestly, the very first thing you need to do, most likely before adding any other ingredient, is to focus strongly on a Scripting language. A scripting language is what will ease and speed up your work, is the thing by which you will be free from excel, a real Swiss Army Knife you won’t be able to do without. Scripting languages are high-level programming languages designed for quick application. Their syntax is pretty easy, and are quite fast to be learned. I have started and love Python, but Perl is also very used, and you can eventually consider Ruby, that is pretty spread in Asia.

The second member in this quartet is R. As many will surely know, R is a development environment dedicated to statistics. One year ago, I was pretty sure that R was not needed if you know Python and use mathematical and statistical libraries (NumPy and PANDAS). After having joined in a project involving NGS data analysis, I had to change my mind. R is provided with a huge set of applications, the Bioconductor Suite, that are aimed to, very useful in, and deemed as standards in NGS and experimental data analysis. Actually, one could also consider using R as a scripting language itself. Very personal opinion: not the case. I am still a beginner with R, and I may be biased since I love Python, but I think that a scripting language is still the best to process information. Also, consider that many software distributions and online databases are provided with APIs (Application Programming Interfaces) that allow to implement your scripts to extend the functionalities. For instance, Ensembl has a Perl API, Uniprot can be queried programmatically by Python, Perl and even Java. In structural biology, Rosetta is provided with a Python interface (PyRosetta), and PyMol is written in Python and allows the creation of plugins.

Third, comes the Operative System. If you play cool and are geeky enough, you are most likely viewing this post on a Mac. Good point, maybe not the best though. In bioinformatics, Unix-based systems are very used, and you will very often be required to have experience in Unix-like environment in many job advises. Mac OS is a Unix- like environment, and you can easily learn how the filesystem works and practice the BASH language, that is the command-line language for Unix. Unfortunately, Mac OS is not really optimized for programmers, and you may experience some bad trip. For instance, I am trying over and over to get this huge iMac in my lab to understand that I need to fucking link mySQL to Python, in order to install a library to fetch genomic sequences from the internet. No way. I warmly suggest to take your courage and install linux. Ubuntu, user-friendly and always beautiful, is the best choice.

The last member of this quartet is the Database. You need to learn how to deal with the main online databases, and will probably need to create your own to explore and analyze your data. It is very important to understand how a DB is organized, and how the management  system work. Boldly, we can say that the dominating database form is the relational database. You can consider learning a bit of SQL (structured query language) and practice with SQL- based software, such as MySQL or SQLite. Relevantly, you can link your scripts in Perl, Python, Ruby and even R with an SQL- based database. Very useful indeed.

Fourth ingredient: stay very tuned

Bioinformatics is a rapidly-developing science, and novelties come pretty often. Together with the basic knowledge of fundamental algorithms (FASTA, BLAST, CLUSTAL…) and file formats, you must improve your attitude to find new algorithms and decide what is the most proper for your work. On this blog, I try to share and review new software, because this is of great interest in bioinformatics. Basically, you should work on your geekery and interactiveness. Search a lot, discuss on internet, rummage around the web. This will help you quite a lot.

Fifth ingredient: the computer awesomeness

Many people I use to talk with, wet lab guys in particular, are surprised at how I am able to spend so much time on the computer. Terminal work can be wearisome, and you need to find the best deal with the machine. Over the time, I realized I have developed a set of habits that speed me up in my computer work. This is very personal, as anyone will find his/her own best way to work on a PC. I can tell that I have found very useful to train myself in using the keyboard over the mouse, in keeping my desktop beautiful, my screen clean and my files ordered, and a couple of tricks to stay comfortable and zen when working. The point here is that you should not limit to focus on learning new notions, but also to improve your customs to get more effective.

The legendary Russian chess player Garry Kasparov use to tell that is very important, in a game strategy, to bring all your heavy pieces at the center of the chessboard. This way,  you will have the full control of the game and your opponent’s moves. This is a very good idea in bioinformatics too. I am fairly sure that the first thing to do, is to put yourself at the center of the chessboard. Even considering the huge diversity of bioinformatics applications, and the breakneck speed with which they evolve, a proper set of skills will help you not to drown. After so many words, the main point is just: keep calm and learn to program.