By this post, I will start exploring the possible contributions Google may provide to bioinformatics, a topic I am considering to deepen for a while. The rise of bigdata in biology is causing Big G to show a growing interest in this field, and very soon, the Mountain View giant may grow its investments in Life Sciences. This is definitely desirable, since the big expertise in matter of big data they have at Google could significantly boost up bioinformatics research.
Today, we’ll try to understand how the most fundamental Google’s big data analysis algorithm works. MapReduce is a programming model developed and brought to fame by Google. It is used to simplify huge data amounts processing. The main point they have in commercial data analysis, is to provide special- purpose solutions to compute big quantities of data. Companies are interested in analysing a lot of datas very quickly to better perform their campaigns and to understand markets.
How does MapReduce work?
MapReduce allows computation in workstation clusters, that can be executed in parallel on non- structured data, or within a database. The name “MapReduce” derives from two functions implemented in many languages, map() and reduce(). In Python for instance, map() function is used on a list to apply function to every item of iterable and return a list of the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. The reduce() function instead, applies a function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value. Whereas map() is able to assign a value to any object in a list, reduce() will combine this values by iteration, reducing them into a single value. A MapReduce program is thus composed of three fundamental steps: the map step, the shuffle step and the reduce step.
In the Map Step, the master node imports input data, parses them in small subsets, and distributes the work on slaves nodes. Any slave node will produce the intermediate result of a map() function, in the form of [key,value] pairs, that are saved on a distributed file. Output file location is notified to the master at the end of mapping phase.
In the Shuffle Step, the master node collect the answers from slave nodes, to combine the key,value pairs in value lists sharing the same key, and sort them by the key. Sorting can be lexicographical, increasing or user- defined.
In the Reduce Step, the reduction function is performed.
Despite it may appear quite tricky to understand, it’s all but complicated. Data are first labeled with a mapping function, they are then sorted, and finally summarised by a reducing procedure. The *magic* is in the distribution of the work on more nodes in order to speed up bigdata processing. Output file could be processed again in a new MapReduce cycle.
Surfing around youtube, I really enjoyed this video by Jesse Anderson on youtube, that explains MapReduce functioning with playing cards. Simple, effective and genial in deed.
How could this help in bioinformatics?
As Lin Dai and collaborators pointed out on Biology Direct in 2012, bioinformatics is moving from in-house computing infrastructure into utility-supplied cloud computing delivered over the Internet, in order to handle the vast quantities of biological data generated by high-throughput experimental technologies. As a matter of fact, cloud computing is getting more and more important in genomics, and is expected to become fundamental on the long run.
Some MapReduce implementations dedicated to sequencing analyses are already available in literature. Michael Schat’s Cloudburst can be a fair example. Published in 2008, Cloudburst is based on the open-source MapReduce package Hadoop to reduce the running time from hours to minutes for typical jobs involving mapping of millions of short reads to the human genome.
More recently, many methods have been developed to make cloud technology available for bioinformatics analyses. I just rattle off three examples, but there is a plenty of these software solutions out there. Galaxy Cloud, software for NGS data analysis, is most likely the best known. In protein structure and function prediction, the PredictProtein Debian package uses cloud technologies as well, and I have found this cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples very interesting in deed.
The list could be far longer, and a more detailed research among literature may return a plenty of cloud- based methods. In the most of them, MapReduce is reference programming model, and the understanding of its basic functioning could turn out as important as the comprehension of classic bioinformatics algorithms.
Some resources to learn and use MapReduce.
Hadoop is a MapReduce package developed by Apache and distributed under the Apache free- software licence. I assume this as a reference distribution, but it is not the only one. You can explore the official website and download a very detailed PDF guide. The tutorial is oriented to Java language, but you can easily find further documentation to run MapReduce in other languages.
Eventually, you may also consider looking a video tutorial. I had a look on this one on Vimeo, and it seems quite complete, but more will come out if you go searching through.
Will MapReduce replace SQL databases?
Someone may have noticed that MapReduce allows to sort and process data, and that their structuring is not essential. Asking whether this method could make SQL databases obsolete is a fair question, still quite a lot open. As far as I can understand surfing around the web, it is most likely not the case. In 2009, some researchers from American universities argued that MapReduce lacked many key features, and a bit of criticism accompanied MapReduce over the years. MapReduce is a great method to process data quickly, but structured languages still provide the best tools for their storage. That is why you will get to find many solutions to apply MapReduce to SQL databases.
I take this as a further demonstration that competition is often pointless, and the best will come from collaboration. If possible, we’d better take the best of anything we have available.