How to keep your work reproducible and replicable? Once you have finished up with your genome- wide analysis, NGS data mining, coding, homology modeling or biostatistics, how can you make your entire job available and testable by other people? The need of a proper strategy to guarantee the reproducibility of research, is a major question in almost any branch of Science, and becomes dramatic in computational research. The large amount of methods available, and the massive quantity of information produced, tend to stultify the efforts in keeping our work replicable and reproducible.
Even if anyone working with computers develops very personal working habits, there is one trick or two to improve them, in order to render your work more reproducible. Broadly speaking, this is what is pointed out in a very recent paper published on Plos Computational Biology. More than a couple of tricks, a real decalogue is proposed to improve the reproducibility of your work. Ten Simple Rules for Reproducible Computational Research, that the authors argue to be pretty effective. As I suggest you to read carefully this very good paper, I just list and discuss each rule.
Rule 1: For Every Result, Keep Track of How It Was Produced. Annotations are fundamental. Very often, one ends up tagging data quickly, just to not forget where they are from. But an extensive and explanatory legend will ease your co-workers and reviewers in understanding what you have done.
Rule 2: Avoid Manual Data Manipulation Steps. Take your time, be patient and write down a couple of code lines. Manual data manipulations are the first source of human error and reduce the verificability of your work.
Rule 3: Archive the Exact Versions of All External Programs Used. Boring and way too much clever, but sometimes fundamental.
Rule 4: Version Control All Custom Scripts. This is something that people tend to underestimate, but still very important. I actually need to improve this part too, to get started you can have a check here.
Rule 5: Record All Intermediate Results, When Possible in Standardized Formats. A fair tabbed file, or a CSV is always an act of love towards your collaborators.
Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds. Never had to use randomness, but seeds can homologate analyses.
Rule 7: Always Store Raw Data behind Plots. That shouldn’t even be mentioned.
Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected. Do not share summarized datas only, but let your reviewers take stock of all the steps you did.
Rule 9: Connect Textual Statements to Underlying Results. Results and their interpretation must be clearly connected.
Rule 10: Provide Public Access to Scripts, Runs, and Results. Summarizing, keep your work transparent, and no one gets hurt.
When I was at the high- school, my italian literature professors used to teach me that “a text is good if self-explanatory”. This means that readers must be able to understand your writing even if you are not there to explain it. This is more or less the simple principle one can adopt to improve the reproducibility of computational analysis.
An improve in working habits can definitely help, even if is not going to be enough. The role of journals, and the need to set up shared rules to impose a major transparency and reproducibility, is also widely discussed. Ultimately, as clearly pointed out in this article on Science, success in reproducibility improvement will come by the collaboration between scientists and journals. A more sustainable working habits, and a major transparency in published results.