Quantifying reproducibility. The case of the TB-Drugome

Quantifying Reproducibility in
Computational Biology:
The Case of the Tuberculosis Drugome

This web page represents a bundle for the contents and additional material of the paper accepted at the PLOS One Journal. The purpose of this web page is to provide a summary of the paper, support links and short descriptions of the contents used as input and generated as output of the described work. The full paper is available online here.

Abstract

How easy is it to reproduce the results found in a typical computational biology paper? Either through experience or intuition the reader will already know that the answer is with difficulty or not at all. Here we attempt to quantify this difficulty using a previously published paper [Kinnings et al. 2010] for different classes of user and suggest ways in which the situation might be improved. Quantification is achieved by estimating what is required to take the procedures described in the paper and include them in a formalized workflow that can reproduce the original results. We conclude with a brief discussion of the complexities of requiring reproducibility in terms of cost versus benefit. This has implications not only in reproducing the work of others from published papers, but reproducing work from your own laboratory.

Inputs used for reproducing the TB-Drugome workflow

Initially, some of the inputs defined in the Drugome's web page (accessible here) were considered. However those materials were mostly curated results to highlight the outputs of the experiment, so instead we took the inputs files and scripts from the original team of scientists. In particular, these inputs were:

The configuration file for the SMAP tool (used for ligand binding sites comparison).
A drug key file with the ids for the drugs used in the workflow.
The drug binding sites file.
The solved structures file, containing the solved structures for the proteins and homology models being compared against the drug binding sites.
The homology list file with the ids of the homology models associated to the proteins to be compared to the drug binding sites.
File with additional information about the proteins of the TB-Drugome.
Template file from the Protein Data Bank with the ids of the homology models.
List of solved structures with the ids of the proteins being compared in the experiment.
Homology model information file to filter up homology models.

Additionally, several tools were used to compute some of the steps of the experiment:

SMAP tool: to compare the ligand binding sites with the solved structures and the homology models.
FATCAT tool: to filter the results of the SMAP step by making a comparison between dissimilar protein structures.
Autodock Vina Tool: to dock and filter the results obtained in the previous steps.
yEd: to visualize the resultant drug-protein network.

Outcome of the reproducibility effort

Some of the results of the work are directly described in the paper, like the guidelines and best practice for making your work reproducible. Additional results of the paper (from which the guidelines were derived from) are described below:

The scientific workflow reproducing the TB-Drugome experiment defined in the Wings workflow System.
Web page summarizing the inputs, parameters and outputs of a sample run of the workflow
Publication of the workflow as Linked Data (including inputs, outputs, intermediate results and scripts). The endpoint to explore these contents is publicly available here.
Wiki with the summary of all the related contents of the workflow.
Detailed timeline of the construction of the workflow, in order to quantify the effort.
Web page containing new visualizations obtained by the reproduced results.

References

[Kinnings et al. 2010]: Kinnings, S. L.; Xie, L.; Fung, K. H.; Jackson, R. M.; Xie, L.; and Bourne, P. E. “The Mycobacterium tuberculosis Drugome and Its Polypharmacological Implications.” To appear in PLoS Computational Biology, 2011. Preprint available from http://sites.google.com/site/beyondthepdf/file-cabinet/FinalPaper.pdf

About the authors

	Daniel Garijo is a PhD student in the Ontology Engineering Group at the Artificial Intelligence Department of the Computer Science Faculty of Universidad Politécnica de Madrid. His research activities focus on e-Science and the Semantic web, specifically on how to increase the understandability of scientific workflows using provenance, metadata, intermediate results and Linked Data.
	Sarah Kinnings is a Bioinformatics Scientist at Sequenom. She is a PhD in Bioinformatics by the University of Leeds and worked as a post doctoral researcher at the Skaggs School of Pharmacy and Parmaceutical Sciences, UCSD.
(Photo not available)	Li Xie is a Senior scientist at the Skaggs School of Pharmacy and Parmaceutical Sciences, UCSD.
	Lei Xie PhD is an Associate Professor at the Hunter College, CUNY. Previously he worked as a Research scientist at the Skaggs School of Pharmacy and Parmaceutical Sciences, UCSD. His current research focus is to develop and apply computational techniques to study the structure, function, dynamic, and evolution of molecular interactions on multiple scales, from atomic details to biological networks.
	Yinliang Zhang is a PhD student at University of Science and Technology of China, and a Visiting Grad student at the UC San Diego.
	Philip E. Bourne PhD is Associate Vice Chancellor for Innovation and Industrial Alliances, a Professor in the Department of Pharmacology and Skaggs School of Pharmacy and Pharmaceutical Sciences at the University of California San Diego, Associate Director of the RCSB Protein Data Bank and an Adjunct Professor at the Sanford Burnham Institute. Bourne's professional interests focus on relevant biological and educational outcomes derived from computation and scholarly communication. This implies algorithms, text mining, machine learning, metalanguages, biological databases, and visualization applied to problems in systems pharmacology, evolution, cell signaling, apoptosis, immunology and scientific dissemination. He has published over 300 papers and 5 books, one of which sold over 150,000 copies.
	Yolanda Gil Yolanda Gil is Director of Knowledge Technologies and at the Information Sciences Institute of the University of Southern California, and Research Professor in the Computer Science Department. Her research interests include intelligent user interfaces, social knowledge collection, provenance and assessment of trust, and knowledge management in science. Her most recent work focuses on intelligent workflow systems to support collaborative data analytics at scale.

Acknowledgements

This research is sponsored by Elsevier Labs, the National Science Foundation with award number CCF-0725332, the Air Force Office of Scientific Research with award number FA9550-11-1-0104, internal funds from the University of Southern California's Information Sciences Institute and from the University of California, San Diego, and by an FPU grant (Formación de Profesorado Universitario) from the Spanish Ministry of Science and Innovation (MCINN).

This page is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.0 Generic License.

Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome