Biology Wednesday, October 8, 2003 . This is a SciScoop post by Ricky James
Pandey says advances in technology have made getting data much easier, but processing it and interpreting observations are now the big hurdles in laboratories. “It has remained difficult to put together a big picture of biology, to see how one set of observations intersects with and complements others,” he says. “With this single database, biologists now will be able to quickly review what is known about the proteins and how they interact, speeding the creation of new hypotheses to test in the lab.”
The 3,000 proteins currently in the database are known to interact with anywhere from tens to hundreds of other proteins. Online, a user can pull up a visual web of protein-protein interactions with just the click of a mouse.
“The entries have been critically reviewed, making the information in the database as accurate and complete as possible,” says Pandey. “Scientists can even link directly to the scientific paper behind an item, to judge for themselves its validity.”
To create the database entries, dozens of trained biologists, most at the Institute of Bioinformatics in India, started with the database Online Mendelian Inheritance in Man (OMIM), the offspring of a paper catalog of disease genes started in 1966 by Victor A. McKusick, M.D., University Professor of Medical Genetics at Hopkins.
Focusing on these genes’ proteins, the scientists critically reviewed hundreds of thousands of scientific papers, making connections between papers and resolving inconsistencies — something automated computer programs cannot do, says Pandey. They also pulled information from smaller, existing databases to complete each protein’s entry.
“We believe that manual curation — lots of scientists poring through the literature — is the key to building a more accurate and more complete database,” says Pandey, who serves as chief scientific adviser to the Institute of Bioinformatics. “Eventually, we hope the database will be managed by the larger community of scientists, because it will be most useful if those who know these proteins best take responsibility for keeping entries up to date and accurate.”
The database currently contains everything that’s known about proteins involved in diseases, such as so-called breast cancer genes BRCA1 and BRCA2, and proteins in key pathways, such as families of enzymes that modify other proteins. It includes only experimentally proven or widely accepted facts about the proteins, without mixing in computer-generated predictions the way some other databases do, says Pandey.
The online database is also easy to use, in large part because those who designed it are experts in both computer science and biology, he adds. A biologist looking for information about BRCA1, for example, can search by any of its names and get a single entry that contains everything — its alternative names, structure, function and sequence, how it’s modified, other proteins with which it interacts, where it’s found in cells, where it’s found in the body and links to the papers that say so.
“The richness of the database is astounding, since it was created in such a short time by expert reviews of individual publications,” says Aravinda Chakravarti, Ph.D., director of the McKusick-Nathans Institute and a co-author on the paper. “This would have been impossible without scientists to review the literature and computational biologists to make a database that is truly easy to use.”
Academic researchers will have free access to the database. Johns Hopkins Licensing and Technology Development is currently establishing licensing criteria for companies interested in using the database. The database has been active for five months and has elicited almost 2 million hits, simply from word-of-mouth and presentations at scientific meetings, says Pandey.
The Human Protein Resource Database was built using freely available computer code, so-called open source, from ZOPE (Z Object Publishing Environment), which experts at the Institute of Bioinformatics adjusted to fit the project’s needs. One of the benefits of using an object-oriented structure like ZOPE, Pandey says, is that there’s no limit on the number of entries (i.e. proteins) or the number of characteristics that can be included.
SciScoop Science News is a forum for news, views and controversial conjectures. Please contact us if would like to submit a guest post.