BioWarehouse – Database Integration for Bioinformatics

[ Documentation ] [ Contact ]

News

March 5, 2021: Please contact If you are interested in using BioWarehouse

August 19th, 2015: Shutdown of BioWarehouse services

June 17, 2009: BioWarehouse 4.6 is released. This release includes:

Minor schema changes
Loader updates and bug fixes

December 23, 2008: BioWarehouse 4.5 is released. This release includes:

Minor schema changes
Loader updates and bug fixes

April 4, 2008: BioWarehouse 4.2 is released. This release includes:

Schema extensions to the Support table
Extension to the BioCyc loader to load regulation data
Bug fixes

June 14, 2007: BioWarehouse 4.1 is released. This release includes:

Schema extensions to accomodate transcription units.
Extensions to the BioCyc loader to load transcription units and associated features.
Extensions to the BioCyc loader to load asociations to GO and MetaCyc Multifun ontology terms.
Changes to MAGE loader to fix Description loading and in OntologyEntry loading.

November 30, 2006: BioWarehouse 4.0 is released.
October 5, 2006: BioWarehouse 3.9 is released.
September 8, 2006: BioWarehouse version 3.8 is released.
August 22, 2006: BioWarehouse version 3.7 is released. PublicHouse is now loaded with BioWarehouse 3.7.
April 25, 2006: BioWarehouse version 3.6 is released.

Overview

BioWarehouse is a component of the Bio-SPICE project. BioWarehouse is an open-source software environment for integrating a set of biological databases into a single physical database management system for data management, mining, and exploration.

Key features of BioWarehouse:

A relational database schema that models important bioinformatics datatypes
BioWarehouse instances can be implemented using either the Oracle or MySQL database management systems
A collection of loader programs that populate the warehouse with data from public biological databases
The loader programs transform the syntax of the source databases into relational form, and transform the diverse semantics of the source database into the common semantics of the BioWarehouse schema.

BioWarehouse Loaders

The BioWarehouse is populated using loader programs that translate the flat file representation of a source database into the warehouse schema. A loader is provided for each source database supported by BioWarehouse. Once loaded within a BioWarehouse instance running on e.g. MySQL, a set of source DBs can now be queried together.

Some loaders are specific to a data format rather than to a single source database. For example, the BioPAX and MAGE-ML loaders can load any database that is in BioPAX or MAGE-ML format, respectively.

BioWarehouse loaders:

Source DB	Contents	Language	Citation
BioCyc DBs	Genomes, genes, proteins, metabolic pathways, reactions, compounds	C	link
BioPAX format	BioPAX format describes biological pathway and protein interaction data. Currently this loader can process BioPAX Level 2 only -- protein interaction data.	JAVA
Comprehensive Microbial Resource (CMR)	Genomes, genes, proteins, reactions	C	link
ENZYME DB	Reactions, proteins	JAVA	link
Eco2dbase	E. coli 2D protein gel database	JAVA
GenBank – bacteria only	Bacterial genes and proteins	JAVA	link
Gene Ontology	A controlled vocabulary to describe gene and gene product attributes	JAVA
Kyoto Encyclopedia of Genes and Genomes (KEGG)	Genomes, genes, proteins, metabolic pathways, reactions, compounds	C	link
MetaCyc Ontology	The MetaCyc ontology of metabolic pathways, and the MetaCyc ontology of chemical compounds	C
MAGE-ML format	The MAGE-ML file format describes gene expression datasets	JAVA
NCBI Taxonomy DB	Taxonomical organism classification	C	link
UniProt (Swiss-Prot and TrEMBL)	Protein knowledgebase	JAVA	link

Typically, many of the source database attributes are copied into the warehouse either verbatim or with minor transformations (e.g., converting from the dalton unit stored in a source database to the kilodalton unit used within the BioWarehouse). The few source attributes that are not represented in the warehouse are generally ignored, although some attributes are inferred from the raw data, for example, in cases where a gene is clearly present but not stated explicitly in the source data.

Current BioWarehouse loaders are implemented in both the C and Java languages. C-based MySQL loaders interface with MySQL using its C API. Similarly, the C-based Oracle loaders interface with Oracle using the Oracle Pro-C precompiler. Java-based loaders use the Java Database Connectivity (JDBC) API to interface with the DBMS. Each of these APIs allows SQL to be embedded and/or generated within its source language. Implementation details can be found in the documentation.

BioWarehouse Schema

The BioWarehouse schema (schema documentation ) is designed to capture as much of the data of each component DB as possible within a uniform representation.

For example, in encoding data from a set of source DBs pertaining to proteins, BioWarehouse uses a single set of schema definitions that spans all attributes of proteins found across this set of DBs. This approach eliminates the semantic heterogeneity present in these DBs, allowing users to query all protein sequence DBs using the same schema. Such sharing of tables is applied wherever practical. The translation from the component DB to the warehouse is achieved by the DB loaders, which convert the conceptualization used in each component DB into the conceptualization used by the warehouse schema.

The major biological objects of the BioWarehouse and their interrelationships are depicted below. Arrows indicate that the objects in those tables can refer to entries in the same table.

BW schema diagram

Documentation

For each loader, there are two pieces of documentation: how to build and run the loader, and a "manual" for developers describing the details of the loader implementation and schema mappings.

Overall documentation:

Release notes
Quick Start guide
Environment Setup documentation
Schema description and database setup instructions
A description of the integration with the Dashboard for the February 2004 Bio-SPICE demonstration
Descriptions of the Perl utilities and Perl demo scripts

All of the overall documentation is listed in a table of contents at doc/index.html. The TOC also has a table listing some statistics about all the loaders (latest supported version of the data, input size, #Objects loader, load time, etc.)

Usage and Obtaining the BioWarehouse

Please contact .

BioWarehouse Applications and Publications

Projects using BioWarehouse include:

Bio-SPICE use cases
SRI’s Enzyme Genomics project
SRI’s Pathway Hole Filling project

Poster presented at 2004 Bio-SPICE PI meeting

BioWarehouse paper in BMC Bioinformatics.

Credits

The BioWarehouse team:

Peter D. Karp, Principal Investigator
Thomas J Lee
Yannick Pouliot
Valerie Wagner
Tomer Altman
Nan Guo
David Dunkley

Support

BioWarehouse developed was supported by DARPA contract F30602-01-C-0153, and by NIH grants GM077905 and GM080746.

For support and inquiries please contact .