BioWarehouse – Database Integration for Bioinformatics

[ Documentation ]  [ Contact ]


March 5, 2021: Please contact If you are interested in using BioWarehouse  

August 19th, 2015: Shutdown of BioWarehouse services  

June 17, 2009: BioWarehouse 4.6 is released.   This release includes:

December 23, 2008: BioWarehouse 4.5 is released. This release includes:

April 4, 2008:  BioWarehouse 4.2 is released.  This release includes:
June 14, 2007:  BioWarehouse 4.1 is released. This release includes:
November 30, 2006:  BioWarehouse 4.0 is released.  
October 5, 2006: BioWarehouse 3.9 is released. 
September 8, 2006: BioWarehouse version 3.8 is released

August 22, 2006: BioWarehouse version 3.7 is released.  PublicHouse is now loaded with BioWarehouse 3.7.
April 25, 2006:  BioWarehouse version 3.6 is released.  


BioWarehouse is a component of the Bio-SPICE project. BioWarehouse is an open-source software environment for integrating a set of biological databases into a single physical database management system for data management, mining, and exploration.

Key features of BioWarehouse:

BioWarehouse Loaders

The BioWarehouse is populated using loader programs that translate the flat file representation of a source database into the warehouse schema. A loader is provided for each source database supported by BioWarehouse. Once loaded within a BioWarehouse instance running on e.g. MySQL, a set of source DBs can now be queried together.

Some loaders are specific to a data format rather than to a single source database. For example, the BioPAX and MAGE-ML loaders can load any database that is in BioPAX or MAGE-ML format, respectively.

BioWarehouse loaders:

Source DB




BioCyc DBs

Genomes, genes, proteins, metabolic pathways, reactions, compounds



BioPAX format BioPAX format describes biological pathway and protein interaction data. Currently this loader can process BioPAX Level 2 only -- protein interaction data. JAVA

Comprehensive Microbial Resource (CMR)

Genomes, genes, proteins, reactions




Reactions, proteins



Eco2dbase E. coli 2D protein gel database JAVA

GenBank – bacteria only

Bacterial genes and proteins



Gene Ontology

A controlled vocabulary to describe gene and gene product attributes



Kyoto Encyclopedia of Genes and Genomes (KEGG)

Genomes, genes, proteins, metabolic pathways, reactions, compounds



MetaCyc Ontology The MetaCyc ontology of metabolic pathways, and the
MetaCyc ontology of chemical compounds
MAGE-ML format The MAGE-ML file format describes gene expression datasets JAVA

NCBI Taxonomy DB

Taxonomical organism classification



UniProt (Swiss-Prot and TrEMBL)

Protein knowledgebase



Typically, many of the source database attributes are copied into the warehouse either verbatim or with minor transformations (e.g., converting from the dalton unit stored in a source database to the kilodalton unit used within the BioWarehouse). The few source attributes that are not represented in the warehouse are generally ignored, although some attributes are inferred from the raw data, for example, in cases where a gene is clearly present but not stated explicitly in the source data. 

Current BioWarehouse loaders are implemented in both the C and Java languages. C-based MySQL loaders interface with MySQL using its C API. Similarly, the C-based Oracle loaders interface with Oracle using the Oracle Pro-C precompiler. Java-based loaders use the Java Database Connectivity (JDBC) API to interface with the DBMS. Each of these APIs allows SQL to be embedded and/or generated within its source language. Implementation details can be found in the documentation.

BioWarehouse Schema

The BioWarehouse schema (schema documentation) is designed to capture as much of the data of each component DB as possible within a uniform representation.

For example, in encoding data from a set of source DBs pertaining to proteins, BioWarehouse uses a single set of schema definitions that spans all attributes of proteins found across this set of DBs. This approach eliminates the semantic heterogeneity present in these DBs, allowing users to query all protein sequence DBs using the same schema. Such sharing of tables is applied wherever practical. The translation from the component DB to the warehouse is achieved by the DB loaders, which convert the conceptualization used in each component DB into the conceptualization used by the warehouse schema.

The major biological objects of the BioWarehouse and their interrelationships are depicted below. Arrows indicate that the objects in those tables can refer to entries in the same table.

BW schema diagram


For each loader, there are two pieces of documentation: how to build and run the loader, and a "manual" for developers describing the details of the loader implementation and schema mappings.

Overall documentation:

All of the overall documentation is listed in a table of contents at doc/index.html.  The TOC also has a table listing some statistics about all the loaders (latest supported version of the data, input size, #Objects loader, load time, etc.)

Usage and Obtaining the BioWarehouse 

Please contact .

BioWarehouse Applications and Publications

Projects using BioWarehouse include:

Bio-SPICE use cases
SRI’s Enzyme Genomics project
SRI’s Pathway Hole Filling project

Poster presented at 2004 Bio-SPICE PI meeting

BioWarehouse paper in BMC Bioinformatics.


The BioWarehouse team:

Peter D. Karp, Principal Investigator
Thomas J Lee
Yannick Pouliot 
Valerie Wagner
Tomer Altman
Nan Guo
David Dunkley


BioWarehouse developed was supported by DARPA contract F30602-01-C-0153, and by NIH grants GM077905 and GM080746.

For support and inquiries please contact .