BioWarehouse Schema

Overview of the BioWarehouse Schema
Generating the Schema From its Source Files
Schema Documentation
Loading the BioWarehouse Schema
Loading the Indexes
Resetting BioWarehouse
Altering the Schema
Schema Design and Modification Guidelines

Overview of the BioWarehouse Schema

For a general introduction to BioWarehouse, please see "BioWarehouse: a bioinformatics database warehouse toolkit," BMC Bioinformatics 7:170 2006.
The BioWarehouse schema contains the data definition statements for the BioWarehouse. These include four different types of tables - constant tables, object tables, linking tables, and special tables.
Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables. Put another way, these constants are controlled vocabulary terms in BioWarehouse columns.

Each object table describes a type of entity in a source database, such as compound or protein. Each column of an object table specifies a parameter that characterizes the object. In addition to the attributes defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object. WIDs must be globally unique throughout the entire BioWarehouse database. A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

A linking table describes a relationship among objects, such as between a gene and its protein product. A linking table row contains WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported.
The various special tables fulfill a number of roles. Some special tables describe the BioWarehouse itself, including any state information required by loaders or other tools. Some tables contain metadata about objects in the BioWarehouse. Some capture reference and crossreference information to facilitate lookup of objects.

The schema is defined by a common DBMS-independent XML generic schema template called all-schema.xml. The schema is divided into two categories: the "core" tables (defined in core-schema.xml) and the MAGE extension tables for modeling gene-expression data (defined in mage-schema.xml). The MAGE extension to the BioWarehouse schema is auto-generated based on the MAGE Object Model. (See the MAGE Loader documentation for more details.)

The schema for the BioWarehouse is located in the schema directory.

Generating the Schema From its Source Files

The BioWarehouse currently supports the MySQL and the Oracle database management systems. The schema for each supported DBMS is generated programmatically from the common DBMS-independent XML generic schema template, all-schema.xml.

MySQL schema
Oracle schema

Schema Documentation

This distribution contains browsable, cross-referenced documentation of the schema tables.

Loading the BioWarehouse Schema

Before loading the schema, follow the instructions in Environment Setup. Make sure you have a database instance created before loading the schema.

To load the schema, open a shell and navigate to the BioWarehouse directory schema. Depending on your DBMS, execute the following commands:

MySQL:
osprompt: mysql [connect-parameters]
mysql> source warehouse-mysql-create.sql;
mysql> exit

Oracle:
osprompt: sqlplus [connect-string]
sqlplus> @warehouse-oracle-create.sql
sqlplus> exit

The Warehouse is now initialized. Datasets may now be loaded into the Warehouse.

Loading the Indexes

The indexes for each supported DBMS are generated programmatically from the common DBMS-independent XML generic schema template, all-schema.xml.

MySQL index
Oracle index

These indexes should be loaded once the loaders have been run and the data is loaded. Note that these indexes are intended to improve query performance. However, if these are loaded before the loaders are run, they can significantly increase the load times.

To load the indexes, open a shell and navigate to the schema directory. Depending on your DBMS, execute the following commands:
MySQL:
osprompt: mysql [connect-parameters]
mysql> source warehouse-mysql-index.sql;
mysql> exit

This would take about TODO hours to load.

Oracle:
osprompt: sqlplus [connect-string]
sqlplus> @warehouse-oracle-index.sql
sqlplus> exit

They can take up 13 hours to load for a full BioWarehouse instance, although the time depends strongly on which datasets are loaded into BioWarehouse.

Reseting BioWarehouse

WARNING: this procedure will irrevocably drop all of the data contained in the warehouse!
To reset the warehouse so that all data is removed, and such that the schema is then subsequently reloaded, open a shell and navigate to the directory schema. Depending on your DBMS, execute the following commands:
MySQL:
osprompt: mysql [connect-parameters]
mysql> use DatabaseToDestroy;
mysql> source mysql-destroy-and-reload.sql;
mysql> exit

Oracle:
osprompt: sqlplus [connect-string]
sqlplus> @oracle-destroy-and-reload.sql
sqlplus> exit

To completely wipe out the warehouse without reloading the schema, run warehouse-mysql-destroy.sql or warehouse-oracle-destroy.sql instead of the above.

Altering the Schema

The BioWarehouse schema is defined by a common DBMS-independent XML generic schema template called all-schema.xml. The schema is divided into two categories: the "core" tables (defined in core-schema.xml) and the MAGE extension tables (defined in mage-schema.xml). The MAGE extension to the BioWarehouse schema is auto-generated based on the MAGE Object Model. (See the MAGE Loader documentation for more details.)

Changes to the core schema should be made in the core-schema.xml file. Then perform the following steps to merge the core schema with the MAGE extension to produce the full schema defined in all-schema.xml, generate new create, index, and destroy scripts, and produce new schema documentation:

osprompt: ant merge-schemas create-ddl-scripts schema-doc
The format allowed in the XML schema file is defined by utils/xsd/db-schema.xsd.

Schema Design and Modification Guidelines

The previous section describes the mechanics of schema modification. This section describes the design principles underlying the schema that should be respected when adding custom tables to the schema, in order to perserve the utility of the BioWarehouse.
WIDs
Each object in the BioWarehouse is assigned an identifier called a Warehouse ID or a WID. Unlike many relational database systems where keys are unique only within a given table, WIDs must be unique across the entire BioWarehouse. A WID is a non-negative integer greater than one (the values zero and one are reserved).
What is and is not considered an object in BioWarehouse? Objects generally correspond to any biological entity, such as proteins, pathways, or chemicals. Comments and synonyms are not considered objects, for example, which is why their tables in the BioWarehouse schema do not contain a WID column. WID's can be used to link separate tables containing multivalued attributes to the main table for a BioWarehouse object. For example, any BioWarehouse object may have multiple comments. In this case, the CommentTable has a column (OtherWID) that contains the WID of the warehouse object that is associated with the comment.
An association table is another example of a table that does not have its own WID, and therefore, is not defined as a warehouse object. An association table allows one to represent a many-to-many relationship within a relational database. The table is defined with two WID-type columns. Each would be populated with the WIDs from the BioWarehouse objects that are being associated.
The schema contains mechanisms for allocating three classes of WIDs. A special WID is used for WIDs that occur very frequently within a dataset. The most common usage is for the WID of the DataSet row representing the containing dataset. There is a relatively small pool (~1000) of special WIDs available.
A regular WID is used by default when a special WID is not called for. There are a large number of regular WIDs available (~2^63).
The BioWarehouse typically provides loaders for certain bioinformatics datasets. For the Eco2DBASE dataset, rather than provide a loader, a dump of a MySQL database is provided. This dataset uses reserved WIDs in place of regular WIDs. There are roughly one million reserved WIDs available for use. They are allocated between the special WIDs and the regular WIDs.
Schema Naming Conventions
Object tables are named for the object represented therein, typically a single word, e.g., Protein. Linking tables that contain exactly two rows, each of which is a WID referring to an object in a specific object table, are named <Table1>WID<Table2>WID, e.g., GeneWIDProteinWID. Table names are in general case-sensitive, though it is an error to define tables whose names differ only in case. This is to accomodate both MySQL and Oracle naming conventions. The first letter of a table name is capitalized; if a table name contains multiple words, each word is capitalized. To accomodate DBMS limits, table names should be less than 32 characters long.
The first letter of a column name is capitalized; if a column name contains multiple words, each word is capitalized. This is primarily for aesthetic reasons -- neither Oracle nor MySQL enforce case conventions on column names. Column names of WIDs that refer to a single table (i.e., a single type of object) are named <Table>WID, e.g., GeneWIDProteinWID.ProteinWID. Column names that may refer to objects of multiple types are named OtherWID (e.g., the DBID table). There are several generic column names such as Name and Type which should be used if appropriate.
Relationships Among Data
Relationships among data can be characterized as one-to-one, one-to-many, or many-to-many. A one-to-one relationship between an object and a datum is specified by defining a column of the appropriate type for that datum in that object table. For example, the NucleicAcid table defines a boolean column Fragment and an integer column MoleculeLength.
A one-to-many relationship is specified by defining an object table to hold the data for the 'many' case, including a WID column for the 'one' object. For example, a biological source may have many nucleic acids associated with it; the NucleicAcid table has a BioSourceWID column to reference the source of the nucleic acid molecule.
A many-to-many relationship is specified by defining a linking table that contains all the columns needed to represent the relationship. For example, the BioSourceWIDGeneWID table reflects that a given gene may be present in many organisms, and that an organism has many genes. Note that this table has only two columns -- one for each object in the relationship. A relationship may have additional properties. For example the Reactant table has a Coefficient column to indicate the number of molecules of a particular substrate that participate in an associated chemical reaction.
Table Reuse
In order to keep the schema size small, single common tables are used for information that is common to many warehouse datatypes. For example, comments and citations are common to many warehouse datatypes; each is implemented through a single table. Similarly, common datatypes are represented by a single table each, even though datasets will vary in the style and specificity of their representation of that datatype. Enforcing commonality and minimalism promotes simplicity, and enables comparative analysis of data from differing sources.
In cases where a property of a dataype is available in a source dataset but not represented in the associated object table, loaders may preserve the property in an unstructured form by associating a comment containing the property with the object.
Controlled Vocabularies
The BioWarehouse specifies a number of controlled vocabularies for certain biological concepts and ontologies. For example, the Topology column in the NucleicAcid table is constrained to the values 'circular', 'linear', or 'other':
        <column length="30" name="Topology" type="@varchar" required="false">
            <comment>Enumeration: 'circular', 'linear' or 'other'.</comment>
            <enumeration>
                <restriction value="linear" description="The topology of the molecule is Linear"/>
                <restriction value="circular" description="The topology of the molecule is Circular"/>
                <restriction value="other"
                         description="The topology of the molecule is neither circular nor linear but is known"/>
            </enumeration>
        </column>
When the schema is modified, it may be useful to add a controlled vocabulary for a new column. Generally, if the number of different values a column can take is small, a controlled vocabulary is preferred over free text. A controlled vocabulary is specified in the XML files for the schema using the <enumeration> and <restriction> XML tags.
If possible, the vocabulary should use terminology that is generic rather than specific to a particular database or type of usage. This will make the schema addition more reusable.
Type Abstraction
In order to accomodate both MySQL and Oracle within a single schema specification, the BioWarehouse defines a set of abstract types that are usable as column types in table specifications. These are defined in all-schema.xml. The complete set is listed below; this set is extensible:
   <datatypes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <dtype notation="@varchar" mysql="VARCHAR" oracle="VARCHAR2"
             description="@varchar(nnn): a varying-length string of length <= nnn"/>
        <dtype notation="@datetime" mysql="DATETIME" oracle="DATE"
             description="a date and/or a time"/>
        <dtype notation="@real32" mysql="FLOAT" oracle="NUMBER"
             description="signed floating point types using 32 bits"/>
        <dtype notation="@real64" mysql="DOUBLE" oracle="NUMBER"
             description="signed floating point types using 64 bits"/>
        <dtype notation="@int16" mysql="SMALLINT" oracle="NUMBER"
             description="signed integral types using 16 bits"/>
        <dtype notation="@int32" mysql="INT" oracle="NUMBER"
             description="signed integral types using 32 bits"/>
        <dtype notation="@int64" mysql="BIGINT" oracle="NUMBER"
             description="signed integral types using 64 bits"/>
        <dtype notation="@byte32" mysql="LONGBLOB" oracle="BLOB"
             description="binary data of length <= 4294967295 bytes"/>
        <dtype notation="@string10" mysql="TEXT" oracle="VARCHAR2(4000)"
             description="a varying-length string of length <= 2047"/>
        <dtype notation="@string16" mysql="TEXT" oracle="CLOB"
             description="a varying-length string of length <= 65535"/>
        <dtype notation="@string32" mysql="LONGTEXT" oracle="CLOB"
             description="a varying-length string of length <= 4294967295"/>
        <dtype notation="@wid" mysql="BIGINT" oracle="NUMBER"
             description="a Warehouse Identifier (a number that is unique throughout the Warehouse)"/>
        <dtype notation="@number" mysql="DECIMAL" oracle="NUMBER"
             description="@number(precision,scale): any number for which exact precision is preserved."/>
        <dtype notation="@boolean" mysql="CHAR(1)" oracle="CHAR(1)"
             description="a boolean value: 'T' for true and 'F' for false"/>
        <dtype notation="@char" mysql="CHAR(1)" oracle="CHAR(1)"
             description="a single character value"/>
    </datatypes>
Normalization
Generally the tables comprising the BioWarehouse are normalized in the sense that no data is replicated in multiple columns. An exception is that both the Entry table and all object tables contain a DataSetWID column. This is done for efficiency considerations, to avoid the need to join an object table with the large Entry table to query for its WID.
Keys and indexing
Each object table has a column named WID, which uniquely identifies the object. This column is a primary key of the table. There are no secondary keys defined in the BioWarehouse. Since the BioWarehouse represents data from different sources, it is hard in general to make any uniqueness assumptions about any other columns or combination of columns in any tables.
To assure efficient queries, each table in the BioWarehouse must be indexed appropriately. Primary keys are indexed automatically by both Oracle and MySQL, so no additional specification need be done other that making these columns primary keys.
The BioWarehouse supports two slightly different ways to specify indexes, depending on whether the index is required by data loaders, or simply needed for efficient querying once all data is loaded. All indexes are specified in the schema file using the <index> tag. If an index is required during loading, the initial=true is specified in the <index> tag. Such indexes are defined when the schema is created, and are present during loading.
Non-initial indexes are computed en masse at the discretion of the BioWarehouse administrator, usually after all, or a significant portion, of the data to be loaded is loaded. The performance tradeoff is that initial indexes degrade loading performance slightly, since they must be maintained by the DBMS during loading, so they are typically used only when that column of the table is used in the WHEN clause of a SQL SELECT or SQL UPDATE during loading.
The BioWarehouse uses simple heuristics in deciding which columns to index. For the most part, each column in each table has an index. Exceptions are boolean and controlled-vocabulary columns that have few distinct values; indexing these columns typically does not improve performance significantly.
It is possible to define multi-column indexes as well. This enhances performance if both column values are frequently specified in a query. Multi-column indexes are defined for most linking tables (e.g., the GeneWIDProteinWID table).
In MySQL, columns that contain character data and that are longer than 255 characters must specify a length no greater than 255 to be used to effectively truncate the data for the purposes of indexing. This is specified in the schema as follows:
        <index name="PROTEIN_NAME"> 
            <variant dialect="mysql" columns="NAME(20)"/> 
            <variant dialect="oracle" columns="NAME"/> 
        </index>