ENZYME Loader Developers' Manual

Version 4.6


(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.


Introduction
Installation and Building
Input data
Running the Loader
Dataset and Organism Specification
Translation Semantics for ENZYME
Entry Table
References

Introduction

This document describes version 4.6 of the ENZYME Loader. It is one of several database loaders comprising the BioWarehouse. The ENZYME Loader (referred to simply as the loader), loads an ENZYME Nomenclature Database into the BioWarehouse.

ENZYME is a database of information relative to the nomenclature of enzymes. The loader inputs a textual representation of ENZYME, converts it to the representation expressed in the BioWarehouse Schema, and loads this directly into an instance of the warehouse.

Overview of BioWarehouse Schema

The BioWarehouse schema contains the data definition statements for the BioWarehouse. These include three different types of tables - constant tables, object tables, linking tables, and special tables.

Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported.

Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

Schema documentation is available.


Installation and Building

See the ENZYME installation instructions for the procedure to install and build the loader.

Input data

The ENZYME installation instructions contain directions for obtaining the ENZYME database. The ENZYME database is contained in a single ASCII file. Its format is fully described in the ENZYME Nomenclature Database User Manual. Briefly, the file consists of an entry for each enzyme in the ENZYME database. Each entry consists of multiple records, and is terminated by a record consisting of the characters //. Each record consists of one or more text lines, each of which is prefixed by an identical two-letter code indicating the record type.


Running the Loader

The ENZYME installation instructions contains details for running the loader, including options and a description of its output.

Dataset Specification

The loader adds one row to the Dataset table as follows:

Column values for Dataset row
Column Value assigned by ENZYME loader
WID The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse.
Name 'Enzyme'.
Version 'unknown'
ReleaseDate When the version was released, currently 'October 27, 2001'.
LoadDate The time/date the loader was run (SQL 'SYSDATE').
HomeURL http://www.expasy.org/enzyme/
QueryURL NULL


Translation Semantics for ENZYME Objects

This section describes the semantic mapping between an entry in ENZYME to its representation in the BioWarehouse. To specify input syntax and loader semantics, extended Backus-Naur form rules are used. The following notation is used to specify these rules:

    lhs ==> rhs     means lhs can be rewritten as rhs.
    x*     means x can occur zero or more times.
    x+     means x can occur one or more times.
    x?     means x is optional.
    x | y   means either x or y is valid.
    ATTR   is an attribute.
    ABC   is literal text.

Each source record contains one or more data values termed atributes. For example, the FT (feature) record contains four attributes - TYPE, FROM, TO, and DESCRIPTION. Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant.

Most semantics are expressed in tabular form, showing the mapping of each source input record (i.e., one line or a block of lines of the same record type) to the warehouse Table.Column values computed from it. The most typical semantics is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Some attributes are ignored. If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored.

Since ENZYME is a protein database, most attributes either define column values in the Protein table entry for the enzyme, or are linked to it through linking tables. The term current protein is used to refer to the enzyme described by the entry being translated.

The ENZYME record types are:

ID line
DE line
AN line
CA line
CF line
CC line
DI line
DR line
PR line

The ID line

The syntax of the ID (identification) line is:

    ID     EC_NUMBER
    EC_NUMBER ==> N.N.N.N;

where N is an integer.

There is only one ID record per entry. A row is added to the Reaction table and to the DBID table for the ID.


Translation semantics for ID
ENZYME Attribute Warehouse Semantics
EC_NUMBER Reaction.ECNumber.
DBID.XID;
DBID.OtherWID is the WID of the Protein object

The DE line

The syntax of the DE (description) line is:

    DE     DESCRIPTION .

where DESCRIPTION is arbitrary text.

The record can be broken into multiple DE lines.

DE specifies the recommended name of the enzyme.


Translation semantics for DE
ENZYME Attribute Warehouse Semantics
DESCRIPTION Protein.Name

The AN line

The syntax of the AN (atlernate name) line is:

    AN     ALTERNATE_NAME .

where ALTERNATE_NAME is arbitrary text. Multiple AN records are allowed for an enzyme.

A row is added to the DBID table for each ALTERNATE_NAME, providing a way to query for the WID of the associated enzyme in the Protein table.


Translation semantics for AN
ENZYME Attribute Warehouse Semantics
ALTERNATE_NAME[*] DBID.XID;
DBID.OtherWID is the WID of the Protein entry of the enzyme

The CA line

The syntax of the CA (catalytic activity) line is:

    CA     reaction .
    reaction ==> text-description | REACTION-EQUATION ;
    REACTION-EQUATION ==> reactants = products
    reactants ==> ingredient add-ingredients*
    products ==> ingredient add-ingredient*
    add-ingredients ==> + ingredient
    ingredient ==> COEFFICIENT? CHEMICAL

A reaction-equation is detected by the presence of an equals sign; arbitrary text containing an equals sign, or multiple equals signs in an equation, will confuse the parser. If text-description is given rather than a reaction-equation, it is ignored and no details of the reaction are stored.

This record describes a reaction that is catalyzed by the enzyme. A row is added to the EnzymaticReaction table for each CA record; EnzymaticReaction.ProteinWID is the current protein, and EnzymaticReaction.ReactionWID is the current reaction (i.e., the reaction with the EC number matching the ID record of this entry).

A row is added to the Chemical table for each CHEMICAL attribute. A row is also added to either the Product or Reactant table depending on which side of the reaction the chemical is on.


Translation semantics for CA
ENZYME Attribute Warehouse Semantics
REACTION-EQUATION Creates EnzymaticReaction row;
equation is stored using Product and Reactant linking tables.
CHEMICAL[*] Chemical.Name; either a Product or Reactant row is added;
its OtherWID is Chemical.WID
COEFFICIENT[*] either Product.Coefficient or Reactant.Coefficient;
if attribute is absent, 1 is stored.

The CF line

The syntax of the CF (cofactor) line is:

    CF     CHEMICAL ;

This record describes a cofactor, a chemical that is required to catalyze a reaction, but is not changed by it.

The ENZYME database allows a more complex syntax for chemicals than a simple name, including disjunctions and multiple cofactors per line. The loader treats the text of the entire line as a single chemical, e.g., 'Zinc or Iron'.

A row is added to the Chemical table for the CHEMICAL attribute.

A row is added to the EnzymaticReactionCofactor linking table for each CF record; EnzymaticReactionCofactor.ChemicalWID is the WID of the added chemical, and EnzymaticReactionCofactor.EnzymaticReactionWID is the WID of the current enzymatic reaction.


Translation semantics for CF
ENZYME Attribute Warehouse Semantics
CHEMICAL Chemical.Name

The CC line

The syntax of the CC (Comment) line is:

    CC     COMMENT+
    COMMENT ==> -!- comment-line+
    comment-line ==> any text, terminated by line breaks

The record can be broken into multiple CC lines.

The value of COMMENT is the concatenation of all comment-lines that comprise it. Line breaks are converted to a single space, and the delimiters CC and -!- are not included. Comments may be of any length (in the schema they are represented as CLOBs). Some ENZYME comments contain extra white space to format them nicely in an ASCII file; this white space is preserved. Many ENZYME comments are copyright notices; these are treated like other comments.

An entry is added to the CommentTable table for each COMMENT. CommentTable.OtherWID is the WID of the current protein.


Translation semantics for CC
ENZYME Attribute Warehouse Semantics
COMMENT[*] CommentTable.Comm

The DI line

The DI (disease) record is parsed, but otherwise ignored by the loader.

The PR line

The syntax of the PR (Prosite reference) line is:

    PR     PROSITE; PROSITE_ID ;

PROSITE_ID is a PROSITE document entry accession number, uniquely identifying a document in PROSITE. The loader accepts any string delimited by white space.

This record describes a cross-reference from an ENZYME protein to an entry in the PROSITE database. A row is added to the CrossReference table for each PR. CrossReference.OtherWID refers to the current protein.


Translation semantics for PR
ENZYME Attribute Warehouse Semantics
PROSITE_ID CrossReference.XID;
CrossReference.DataSetName is 'PROSITE';
CrossReference.OtherWID refers to the current protein.

The DR line

The syntax of the DR (database cross-reference) line is:

    DR     swissprot-ref+
    swissprot-ref ==> SWISSPROT_ID , SWISSPROT_AC ;

SWISSPROT_ID is the ID value of an entry in the SWISS-PROT database. SWISSPROT_AC is the primary AC (accession) value of an entry in the SWISS-PROT database.

This record describes a cross-reference from an ENZYME protein to the SWISS-PROT database. A row is added to the CrossReference table for each SWISSPROT_ID and SWISSPROT_AC. CrossReference.OtherWID refers to the current protein.


Translation semantics for DR
ENZYME Attribute Warehouse Semantics
SWISSPROT_ID[*] CrossReference.XID;
CrossReference.DataSetName is 'SWISS-PROT';
CrossReference.OtherWID refers to the current protein.
SWISSPROT_AC[*] CrossReference.XID;
CrossReference.DataSetName is 'SWISS-PROT';
CrossReference.OtherWID refers to the current protein.

Entry Table

A row is added to the Entry table for each row that the loader adds to these tables: The Entry row is created as follows:

Column values for Entry row
Column Value assigned by loader
OtherWID The WID of the entry described by this row.
InsertDate The time/date the loader was run (SQL 'SYSDATE').
CreationDate NULL
ModifiedDate NULL
LoadError 'T' if a parse error is detected, 'F' otherwise.
LineNumber The line number of the input file the associated source entry starts on.
DatasetWID The value Dataset.WID assigned to the dataset being loaded.


References