(C) 2006 SRI International.
All Rights Reserved. See BioWarehouse
Overview for license details.
This document describes version 4.6 of the ENZYME Loader. It is one of several database loaders comprising the BioWarehouse. The ENZYME Loader (referred to simply as the loader), loads an ENZYME Nomenclature Database into the BioWarehouse.
ENZYME is a database of information relative to the nomenclature of enzymes. The loader inputs a textual representation of ENZYME, converts it to the representation expressed in the BioWarehouse Schema, and loads this directly into an instance of the warehouse.
Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.
Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.
A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.
A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported.
Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.
Schema documentation is available.
The ENZYME installation
instructions
contain directions for obtaining the ENZYME database.
The ENZYME database is contained in a single ASCII file. Its format is
fully described in the ENZYME Nomenclature
Database User Manual. Briefly, the file consists of an entry
for each enzyme in the ENZYME database.
Each entry consists of multiple records, and is terminated
by a record consisting of the characters //
.
Each record consists of one or more text lines, each of which is
prefixed by an identical two-letter code indicating the record type.
Dataset
table as follows:
Column | Value assigned by ENZYME loader |
---|---|
WID |
The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse. |
Name |
'Enzyme'. |
Version |
'unknown' |
ReleaseDate |
When the version was released, currently 'October 27, 2001'. |
LoadDate |
The time/date the loader was run (SQL 'SYSDATE'). |
HomeURL |
http://www.expasy.org/enzyme/ |
QueryURL |
NULL |
This section describes the semantic mapping between an entry in ENZYME to its representation in the BioWarehouse. To specify input syntax and loader semantics, extended Backus-Naur form rules are used. The following notation is used to specify these rules:
lhs ==> rhs
means lhs can be rewritten as rhs.
x* means x
can occur zero or more times.
x+ means x
can occur one or more times.
x? means x is
optional.
x | y means either x
or y is valid.
ATTR is an attribute.
ABC is literal text.
Each source record contains one or more data values termed atributes. For example, the FT (feature) record contains four attributes - TYPE, FROM, TO, and DESCRIPTION. Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant.
Most semantics are expressed in tabular form, showing the mapping of
each source input record (i.e., one line or a block of lines of the
same record type) to the warehouse Table.Column
values
computed from it. The most typical semantics is that the attribute is
simply copied into a warehouse column; if translation is more complex,
an explanation is given. Some attributes are ignored.
If an attribute is missing from a source file but required by the
warehouse schema (i.e., its column is qualified with NOT NULL), a
warning is issued. If the missing attribute is not required, NULL is
stored.
Since ENZYME is a protein database, most attributes either define
column values in the Protein
table entry for the enzyme,
or are linked to it through linking tables. The term current
protein is used to refer to the enzyme described by the entry
being translated.
The ENZYME record types are:
ID line
DE line
AN line
CA line
CF line
CC line
DI line
DR line
PR line
There is only one ID
record per entry.
A row is added to the Reaction
table and to the DBID
table for the ID.
ENZYME Attribute | Warehouse Semantics |
---|---|
EC_NUMBER | Reaction.ECNumber . DBID.XID ; DBID.OtherWID is the WID of the Protein
object |
The record can be broken into multiple DE lines.
DE specifies the recommended name of the enzyme.
ENZYME Attribute | Warehouse Semantics |
---|---|
DESCRIPTION | Protein.Name |
A row is added to the DBID
table for each ALTERNATE_NAME,
providing a way to query for the WID of the associated enzyme in the Protein
table.
ENZYME Attribute | Warehouse Semantics |
---|---|
ALTERNATE_NAME[*] | DBID.XID ; DBID.OtherWID is the WID of the Protein
entry of the enzyme |
A reaction-equation is detected by the presence of an equals sign; arbitrary text containing an equals sign, or multiple equals signs in an equation, will confuse the parser. If text-description is given rather than a reaction-equation, it is ignored and no details of the reaction are stored.
This record describes a reaction that is catalyzed by the enzyme.
A row is added to the EnzymaticReaction
table for each CA
record; EnzymaticReaction.ProteinWID
is the current
protein, and
EnzymaticReaction.ReactionWID
is the current reaction
(i.e., the reaction
with the EC number matching the ID record of this entry).
A row is added to the Chemical
table for each CHEMICAL
attribute. A row is
also added to either the Product
or Reactant
table depending on which
side of the reaction the chemical is on.
ENZYME Attribute | Warehouse Semantics |
---|---|
REACTION-EQUATION | Creates EnzymaticReaction row; equation is stored using Product and Reactant
linking tables. |
CHEMICAL[*] | Chemical.Name ; either a Product
or Reactant row is added; its OtherWID is Chemical.WID |
COEFFICIENT[*] | either Product.Coefficient or Reactant.Coefficient ;
if attribute is absent, 1 is stored. |
This record describes a cofactor, a chemical that is required to catalyze a reaction, but is not changed by it.
The ENZYME database allows a more complex syntax for chemicals than a simple name, including disjunctions and multiple cofactors per line. The loader treats the text of the entire line as a single chemical, e.g., 'Zinc or Iron'.
A row is added to the Chemical
table for the CHEMICAL
attribute.
A row is added to the EnzymaticReactionCofactor
linking table for each CF record; EnzymaticReactionCofactor.ChemicalWID
is the WID of the added chemical, and
EnzymaticReactionCofactor.EnzymaticReactionWID
is the WID
of the current enzymatic reaction.
ENZYME Attribute | Warehouse Semantics |
---|---|
CHEMICAL | Chemical.Name |
-!-
comment-line+
The record can be broken into multiple CC lines.
The value of COMMENT is the concatenation of all comment-lines that comprise it. Line breaks are converted to a single space, and the delimiters CC and -!- are not included. Comments may be of any length (in the schema they are represented as CLOBs). Some ENZYME comments contain extra white space to format them nicely in an ASCII file; this white space is preserved. Many ENZYME comments are copyright notices; these are treated like other comments.
An entry is added to the CommentTable
table for each COMMENT.
CommentTable.OtherWID
is the WID of the current protein.
ENZYME Attribute | Warehouse Semantics |
---|---|
COMMENT[*] | CommentTable.Comm |
PROSITE_ID is a PROSITE document entry accession number, uniquely identifying a document in PROSITE. The loader accepts any string delimited by white space.
This record describes a cross-reference from an ENZYME protein to an
entry in the
PROSITE
database.
A row is added to the CrossReference
table for each PR.
CrossReference.OtherWID
refers to the current protein.
ENZYME Attribute | Warehouse Semantics |
---|---|
PROSITE_ID | CrossReference.XID ; CrossReference.DataSetName is 'PROSITE'; CrossReference.OtherWID refers to the current
protein. |
SWISSPROT_ID is the ID
value of an entry in
the SWISS-PROT database. SWISSPROT_AC is the primary AC
(accession) value of an entry in the SWISS-PROT database.
This record describes a cross-reference from an ENZYME protein to
the SWISS-PROT database.
A row is added to the CrossReference
table for each SWISSPROT_ID
and
SWISSPROT_AC.
CrossReference.OtherWID
refers to the current protein.
ENZYME Attribute | Warehouse Semantics |
---|---|
SWISSPROT_ID[*] | CrossReference.XID ; CrossReference.DataSetName is 'SWISS-PROT'; CrossReference.OtherWID refers to the current
protein. |
SWISSPROT_AC[*] | CrossReference.XID ; CrossReference.DataSetName is 'SWISS-PROT'; CrossReference.OtherWID refers to the current
protein. |
Entry
table for each row that the
loader adds to these tables:
Chemical
EnzymaticReaction
Protein
Reaction
Entry
row is created as follows:
Column | Value assigned by loader |
---|---|
OtherWID |
The WID of the entry described by this row. |
InsertDate |
The time/date the loader was run (SQL 'SYSDATE'). |
CreationDate |
NULL |
ModifiedDate |
NULL |
LoadError |
'T' if a parse error is detected, 'F' otherwise. |
LineNumber |
The line number of the input file the associated source entry starts on. |
DatasetWID |
The value Dataset.WID assigned to
the dataset being loaded. |