KEGG Loader for BioWarehouse

Version 4.6


(C) 2004-2009 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.




Introduction
Limitations
Loader Dependencies and Prerequisites
Installation and Building
Obtaining KEGG Data
KEGG Dataset
PATHWAY
GENOME
GENES
LIGAND: COMPOUND
LIGAND: REACTION
LIGAND: ENZYME
References


Introduction

This document describes version 4.6 of the KEGG Loader. It is one of several database loaders comprising the BioWarehouse.

KEGG (the Kyoto Encyclopedia of Genes and Genomes) is a collection of databases curated by the Bioinformtics Center at the Institute for Chemical Research at Kyoto University. KEGG is available online at http://www.genome.ad.jp/kegg/. KEGG contains five types of data:

  1. Pathway maps
  2. Ortholog group tables
  3. Molecular catalogs
  4. Genome maps
  5. Gene catalogs

KEGG contains three major components:

  1. PATHWAY: metabolic and regulatory pathways
  2. GENES: gene catalogs for organisms
  3. LIGAND: enzyme reactions and chemical compounds

LIGAND was originally started by Takaaki Nishioka, and is now maintained in collaboration with the KEGG project. LIGAND itself is a compound of three databases:

  1. COMPOUND: collection of chemical compounds that are related to various cellular processes
  2. REACTION: collection of reactions (mostly enzymatic reactions) involving the compounds from COMPOUND
  3. ENZYME: the enzyme nomenclature

This document describes the semantic mapping between the KEGG database components PATHWAY, GENES and LIGAND to a representation in the BioWarehouse. A chapter is dedicated to each of the KEGG components, defining the mapping to the BioWarehouse schema.

Overview of BioWarehouse Schema

The BioWarehouse schema contains the data definition statements for the BioWarehouse. These include four different types of tables - constant tables, object tables, linking tables, and special tables.

Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

Full schema information, including source files and browseable documentation, is available with this distribution.


Limitations

The latest supported data version for the KEGG loader is listed in the loader summary table. The loader may not be compatible with future versions of KEGG. KEGG does not seem to include a current version number in their download, and is not displayed prominently on their website, but some version and release information can be found.

The loader does not load any data from the PATHWAY component of KEGG.

The loader ignores the MASS keyword on compounds, though it could load this into Chemical.MolecularWeightCalc.


Loader Dependencies and Prerequisites

Other that the standard warehouse creation procedure, the loader does not require that any other Warehouse tools be run prior to its execution.

Installation and Building

See Building the KEGG Loader for details on installing and building the loader.

Obtaining KEGG Data

See Running the KEGG Loader for details on installing and building the loader.

KEGG DataSet

All data loaded by the loader are loaded as a single dataset in the warehouse. References from one part of KEGG to another (e.g. chemicals used in a reaction) are resolved to the wid within the dataset References to KEGG data that are not loaded use the CrossReference table to associate the data that is loaded with the data that is not loaded.

OtherWID The WID assigned to the loaded data.
XID KEGG accession for the data that is not loaded
DatabaseName Abbreviated name of the KEGG component (e.g., 'KO' for KEGG Orthology).
CrossWID NULL.


DataSet Table

Each loaded version of KEGG will be assigned a new row in the DataSet table as follows:

WID The next available WID in the warehouse.
Name "KEGG"
Version The version numberassigned by KEGG to this release, e.g. "34''.
ReleaseDate The date that this version of KEGG was released.
LoadDate The time/date the loader was run (SQL `SYSDATE').
ReleaseDate NULL.
ChangeDate The date and time the loader completed, NULL if the loader did not complete successfully.
LoadedBy The value of the system environment variable USER for the account running the loader.
Application 'KEGG Loader'
ApplicationVersion 4.6
HomeURL http://www.genome.ad.jp/.
QueryURL NULL

Entry Table

All entities that are assigned a WID (other than the DataSet above) are also given an Entry row:

OtherWID The WID assigned to the entity.
InsertDate The time/date the loader was run.
CreationDate NULL.
ModifiedDate NULL.
LoadError "T'' if a parse error is detected, "F'' otherwise.
DatasetWID The WID assigned to the DataSet (see above).

The LoadError field is set to true if any error occured in loading the record from the source database. The granularity is based on the source record-i.e. if there was an error on one line of the record, all warehouse entries derived from that record will have the LoadError flag set true.


PATHWAY

The KEGG PATHWAY component is a graphical structure combining the other parts of KEGG. The distributed data contains images and HTML image maps to allow for a convenient visual interaction with the data from LIGAND and GENES. The information it provides above that in LIGAND and GENES, is which reactions occur in which organisms.

PATHWAY is not currently loaded into the warehouse.


GENOME

The Genome database contains descriptions of the organisms whose genomes are present in the GENES database. Information represented includes the organism name, the abbreviation used in KEGG, the categorization of the organism, high-level information about its genome and citations of the source of the information.


Semantic Mapping

The present loading of this file ignores statistical and map/catalog information. The ignored fields are:

  1. STATISTICS
  2. GENOMEMAP
  3. GENECATALOG

In addition, several fields are loaded strictly as comments, without any semantic interpretation of their contents:

  1. DEFINITION
  2. TAXONOMY
  3. LINEAGE
  4. MORPHOLOGY
  5. PHYSIOLOGY
  6. DATA_SOURCE
  7. ORIGINAL_DB
  8. ENVIRONMENT
  9. COMMENT

ENTRY

Each entry in GENOME begins with an ENTRY field, giving the abbreviation used in KEGG for the organism, e.g. `hin' for `H.influenzae'. This is stored in the DBID table:

OtherWID The BioSource.WID assigned to this organism (see GENOME Name below).
XID The three-character organism abbreviation.

NAME

The name entry gives the scientific name for the organism. This is used to populate the BioSource table:

WID A new WID assigned to this object.
Name The organism name.
DataSetWID The WID assigned to the DataSet (see DataSet table above).
all other columns NULL.

CHROMOSOME

The chromosome entry optionally gives the circularity (`Circular' or `Linear'), for initial population of the NucleicAcid table, and optionally a chromosome name (in organisms with multiple chromosomes).

WID A new WID assigned to this object.
Name See SEQUENCE below.
Type "DNA''.
Class "chromosome''.
Topology "circular'' if Circular, "linear'' if Linear, NULL if not specified.
MoleculeLength See LENGTH below.
GeneticCodeWID The GeneticCode.WID associated with the genetic code (see SEQUENCE below).
BioSourceWID The BioSource.WID assigned to this organism (see GENOME Name above).
DataSetWID The WID assigned to the DataSet (see DataSet table above).

PLASMID

The plasmid entry gives the name of the plasmid and (optionally) if it is circular. This is used for initial population of the NucleicAcid table:

WID A new WID assigned to this object.
Name See SEQUENCE below.
Type "DNA''.
Class "plasmid''.
Topology "circular'' if Circular, "linear'' if Linear, NULL if not specified.
MoleculeLength See LENGTH below.
GeneticCodeWID The GeneticCode.WID associated with the genetic code (see SEQUENCE below).
BioSourceWID The BioSource.WID assigned to this organism (see GENOME Name above).
DataSetWID The WID assigned to the DataSet (see DataSet table above).

SEQUENCE

The sequence item gives the Genbank accession number for the chromosome or plasmid, and (optionally) the genetic code number used.

The replicon name is constructed from the accession number and the replicon type, (e.g. "Chromosome GB:L77117'' for the chromosome of Methanococcus jannaschii DSM2661) and is stored in NucleicAcid.Name.

If the NCBI Taxonomy Loader has been run, the loader will use the genetic code number to find the associated entry in the GeneticCode table, and store it in NucleicAcid.GeneticCodeWID for this replicon.NOTE: As of approximately version 27.0 of KEGG, genetic codes do not seem to be provided in the data, so this column will not be populated.

The Genbank accession number is also stored in the CrossReference table:

OtherWID The WID assigned to this NucleicAcid (see above).
XID The Genbank accession number.
DatasetWID NULL.
DatabaseName "GENBANK''

LENGTH

The length entry gives the number of nucleotides in the replicon, and populates NucleicAcid.MoleculeLength.

Literature Citations

Each entry in the genome file gives one or more citations to the literature, contained in the fields REFERENCE (giving the Pubmed ID), AUTHORS, TITLE, and JOURNAL. These are used to populate the Citation table:

WID A new WID assigned to this citation.
Citation The concatenation of the AUTHORS, TITLE, and JOURNAL entries.
PMID The Pubmed ID from the REFERENCE entry.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Two entries are made in CitationWIDOtherWID to relate the citation back to the BioSource and to the NucleicAcid:

OtherWID The BioSource.WID assigned to this organism (see GENOME Name above).
CitationWID The WID of the citation

OtherWID The NucleicAcid.WID of the replicon (see CHROMOSOME or PLASMID above).
CitationWID The WID of the citation

GENES

The GENES database contains information on the genome of particular organisms, one organism per file. The information includes the name(s) of the gene, its position, the codon usage, amino acid sequence and nucleotide sequence.


Semantic Mapping

An entry in GENES contains up to nine fields.

ENTRY

The ENTRY line gives the gene id and the organism name. The gene id is used in the Gene table (see NAME below). The organism name is used to lookup the previously loaded organism from GENOME. If the organism is found, a row is created in the BioSourceWIDGeneWID table:

BioSourceWID The WID of the organism (see GENOME NAME above).
GeneWID The WID assigned to this gene.

NAME

The first name given is assumed to be the primary name, and other names are synonyms. The name starts populating the Gene table:

WID A new WID assigned to this object.
Name The primary name of the gene.
GenomeID The gene id (from ENTRY above).
CodingRegionStart See POSITION below.
CodingRegionEnd See POSITION below.
Interrupted See POSITION below.
NucleicAcidWID NucleicAcid.WID of the replicon this gene resides on (see GENOME Chromosome and GENOME Plasmid above).
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Alternate names are stored in SynonymTable:

OtherWID The WID assigned to this gene.
Syn The alternative name

DEFINITION

The definition is stored in CommentTable:

OtherWID The WID assigned to this gene.
Comm The definition text.

CLASS

This is presently ignored.

POSITION

The position of the gene can be simply a numerical range, a join (patching together a number of regions), a complement, a range relative to other genes, and also indicate on which replicon the gene resides.

We presently ignore the non-numerical range information. Joins are considered to range from the start of the low range, to the end of the high range and the Interrupted flag is set to `T'.

The Gene entry from above is thereby extended with:

WID ...
Name ...
GenomeID ...
CodingRegionStart The low end of the numerical range(s).
CodingRegionEnd The high end of the numerical range(s).
Direction `F' for forward, `R' for complement.
Interrupted `T' if a join was present, `F' otherwise.
DataSetWID ...

References to the replicon on which the gene resides are represented in the GeneWIDRepliconWID table:

GeneWID The WID of the Gene.
RepliconWID The WID of the Replicon.

DBLINKS

The dblinks item contains cross-link information to other databases. This is used to populate the CrossReference table:
OtherWID The WID assigned to this compound (see above).
XID The external database identifier.
DatasetWID NULL.
DatabaseName The external database name.

CODON_USAGE

This is presently ignored.

AASEQ

The AASEQ item gives the amino acid sequence for the protein generated by this gene. This is used to complete the AASEQUENCE in the relevant protein (see NAME below).

WID ...
Name ...
AASequence The given sequence.
Charge ...
Fragment ...
MolecularWeightCalc ...
MolecularWeightExp ...
PlCalc ...
PlExp ...
DataSetWID ...

NTSEQ

This is presently ignored.


LIGAND: COMPOUND

The COMPOUND section of LIGAND is a collection of metabolic compounds including substrates, products and inhibitors. Each of the chemicals referenced in the ENZYME and KEGG PATHWAY components is represented in this component. Information represented includes the naming, chemical formula, structural information, metabolic pathways, related enzymes, related protein structures, prosthetic groups and the CAS registry number.

In our semantic mapping, we ignore the information representing the structural information, as there is no current table for this in the BioWarehouse schema.


Semantic Mapping

This section describes how each of the fields in a COMPOUND entry is mapped into the BioWarehouse schema.

ENTRY

Each data item begins with an ENTRY field, giving the compound accession number for the LIGAND database. The accession number is stored in the DBID table:

OtherWID The WID assigned to this chemical (see below).
XID The accession number

NAME

The name item contains the recommended name for the compound, and optionally some alternatives. The recommended name is always first, as is mandatory. This item starts populating the Chemical table:

WID A new WID assigned to this object.
Name The recommended name.
BeilsteinName NULL.
SystematicName NULL.
CAS NULL.
Charge NULL.
EmpiricalFormula See below.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
OctH20PartitionCoeff NULL.
PKA1 NULL.
PKA2 NULL.
PKA3 NULL.
WaterSolubility NULL.
Smiles NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Alternative names are each stored in SynonymTable:

OtherWID The WID assigned to this chemical.
Syn The alternative name

FORMULA

The formula item is an ascii representation of the chemical formula of this compound, e.g. H2O, C10H16N5O13P3. This is used to populate Chemical.EmpiricalFormula.

PATHWAY

The pathway item is a cross-link to the KEGG PATHWAY data, and consists of the pathway map accession number, followed by the description. This is used to populate the CrossReference table:

OtherWID The WID assigned to this chemical (see above).
XID The pathway accession number.
DatasetWID NULL.
DatabaseName "KEGG PATHWAY''

ENZYME

The enzyme item is a cross-link to the KEGG ENZYME data, and consists of the EC number, followed by a type indicating how the compound is related to the enzyme. Valid types are R for reactant, I for inhibitor, C for cofactor and E for effector.

Rather than load this data from COMPOUND, this information is loaded from ENZYME, where it is redundantly replicated in KEGG.

STRUCTURES

The structures item is a cross-link to PDB-the Protein Data Bank-which stores the three dimensional structure information for proteins. This is used to populate the CrossReference table:

OtherWID The WID assigned to this compound (see above).
XID The PDB ID.
DatasetWID NULL.
DatabaseName "PDB''

DBLINKS

The dblinks item contains cross-link information to other databases. This is used to populate the CrossReference table:

OtherWID The WID assigned to this compound (see above).
XID The external database identifier.
DatasetWID NULL.
DatabaseName The external database name.

RPAIR

Ignored, but see the translation of RPAIR in REACTION.

GLYCAN

This section is ignored.

COMMENT

A row is added to the CommentTable table for each comment:

OtherWID The Chemical WID assigned to this compound.
Comm The comment string.

REMARK

If the remark item begins with 'Same as: ', the remainder of the item is assumed to be an accession number. It is associated internally with this compound, and subsequent references to this accession number in REACTION are treated as if they were references to this compound. Also, a row is added to the SynonymTable table for each remark:

OtherWID The WID assigned to this compound.
Syn The text that occurs after 'Same as: '

Other remarks are ignored.


LIGAND: REACTION

The REACTION section of LIGAND defines a collection of chemical reactions.

These reactions are referenced in the ENZYME component of KEGG by their accession number. The great majority of reactions in ENZYME are defined in REACTION, but some are not. Furthermore, REACTION can define non-enzymatic reactions which are not referenced there.


Semantic Mapping

This section describes how each of the items in a REACTION entry is mapped into the BioWarehouse schema. The principal mapping is to the Reaction table; one or more rows are added to it based on this entry.

Most reactions have an ENZYME item, which specifies one or more Enzyme Commission (EC) numbers. For these entries, a Reaction row is created for each EC number. Partial EC numbers (those containing a dash) such as 1.2.3.- or 6.-.-.- are treated only as synonyms, and never populate one of the EC number columns. If the EC number is non-partial it is stored in Reaction.ECNumberProposed. But when the ENZYME is loaded, if this reaction's accession number is mentioned in its REACTION item, that EC number becomes Reaction.ECNumber, and ECNumberProposed becomes NULL, effectively making the EC number official.

WID A new WID assigned to this reaction.
Name The first name in the NAME section; NULL if absent.
DeltaG NULL.
ECNumber NULL.
This may be updated when an ENZYME refreencing this reaction is loaded.
ECNumberProposed The EC number from the ENZYME item if it is present, else NULL.
Multiple EC numbers in ENZYME cause multiple Reaction rows to be loaded.
This may be updated to NULL during the load of ENZYME.
Spontaneous NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

If multiple reactions are added, all links defined below are replicated for each Reaction row added.

ENTRY

Each data item begins with an ENTRY field, giving the reaction accession number for the LIGAND database. If the ENZYME field is present and refers to multiple EC numbers, there is no unique identifier for this Reaction, and no DBID row is loaded. Otherwise the accession number is stored in the DBID table:

OtherWID The WID assigned to this reaction (see below).
XID The accession number

NAME

The name item contains an optional list of names for the reaction. The recommended name, if present, is always first, followed by alternatives. This is stored in Reaction.Name.

Alternative names are each stored in SynonymTable:

OtherWID The WID assigned to this reaction.
Syn The alternative name

DEFINITION

A row is added to the Description table for this item. It is the textual depiction of the reaction. It is used in the processing of the ENZYME section.

OtherWID The WID assigned to this reaction (see above).
Comm The definition.
Table 'Reaction'

ENZYME

This item, when present, specifies one or more EC numbers associated with the reaction. As described above, a row is added to Reaction for each EC number mentioned in this item. If this item is absent, a single row is added with both ECNumber and ECNumberProposed as NULL.

Each EC number, whether full or partial, is stored in the SynonymTable:

OtherWID The WID assigned to this reaction.
Syn The EC number.

EQUATION

The EQUATION item specifies the reactants and products of the reaction. The equation is a list of reactants followed by a list of products, separated by a double arrow. Each reactant or product is a COMPOUND, GLYCAN, or DRUG accesion number, optionally preceded by a coefficient. If a reactant or product is not an accession number that was defined in COMPOUND, a syntax error is issued, and that compound is excluded from the reaction.

If the coefficient is absent, is is assumed to be 1. An explicit coefficient is either an integer or an expression such as (n), (m-1)or (n+m). The loader recognizes most parenthezized expressions, as this is the standard syntax for KEGG. It also recognizes some unparenthesized expressions such as n-1, but has a limitation that it does not recognize all such unparenthesized expressions.

The loader loads the compounds on the left side of the equation into the Reactant table. The loader loads the compounds on the right side of the equation into the Product table. Both table entries have identical column definitions:

ReactionWID The WID of the reaction
OtherWID The Chemical.WID assigned to the compound.
Coefficient Coefficient of this substrate, or 0 if the coefficient is an expression.

If multiple reactions are created, multiple links to the reactants and products are added to Reactant and Product from each Reaction row.

ORTHOLOGY

The ORTHOLOGY item contains cross-link information to KEGG Orthology, which is not loaded by the loader. This is used to populate the CrossReference table:

OtherWID The WID assigned to this reaction (see above).
XID The KEGG accession number. It generally starts with 'KO'.
DatasetWID NULL.
DatabaseName The external database name, generally 'KO'.

RPAIR

The RPAIR item contains cross-link information to KEGG Rpair, which is not loaded by the loader. This is used to populate the CrossReference table:

OtherWID The WID assigned to this reaction (see above).
XID The KEGG accession number. It generally starts with 'RP'.
DatasetWID NULL.
DatabaseName The external database name, generally 'RP'.

PATHWAY

The PATHWAY item contains cross-link information to KEGG Pathway, which is not loaded by the loader. This is used to populate the CrossReference table:

OtherWID The WID assigned to this reaction (see above).
XID The KEGG accession number.
DatasetWID NULL.
DatabaseName The external database name, generally 'PATH'.

COMMENT

A row is added to the CommentTable table for each comment:

OtherWID The WID assigned to this reaction (see above).
Comm The comment string.

REMARK

A row is added to the CommentTable table for each remark:

OtherWID The WID assigned to this reaction (see above).
Comm The remark string.

REFERENCE

Ignored.


LIGAND: ENZYME

The ENZYME section of LIGAND is a collection of all known enzymatic reactions classified according to the nomenclature of the International Union of Biochemistry and Molecular Biology (IUBMB). Some of the entries in this data are taken from the ExPASY ENZYME database (http://expasy.hcuge.ch/sprot/enzyme.html) from the Swiss Institute of Bioinformatics.

Each entry is identified by the EC number, and contains information of naming, chemical reactions, metabloic compounds, metabolic pathways, genes encoding the enzyme (for several organisms), genetic diseases, and links to other databases.


Semantic Mapping

This section describes how each of the items in the ENZYME entry is mapped into the BioWarehouse schema.

ENTRY

Each data item begins with a mandatory ENTRY field, giving the EC number for the enzyme. An EC number may be either a full or a partial EC number. If an EC number is full, it is stored in the Reaction table row of each associated reaction (see below). Partial EC numbers are not stored.

Note that if the EC number is partial, it will typically have no EnzymaticReaction, Protein,, or Pathway rows associated with it, as partial EC enzymes typically do not specify any genes in their GENES item. Moreover, entries for partial EC numbers typically cause no data to be loaded into the BioWarehouse.

NAME

The name item contains the recommended name for the enzyme, and optionally some alternatives. All names are assumed to refer to proteins, not ribozymes. The recommended name is always first, and is mandatory. This item is stored in the Protein table:

WID A new WID assigned to this object.
Name The recommended name.
AASequence NULL.
Charge NULL.
Fragment NULL.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
PlCalc NULL.
PlExp NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

One copy of the protein is made for each gene which can generate it (see below). The amino acid sequence is completed when loading the gene (see above).

Alternative names are each stored in SynonymTable:

OtherWID The WID assigned to this Protein.
Syn The alternative name.

CLASS

The class item contains the meaning of the EC number, and is mandatory for all entries. There are three elements: the class, subclass and sub-subclass of the enzyme.

The class entry is not currently loaded.

SYSNAME

The sysname item contains the systematic name given by the Enzyme Commission, representing the nature of the chemical reaction. This is stored as a synonym of the reaction name, in SynonymTable:

OtherWID WID of the reaction (see below).
Syn The Systematic Name.

REACTION

The reaction item specifies one or more chemical reactions subitems, each of which specifies one or more Reaction rows to be used in translating various items of this entry.

Each subitem is in the form of an equation or a text description, followed optionally by a list of accession numbers referring to reactions defined in REACTION. If multiple subitems are speccified, each is preceded by a parenthesized number such as (1).

If the REACTION item is absent, a single reaction is created from the SUBSTRATE and PRODUCT items (see below).

A reaction item is assumed to be a textual description if it contains no blank-delimited equals sign = or double arrow <=>. If a reaction is given in text, the SUBSTRATE and PRODUCT items are used to define the reaction in preference to the REACTION item, which is left uninterpreted and stored as a comment:

OtherWID The WID assigned to this reaction.
Comm The reaction string.

If the list of reaction accession numbers is absent from a reaction, each side of the interpreted equation is stored as per the substrate and product items (see below). The reaction is stored in the Reaction table:

WID A new WID assigned to this object.
DeltaG NULL.
ECNumber NULL.
ECNumberProposed The EC Number specified in the ENTRY item, or NULL if it is a partial EC number.
Spontaneous NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

If the list of reaction accession numbers is present (the typical case), no new Reaction rows are created. Instead, the Reaction objects created during translation of the REACTION section specified by these accession numbers are used in translating this entry. Also, the presence of an accession number serves to define the EC number specified in the ENTRY item as an official EC number. The Reaction.ECNumber associated with each accession is updated to be the EC number specified in the ENTRY item, and the Reaction.ECNumberProposed associated with it is updated to NULL.

An EnzymaticReaction entry is also created for every Reaction specified, with one copy for each Protein generated:

WID A new WID assigned to this object.
ReactionWID The WID of the Reaction assigned above.
ProteinWID The WID of the Enzyme (see NAME above).
ComplexWID NULL.
ReactionDirectionWID NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

SUBSTRATE

The substrate item contains the chemical compounds that appear on the left side of the reaction. If the REACTION item is specified and gave an interpretable reaction, the substrate is ignored. Otherwise it is used to construct a reaction as follows.

Each substrate chemical is assigned an entry in the Chemical table. If two chemicals occur within KEGG that are textually identical they are considered the same entity. For new chemicals (not previously loaded from LIGAND COMPOUND), the fields are completed as follows:

WID A new WID assigned to this object.
Name The name of the substrate chemical.
BeilsteinName NULL.
CAS NULL.
Charge NULL.
EmpiricalFormula NULL.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
OctH20PartitionCoeff NULL.
SystematicName NULL.
WaterSolubility NULL.
Smiles NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Each of the substrate chemicals is linked to the reaction with a Reactant table entry, including the coefficient when specified. If the coefficient is not given, it is assumed to be 1:

ReactionWID The WID of the reaction
OtherWID The Chemical.WID assigned to the substrate
Coefficient Coefficient of this substrate.

PRODUCT

The product item contains the chemical compounds that appear on the right side of the reaction. If the REACTION item is specified and gave an interpretable reaction, the product is ignored. Otherwise it is used to construct a reaction as follows.

Each product chemical is assigned an entry in the Chemical table. If two chemicals occur within LIGAND ENZYME that are textually identical within they are considered the same entity. For new chemicals (not previously loaded from LIGAND COMPOUND), the fields are completed as follows:

WID A new WID assigned to this object.
Name The name of the product chemical.
BeilsteinName NULL.
CAS NULL.
Charge NULL.
EmpiricalFormula NULL.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
OctH20PartitionCoeff NULL.
SystematicName NULL.
WaterSolubility NULL.
Smiles NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Each of the product chemicals is linked to the reaction with a Product table entry, including the coefficient when specified. If the coefficient is not given, it is assumed to be 1:

ReactionWID The WID of the reaction
OtherWID The Chemical.WID assigned to the product chemical
Coefficient Coefficient of this product.

INHIBITOR

The inhibitor item names compounds that inhibit the reaction from taking place. Each compound is given an entry in the Chemical table (subject to the textual identical conservation, as in substrate/product):

WID A new WID assigned to this object.
Name The name of the inhibitor compound.
BeilsteinName NULL.
CAS NULL.
Charge NULL.
EmpiricalFormula NULL.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
OctH20PartitionCoeff NULL.
SystematicName NULL.
WaterSolubility NULL.
Smiles NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Each of the inhibitors is linked to each of the enzymatic reactions by the EnzReactionWIDChemicalWID table:

EnzymaticReactionWID The WID of the Enzymatic Reaction (see above).
ChemicalWID The WID assigned to the chemical
InhibitOrActivate 'I'
Mechanism NULL.
PhysioRelevant NULL.

COFACTOR

NOTE: As of approximately version 27 of KEGG, cofactor information appears to be missing from the data files. In this case, no cofactor information is loaded.

The cofactor item names compounds that do not appear in the reaction equation, but are described in the comment item as operating as cofactors in the reaction. Each compound is given an entry in the Chemical table (subject to the textual identical conservation, as in substrate/product):

WID A new WID assigned to this object.
Name The name of the cofactor compound.
BeilsteinName NULL.
CAS NULL.
Charge NULL.
EmpiricalFormula NULL.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
OctH20PartitionCoeff NULL.
SystematicName NULL.
WaterSolubility NULL.
Smiles NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Each of the cofactor compounds is linked to each of the enzymatic reactions with a EnzReactionCofactor table entry:

EnzymaticReactionWID The WID of the enzymatic reaction (see above).
ChemicalWID The WID assigned to the cofactor compound.
Prosthetic NULL.

EFFECTOR

The effector item names compounds that activate the reaction. Each compound is given an entry in the Chemical table (subject to the textual identical conservation, as in substrate/product):

WID A new WID assigned to this object.
Name The name of the effector compound.
BeilsteinName NULL.
CAS NULL.
Charge NULL.
EmpiricalFormula NULL.
MolecularWeightCalc NULL.
MolecularWeightExp NULL.
OctH20PartitionCoeff NULL.
SystematicName NULL.
WaterSolubility NULL.
Smiles NULL.
DataSetWID The WID assigned to the DataSet (see DataSet table above).

Each of the effectors is linked to each of the enzymatic reactions by the EnzReactionWIDChemicalWID table:

EnzymaticReactionWID The WID of the Enzymatic Reaction (see above).
ChemicalWID The WID assigned to the chemical
InhibitOrActivate 'A'
Mechanism NULL.
PhysioRelevant NULL.

COMMENT

The comment item contains free form text information commenting on the enzyme. This item populates the CommentTable:

OtherWID The WID assigned to this enzyme (see NAME above).
Comm The comment string.

There may be several comments associated with each enzyme.

PATHWAY

The pathway item is a cross-link to the KEGG PATHWAY data, and consists of the pathway map accession number, followed by the description. As that database is not parseable, this entry is used to associate reactions into pathways.

A reference (sum of organisms) pathway is created, if it does not already exist:

WID A new WID assigned to this object.
Name The given descriptive name of the pathway.
Type 'R' (Reference).
BioSourceWID The BioSource.WID assigned to this organism (see GENOME Name above).
DataSetWID The WID assigned to the DataSet (see DataSet table above).

The pathway map accession number is stored in the DBID table:

OtherWID The WID assigned to this pathway.
XID The accession number.

Reactions are considered distinct if they have different text depictions specified in their DEFINITION item in REACTION. Each distinct reaction is then linked to the pathway by adding a row to the PathwayReaction table:

PathwayWID The WID assigned to this pathway.
ReactionWID The reaction WID.
PriorReactionWID NULL.
Hypothetical 'U' (Unknown).

GENES

The genes item is a cross-link to the KEGG gene catalogs, showing the genes in various organisms that encode this enzyme. This is used to create organism specific pathways, and to indicate the number of proteins to generate in loading: one is generated for each gene, as they may have different amino acid sequences.

For each organism with the necessary gene(s) a new pathway is created (if not already present). The BioSource WID is searched from the organisms previously loaded from the Genome data.

WID A new WID assigned to this object.
Name The given descriptive name of the pathway.
Type 'O' (Organism).
BioSourceWID The BioSource.WID assigned to this organism (see GENOME Name above).
DataSetWID The WID assigned to the DataSet (see DataSet table above).

The pathway map accession number for this pathway is stored in the DBID table:

OtherWID The WID assigned to this pathway.
XID The accession number.

And the Enzyme is linked to the BioSource by the BioSourceWIDProteinWID table:

BioSourceWID The BioSource.WID assigned to this organism (see GENOME Name above).
ProteinWID The WID assigned to the Enzyme.

Each reaction is then linked to the new pathway by adding a row to the PathwayReaction table, in the same way as for the reference pathway above:

PathwayWID The WID assigned to this pathway.
ReactionWID The reaction WID.
PriorReactionWID NULL.
Hypothetical 'U' (Unknown).

DISEASE

The disease item is a cross-link to OWIM (On-line Mendelian Inheritance in Man) database. This is used to populate the CrossReference table:

OtherWID The WID assigned to this enzyme (see NAME above).
XID The MIM Number.
DatasetWID NULL.
DatabaseName "MIM''

MOTIF

The motif item is a cross-link to the PROSITE database. Each PROSITE identifier is used to populate the CrossReference table:

OtherWID The WID assigned to this enzyme (see NAME above).
XID The PROSITE ID.
DatasetWID NULL.
DatabaseName "PS''

STRUCTURES

The structures item is a cross-link to PDB-the Protein Data Bank-which stores the three dimensional structure information for proteins. Each PDB identifier is used to populate the CrossReference table:

OtherWID The WID assigned to this enzyme (see NAME above).
XID The PDB ID.
DatasetWID NULL.
DatabaseName "PDB''

DBLINKS

The dblinks item contains cross-link information to other databases, including the ENZYME Nomenclature database from the Swiss Institute of Bioinformatics. This is used to populate the CrossReference table:

OtherWID The WID assigned to this enzyme (see NAME above).
XID The external database identifier.
DatasetWID NULL.
DatabaseName The external database name.

ALL_REAC

Ignored.

ORTHOLOGY

Ignored, but see the translation of ORTHOLOGY in REACTION.

REFERENCE

Ignored.


References