KEGG Loader for BioWarehouse

This document describes version 4.6 of the KEGG Loader. It is one of several database loaders comprising the BioWarehouse.

KEGG (the Kyoto Encyclopedia of Genes and Genomes) is a collection of databases curated by the Bioinformtics Center at the Institute for Chemical Research at Kyoto University. KEGG is available online at http://www.genome.ad.jp/kegg/. KEGG contains five types of data:

LIGAND was originally started by Takaaki Nishioka, and is now maintained in collaboration with the KEGG project. LIGAND itself is a compound of three databases:

This document describes the semantic mapping between the KEGG database components PATHWAY, GENES and LIGAND to a representation in the BioWarehouse. A chapter is dedicated to each of the KEGG components, defining the mapping to the BioWarehouse schema.

Overview of BioWarehouse Schema

Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

Full schema information, including source files and browseable documentation, is available with this distribution.

Limitations

The latest supported data version for the KEGG loader is listed in the loader summary table. The loader may not be compatible with future versions of KEGG. KEGG does not seem to include a current version number in their download, and is not displayed prominently on their website, but some version and release information can be found.

The loader ignores the MASS keyword on compounds, though it could load this into Chemical.MolecularWeightCalc.

Loader Dependencies and Prerequisites

Installation and Building

Obtaining KEGG Data

KEGG DataSet

All data loaded by the loader are loaded as a single dataset in the warehouse. References from one part of KEGG to another (e.g. chemicals used in a reaction) are resolved to the wid within the dataset References to KEGG data that are not loaded use the CrossReference table to associate the data that is loaded with the data that is not loaded.

OtherWID	The WID assigned to the loaded data.
XID	KEGG accession for the data that is not loaded
DatabaseName	Abbreviated name of the KEGG component (e.g., 'KO' for KEGG Orthology).
CrossWID	NULL.

DataSet Table

Each loaded version of KEGG will be assigned a new row in the DataSet table as follows:

WID	The next available WID in the warehouse.
Name	"KEGG"
Version	The version numberassigned by KEGG to this release, e.g. "34''.
ReleaseDate	The date that this version of KEGG was released.
LoadDate	The time/date the loader was run (SQL `SYSDATE').
ReleaseDate	NULL.
`ChangeDate`	The date and time the loader completed, NULL if the loader did not complete successfully.
`LoadedBy`	The value of the system environment variable USER for the account running the loader.
`Application`	'KEGG Loader'
`ApplicationVersion`	4.6
HomeURL	http://www.genome.ad.jp/.
QueryURL	NULL

Entry Table

All entities that are assigned a WID (other than the DataSet above) are also given an Entry row:

OtherWID	The WID assigned to the entity.
InsertDate	The time/date the loader was run.
CreationDate	NULL.
ModifiedDate	NULL.
LoadError	"T'' if a parse error is detected, "F'' otherwise.
DatasetWID	The WID assigned to the DataSet (see above).

The LoadError field is set to true if any error occured in loading the record from the source database. The granularity is based on the source record-i.e. if there was an error on one line of the record, all warehouse entries derived from that record will have the LoadError flag set true.

PATHWAY

The KEGG PATHWAY component is a graphical structure combining the other parts of KEGG. The distributed data contains images and HTML image maps to allow for a convenient visual interaction with the data from LIGAND and GENES. The information it provides above that in LIGAND and GENES, is which reactions occur in which organisms.

GENOME

The Genome database contains descriptions of the organisms whose genomes are present in the GENES database. Information represented includes the organism name, the abbreviation used in KEGG, the categorization of the organism, high-level information about its genome and citations of the source of the information.

Semantic Mapping

The present loading of this file ignores statistical and map/catalog information. The ignored fields are:

In addition, several fields are loaded strictly as comments, without any semantic interpretation of their contents:

ENTRY

Each entry in GENOME begins with an ENTRY field, giving the abbreviation used in KEGG for the organism, e.g. `hin' for `H.influenzae'. This is stored in the DBID table:

OtherWID	The `BioSource.WID` assigned to this organism (see GENOME Name below).
XID	The three-character organism abbreviation.

NAME

The name entry gives the scientific name for the organism. This is used to populate the BioSource table:

CHROMOSOME

The chromosome entry optionally gives the circularity (`Circular' or `Linear'), for initial population of the NucleicAcid table, and optionally a chromosome name (in organisms with multiple chromosomes).

WID	A new WID assigned to this object.
Name	See SEQUENCE below.
Type	"DNA''.
Class	"chromosome''.
Topology	"circular'' if Circular, "linear'' if Linear, NULL if not specified.
MoleculeLength	See LENGTH below.
GeneticCodeWID	The `GeneticCode.WID` associated with the genetic code (see SEQUENCE below).
BioSourceWID	The `BioSource.WID` assigned to this organism (see GENOME Name above).
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

PLASMID

The plasmid entry gives the name of the plasmid and (optionally) if it is circular. This is used for initial population of the NucleicAcid table:

WID	A new WID assigned to this object.
Name	See SEQUENCE below.
Type	"DNA''.
Class	"plasmid''.
Topology	"circular'' if Circular, "linear'' if Linear, NULL if not specified.
MoleculeLength	See LENGTH below.
GeneticCodeWID	The `GeneticCode.WID` associated with the genetic code (see SEQUENCE below).
BioSourceWID	The `BioSource.WID` assigned to this organism (see GENOME Name above).
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

SEQUENCE

The sequence item gives the Genbank accession number for the chromosome or plasmid, and (optionally) the genetic code number used.

The replicon name is constructed from the accession number and the replicon type, (e.g. "Chromosome GB:L77117'' for the chromosome of Methanococcus jannaschii DSM2661) and is stored in NucleicAcid.Name.

If the NCBI Taxonomy Loader has been run, the loader will use the genetic code number to find the associated entry in the GeneticCode table, and store it in NucleicAcid.GeneticCodeWID for this replicon.NOTE: As of approximately version 27.0 of KEGG, genetic codes do not seem to be provided in the data, so this column will not be populated.

OtherWID	The WID assigned to this `NucleicAcid` (see above).
XID	The Genbank accession number.
DatasetWID	NULL.
DatabaseName	"GENBANK''

LENGTH

The length entry gives the number of nucleotides in the replicon, and populates NucleicAcid.MoleculeLength.

Literature Citations

Each entry in the genome file gives one or more citations to the literature, contained in the fields REFERENCE (giving the Pubmed ID), AUTHORS, TITLE, and JOURNAL. These are used to populate the Citation table:

WID	A new WID assigned to this citation.
Citation	The concatenation of the AUTHORS, TITLE, and JOURNAL entries.
PMID	The Pubmed ID from the REFERENCE entry.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Two entries are made in CitationWIDOtherWID to relate the citation back to the BioSource and to the NucleicAcid:

OtherWID	The `BioSource.WID` assigned to this organism (see GENOME Name above).
CitationWID	The WID of the citation

OtherWID	The `NucleicAcid.WID` of the replicon (see CHROMOSOME or PLASMID above).
CitationWID	The WID of the citation

GENES

The GENES database contains information on the genome of particular organisms, one organism per file. The information includes the name(s) of the gene, its position, the codon usage, amino acid sequence and nucleotide sequence.

Semantic Mapping

ENTRY

The ENTRY line gives the gene id and the organism name. The gene id is used in the Gene table (see NAME below). The organism name is used to lookup the previously loaded organism from GENOME. If the organism is found, a row is created in the BioSourceWIDGeneWID table:

NAME

The first name given is assumed to be the primary name, and other names are synonyms. The name starts populating the Gene table:

WID	A new WID assigned to this object.
Name	The primary name of the gene.
GenomeID	The gene id (from ENTRY above).
CodingRegionStart	See POSITION below.
CodingRegionEnd	See POSITION below.
Interrupted	See POSITION below.
NucleicAcidWID	`NucleicAcid.WID` of the replicon this gene resides on (see GENOME Chromosome and GENOME Plasmid above).
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

DEFINITION

CLASS

POSITION

The position of the gene can be simply a numerical range, a join (patching together a number of regions), a complement, a range relative to other genes, and also indicate on which replicon the gene resides.

We presently ignore the non-numerical range information. Joins are considered to range from the start of the low range, to the end of the high range and the Interrupted flag is set to `T'.

WID	...
Name	...
GenomeID	...
CodingRegionStart	The low end of the numerical range(s).
CodingRegionEnd	The high end of the numerical range(s).
Direction	`F' for forward, `R' for complement.
Interrupted	`T' if a join was present, `F' otherwise.
DataSetWID	...

References to the replicon on which the gene resides are represented in the GeneWIDRepliconWID table:

DBLINKS

CODON_USAGE

AASEQ

The AASEQ item gives the amino acid sequence for the protein generated by this gene. This is used to complete the AASEQUENCE in the relevant protein (see NAME below).

NTSEQ

LIGAND: COMPOUND

The COMPOUND section of LIGAND is a collection of metabolic compounds including substrates, products and inhibitors. Each of the chemicals referenced in the ENZYME and KEGG PATHWAY components is represented in this component. Information represented includes the naming, chemical formula, structural information, metabolic pathways, related enzymes, related protein structures, prosthetic groups and the CAS registry number.

In our semantic mapping, we ignore the information representing the structural information, as there is no current table for this in the BioWarehouse schema.

Semantic Mapping

This section describes how each of the fields in a COMPOUND entry is mapped into the BioWarehouse schema.

ENTRY

Each data item begins with an ENTRY field, giving the compound accession number for the LIGAND database. The accession number is stored in the DBID table:

NAME

The name item contains the recommended name for the compound, and optionally some alternatives. The recommended name is always first, as is mandatory. This item starts populating the Chemical table:

WID	A new WID assigned to this object.
Name	The recommended name.
BeilsteinName	NULL.
SystematicName	NULL.
CAS	NULL.
Charge	NULL.
EmpiricalFormula	See below.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
OctH20PartitionCoeff	NULL.
PKA1	NULL.
PKA2	NULL.
PKA3	NULL.
WaterSolubility	NULL.
Smiles	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

FORMULA

The formula item is an ascii representation of the chemical formula of this compound, e.g. H2O, C10H16N5O13P3. This is used to populate Chemical.EmpiricalFormula.

PATHWAY

The pathway item is a cross-link to the KEGG PATHWAY data, and consists of the pathway map accession number, followed by the description. This is used to populate the CrossReference table:

ENZYME

The enzyme item is a cross-link to the KEGG ENZYME data, and consists of the EC number, followed by a type indicating how the compound is related to the enzyme. Valid types are R for reactant, I for inhibitor, C for cofactor and E for effector.

Rather than load this data from COMPOUND, this information is loaded from ENZYME, where it is redundantly replicated in KEGG.

STRUCTURES

The structures item is a cross-link to PDB-the Protein Data Bank-which stores the three dimensional structure information for proteins. This is used to populate the CrossReference table:

DBLINKS

The dblinks item contains cross-link information to other databases. This is used to populate the CrossReference table:

RPAIR

GLYCAN

COMMENT

REMARK

If the remark item begins with 'Same as: ', the remainder of the item is assumed to be an accession number. It is associated internally with this compound, and subsequent references to this accession number in REACTION are treated as if they were references to this compound. Also, a row is added to the SynonymTable table for each remark:

LIGAND: REACTION

These reactions are referenced in the ENZYME component of KEGG by their accession number. The great majority of reactions in ENZYME are defined in REACTION, but some are not. Furthermore, REACTION can define non-enzymatic reactions which are not referenced there.

Semantic Mapping

This section describes how each of the items in a REACTION entry is mapped into the BioWarehouse schema. The principal mapping is to the Reaction table; one or more rows are added to it based on this entry.

Most reactions have an ENZYME item, which specifies one or more Enzyme Commission (EC) numbers. For these entries, a Reaction row is created for each EC number. Partial EC numbers (those containing a dash) such as 1.2.3.- or 6.-.-.- are treated only as synonyms, and never populate one of the EC number columns. If the EC number is non-partial it is stored in Reaction.ECNumberProposed. But when the ENZYME is loaded, if this reaction's accession number is mentioned in its REACTION item, that EC number becomes Reaction.ECNumber, and ECNumberProposed becomes NULL, effectively making the EC number official.

WID	A new WID assigned to this reaction.
Name	The first name in the NAME section; NULL if absent.
DeltaG	NULL.
ECNumber	NULL. This may be updated when an ENZYME refreencing this reaction is loaded.
ECNumberProposed	The EC number from the ENZYME item if it is present, else NULL. Multiple EC numbers in ENZYME cause multiple `Reaction` rows to be loaded. This may be updated to NULL during the load of ENZYME.
Spontaneous	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

If multiple reactions are added, all links defined below are replicated for each Reaction row added.

ENTRY

Each data item begins with an ENTRY field, giving the reaction accession number for the LIGAND database. If the ENZYME field is present and refers to multiple EC numbers, there is no unique identifier for this Reaction, and no DBID row is loaded. Otherwise the accession number is stored in the DBID table:

NAME

The name item contains an optional list of names for the reaction. The recommended name, if present, is always first, followed by alternatives. This is stored in Reaction.Name.

DEFINITION

A row is added to the Description table for this item. It is the textual depiction of the reaction. It is used in the processing of the ENZYME section.

ENZYME

This item, when present, specifies one or more EC numbers associated with the reaction. As described above, a row is added to Reaction for each EC number mentioned in this item. If this item is absent, a single row is added with both ECNumber and ECNumberProposed as NULL.

EQUATION

If the coefficient is absent, is is assumed to be 1. An explicit coefficient is either an integer or an expression such as (n), (m-1)or (n+m). The loader recognizes most parenthezized expressions, as this is the standard syntax for KEGG. It also recognizes some unparenthesized expressions such as n-1, but has a limitation that it does not recognize all such unparenthesized expressions.

The loader loads the compounds on the left side of the equation into the Reactant table. The loader loads the compounds on the right side of the equation into the Product table. Both table entries have identical column definitions:

ReactionWID	The WID of the reaction
OtherWID	The `Chemical.WID` assigned to the compound.
Coefficient	Coefficient of this substrate, or 0 if the coefficient is an expression.

If multiple reactions are created, multiple links to the reactants and products are added to Reactant and Product from each Reaction row.

ORTHOLOGY

The ORTHOLOGY item contains cross-link information to KEGG Orthology, which is not loaded by the loader. This is used to populate the CrossReference table:

RPAIR

The RPAIR item contains cross-link information to KEGG Rpair, which is not loaded by the loader. This is used to populate the CrossReference table:

PATHWAY

The PATHWAY item contains cross-link information to KEGG Pathway, which is not loaded by the loader. This is used to populate the CrossReference table:

COMMENT

REMARK

REFERENCE

LIGAND: ENZYME

The ENZYME section of LIGAND is a collection of all known enzymatic reactions classified according to the nomenclature of the International Union of Biochemistry and Molecular Biology (IUBMB). Some of the entries in this data are taken from the ExPASY ENZYME database (http://expasy.hcuge.ch/sprot/enzyme.html) from the Swiss Institute of Bioinformatics.

Each entry is identified by the EC number, and contains information of naming, chemical reactions, metabloic compounds, metabolic pathways, genes encoding the enzyme (for several organisms), genetic diseases, and links to other databases.

Semantic Mapping

ENTRY

Each data item begins with a mandatory ENTRY field, giving the EC number for the enzyme. An EC number may be either a full or a partial EC number. If an EC number is full, it is stored in the Reaction table row of each associated reaction (see below). Partial EC numbers are not stored.

Note that if the EC number is partial, it will typically have no EnzymaticReaction, Protein,, or Pathway rows associated with it, as partial EC enzymes typically do not specify any genes in their GENES item. Moreover, entries for partial EC numbers typically cause no data to be loaded into the BioWarehouse.

NAME

The name item contains the recommended name for the enzyme, and optionally some alternatives. All names are assumed to refer to proteins, not ribozymes. The recommended name is always first, and is mandatory. This item is stored in the Protein table:

WID	A new WID assigned to this object.
Name	The recommended name.
AASequence	NULL.
Charge	NULL.
Fragment	NULL.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
PlCalc	NULL.
PlExp	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

One copy of the protein is made for each gene which can generate it (see below). The amino acid sequence is completed when loading the gene (see above).

CLASS

The class item contains the meaning of the EC number, and is mandatory for all entries. There are three elements: the class, subclass and sub-subclass of the enzyme.

SYSNAME

The sysname item contains the systematic name given by the Enzyme Commission, representing the nature of the chemical reaction. This is stored as a synonym of the reaction name, in SynonymTable:

REACTION

The reaction item specifies one or more chemical reactions subitems, each of which specifies one or more Reaction rows to be used in translating various items of this entry.

Each subitem is in the form of an equation or a text description, followed optionally by a list of accession numbers referring to reactions defined in REACTION. If multiple subitems are speccified, each is preceded by a parenthesized number such as (1).

If the REACTION item is absent, a single reaction is created from the SUBSTRATE and PRODUCT items (see below).

A reaction item is assumed to be a textual description if it contains no blank-delimited equals sign = or double arrow <=>. If a reaction is given in text, the SUBSTRATE and PRODUCT items are used to define the reaction in preference to the REACTION item, which is left uninterpreted and stored as a comment:

If the list of reaction accession numbers is absent from a reaction, each side of the interpreted equation is stored as per the substrate and product items (see below). The reaction is stored in the Reaction table:

WID	A new WID assigned to this object.
DeltaG	NULL.
ECNumber	NULL.
ECNumberProposed	The EC Number specified in the ENTRY item, or NULL if it is a partial EC number.
Spontaneous	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

If the list of reaction accession numbers is present (the typical case), no new Reaction rows are created. Instead, the Reaction objects created during translation of the REACTION section specified by these accession numbers are used in translating this entry. Also, the presence of an accession number serves to define the EC number specified in the ENTRY item as an official EC number. The Reaction.ECNumber associated with each accession is updated to be the EC number specified in the ENTRY item, and the Reaction.ECNumberProposed associated with it is updated to NULL.

An EnzymaticReaction entry is also created for every Reaction specified, with one copy for each Protein generated:

WID	A new WID assigned to this object.
ReactionWID	The WID of the Reaction assigned above.
ProteinWID	The WID of the Enzyme (see NAME above).
ComplexWID	NULL.
ReactionDirectionWID	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

SUBSTRATE

The substrate item contains the chemical compounds that appear on the left side of the reaction. If the REACTION item is specified and gave an interpretable reaction, the substrate is ignored. Otherwise it is used to construct a reaction as follows.

Each substrate chemical is assigned an entry in the Chemical table. If two chemicals occur within KEGG that are textually identical they are considered the same entity. For new chemicals (not previously loaded from LIGAND COMPOUND), the fields are completed as follows:

WID	A new WID assigned to this object.
Name	The name of the substrate chemical.
BeilsteinName	NULL.
CAS	NULL.
Charge	NULL.
EmpiricalFormula	NULL.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
OctH20PartitionCoeff	NULL.
SystematicName	NULL.
WaterSolubility	NULL.
Smiles	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Each of the substrate chemicals is linked to the reaction with a Reactant table entry, including the coefficient when specified. If the coefficient is not given, it is assumed to be 1:

ReactionWID	The WID of the reaction
OtherWID	The `Chemical.WID` assigned to the substrate
Coefficient	Coefficient of this substrate.

PRODUCT

The product item contains the chemical compounds that appear on the right side of the reaction. If the REACTION item is specified and gave an interpretable reaction, the product is ignored. Otherwise it is used to construct a reaction as follows.

Each product chemical is assigned an entry in the Chemical table. If two chemicals occur within LIGAND ENZYME that are textually identical within they are considered the same entity. For new chemicals (not previously loaded from LIGAND COMPOUND), the fields are completed as follows:

WID	A new WID assigned to this object.
Name	The name of the product chemical.
BeilsteinName	NULL.
CAS	NULL.
Charge	NULL.
EmpiricalFormula	NULL.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
OctH20PartitionCoeff	NULL.
SystematicName	NULL.
WaterSolubility	NULL.
Smiles	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Each of the product chemicals is linked to the reaction with a Product table entry, including the coefficient when specified. If the coefficient is not given, it is assumed to be 1:

ReactionWID	The WID of the reaction
OtherWID	The `Chemical.WID` assigned to the product chemical
Coefficient	Coefficient of this product.

INHIBITOR

The inhibitor item names compounds that inhibit the reaction from taking place. Each compound is given an entry in the Chemical table (subject to the textual identical conservation, as in substrate/product):

WID	A new WID assigned to this object.
Name	The name of the inhibitor compound.
BeilsteinName	NULL.
CAS	NULL.
Charge	NULL.
EmpiricalFormula	NULL.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
OctH20PartitionCoeff	NULL.
SystematicName	NULL.
WaterSolubility	NULL.
Smiles	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Each of the inhibitors is linked to each of the enzymatic reactions by the EnzReactionWIDChemicalWID table:

COFACTOR

NOTE: As of approximately version 27 of KEGG, cofactor information appears to be missing from the data files. In this case, no cofactor information is loaded.

The cofactor item names compounds that do not appear in the reaction equation, but are described in the comment item as operating as cofactors in the reaction. Each compound is given an entry in the Chemical table (subject to the textual identical conservation, as in substrate/product):

WID	A new WID assigned to this object.
Name	The name of the cofactor compound.
BeilsteinName	NULL.
CAS	NULL.
Charge	NULL.
EmpiricalFormula	NULL.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
OctH20PartitionCoeff	NULL.
SystematicName	NULL.
WaterSolubility	NULL.
Smiles	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Each of the cofactor compounds is linked to each of the enzymatic reactions with a EnzReactionCofactor table entry:

EFFECTOR

The effector item names compounds that activate the reaction. Each compound is given an entry in the Chemical table (subject to the textual identical conservation, as in substrate/product):

WID	A new WID assigned to this object.
Name	The name of the effector compound.
BeilsteinName	NULL.
CAS	NULL.
Charge	NULL.
EmpiricalFormula	NULL.
MolecularWeightCalc	NULL.
MolecularWeightExp	NULL.
OctH20PartitionCoeff	NULL.
SystematicName	NULL.
WaterSolubility	NULL.
Smiles	NULL.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Each of the effectors is linked to each of the enzymatic reactions by the EnzReactionWIDChemicalWID table:

COMMENT

The comment item contains free form text information commenting on the enzyme. This item populates the CommentTable:

PATHWAY

The pathway item is a cross-link to the KEGG PATHWAY data, and consists of the pathway map accession number, followed by the description. As that database is not parseable, this entry is used to associate reactions into pathways.

A reference (sum of organisms) pathway is created, if it does not already exist:

WID	A new WID assigned to this object.
Name	The given descriptive name of the pathway.
Type	'R' (Reference).
BioSourceWID	The `BioSource.WID` assigned to this organism (see GENOME Name above).
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

Reactions are considered distinct if they have different text depictions specified in their DEFINITION item in REACTION. Each distinct reaction is then linked to the pathway by adding a row to the PathwayReaction table:

GENES

The genes item is a cross-link to the KEGG gene catalogs, showing the genes in various organisms that encode this enzyme. This is used to create organism specific pathways, and to indicate the number of proteins to generate in loading: one is generated for each gene, as they may have different amino acid sequences.

For each organism with the necessary gene(s) a new pathway is created (if not already present). The BioSource WID is searched from the organisms previously loaded from the Genome data.

WID	A new WID assigned to this object.
Name	The given descriptive name of the pathway.
Type	'O' (Organism).
BioSourceWID	The `BioSource.WID` assigned to this organism (see GENOME Name above).
DataSetWID	The WID assigned to the DataSet (see DataSet table above).

The pathway map accession number for this pathway is stored in the DBID table:

And the Enzyme is linked to the BioSource by the BioSourceWIDProteinWID table:

BioSourceWID	The `BioSource.WID` assigned to this organism (see GENOME Name above).
ProteinWID	The WID assigned to the Enzyme.

Each reaction is then linked to the new pathway by adding a row to the PathwayReaction table, in the same way as for the reference pathway above:

DISEASE

The disease item is a cross-link to OWIM (On-line Mendelian Inheritance in Man) database. This is used to populate the CrossReference table:

MOTIF

The motif item is a cross-link to the PROSITE database. Each PROSITE identifier is used to populate the CrossReference table:

STRUCTURES

The structures item is a cross-link to PDB-the Protein Data Bank-which stores the three dimensional structure information for proteins. Each PDB identifier is used to populate the CrossReference table:

DBLINKS

The dblinks item contains cross-link information to other databases, including the ENZYME Nomenclature database from the Swiss Institute of Bioinformatics. This is used to populate the CrossReference table:

EnzymaticReactionWID	The WID of the Enzymatic Reaction (see above).
ChemicalWID	The WID assigned to the chemical
InhibitOrActivate	'I'
Mechanism	NULL.
PhysioRelevant	NULL.

WID	A new WID assigned to this object.
Name	The organism name.
DataSetWID	The WID assigned to the DataSet (see DataSet table above).
all other columns	NULL.

BioSourceWID	The WID of the organism (see GENOME NAME above).
GeneWID	The WID assigned to this gene.

OtherWID	The WID assigned to this gene.
Syn	The alternative name

OtherWID	The WID assigned to this gene.
Comm	The definition text.

GeneWID	The WID of the Gene.
RepliconWID	The WID of the Replicon.

OtherWID	The WID assigned to this compound (see above).
XID	The external database identifier.
DatasetWID	NULL.
DatabaseName	The external database name.

WID	...
Name	...
AASequence	The given sequence.
Charge	...
Fragment	...
MolecularWeightCalc	...
MolecularWeightExp	...
PlCalc	...
PlExp	...
DataSetWID	...

OtherWID	The WID assigned to this chemical (see below).
XID	The accession number

OtherWID	The WID assigned to this chemical.
Syn	The alternative name

OtherWID	The WID assigned to this chemical (see above).
XID	The pathway accession number.
DatasetWID	NULL.
DatabaseName	"KEGG PATHWAY''

OtherWID	The WID assigned to this compound (see above).
XID	The PDB ID.
DatasetWID	NULL.
DatabaseName	"PDB''

OtherWID	The `Chemical` WID assigned to this compound.
Comm	The comment string.

OtherWID	The WID assigned to this compound.
Syn	The text that occurs after 'Same as: '

OtherWID	The WID assigned to this reaction (see below).
XID	The accession number

OtherWID	The WID assigned to this reaction.
Syn	The alternative name

OtherWID	The WID assigned to this reaction (see above).
Comm	The definition.
Table	'Reaction'

OtherWID	The WID assigned to this reaction (see above).
XID	The KEGG accession number. It generally starts with 'KO'.
DatasetWID	NULL.
DatabaseName	The external database name, generally 'KO'.

OtherWID	The WID assigned to this reaction (see above).
XID	The KEGG accession number.
DatasetWID	NULL.
DatabaseName	The external database name, generally 'PATH'.

OtherWID	The WID assigned to this Protein.
Syn	The alternative name.

OtherWID	WID of the reaction (see below).
Syn	The Systematic Name.

KEGG Loader for BioWarehouse

Version 4.6

Introduction

Overview of BioWarehouse Schema

Limitations

Loader Dependencies and Prerequisites

Installation and Building

Obtaining KEGG Data

KEGG DataSet

DataSet Table

Entry Table

PATHWAY

GENOME

Semantic Mapping

ENTRY

NAME

CHROMOSOME

PLASMID

SEQUENCE

LENGTH

Literature Citations

GENES

Semantic Mapping

ENTRY

NAME

DEFINITION

CLASS

POSITION

DBLINKS

CODON_USAGE

AASEQ

NTSEQ

LIGAND: COMPOUND

Semantic Mapping

ENTRY

NAME

FORMULA

PATHWAY

ENZYME

STRUCTURES

DBLINKS

RPAIR

GLYCAN

COMMENT

REMARK

LIGAND: REACTION

Semantic Mapping

ENTRY

NAME

DEFINITION

ENZYME

EQUATION

ORTHOLOGY

RPAIR

PATHWAY

COMMENT

REMARK

REFERENCE

LIGAND: ENZYME

Semantic Mapping

ENTRY

NAME

CLASS

SYSNAME

REACTION

SUBSTRATE

PRODUCT

INHIBITOR

COFACTOR

EFFECTOR

COMMENT

PATHWAY

GENES

DISEASE

MOTIF

STRUCTURES

DBLINKS

ALL_REAC

ORTHOLOGY

REFERENCE

OtherWID	The WID assigned to this enzyme (see NAME above).
Comm	The comment string.

OtherWID	The WID assigned to this pathway.
XID	The accession number.

PathwayWID	The WID assigned to this pathway.
ReactionWID	The reaction WID.
PriorReactionWID	NULL.
Hypothetical	'U' (Unknown).

OtherWID	The WID assigned to this enzyme (see NAME above).
XID	The MIM Number.
DatasetWID	NULL.
DatabaseName	"MIM''

OtherWID	The WID assigned to this enzyme (see NAME above).
XID	The PROSITE ID.
DatasetWID	NULL.
DatabaseName	"PS''