BioCyc Loader for BioWarehouse

Version 4.6


(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.



Introduction
Limitations
Installation and Building
Obtaining input data
Loader Dependencies and Prerequisites
Running the Loader
Dataset and BioSource Specification
Translation Semantics for BioCyc Objects
  • Publications
  • Compounds
  • Proteins
  • Protein Sequences
  • Transcription Units
  • Genes
  • Promoters
  • Terminators
  • DNA Binding Sites
  • Reactions
  • Enzymatic Reactions
  • Regulation
  • Pathways
  • Support Table
  • Entry Table
    References

    Introduction

    This document describes version 4.6 of the BioCyc Loader. It is one of several database loaders comprising the BioWarehouse. The BioCyc Loader (referred to simply as the loader), loads a Pathway/Genome Database (PGDB) into the BioWarehouse - a relational database that provides a common representation for diverse bioinformatics databases.

    PGDBs are implemented in a frame-based representation system which is implemented in Common Lisp. The loader inputs a textual flat file representation of a PGDB, converts it to the representation expressed in the Bio-SPICE Warehouse Schema, and loads this directly into an instance of the warehouse.

    Overview of BioWarehouse Schema

    The Bio-SPICE warehouse schema contains the data definition statements for the BioWarehouse. These include three different types of tables - constant tables, object tables, linking tables, and special tables.

    Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

    Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique em> warehouse ID (WID) to each object, which is used by other tables to reference the object.

    A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

    A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

    Schema documentation is available.


    Limitations

    The latest supported data version for the BioCyc loader is listed in the loader summary table.  Attributes added to the BioCyc schema after this version are not supported.

    The loader silently ignores the numerous source attributes that have no analogue in the BioWarehouse.

    In the reaction graph specification in the PREDECESSORS attribute of pathways, due to parsing limitations a reaction may have at most two predecessor reactions. Additional reactions are flagged as generic syntax errors.

    The loader treats the MetaCyc database as if contained data for a single organism named "MetaCyc", rather than as containing experimentally elucidated data from many organisms.

    Many proteins do not have the COMMON-NAME attribute. This is because multifunctional enzymes often have different names depending on which enzymatic function is being referred to. If no common name is specified (and we prefer that no common name be specified if an activity name is appropriate), we use the common name of the corresponding enzymatic-reaction frame (or a concatenation of them separated by / if there there are multiple enzymatic-reactions).

    The individual components of a publication -- title, year, etc. -- are not broken out by the loader into the associated columns of the Citation table; only the concatenation of all components is loaded into Citation.Citation.

    The TRANSCRIPTION-DIRECTION attribute is ignored for transcription units. In particular, neither it nor the TRANSCRIPTION-DIRECTION of the associated gene is used in computing Feature.StartPosition of Feature.EndPosition for DNA binding sites. This could lead to incorrect values for these columns.


    Installation and Building

    See BioCyc installation instructions for details on installing and building the loader.

    Loader Dependencies and Prerequisites

    Other that the standard warehouse creation procedure, the loader does not require that any other Warehouse tools be run prior to its execution. However, if the MetaCyc Ontology or Gene Ontology datasets have been loaded, the loader builds useful links to objects in these datasets. A warning is issued if either of these datasets has not been loaded. If multiple versions of these datasets are present in the Warehouse, the dataset with the maximum WID (typically the most recently loaded) is used.

    The Multifun Ontology contained in the MetaCyc Ontology dataset is used to populate rows of the RelatedTerm table for genes that specify a Multifun type. The Gene Ontology dataset is used to populate rows of the RelatedTerm table for proteins that specify a GO term. See genes translation and proteins translation for details.

    The loader is invoked as part of the ChIP-chip loader. When invoked in this manner, a restricted set of data files, with a minimal amount of data, are loaded and merged into a dataset containing other BioWarehouse data as well. Also, the command-line arguments provided to the loader are specified in a properties file. See the ChIP-chip documentation for full details.


    Input data

    BioCyc PGDB databases are available for a number of species. However, a license may be required to obtain them. Visit BioCyc downloads or send a request to biocyc-info@ai.sri.com for details.

    The textual representation of a PGDB consists of several ASCII files. A subset of these are used by the BioCyc loader. Furthermore, not all files are present in all PGDBs; in particular, several files are not present for the MetaCyc PGDB. Input files are loaded in the following order:

    1. pubs.dat [not present for all BioCyc PDDBs]
    2. compounds.dat
    3. proteins.dat
    4. protseq.fasta [not present for all BioCyc PGDBs]
    5. transunits.dat [not present for all BioCyc PGDBs]
    6. genes.dat
    7. promoters.dat [not present for the MetaCyc PGDB]
    8. terminators.dat [not present for the MetaCyc PGDB]
    9. dnabindsites.dat [not present for the MetaCyc PGDB]
    10. reactions.dat
    11. enzrxns.dat
    12. regulation.dat
    13. pathways.dat

    See the PGDB flat file format specification for detailed specification on the contents of the input files used by the loader. Most files are in attribute-value format.


    Running the Loader

    The BioCyc installation instructions contain details for running the loader, including options and a description of its output.

    Dataset and BioSource Specification

    If the -m (merge) command line option is used, the loader loads data into the dataset named "BioCyc", using the WID of this entry as the DataSetWID for all objects it adds to the Warehouse. If multiple datasets of this name exist, the one with the maximal DataSet.WID is used; typically this is the dataset that was most recently loaded. If no dataset of this name exists, a warning is issued and one is created.

    If the -m command line option is not used, the loader adds one row to the Dataset table as follows, using the WID of this entry as the DataSetWID for all objects it adds to the Warehouse:

    Column values for Dataset row
    Column Value assigned by BioCyc loader
    WID A small integer that uniquely identifies this dataset in the warehouse.
    Name 'speciesCyc', where species is an abbreviation for the organism represented in the PGDB.
    Version Major version of PGDB that is loaded.
    ReleaseDate The date that this version of the PGDB was released.
    LoadDate The date and time the loader was started.
    ChangeDate The date and time the loader completed, NULL if the loader did not complete successfully.
    LoadedBy The value of the system environment variable USER for the account running the loader.
    Application 'BioCyc Loader'
    ApplicationVersion 4.6
    HomeURL http://www.biocyc.org
    QueryURL http://www.biocyc.org:1555

    The loader adds one row to the BioSource table as follows, defining the organism the BioCyc KB describes:

    Column values for BioSource row
    Column Value assigned by BioCyc loader
    WID The next available WID in the warehouse. Uniquely specifies this BioSource in the warehouse.
    Name The value specified by the -o command line option, e.g., 'Bacillus subtilis'
    Strain NULL
    DatasetWID The value Dataset.WID assigned to the dataset being loaded.


    Translation Semantics for BioCyc Objects

    This section describes the semantic mapping between the objects comprising a BioCyc knowledge base and its associated flat file representation to its representation in the Bio-SPICE data warehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Any attributes not listed are ignored.

    Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant. If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored.

    Publications

    Publications are input from the file pubs.dat. Any BioCyc object can contain a reference to a publication. This is done with the CITATIONS attribute. This attribute contains the UNIQUE-ID of a publication, possibly enclosed in square brackets, and possibly without the leading "PUB-" prefix. Certain citations do not refer to publications; they refer to evidence. See the Support Table for translation details.

    A row is added to the Citation table for each entry in pubs.dat, except for those with a REFERENT-FRAME attribute. The latter provide an alternative name for a publication - either the publication's UNIQUE-ID or its REFERENT-FRAME attribute may be used to refer to the publication in CITATIONS attributes in other files.


    Translation semantics for pubs.dat
    BioCyc Attribute Warehouse Semantics
    AUTHOR[*] Concatenated together to form the list of authors of the publication. The order of the authors is preserved.
    A comma and a space are inserted between each author.
    Along with other attributes, this is included in the full text of the citation, stored at Citation.Citation.

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Citation object

    MEDLINE-UID Crossreference.XID;
    Crossreference.DatabaseName is 'Medline';
    Crossreference.OtherWID is the WID of this Citation object

    PUBMED-ID Citation.PMID
    REFERENT-FRAME Attribute provides an alternate name for UNIQUE-ID; either may be used in other files to refer to the publication.
    Both refer to the same Citation object.
    Used internally to associate a publication name with a Citation.WID.

    TITLE,
    SOURCE,
    YEAR,
    URL
    Concatenated together in the given order and appended to the full list of authors derived from the AUTHOR[*] attributes
    to form Citation.Citation. Tabs are inserted between each attribute and the between the author list and TITLE.

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Citation object.
    Used internally to associate a publication name with a Citation.WID.

    Additional Tables

    No additional table rows are added for publications.

    Compounds

    Compounds are input from the file compounds.dat. A row is added to the Chemical table for each entry in compounds.dat. A Chemical may be either a single compound or a class of compounds; all entries from this file are single compounds; hence Chemical.Class is always 'F'.

    Note that rows are also added to Chemical when translating reactions.dat; these compounds are classes of compounds, and their Chemical.Class is 'T'.


    Translation semantics for compounds.dat
    BioCyc Attribute Warehouse Semantics
    CAS-REGISTRY-NUMBERS Chemical.CAS
    CHARGE Chemical.Charge
    CHEMICAL-FORMULA[*] All (Element Number) pairs are concatenated to form Chemical.EmpiricalFormula
    Ex: (H 2) and (O 1) form 'H2O1'

    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Chemical

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Chemical object

    COMMON-NAME Chemical.Name
    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Chemical object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.
    MOLECULAR-WEIGHT Chemical.MolecularWeightCalc
    PKA1 Chemical.PKA1
    PKA2 Chemical.PKA2
    PKA3 Chemical.PKA3
    REGULATES Ignored. This is the converse of the REGULATOR attribute for an enzymatic regulator.
    SMILES Chemical.Smiles
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Chemical object

    SYSTEMATIC-NAME Chemical.SystematicName
    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Chemical object

    ACTIVATORS-UNMECH-OF,
    INHIBITORS-ALLOSTERIC-OF,
    INHIBITORS-IRRREVERSIBLE-OF,
    INHIBITORS-OTHER-OF
    Ignored; these are symmetric analogues to the corresponding attributes INHIBITORS-ALLOSTERIC, etc. of enzrxns.dat.

    Additional Tables

    No additional table rows are added for compounds.

    Proteins

    Proteins are input from the file proteins.dat. A row is added to the Protein table for each entry in proteins.dat.


    Translation semantics for proteins.dat
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Protein

    ^COEFFICIENT[*] Subunit.Coefficient for the immediately preceding COMPONENTS attribute.
    COMMON-NAME Protein.Name.
    Many proteins do not have the COMMON-NAME attribute. This is because multifunctional enzymes often have different names depending on which enzymatic function is being referred to. If no common name is specified (and we prefer that no common name be specified if an activity name is appropriate), we use the common name of the corresponding enzymatic-reaction frame (or a concatenation of them separated by / if there there are multiple enzymatic-reactions).

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Protein object

    COMPONENTS[*] Attribute is a UNIQUE-ID for a protein. A row is added to Subunit;
    Subunit.ComplexWID is the WID of this Protein object;
    Subunit.SubunitWID is the WID of the protein associated with the attribute;
    Subunit.Coefficient is the value of the immediately following ^COEFFICIENT attribute,
    defaulting to 1 if not explicit
    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Protein object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.
    FEATURES Ignored
    GENE Ignored
    GO-TERMS[*] Ignored unless the Gene Ontology dataset has been previously loaded; each term is a DBID.XID from that dataset.
    A row is added to RelatedTerm:
    RelatedTerm.TermWID references this term;
    RelatedTerm.OtherWID is the WID of this Protein object.
    RelatedTerm.Relationship is 'keyword'

    LOCATION[*] Location.Location;
    Location.ProteinWID is the WID of this Protein object

    MOLECULAR-WEIGHT-SEQ Protein.MolecularWeightCalc
    MOLECULAR-WEIGHT-EXP Protein.MolecularWeightExp
    PI Protein.PICalc
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Protein object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Protein object

    ACTIVATORS-UNMECH-OF,
    COFACTORS-UNMECH-OF,
    INHIBITORS-UNMECH-OF,
    INHIBITORS-COMPETITIVE-OF,
    INHIBITORS-OTHER-OF,
    PROSTHETIC-GROUPS-OF
    Ignored; these are symmetric analogues to the corresponding attributes INHIBITORS-COMPETITIVE, etc. of enzrxns.dat.

    Additional Tables

    A row is added to BioSourceWIDProteinWID for each protein entry. BioSourceWIDProteinWID.BioSourceWID is the WID from the one row of BioSource created by the loader for the species being loaded. Numerous other table rows are added for proteins, but they are added when the linked object is parsed. In particular, the GENE attribute is ignored, and the protein - gene link is created when genes are parsed.

    If the MetaCyc Ontology dataset has been loaded, a row is added to RelatedTerm for each GO-TERMS attribute that is a term in Gene Ontology.

    Protein Sequences

    Protein sequences are input from the file protseq.fasta. This file is not an attribute-value format file; each entry contains a protein name, some other information which is ignored, and an amino acid sequence. The row for that protein is updated to store the amino acid sequence as Protein.AASequence.

    Additional Tables

    No additional table rows are added based on entries in protseq.fasta.

    Transcription Units

    Transcription Units are input from the file transunits.dat. A row is added to the TranscriptionUnit table for each entry in this file.


    Translation semantics for transunits.dat
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this TranscriptionUnit

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this TranscriptionUnit object

    COMMON-NAME TranscriptionUnit.Name
    If missing, UNIQUE-ID is used in its place.

    COMPONENTS[*] Ignored; a link to this entry is added when the component is loaded.
    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this TranscriptionUnit object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    EXTENT-UNKNOWN? Ignored.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this TranscriptionUnit object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this TranscriptionUnit object

    Additional Tables

    No additional table rows are added based on entries in transunits.dat.

    Genes

    Genes are input from the file genes.dat. A row is added to the Gene table for each entry in genes.dat.


    Translation semantics for genes.dat
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Gene

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Gene object

    COMMON-NAME Gene.Name
    COMPONENT-OF[*] If value matches the UNIQUE_ID of a previously loaded transcription unit, a row is added to TranscriptionUnitComponent:
    TranscriptionUnitComponent.TranscriptionUnitWID is the WID of the TranscriptionUnit object;
    TranscriptionUnitComponent.OtherWID is the WID of this Gene object;
    TranscriptionUnitComponent.Type is 'gene'.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Gene object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    INTERRUPTED Gene.INTERRUPTED
    LEFT-END-POSITION Gene.CodingRegionStart or
    Gene.CodingRegionEnd, depending on TRANSCRIPTION-DIRECTION

    PRODUCT[*] Value should match the UNIQUE_ID of a protein.
    If so, a row is added to GeneWIDProteinWID.

    RIGHT-END-POSITION Gene.CodingRegionEnd or
    Gene.CodingRegionStart, depending on TRANSCRIPTION-DIRECTION

    PI Gene.PICalc
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Gene object

    TRANSCRIPTION-DIRECTION A value of "+" indicates that LEFT-END-POSITION is stored as
    Gene.CodingRegionStart and that RIGHT-END-POSITION is stored as
    Gene.CodingRegionEnd.
    A value of "-" indicates that LEFT-END-POSITION is stored as
    Gene.CodingRegionEnd and that RIGHT-END-POSITION is stored as
    Gene.CodingRegionStart.

    TYPES[*] Ignored unless the MetaCyc Ontology dataset has been previously loaded; each type is a Term.Name from the Multifun subontology of that dataset. A row is added to RelatedTerm:
    RelatedTerm.TermWID references a term that is this type.
    RelatedTerm.OtherWID is the WID of this Gene object.
    RelatedTerm.Relationship is 'superclass'

    UNIQUE-ID Gene.GenomeID and
    DBID.XID;
    DBID.OtherWID is the WID of this Gene object

    Additional Tables

    A row is added to GeneWIDProteinWID for each PRODUCT attribute, associating the gene with the gene product.

    A row is added to BioSourceWIDGeneWID for each gene entry. BioSourceWIDGeneWID.BioSourceWID is the WID from the one row of BioSource created by the loader for the species being loaded.

    A row is added to TranscriptionUnitComponent when a COMPONENT-OF attribute matches the UNIQUE-ID of a previously loaded transcription unit.

    If the MetaCyc Ontology dataset has been loaded, a row is added to RelatedTerm for each TYPES attribute that is a term in the Multifun ontology that is part of the MetaCyc Ontology.

    Promoters

    Promoters are input from the file promoters.dat. A row is added to the Feature table for each entry in this file:
    Translation semantics for promoters.dat
    BioCyc Attribute Warehouse Semantics
    ABSOLUTE-PLUS-1-POSITION Feature.StartPosition and Feature.EndPosition.
    This is the position at which transcription starts.

    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Feature

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Feature object

    COMMON-NAME Feature.Description
    COMPONENT-OF[*] If value matches the UNIQUE_ID of a previously loaded transcription unit, a row is added to TranscriptionUnitComponent:
    TranscriptionUnitComponent.TranscriptionUnitWID is the WID of the TranscriptionUnit object;
    TranscriptionUnitComponent.OtherWID is the WID of this Feature object;
    TranscriptionUnitComponent.Type is 'promoter'.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Feature object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    PROMOTER-EVIDENCE Ignored. There is typically an associated CITATIONS attribute for thie evidence code.
    REGULATED-BY Ignored. This is the converse of the REGULATED-ENTITY attribute for a transcriptional regulator.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Feature object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Feature object

    Additional Tables

    A row is added to TranscriptionUnitComponent for each COMPONENT-OF attribute that references a previously loaded transcription unit.

    Terminators

    Terminators are input from the file terminators.dat. A row is added to the Feature table for each entry in this file:
    Translation semantics for terminators.dat
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Feature

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Feature object

    COMMON-NAME Feature.Description
    COMPONENT-OF[*] If value matches the UNIQUE_ID of a previously loaded transcription unit, a row is added to TranscriptionUnitComponent:
    TranscriptionUnitComponent.TranscriptionUnitWID is the WID of the TranscriptionUnit object;
    TranscriptionUnitComponent.OtherWID is the WID of this Feature object;
    TranscriptionUnitComponent.Type is 'terminator'.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Feature object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    LEFT-END-POSITION Feature.StartPosition
    RIGHT-END-POSITION Feature.EndPosition
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Feature object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Feature object

    Additional Tables

    A row is added to TranscriptionUnitComponent for each COMPONENT-OF attribute that references a previously loaded transcription unit.

    DNA Binding Sites

    DNA binding sites are input from the file dnabindsites.dat. A row is added to the Feature table for each entry in this file:

    The TRANSCRIPTION-DIRECTION attribute is ignored for transcription units. In particular, neither it nor the TRANSCRIPTION-DIRECTION of the associated gene is used in computing Feature.StartPosition of Feature.EndPosition for DNA binding sites. This could lead to incorrect values for these columns.

    Translation semantics for dnabindsites.dat
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Feature

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Feature object

    COMMON-NAME Feature.Description
    COMPONENT-OF[*] If value matches the UNIQUE_ID of a previously loaded transcription unit, a row is added to TranscriptionUnitComponent:
    TranscriptionUnitComponent.TranscriptionUnitWID is the WID of the TranscriptionUnit object;
    TranscriptionUnitComponent.OtherWID is the WID of this Feature object;
    TranscriptionUnitComponent.Type is 'binding site'.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Feature object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    REGULATED-PROMOTER References the UNIQUE-ID of a promoter. Assuming that promoter has been previously loaded, its ABSOLUTE-PLUS-1-POSITION is used to convert this binding site's RELATIVE-CENTER-POSITION from a relative to an absolute position.
    RELATIVE-CENTER-POSITION This numeric value designates either an integral position, or a position halfway between two integral positions. First of all, if the attribute is positive, one is subtracted from it. It is then added to the Feature.StartPosition of the promoter named in the REGULATED-PROMOTER attribute. If the sum is integral it is stored in both Feature.StartPosition and Feature.EndPosition. If it is nonintegral, the next-lowest integer is stored in Feature.StartPosition and the next-highest integer is stored in Feature.EndPosition.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Feature object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Feature object

    Additional Tables

    A row is added to TranscriptionUnitComponent for each COMPONENT-OF attribute that references a previously loaded transcription unit.

    Reactions

    Reactions are input from the file reactions.dat. A row is added to the Reaction table for each entry in reactions.dat.


    Translation semantics for reactions.dat
    BioCyc Attribute Warehouse Semantics
    BALANCE-STATE Ignored
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Reaction

    ^COEFFICIENT[*] Reactant.Coefficient or Product.Coefficient
    for the immediately preceding LEFT or RIGHT attribute.

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Reaction object

    COMMON-NAME SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Reaction object

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Reaction object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    DELTAG0 Reaction.DeltaG
    EC-NUMBER Reaction.ECNumber or Reaction.ECNumberProposed,
    depending on OFFICIAL-EC?

    OFFICIAL-EC? Determines whether EC-NUMBER is stored as
    Reaction.ECNumber or Reaction.ECNumberProposed

    ORPHAN? Ignored
    LEFT[*] A row is added to Reactant; its ReactionWID is the WID of this reaction.
    The attribute designates a substrate or a class of substrates involved in the reaction. It is translated as discussed below (see Additional Tables); the WID for the substrate is stored in Reactant.OtherWID. If a ^COEFFICIENT attribute follows immediately, it is stored as Reactant.Coefficient. Otherwise the value 1 is stored.

    RIGHT[*] A row is added to Product; its ReactionWID is the WID of this reaction..
    The attribute designates a substrate or a class of substrates involved in the reaction. It is translated as discussed below (see Additional Tables); the WID for the substrate is stored in Product.OtherWID. If a ^COEFFICIENT attribute follows immediately, it is stored as Product.Coefficient. Otherwise the value 1 is stored.

    SPONTANEOUS? Reaction.Spontaneous
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Reaction object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Reaction object

    Additional Tables

    As noted, a row is added to Reactant for each LEFT attribute and Product for each RIGHT attribute. If the attribute matches the UNIQUE-ID, the COMMON-NAME, or a SYNONYM of a Chemical or Protein, its WID is stored as Reactant.WID. Else it is assumed to specify a class of chemicals; an entry in Chemical is created for it, such that Chemical.Name is the attribute, and Chemical.Class is 'T'.

    Enzymatic Reactions

    Enzymatic reactions are input from the file enzrxns.dat. A row is added to the EnzymaticReaction table for each entry in enzrxns.dat.


    Translation semantics for enzrxns.dat
    BioCyc Attribute Warehouse Semantics
    ALTERNATIVE-COFACTORS[*]
    ALTERNATIVE-SUBSTRATES[*]
    The attribute consists of a list of names (PRIMARY ALT1 ALT2 ... ALTn).
    For each name, the Chemical table is queried for a UNIQUE_ID or a COMMON_NAMEof a compound in the database being loaded. If none is found, a row is added to Chemical; the name is stored as Chemical.Name; its Chemical.WID is used as described below.

    N rows are added to EnzReactionAltCompound, one for each ALTi:
    EnzReactionAltCompound.EnzymaticReactionWID is the WID of this enzymatic reaction;
    EnzReactionAltCompound.PrimaryWID is the WID associated with the primary compound;
    EnzReactionAltCompound.AlternativeWID is the WID associated with compound ALTi;
    EnzReactionAltCompound.Cofactor is 'T' for ALTERNATIVE-COFACTORS, 'F' for ALTERNATIVE-SUBSTRATES.

    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this EnzymaticReaction

    COFACTORS[*]
    PROSTHETIC-GROUPS[*] COFACTORS-OR-PROSTHETIC-GROUPS[*]
    The Chemical and Protein tables (in that order) are queried for a UNIQUE_ID or a COMMON_NAMEof a compound or a protein in the database being loaded that matches the value of this attribute. If none is found, a row is added to Chemical; the attribute value is stored as Chemical.Name its Chemical.WID is used as described below.

    A row is added to EnzReactionCofactor; EnzReactionCofactor.EnzymaticReactionWID is the WID of this enzymatic reaction; EnzReactionCofactor.CompoundWID is the WID associated with the compound or protein; EnzReactionCofactor.Prosthetic is 'F' COFACTORS, 'T' for PROSTHETIC-GROUPS, and NULL for COFACTORS-OR-PROSTHETIC-GROUPS.

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this EnzymaticReaction object

    COMMON-NAME Ignored
    REQUIRED-PROTEIN-COMPLEX Value should match the UNIQUE_ID of a protein.
    If so, the WID of the protein is stored as EnzymaticReaction.ComplexWID.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this EnzymaticReaction object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    ENZYME Required. Value should match the UNIQUE_ID of a protein.
    If so, the WID of the protein is stored as EnzymaticReaction.ProteinWID.

    REACTION Required. Value should match the UNIQUE_ID of a reaction.
    If so, the WID of the reaction is stored as EnzymaticReaction.ReactionWID.

    REACTION-DIRECTION EnzymaticReaction.ReactionDirection.
    REGULATED-BY Ignored. This is the converse of the REGULATED-ENTITY attribute for a enzymatic regulator.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Reaction object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this EnzymaticReaction object

    Additional Tables

    The loader adds rows to EnzReactionAltCompound, EnzReactionCofactor, and EnzReactionInhibitorActivator as noted above.

    Regulation

    A regulation entry describes a relationship between a regulated entity and an agent that performs regulation. There are two types of regulation that are represented in the database and translated to the BioWarehouse -- regulation of transcription, and regulation of enzyme activity. The regulation type is indicated by the TYPES attribute.

    In transcriptional regulation, the regulator is a protein that is a transcription factor; the regulated entity is a promoter that is a component of one or more transcription units. In enzymatic regulation, the regulator is a chemical; the regulated entity is an enzymatic reaction. Comments, citations, synonyms, and crossreferences of each entry are linked to the regulator. Regulator proteins and compounds will have multiple DBIDs -- their UNIQUE-ID from this entry as well as the UNIQUE-ID from the protein or compound entry.

    All regulation is characterized by a mode, indicating whether the process is inhibited or activated. In addition, enzymatic regulation is characterized by a regulation mechanism, as well as a flag indicating physiological relevance.

    Note that for transcriptional regulation, no rows are added to the BioWarehouse. It has no representation of transcription factors. However, the naming conventions of BioCyc database may be exploited to find all transcription factors by finding all proteins that have a DBID.XID starting with 'REG'.

    Translation semantics for regulation.dat where TYPES is 'Regulation-of-Transcription-Initiation'
    BioCyc Attribute Warehouse Semantics
    ASSOCIATED-BINDING-SITE Ignored.
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of the Protein referenced by the REGULATOR attribute.

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of the Protein referenced by the REGULATOR attribute.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of the Protein referenced by the REGULATOR attribute;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    MODE Ignored.
    REGULATED-ENTITY References the UNIQUE-ID of a promoter Feature that is a component of a transcription unit.
    REGULATOR References the UNIQUE-ID of a protein that is a transcription factor.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of the Protein referenced by the REGULATOR attribute.

    TYPES 'Regulation-of-Transcription-Initiation'. Determines whether this entry is translated as transcriptional or enzymatic regulation.
    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of the Protein referenced by the REGULATOR attribute.


    For each entry that describes enzymatic regulation, a row is added to EnzReactionInhibitorActivator. Entry attributes determine the column values as described in the table below.

    Translation semantics for regulation.dat where TYPES is 'Regulation-of-Enzyme-Activity'
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of the Chemical or Protein referenced by the REGULATOR attribute.

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of the Chemical referenced by the REGULATOR attribute.

    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of the Chemical or Protein referenced by the REGULATOR attribute;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    MECHANISM EnzReactionInhibitorActivator.Mechanism is
    • 'A' if MECHANISM is ':ALLOSTERIC',
    • 'C' if MECHANISM is ':COMPETITIVE',
    • 'I' if MECHANISM is ':IRREVERSIBLE'
    • 'N' if MECHANISM is ':NONCOMPETITIVE' or ':UNCOMPETITIVE',
    • 'U' if MECHANISM is ':UNKMECH',
    • NULL otherwise.
    MODE EnzReactionInhibitorActivator.InhibitOrActivate is
    • 'A' if MODE is '+',
    • 'I' if MODE is '-', and
    • NULL otherwise.
    PHYSIOLOGICALLY-RELEVANT? EnzReactionInhibitorActivator.PhysioRelevant is
    • 'T' if MODE is 'T',
    • 'F' otherwise.
    REGULATED-ENTITY References the UNIQUE-ID of an enzymatic reaction.
    Its WID is EnzReactionInhibitorActivator.EnzymaticReactionWID.
    REGULATOR The Chemical table is queried for a UNIQUE_ID or a COMMON_NAME of a compound in the database being loaded that matches the value of this attribute. If none is found, a row is added to Chemical; the attribute value is stored as Chemical.Name its Chemical.WID is EnzReactionInhibitorActivator.CompoundWID
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of the Chemical or Protein referenced by the REGULATOR attribute.

    TYPES 'Regulation-of-Enzyme-Activity'. Determines whether this entry is translated as transcriptional or enzymatic regulation.
    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of the Chemical or Protein referenced by the REGULATOR attribute.

    Pathways

    Pathways are input from the file pathways.dat. A row is added to the Pathway table for each entry in pathways.dat. The Pathway.Type column value of each row is set to 'O' to signify the pathway is from a real organism. Pathway.BioSourceWID is the WID from the one row of BioSource created by the loader for the species being loaded.

    Pathway entries can reference other pathways (using their UNIQUE-ID), and there is no guarantee that a pathway entry will be defined before a reference to it occurs. The loader adds a row to Pathway and assigns it a WID upon the first reference to a pathway, and performs a SQL UPDATE of the row when its entry is fully defined.

    Translation semantics for pathways.dat
    BioCyc Attribute Warehouse Semantics
    CITATIONS[*] Each attribute is either an evidence code or the UNIQUE-ID of a publication. See Support Table for translation of evidence codes. Each publication UNIQUE-ID is possibly enclosed in square brackets, and possibly missing the leading "PUB-".
    A row is added to CitationWIDOtherWID;
    CitationWIDOtherWID.CitationWID is the WID of the Citation associated with this unique ID;
    CitationWIDOtherWID.OtherWID is the WID of this Pathway

    COMMENT[*] CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Pathway object

    COMMON-NAME Pathway.Name
    DBLINKS[*] Attribute is a list for two or more elements. A row is added to CrossReference;
    CrossReference.OtherWID is the WID of this Pathway object;
    CrossReference.DatabaseName is the first element of the list;
    CrossReference.XID is the second element of the list;
    the rest of the list is ignored.

    HYPOTHETICAL-REACTIONS[*] Value should match the UNIQUE_ID of a reaction.
    If so, PathwayReaction.Hypothetical is 'T'
    for the row added to PathwayReaction for the reaction.

    NET-REACTION-EQUATION CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Pathway object

    PATHWAY-INTERACTIONS CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Pathway object

    PATHWAY-LINKS[*] Indicates pathways that are linked via a common substrate.
    If value is of form (Unique-ID) it is probably a pathway reference and is ignored. Otherwise value is of form (Compound PathwaySpec1 ... PathwaySpecN) where each PathwaySpec is either a descriptor or (descriptor . direction). The direction is ignored. The descriptor may be either a quoted string or a UNIQUE-ID of some BioCyc object (not necessarily a pathway). If the descriptor is anything other than a UNIQUE-ID for a previously defined pathway, a row is added to Pathway:
    Pathway.Name is the descriptor.
    A row is added to PathwayLink for each PathwaySpec:
    PathwayLink.ChemicalWID is the WID for the Compound;
    PathwayLink.Pathway1WID is the WID of this Pathway object;
    PathwayLink.Pathway2WID is the WID of the linked Pathway.
    During postprocessing, any Pathway rows that were added that were not actually pathways (i.e., no pathway entry was later encountered for that descriptor) are deleted from Pathway, along with any linked PathwayLink rows.

    PREDECESSORS[*] Collectively, these specify the graph of reactions that form the pathway. Each value is of one of two forms:
    1. (Successor Predecessor1 ... PredecessorN); zero predecessors are allowed
    2. Predecessor-PathwayID
    For case 1:
    A row is added to PathwayReaction for each Predecessor. For each row:
    PathwayReaction.PathwayWID is the WID of this Pathway object;
    PathwayReaction.ReactionWID is the Reaction WID of Successor;
    PathwayReaction.Hypothetical is 'F', unless successor is named as a HYPOTHETICAL-REACTIONS attribute;
    PathwayReaction.PriorReactionWID is the Reaction WID of Predecessor, or NULL if there are none.
    For case 2:
    Each such attribute should also occur as a SUB-PATHWAYS attribute. If it does not, it is ignored.
    REACTION-LIST[*] Each attribute is a UNIQUE-ID of a reaction or pathway. Pathways occurring here are ignored. For each reaction occurring here, but not occurring as a Successor in a PREDECESSORS attribute, a row is added to PathwayReaction:
    PathwayReaction.PathwayWID is the WID of this Pathway object;
    PathwayReaction.ReactionWID is the Reaction WID of the attribute;
    PathwayReaction.Hypothetical is 'F', unless the attribute is named as a HYPOTHETICAL-REACTIONS attribute;
    PathwayReaction.PriorReactionWID is NULL.
    SUB-PATHWAYS[*] This pathway inherits the reaction graph of the pathway whose UNIQUE-ID equals the attribute. That is, a PathwayReaction row is added for this pathway for each PathwayReaction row of the pathway designated by the attribute. The columns of each PathwayReaction row are identical, except that PathwayWID is changed from the attribute's pathway WID to the WID of this pathway.
    Note: sub/superpathway information is loaded via SUPER-PATHWAYS.
    SUPER-PATHWAYS[*] Value should match the UNIQUE_ID of a pathway.
    If so, a row is added to SuperPathway:
    SuperPathway.SuperPathwayWID is the WID associated with this UNIQUE_ID;
    SuperPathway.PathwayWID is the WID of this Pathway object.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Pathway object

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Pathway object

    Additional Tables

    Rows are added to the tables SuperPathway, PathwayReaction, and PathwayLink as specified above.

    Support Table

    Most BioCyc entries allow a CITATIONS attribute to be specified. If this attribute begins with a colon or contains the string ':EV-' anywhere, it is treated as an indicator of evidence for the validity of the associated entry, or some portion of it. The evidence code is the text between the colon and the next colon (if present) or the end of the attribute (excluding the colons).

    For each attribute of this form, a row in the Support table is created as follows:

    Column values for Support row
    Column Value assigned by BioCyc loader
    WID The BioWarehouse ID allocated for this Support.
    OtherWID The WID of the entry this supporting evidence applies to.
    Type The evidence code (e.g., 'EV-EXP-IMP-POLAR-MUTATION').
    Note: this is not consistent with the schema documentation, which states this column is either 'computational or 'experimental'.

    Confidence NULL
    DatasetWID The value Dataset.WID assigned to the dataset being loaded.

    Additional Tables

    If the CITATIONS attribute contains a reference to a publication ID, a row is added to the CitationWIDOtherWID to associate the Support row with the Citation of the publication.

    Entry Table

    For each object loaded from the database, a row in the Entry table is created as follows:

    Column values for Entry row
    Column Value assigned by BioCyc loader
    OtherWID The WID of the entry described by this row. Entry may be in
    Chemical, Reaction, Protein, Gene, EnzymaticReaction, or Pathway.

    InsertDate The time/date the loader was run.
    CreationDate NULL
    ModifiedDate NULL
    LineNumber The line number from the input file on which this entry began.
    LoadError 'T' if a parse error is detected, 'F' otherwise.
    DatasetWID The value Dataset.WID assigned to the dataset being loaded.


    References

  • BioSPICE Web Site
  • BioCyc.org
  • Pathway/Genome Databases (PGDBs)
  • Schema documentation