MetaCyc Ontology Loader for BioWarehouse

Version 4.6


(C) 2005 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.



Introduction
Limitations
Installation and Building
Obtaining input data
Running the Loader
Dataset and BioSource Specification
Translation Semantics for MetaCyc Ontology Objects
  • Classes
  • Entry Table
    References

    Introduction

    This document describes version 4.6 of the MetaCyc Ontology Loader. It is one of several database loaders comprising the BioWarehouse. The BioWarehouse is a relational database that provides a common representation for diverse bioinformatics databases.

    The MetaCyc Ontology Loader (referred to simply as the loader), loads a portion of the MetaCyc Pathway/Genome Database (PGDB) into the BioWarehouse. PGDBs are implemented in a frame-based representation system which is implemented in Common Lisp. The loader inputs a textual flat file representation of a PGDB, converts it to the representation expressed in the BioWarehouse Schema, and loads this directly into an instance of the warehouse.

    The loader loads all terms from three ontologies it contains:

    MetaCyc is a large database containing many other types of data; these are not loaded by this loader. Much of this data may be loaded by loading MetaCyc with the BioCyc Loader.

    Overview of BioWarehouse Schema

    The BioWarehouse schema contains the data definition statements for the BioWarehouse. These include three different types of tables - constant tables, object tables, linking tables, and special tables.

    Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

    Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique em> warehouse ID (WID) to each object, which is used by other tables to reference the object.

    A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

    A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

    Schema documentation is available.


    Limitations

    The latest supported data version for the MetaCyc Ontology loader is listed in the loader summary table.  Attributes added to the MetaCyc Ontology schema after this version are not supported.

    Only the three ontologies noted above are loaded by this loader - no other ontological terms, and no other data types present in MetaCyc.


    Installation and Building

    See MetaCyc Ontology installation instructions for details on installing and building the loader.

    Input data

    MetaCyc is a BioCyc PGDB database. A license may be required to obtain it. Visit BioCyc downloads or send a request to biocyc-info@ai.sri.com for details. See the PGDB flat file format specification for detailed specification on the contents of the input files used by the loader. Most files are in attribute-value format.

    The textual representation of a PGDB consists of several ASCII files. The loader loads only one file - classes.dat.


    Running the Loader

    The MetaCyc Ontology loader installation instructions contain details for running the loader, including options and a description of its output.

    Dataset Specification

    If the -m (merge) command line option is used, the loader loads data into the dataset named "BioCyc", using the WID of this entry as the DataSetWID for all objects it adds to the Warehouse. If multiple datasets of this name exist, the one with the maximal DataSet.WID is used; typically this is the dataset that was most recently loaded. If no dataset of this name exists, a warning is issued and one is created.

    If the -m command line option is not used, the loader creates three datasets, one for each of the three subontologies loaded by the loader. One row is added to the DataSet table for each dataset. The Dataset.WID of each is used as the DataSetWID for all objects it adds to the Warehouse:

    Column values for Dataset row
    Column Value assigned by BioCyc loader
    WID A small integer that uniquely identifies this dataset in the warehouse.
    Name 'MetaCyc Chemical Compound Ontology' or
    'MultiFun Gene Ontology' or
    'MetaCyc Pathway Ontology'

    Version Major version of MetaCyc that it loaded.
    ReleaseDate The date that this version of MetaCyc was released.
    LoadDate The date and time the loader was run.
    ChangeDate The date and time the loader completed, NULL if the loader did not complete successfully.
    LoadedBy The value of the system environment variable USER for the account running the loader.
    Application MetaCyc Ontology Loader'
    ApplicationVersion 4.6
    HomeURL http://www.biocyc.org
    QueryURL http://www.biocyc.org:1555


    Translation Semantics for MetaCyc Ontology Objects

    This section describes the semantic mapping between the objects comprising the MetaCyc Ontology knowledge base and its associated flat file representation to its representation in the BioWarehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Any attributes not listed are ignored.

    Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant. If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored.

    Classes

    Classes are input from the file classes.dat. It contains a description of the class system used to represent MetaCyc data. The classes of interest to the loader are those hierarchies rooted at the following: Only these three classes and those classes that include one of these classes as an ancestor as specified in the TYPES attribute are loaded by the loader; all others are ignored. A row is added to the Term table for each such entry in classes.dat. Since each term is part of a hierarchy, Term.Hierarchy is 'T' for all entries. In addition, a row is added to the TermRelationship table for every entry except these three roots to describe the superclass of the entry. Each entry has exactly one superclass except the roots, which have no superclass.

    In the input file, classes are sorted in hierarchical order. This is exploited by the loader, to avoid multiple passes over the data.


    Translation semantics for classes.dat
    BioCyc Attribute Warehouse Semantics
    COMMENT[1] The first comment is considered the defining comment and is stored at Term.Definition.
    COMMENT[2+] Typically only one comment per entry is present, but if more are present, they are stored as comments:
    CommentTable.Comm;
    CommentTable.OtherWID is the WID of this Term object

    COMMON-NAME Term.Name; if this attribute is missing, the UNIQUE-ID attribute is substituted.
    SYNONYMS[*] SynonymTable.Syn;
    SynonymTable.OtherWID is the WID of this Term object

    TYPES One row is added to TermRelationship:
    TermRelationship.RelatedTermWID is the WID of the term object whose UNIQUE-ID matches this value.
    TermRelationship.TermWID is the WID of this Term object.
    TermRelationship.Relationship is 'superclass'.
    Note: TYPES contains only one superclass, despite the plurality of its name.

    UNIQUE-ID DBID.XID;
    DBID.OtherWID is the WID of this Term object.
    If it is 'Compounds-and-Elements', 'All-Genes', or 'Pathways',
    Term.Root is 'T', else Term.Root is 'F'.

    Entry Table

    For each object loaded from the database, a row in the Entry table is created as follows:

    Column values for Entry row
    Column Value assigned by BioCyc loader
    OtherWID The Term WID of the entry described by this row.
    InsertDate The time/date the loader was run.
    CreationDate NULL
    ModifiedDate NULL
    LineNumber The line number from the input file on which this entry began.
    LoadError 'T' if a parse error is detected, 'F' otherwise.
    DatasetWID The value Dataset.WID assigned to the dataset containing the subontology this term belongs to.
    Three distinct datasets are created by the loader.


    References

  • Building and Running the loader
  • BioSPICE Web Site
  • BioCyc.org
  • Pathway/Genome Databases (PGDBs)
  • Schema documentation