NCBI Taxonomy Loader for Bio-SPICE Warehouse

Version 4.6


(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.


Introduction
Installation and Building
Obtaining input data
Running the Loader
Dataset Specification
Translation Semantics for NCBI Taxonomy Objects
  • Division
  • Genetic Code
  • Taxon
  • Entry Table
    References

    Introduction

    This document describes version 4.6 of the NCBI Taxonomy Loader. It is one of several database loaders comprising the Bio-SPICE Warehouse. The NCBI Taxonomy database contains information about different Taxons which is loaded into the Bio-SPICE Warehouse - a relational database that provides a common representation for diverse bioinformatics databases.


    Installation and Building

    See NCBI Loader installation instructions for details on installing and building the loader.

    Input data

    The latest supported data version for the NCBI Taxonomy loader is listed in the loader summary table.NCBI Taxonomy database can be dowloaded from the site: NCBI Taxonomy downloads

    The textual representation of the database consists of several ASCII files which are used in the NCBI Taxonomy loader. These files are loaded in the following order:

    1. division.dmp
    2. gencode.dmp
    3. names.dmp
    4. nodes.dmp
    The data fields are separated from each other with a "tab|tab".

    Running the Loader

    The NCBI installation instructions contain details for running the loader, including options and a description of its output.


    Dataset Specification

    The loader adds one row to the Dataset table as follows:

    Column values for Dataset row
    Column Value assigned by NCBI Taxonomy loader
    WID The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse.
    Name 'NCBI Taxonomy'
    LoadDate The time/date the loader was run.
    Version This is by convention the download date of the data, as NCBI Taxonomy changes continually and is not versioned.
    ReleaseDate This is by convention the download date of the data, as NCBI Taxonomy changes continually and is not versioned.
    ChangeDate The date and time the loader completed, NULL if the loader did not complete successfully.
    LoadedBy The value of the system environment variable USER for the account running the loader.
    Application 'NCBI Taxonomy Loader'
    ApplicationVersion 4.6
    HomeURL http://ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy


    Translation Semantics for NCBI Taxonomy database

    This section describes the semantic mapping between the data in the NCBI Taxonomy database and the BioSpice data warehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given.

    If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored.

    Division
    Genetic Codes
    Taxon
    Entry Table

    Division

    Divisions are input from the file division.dmp. A row is added to the Division table for each entry in division.dmp. Each row has a unique WID for a Division and the dataset wid which is unique for the NCBI database.


    Translation semantics for division.dmp
    NCBI database Attribute Warehouse Semantics
    Division Id DBID.XID
    DBID.OtherWID is the WID of this Division object.

    Division Cde Division.Code
    The value specifies the Division code, which is a three letter abbreviation for the Division. For example: BCT for Bacteria.

    Division Name Division.Name
    The name of this Division.

    Comments CommentTable.Comm
    CommentTable.OtherWID is the WID of this Division object.

    Linking Tables

    No linking table rows are added for Division.

    Genetic Codes

    Genetic Codes are input from the file gencode.dmp. A row is added to the GeneticCode table for each entry in gencode.dmp. Each row has a unique WID for a genetic code and the dataset wid which is unique for the NCBI database.


    Translation semantics for gencode.dmp
    NCBI Taxonomy Database Attributes Warehouse Semantics
    Genetic code id GeneticCode.NCBIID and
    DBID.XID; DBID.OtherWID is the WID of this GeneticCode object.

    Abbreviation Ignored
    Name GeneticCode.Name
    If there are multiple names in the name field (separated by semicolon) they are stored in the SynonymTable.
    SynonymTable.Syn stores the synonym and SynonymTable.OtherWID is the WID of this GeneticCode object.

    Cde GeneticCode.TranslationTable
    Starts GeneticCode.StartCodon

    Linking Tables

    No linking table rows are added for Genetic Code.

    Taxon

    Taxon is input from the files names.dmp and nodes.dmp .The names.dmp file has multiple entries per tax id. This is because there are different ways to refer to the same taxonomy name. For example: tax id 9606 has three entries corresponding to human beings: "human", "Homo sapiens" and "man". The name class identifies if this name_txt is a scientific name, a common name, a genbank name, a synonym or something else. For each tax id an entry is created in the Taxon table and one name (either name_txt or unique name) is stored in the Taxon table. All other names for a given tax id are added to the SynonymTable. The following priority rules are followed to decide what name should be stored in the Taxon table (arranged in order of decreasing priority):
    1. Scientific unique name
    2. Scientific name_txt
    3. Non scientific unique name
    4. Non scientific name_txt
    For each tax id, a unique wid is created and is stored in the Taxon table with the preferred name. An entry is created in the Entry table for this Taxon.


    Translation semantics for names.dmp
    NCBI Taxonomy Database Attribute Warehouse Semantics
    tax_id DBID.XID
    DBID.OtherWID is the WID of this Taxon object.

    name_txt Taxon.Name or SynonymTable.Syn depending on the priority that this name has compared to the other names that are there in the table.
    If the name is enclosed in single quotes, the quotes are removed.
    If a name other than 'environmental sample' ends with a qualifying name in angle brackets (e.g., 'Bacteria <Bacteria>'), the text in the angle brackets and any preceding spaces are removed.

    unique name Taxon.Name or SynonymTable.Syn depending on the priority that this name has compared to the other names that are there in the table.
    If the name is enclosed in single quotes, the quotes are removed.
    If a name other than 'environmental sample' ends with a qualifying name in angle brackets (e.g., 'Bacteria <Bacteria>'), the text in the angle brackets and any preceding spaces are removed.

    name class This helps to determine the priority of this name. Depending on this priority and what other name classes are available for the same tax id, the name is either placed in the Taxon.Name or SynonymTable.Syn.


    Translation semantics for nodes.dmp
    NCBI Taxonomy Database Attribute Warehouse Semantics
    tax_id Used to look up the corresponding wid for the Taxon and update the Taxon entry that was added while parsing the names.dmp table.If this is null then an error is raised. This is based on the assumption that the names.dmp contains all tax ids and hence a wid for any Taxon should have already been created.
    parent tax_id Taxon.Parent_WID. If this is null then an error is raised.
    rank Taxon.Rank. If this Rank doesn't exist in the Enumeration table then an error is raised.
    embl code Ignored
    division id Taxon.Division_WID.The corresponding WID for the Division is found from the division id and stored. If division id not found then an error is raised.
    inherited div flag Taxon.Inherited_Div_Flag. 'T' or 'F' based on whether the inherited div flag is 1 or 0.
    genetic code id Taxon.Gencode_WID. The corresponding WID for the Genetic Code is found from the genetic code id and stored. If genetic code id not found then an error is raised.
    inherited GC flag Taxon.Inherited_GC_Flag.'T' or 'F' based on whether the inherited GC flag is 1 or 0.
    mitochondrial genetic code id Taxon.MC_Gencode_WIDThe corresponding WID for the Genetic Code is found from the mitochondrial genetic code id and stored. If genetic code id not found then an error is raised.
    inherited MGC flag Taxon.Inherited_MCGC_Flag.'T' or 'F' based on whether the inherited MGC flag is 1 or 0.
    GenBank hidden flag Ignored
    hidden subtree root flag Ignored
    comments CommentTable.Comm with CommentTable.OtherWID equal to the WID of this object.

    Linking Tables

    No linking table rows are added for Taxon.

    Entry Table

    For each object loaded from the database, a row in the Entry table is created as follows:

    Column values for Entry row
    Column Value assigned by NCBI loader
    OtherWID The WID of the entry described by this row. Entry may be in
    Division, GeneticCode or Taxon.

    InsertDate The time/date the loader was run.
    LoadError 'T' if a parse error is detected, 'F' otherwise.
    LineNumber The line number at which the error is noticed.
    DatasetWID The value Dataset.WID assigned to the dataset being loaded.

    References

  • Bio-SPICE Community Web Site
  • NCBI Taxonomy Home Page
  • NCBI Home Page
  • Online schema documentation