ECO2DBASE Loader for Bio-SPICE Warehouse

Version 4.6


(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.


Introduction
Translation Semantics for eco2dbase Tables
  • Table gene
  • Table prot
  • Table spot
  • Table located_on_gel
  • Table method
  • Table spot_method
  • Table rr_map_rows
  • References

    Introduction

    This document describes version 4.6 of the eco2dbase loader. It is one of several database loaders comprising the Bio-SPICE Warehouse. For more information regarding the eco2dbase Loader, see the eco2dbase manual.

    Overview of Bio-SPICE Warehouse Schema

    The Bio-SPICE warehouse schema contains the data definition statements for the Bio-SPICE Warehouse. These include four different types of tables - constant tables, object tables, linking tables, and special tables.

    Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

    Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

    It is very important to note, that WIDs have been set aside and reserved the eco2dbase data. For the eco2dbase DataSet, the reserved WID is 1001 or larger. For the rest of the eco2dbase data, the reserved WIDs are from 0 to 999,999. WIDs were reserved for the eco2dbase dataset to ensure that there are no conflicts (overlapping WIDs) with other datasets in the eco2dbase dataset. This is necesary to do because BIO-SPICE Warehouse users do not run the full eco2dbase loader. Users instead load database dumps into their BIO-SPICE Warehouse schema. Because their schema may already have other data in there it is necessary to make sure there is no overlap with the WIDs.

    A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

    A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

    Full schema information, including source files and browseable documentation, is available with this distribution.


    Translation Semantics for eco2dbase Tables

    This section describes the semantic mapping between the tables comprising the eco2dbase database and to its representation in the BioSpice data warehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Any attributes not listed are ignored.

    Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant.

    If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored. Most semantics are expressed in tabular form, showing the mapping of each input attribute to the warehouse Table.Column values computed from it. The most typical semantics is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Some attributes are ignored.

    Comments that are NULL or are "-" are ignored.


    Table gene

    The gene table contains a description of each gene present in eco2dbase. A row is added to the Gene table for each entry in gene.

    There is no nucleic acid information in the eco2dbase database. Not every spot or protein is linked to a gene. The gene table contains the set of all known genes at the time, so that users can make their own identifications, and because users might be interested in genes not identified in this project.


    Translation semantics for gene
    eco2dbase Attribute Warehouse Semantics
    GENE_NAME Gene.Name; every gene contains a name and each name is distinct.
    L_END Gene.CodingRegionStart
    R_END Gene.CodingRegionEnd
    GENE_DIRECTION Gene.Direction
    B_NUM SynonymTable.Syn
    GENE_NOTE CommentTable.Comm

    Linking Tables

    No rows to linking tables are added when this file is loaded. However, various linking table rows are added when subsequent tables are loaded.

    Table prot

    The prot table contains a description of each protein present in eco2dbase. A row is added to the Protein table for each entry in prot.

    Not every protein is linked to a gene, and not every spot is linked to a protein. The protein table contains the set of all known proteins at the time, so that users can make their own identifications, and because users might be interested in genes not identified in this project.


    When querying the protein data from eco2dbase, the organism_id=1 is specified to ensure we get e.coli proteins only.


    Translation semantics for prot
    eco2dbase Attribute Warehouse Semantics
    PROTEIN_NAME Protein.name; the protein name is the protein definition. This is sometimes null,
    and some proteins map to the same gene, and have the same name.

    CALC_MW Protein.MolecularWeightCalc; calculated from DNA sequence
    CALC_PI Protein.PICalc; calculated from DNA sequence

    Linking Tables

    if gene_id in the eco2dbase schema is not null, a row is added to GeneWIDProteinWID to associate the protein to the gene.

    Table spot

    The spot table contains information for each spot. A spot entry in this table records a single spot that can occur across multiple gels. Some spots are linked to all gels, meaning that have been run and identified on all possible gel conditions. Also, the spot ll of which should be run under may have been run under a variety of different growth conditions.

    Naming convention: Alphanumeric IDs and RRM (or RR_MAP) are the two types of IDs for spots in the eco2dbase database. For alphanumeric IDs, an example spot name is A008.0. The A indicates a certain pI range, and the number indicates a MolWt range (see picture on p2069 of Neidhardt). Alphanumeric IDs were intended to be for proteins that have been identified as a product of a particular gene, but not all spots that been linked to a known protein have an alphanumeric ID, and not all spots with an alphanumeric ID are linked to a known protein. At some point, eco2dbase shifted from using the Alphanumeric spot names to using the RRM numbers as spot names. This was done to help improve redundancy issues, but there are still redundancy issues. RRM IDs were intended to be for proteins from the Response/Regulation Map (RRM) Project that have been observed but still await identification.

    Ultimately, some spots have alphanumeric IDs, some have RRM IDs, and some spots have both. Using the combination of alphanumeric and RRM gives a distinct name for the spot (this is important to note, since while alphanumeric ID are distinct,one RRM ID, R2463, is duplicated). We will use the WID to ensure a distinct name, and will store the alphanumeric and RRM IDs in the BIO-SPICE Warehouse schema.

    Also, it should be noted that every spot has a location, and that most (not all) spots with a RRM ID have regulation data.


    Translation semantics for spot
    eco2dbase Attribute Warehouse Semantics
    ALPHANUMERIC Spot.Name; the alphanumeric ID will be used as the primary name in the BIO-SPICE Warehouse schema
    An example spot name is A008.0. The A indicates a certain pI range, and the number indicates a MolWt range (see picture on p2069).

    RRM_NUMBER DBID.XID; not every spot has a RRM number.
    EST_MW Spot.MolecularWeightEst; Molecular weight estimated from migration on
    reference 2D gel.

    EST_PI Spot.PIEst; Isoelectric point (pI) estimated from
    migration on reference 2D gel.

    Linking Tables

    A row is added to ProteinWidSpotWid to associate the spot to a protein, for cases where the spot has been mapped to a known protein.

    located_on_gel

    The located_on_gel table maps spots to the one or more gels they are found on. The Gel_Number indicates the "reference gel" from the Neidhardt book (p2067). Each "reference gel" provides a reference set of growth conditions and gel conditions. Additional gels are then run to compare other growth conditions to that reference condition, using the same gel conditions.

    Spot locs without a gel number (1 record), and spots on an invalid gel number (two spots are on 4B), and spots without an X or Y coordinate will not be copied to the GelLocation table of the biowarehouse schema. Also, there is one spot (spot_id 830) that is the located_on_gel table, but not in the spot table. This will not be added to biowarehouse database.


    Translation semantics for located_on_gel
    eco2dbase Attribute Warehouse Semantics
    REF_GEL GelLocation.refGel; Y or N indicates whether the gel used
    to define the coordinates for this spot to determine MW and pI.

    X_COORD GelLocation.Xcoord; X coordinate on the gel
    Y_COORD GelLocation.Ycoord; Y coordinate on the gel
    GEL_NUMBER For each gel, a record will be created in the experiment table to capture about the conditions of each gel.
    GelLocation.ExperimentWID will refer to the WID of the gel (which is stored in the experiment table.

    Experiment.Description will contain the gel conditions pulled from captions of Figures 1, 2A, 2B, 3A,
    3B, 4. These descriptions are also at:
    ftp://ftp.ncbi.nlm.nih.gov/repository/ECO2DBASE/edition6/image.info

    Gels 1, 2A, 2B, and 4 were run with E. coli K-12, strain W3110; Gels 3A/3B were run with E.coli B/r, strain NC3
    Experiment.BioSourceWID will link to a record in the biosource table so that this information is captured.

    Experiment.GroupSize will equal 0, since there are no child experiments
    DBID.XID will storeteh gel number and be linked to the Experiment table.
    Experiment.Type will contain "2D Gel."

    Spot_ID GelLocation.SpotWID; will be used to link to a record in the spot table

    Linking Tables

    No linking table rows are added.

    Table method

    The method describes how a spot was identified as a particular protein. Purified protein, co-migration, genetic criteria (mutant, deletion), deduced from sequence info, physiological.


    Translation semantics for method
    eco2dbase Attribute Warehouse Semantics
    METHOD_ABBR SpotIdMethod.MethodAbbrev; A short, one letter abbreviation for the method name.
    METHOD_NAME SpotIdMethod.MethodNameD; The name of the method.
    METHOD_DESC SpotIdMethod.MethodDesc; a more detailed description of the method

    Linking Tables

    No linking table rows are added.

    Table spot_method

    The spot_method links spots to the methods by which they were identified as a particular protein. This table is necessary to support many-to-many mapping, since a spot can be identified by multiple methods.


    Translation semantics for spot_method
    eco2dbase Attribute Warehouse Semantics
    SPOT_ID SpotWIDSpotIdMethodWID.SpotWID
    METHOD_ID SpotWIDSpotIdMethodWID.SpotIdMethodWID

    Linking Tables

    No linking table rows are added.

    Table rr_map_rows

    The rr_map_rows table indicates the quantity of protein for a given spot grown under a particular set of growth (or stress) conditions.

    99999999 for the protein level is equivalent to "Y" (yes, the spot is present) for results that were qualitativey analyzed.

    One spot has a null prot_level, so these rr_map_rows will not be loaded into biowarehouse. Not all spots have records in this table.


    Translation semantics for rr_map_rows
    eco2dbase Attribute Warehouse Semantics
    ALPHANUMERIC and RRM_NUMBER ExperimentData.OtherWID; used to link to spot record. The combination of these two IDs
    allows you to link to a unique Spot ID in the spot table of the eco2dbase schema.

    PROT_LEVEL ExperimentData.Data is the amount of protein seen under those growth conditions.
    ExperimentData.Kind will be "O" for observed.
    ExperimentData.Role will be "Amount of protein seen under the growth conditions."

    TREATMENT The protein level will be stored in the ExperimentData table, and will link to the Experiment table,
    which will store the information about the growth conditions.

    Experiment.Description will contain the treatment (growth conditions)
    Experiment.Type will contain the text "Growth Conditions for 2D Gel Experiment."
    Experiment.GroupSize will equal 0, since there are no child experiments

    Linking Tables

    Rows will be added to the ExperimentRelationship table to associate each growth conditions to each of the gel conditions. ExperimentRelationship.ExperimentWID will be the growth conditions WID from the Experiment table; ExperimentRelationship.RelatedExperimentWID will be the gel conditions WID from the Experment table.

    References

  • eco2dbase at NCBI ftp site
  • Chapter 115 in the second edition of the book "Escherichia coli and Salmonella" by Neidhardt.