ECO2DBASE Loader for Bio-SPICE Warehouse

Version 4.6

(C) 2006 SRI
International. All Rights Reserved.  See BioWarehouse
Overview for license details.

Introduction
Translation Semantics for eco2dbase Tables

Table gene

References

Introduction

This document describes version 4.6 of the eco2dbase loader. It is one of several database loaders comprising the Bio-SPICE Warehouse. For more information regarding the eco2dbase Loader, see the eco2dbase manual.

Overview of Bio-SPICE Warehouse Schema

The Bio-SPICE warehouse schema contains the data definition statements for the Bio-SPICE Warehouse. These include four different types of tables - constant tables, object tables, linking tables, and special tables.

Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

It is very important to note, that WIDs have been set aside and reserved the eco2dbase data. For the eco2dbase DataSet, the reserved WID is 1001 or larger. For the rest of the eco2dbase data, the reserved WIDs are from 0 to 999,999. WIDs were reserved for the eco2dbase dataset to ensure that there are no conflicts (overlapping WIDs) with other datasets in the eco2dbase dataset. This is necesary to do because BIO-SPICE Warehouse users do not run the full eco2dbase loader. Users instead load database dumps into their BIO-SPICE Warehouse schema. Because their schema may already have other data in there it is necessary to make sure there is no overlap with the WIDs.

A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

Full schema information, including source files and browseable documentation, is available with this distribution.

Translation Semantics for eco2dbase Tables

This section describes the semantic mapping between the tables comprising the eco2dbase database and to its representation in the BioSpice data warehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Any attributes not listed are ignored.

Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant.

If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored. Most semantics are expressed in tabular form, showing the mapping of each input attribute to the warehouse Table.Column values computed from it. The most typical semantics is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Some attributes are ignored.

Comments that are NULL or are "-" are ignored.

Table gene

The gene table contains a description of each gene present in eco2dbase. A row is added to the Gene table for each entry in gene.

There is no nucleic acid information in the eco2dbase database. Not every spot or protein is linked to a gene. The gene table contains the set of all known genes at the time, so that users can make their own identifications, and because users might be interested in genes not identified in this project.

Translation semantics for gene

eco2dbase Attribute Warehouse Semantics

GENE_NAME Gene.Name; every gene contains a name and each name is distinct.

L_END Gene.CodingRegionStart

R_END Gene.CodingRegionEnd

GENE_DIRECTION Gene.Direction

B_NUM SynonymTable.Syn

GENE_NOTE CommentTable.Comm

**Translation semantics for `gene`**
eco2dbase Attribute	Warehouse Semantics
GENE_NAME	`Gene.Name`; every gene contains a name and each name is distinct.
L_END	`Gene.CodingRegionStart`
R_END	`Gene.CodingRegionEnd`
GENE_DIRECTION	`Gene.Direction`
B_NUM	`SynonymTable.Syn`
GENE_NOTE	`CommentTable.Comm`

Linking Tables

No rows to linking tables are added when this file is loaded. However, various linking table rows are added when subsequent tables are loaded.

Table prot

The prot table contains a description of each protein present in eco2dbase. A row is added to the Protein table for each entry in prot.

Not every protein is linked to a gene, and not every spot is linked to a protein. The protein table contains the set of all known proteins at the time, so that users can make their own identifications, and because users might be interested in genes not identified in this project.

When querying the protein data from eco2dbase, the organism_id=1 is specified to ensure we get e.coli proteins only.

Translation semantics for prot

eco2dbase Attribute Warehouse Semantics

PROTEIN_NAME Protein.name; the protein name is the protein definition. This is sometimes null,
and some proteins map to the same gene, and have the same name.

CALC_MW Protein.MolecularWeightCalc; calculated from DNA sequence

CALC_PI Protein.PICalc; calculated from DNA sequence

**Translation semantics for `prot`**
eco2dbase Attribute	Warehouse Semantics
PROTEIN_NAME	`Protein.name`; the protein name is the protein definition. This is sometimes null, and some proteins map to the same gene, and have the same name.
CALC_MW	`Protein.MolecularWeightCalc`; calculated from DNA sequence
CALC_PI	`Protein.PICalc`; calculated from DNA sequence

Linking Tables

if gene_id in the eco2dbase schema is not null, a row is added to GeneWIDProteinWID to associate the protein to the gene.

Table spot

The spot table contains information for each spot. A spot entry in this table records a single spot that can occur across multiple gels. Some spots are linked to all gels, meaning that have been run and identified on all possible gel conditions. Also, the spot ll of which should be run under may have been run under a variety of different growth conditions.

Naming convention: Alphanumeric IDs and RRM (or RR_MAP) are the two types of IDs for spots in the eco2dbase database. For alphanumeric IDs, an example spot name is A008.0. The A indicates a certain pI range, and the number indicates a MolWt range (see picture on p2069 of Neidhardt). Alphanumeric IDs were intended to be for proteins that have been identified as a product of a particular gene, but not all spots that been linked to a known protein have an alphanumeric ID, and not all spots with an alphanumeric ID are linked to a known protein. At some point, eco2dbase shifted from using the Alphanumeric spot names to using the RRM numbers as spot names. This was done to help improve redundancy issues, but there are still redundancy issues. RRM IDs were intended to be for proteins from the Response/Regulation Map (RRM) Project that have been observed but still await identification.

Ultimately, some spots have alphanumeric IDs, some have RRM IDs, and some spots have both. Using the combination of alphanumeric and RRM gives a distinct name for the spot (this is important to note, since while alphanumeric ID are distinct,one RRM ID, R2463, is duplicated). We will use the WID to ensure a distinct name, and will store the alphanumeric and RRM IDs in the BIO-SPICE Warehouse schema.

Also, it should be noted that every spot has a location, and that most (not all) spots with a RRM ID have regulation data.

Translation semantics for spot

eco2dbase Attribute Warehouse Semantics

ALPHANUMERIC Spot.Name; the alphanumeric ID will be used as the primary name in the BIO-SPICE Warehouse schema
An example spot name is A008.0. The A indicates a certain pI range, and the number indicates a MolWt range (see picture on p2069).

RRM_NUMBER DBID.XID; not every spot has a RRM number.

EST_MW Spot.MolecularWeightEst; Molecular weight estimated from migration on
reference 2D gel.

EST_PI Spot.PIEst; Isoelectric point (pI) estimated from
migration on reference 2D gel.

**Translation semantics for `spot`**
eco2dbase Attribute	Warehouse Semantics
ALPHANUMERIC	`Spot.Name`; the alphanumeric ID will be used as the primary name in the BIO-SPICE Warehouse schema An example spot name is A008.0. The A indicates a certain pI range, and the number indicates a MolWt range (see picture on p2069).
RRM_NUMBER	`DBID.XID`; not every spot has a RRM number.
EST_MW	`Spot.MolecularWeightEst`; Molecular weight estimated from migration on reference 2D gel.
EST_PI	`Spot.PIEst`; Isoelectric point (pI) estimated from migration on reference 2D gel.

Linking Tables

A row is added to ProteinWidSpotWid to associate the spot to a protein, for cases where the spot has been mapped to a known protein.

located_on_gel

The located_on_gel table maps spots to the one or more gels they are found on. The Gel_Number indicates the "reference gel" from the Neidhardt book (p2067). Each "reference gel" provides a reference set of growth conditions and gel conditions. Additional gels are then run to compare other growth conditions to that reference condition, using the same gel conditions.

Spot locs without a gel number (1 record), and spots on an invalid gel number (two spots are on 4B), and spots without an X or Y coordinate will not be copied to the GelLocation table of the biowarehouse schema. Also, there is one spot (spot_id 830) that is the located_on_gel table, but not in the spot table. This will not be added to biowarehouse database.

Translation semantics for located_on_gel

eco2dbase Attribute Warehouse Semantics

REF_GEL GelLocation.refGel; Y or N indicates whether the gel used
to define the coordinates for this spot to determine MW and pI.

X_COORD GelLocation.Xcoord; X coordinate on the gel

Y_COORD GelLocation.Ycoord; Y coordinate on the gel

GEL_NUMBER For each gel, a record will be created in the experiment table to capture about the conditions of each gel.
GelLocation.ExperimentWID will refer to the WID of the gel (which is stored in the experiment table.

Experiment.Description will contain the gel conditions pulled from captions of Figures 1, 2A, 2B, 3A,
3B, 4. These descriptions are also at:
ftp://ftp.ncbi.nlm.nih.gov/repository/ECO2DBASE/edition6/image.info

Gels 1, 2A, 2B, and 4 were run with E. coli K-12, strain W3110; Gels 3A/3B were run with E.coli B/r, strain NC3
Experiment.BioSourceWID will link to a record in the biosource table so that this information is captured.

Experiment.GroupSize will equal 0, since there are no child experiments
DBID.XID will storeteh gel number and be linked to the Experiment table.
Experiment.Type will contain "2D Gel."

Spot_ID GelLocation.SpotWID; will be used to link to a record in the spot table

**Translation semantics for `located_on_gel`**
eco2dbase Attribute	Warehouse Semantics
REF_GEL	`GelLocation.refGel`; Y or N indicates whether the gel used to define the coordinates for this spot to determine MW and pI.
X_COORD	`GelLocation.Xcoord`; X coordinate on the gel
Y_COORD	`GelLocation.Ycoord`; Y coordinate on the gel
GEL_NUMBER	For each gel, a record will be created in the experiment table to capture about the conditions of each gel. `GelLocation.ExperimentWID` will refer to the WID of the gel (which is stored in the experiment table. `Experiment.Description` will contain the gel conditions pulled from captions of Figures 1, 2A, 2B, 3A, 3B, 4. These descriptions are also at: ftp://ftp.ncbi.nlm.nih.gov/repository/ECO2DBASE/edition6/image.info Gels 1, 2A, 2B, and 4 were run with E. coli K-12, strain W3110; Gels 3A/3B were run with E.coli B/r, strain NC3 `Experiment.BioSourceWID` will link to a record in the biosource table so that this information is captured. `Experiment.GroupSize` will equal 0, since there are no child experiments `DBID.XID` will storeteh gel number and be linked to the Experiment table. `Experiment.Type` will contain "2D Gel."
Spot_ID	`GelLocation.SpotWID`; will be used to link to a record in the spot table

Linking Tables

No linking table rows are added.

Table method

The method describes how a spot was identified as a particular protein. Purified protein, co-migration, genetic criteria (mutant, deletion), deduced from sequence info, physiological.

Translation semantics for method

eco2dbase Attribute Warehouse Semantics

METHOD_ABBR SpotIdMethod.MethodAbbrev; A short, one letter abbreviation for the method name.

METHOD_NAME SpotIdMethod.MethodNameD; The name of the method.

METHOD_DESC SpotIdMethod.MethodDesc; a more detailed description of the method

**Translation semantics for `method`**
eco2dbase Attribute	Warehouse Semantics
METHOD_ABBR	`SpotIdMethod.MethodAbbrev`; A short, one letter abbreviation for the method name.
METHOD_NAME	`SpotIdMethod.MethodNameD`; The name of the method.
METHOD_DESC	`SpotIdMethod.MethodDesc`; a more detailed description of the method

Linking Tables

No linking table rows are added.

Table spot_method

The spot_method links spots to the methods by which they were identified as a particular protein. This table is necessary to support many-to-many mapping, since a spot can be identified by multiple methods.

Translation semantics for spot_method

eco2dbase Attribute Warehouse Semantics

SPOT_ID SpotWIDSpotIdMethodWID.SpotWID

METHOD_ID SpotWIDSpotIdMethodWID.SpotIdMethodWID

**Translation semantics for `spot_method`**
eco2dbase Attribute	Warehouse Semantics
SPOT_ID	`SpotWIDSpotIdMethodWID.SpotWID`
METHOD_ID	`SpotWIDSpotIdMethodWID.SpotIdMethodWID`

Linking Tables

No linking table rows are added.

Table rr_map_rows

The rr_map_rows table indicates the quantity of protein for a given spot grown under a particular set of growth (or stress) conditions.

99999999 for the protein level is equivalent to "Y" (yes, the spot is present) for results that were qualitativey analyzed.

One spot has a null prot_level, so these rr_map_rows will not be loaded into biowarehouse. Not all spots have records in this table.

Translation semantics for rr_map_rows

eco2dbase Attribute Warehouse Semantics

ALPHANUMERIC and RRM_NUMBER ExperimentData.OtherWID; used to link to spot record. The combination of these two IDs
allows you to link to a unique Spot ID in the spot table of the eco2dbase schema.

PROT_LEVEL ExperimentData.Data is the amount of protein seen under those growth conditions.
ExperimentData.Kind will be "O" for observed.
ExperimentData.Role will be "Amount of protein seen under the growth conditions."

TREATMENT The protein level will be stored in the ExperimentData table, and will link to the Experiment table,
which will store the information about the growth conditions.

Experiment.Description will contain the treatment (growth conditions)
Experiment.Type will contain the text "Growth Conditions for 2D Gel Experiment."
Experiment.GroupSize will equal 0, since there are no child experiments

**Translation semantics for `rr_map_rows`**
eco2dbase Attribute	Warehouse Semantics
ALPHANUMERIC and RRM_NUMBER	`ExperimentData.OtherWID`; used to link to spot record. The combination of these two IDs allows you to link to a unique Spot ID in the spot table of the eco2dbase schema.
PROT_LEVEL	`ExperimentData.Data` is the amount of protein seen under those growth conditions. `ExperimentData.Kind` will be "O" for observed. `ExperimentData.Role` will be "Amount of protein seen under the growth conditions."
TREATMENT	The protein level will be stored in the ExperimentData table, and will link to the Experiment table, which will store the information about the growth conditions. `Experiment.Description` will contain the treatment (growth conditions) `Experiment.Type` will contain the text "Growth Conditions for 2D Gel Experiment." `Experiment.GroupSize` will equal 0, since there are no child experiments

Linking Tables

Rows will be added to the ExperimentRelationship table to associate each growth conditions to each of the gel conditions. ExperimentRelationship.ExperimentWID will be the growth conditions WID from the Experiment table; ExperimentRelationship.RelatedExperimentWID will be the gel conditions WID from the Experment table.

References

eco2dbase at NCBI ftp site

Chapter 115 in the second edition of the book "Escherichia coli and Salmonella" by Neidhardt.