CMR Loader for BioWarehouse

Version 4.6

(C) 2005 SRI
International. All Rights Reserved.  See BioWarehouse
Overview for license details.

Introduction
Limitations
Installation and Building
Input data
Loader Dependencies and Prerequisites
Running the Loader
Variants of CMR
Dataset Specification
Sequence Comparison Computation Specification
Translation Semantics for CMR Tables

Table db_data

Table all_vs_all
References

Introduction

This document describes version 4.6 of the Comprehensive Microbial Resource (CMR) Loader. It is one of several database loaders comprising the BioWarehouse. The CMR Loader (referred to simply as the loader), loads a collection of files comprising the Comprehensive Microbial Resource (sometimes referred to as the Omniome databse) into the BioWarehouse - an Oracle relational database that provides a common representation for diverse bioinformatics databases. CMR is developed and maintained by The Institute for Genomic Research (TIGR) .

Overview of BioWarehouse Schema

The Bio-SPICE warehouse schema contains the data definition statements for the BioWarehouse. These include four different types of tables - constant tables, object tables, linking tables, and special tables.

Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

Full schema information, including source files and browseable documentation, is available with this distribution.

Limitations

The CMR loader should not be run concurrently with other loaders. The loader is likely to fail in this case. This limitation was imposed for performance reasons.

The latest supported data version for the CMR loader is listed in the loader summary table. The loader may not be compatible with future versions of CMR. Due to changes in the CMR flat file format, the loader is not compatible with earlier CMR versions.

MySQL servers impose a limit on the maximum size of an SQL statement. The CMR loader currently assumes that this limit is at least 4,000,000 bytes, and will truncate long data elements such as sequences to ensure this limit is not exceeded. This limit is currently not changeable. It is necessary to configure your MySQL server so that its max_allowed_packets parameter is at least 4M, as described in Environment Setup (MySQL).

The CMR database contains modification dates for a number of its data types. However, these dates are ignored, rather than stored in Entry.ModifiedDate.

Installation and Building

See CMR installation instructions for details on installing and building the loader.

Input data

Currently, loadable data for CMR is available at the TIGR Omniome Database public FTP site.

The CMR database consists of several tables. Data for each table is in a separate file or group of files. By CMR convention, files are named bcp_tablename. A subset of these are used by the loader. Each file is described in the following table, and is listed in the order in which it is loaded.
NOTE: As of release 3.7, the ident and nt_ident tables have been deprecated in favor of the new_ident table. The former tables are loaded if present, in order to support backward compatibility.

Input data for CMR Loader

Table File name(s) Loaded when Record delimiter Field delimiter

db_data bcp_db_data Always Newline Tab

taxon bcp_taxon Always Newline Tab

asmbl_data bcp_asmbl_data Always &&& @@@

assembly bcp_assembly Always &&& @@@

new_ident bcp_new_ident Loading any variant except TIGR
(see variant command line option) &&& @@@

nt_ident bcp_nt_ident DEPRECATED as of 3.7 in favor of new_ident.
Loading any variant except TIGR
(see variant command line option) &&& @@@

feat_link bcp_feat_link Loading PRIMARY variant
(see variant command line option) Newline Tab

ident bcp_ident DEPRECATED as of 3.7 in favor of new_ident.
Always, but when loading PRIMARY ORFs,
(see variant command line option)
genes duplicated in nt_ident are not loaded &&& @@@

asm_feature bcp_asm_feature Always &&& @@@

ORF_attribute bcp_ORF_attribute Always Newline Tab

accession bcp_accession Always Newline Tab

all_vs_all bcp_all_vs_all_1,
bcp_all_vs_all_2, etc., and/or
bcp_all_vs_all_01,
bcp_all_vs_all_02, etc. Always Newline Tab

**Input data for CMR Loader**
Table	File name(s)	Loaded when	Record delimiter	Field delimiter
`db_data`	`bcp_db_data`	Always	Newline	Tab
`taxon`	`bcp_taxon`	Always	Newline	Tab
`asmbl_data`	`bcp_asmbl_data`	Always	&&&	@@@
`assembly`	`bcp_assembly`	Always	&&&	@@@
`new_ident`	`bcp_new_ident`	Loading any variant except TIGR (see variant command line option)	&&&	@@@
`nt_ident`	`bcp_nt_ident`	DEPRECATED as of 3.7 in favor of `new_ident`. Loading any variant except TIGR (see variant command line option)	&&&	@@@
`feat_link`	`bcp_feat_link`	Loading PRIMARY variant (see variant command line option)	Newline	Tab
`ident`	`bcp_ident`	DEPRECATED as of 3.7 in favor of `new_ident`. Always, but when loading PRIMARY ORFs, (see variant command line option) genes duplicated in `nt_ident` are not loaded	&&&	@@@
`asm_feature`	`bcp_asm_feature`	Always	&&&	@@@
`ORF_attribute`	`bcp_ORF_attribute`	Always	Newline	Tab
`accession`	`bcp_accession`	Always	Newline	Tab
`all_vs_all`	`bcp_all_vs_all_1`, `bcp_all_vs_all_2`, etc., and/or `bcp_all_vs_all_01`, `bcp_all_vs_all_02`, etc.	Always	Newline	Tab

All other tables are ignored by the loader. All input files must reside in the same directory, which is specified as a command line parameter to the loader.

There are two file formats. One format uses a newline to terminate records, and a tab character to separate fields within the record. The other format, which allows embeddded tab and newline characters within its fields, uses the character sequence &&& as the record terminator and the character sequence @@@ as the field separator. In files with no line breaks between entries, any report by the loader of line numbers or total number of lines refers to entries, not lines.

Loader Dependencies and Prerequisites

Other that the standard warehouse creation procedure, the loader does not require that any other Warehouse tools be run prior to its execution. However, if the NCBI Taxonomy or Enzyme datasets have been loaded, the CMR loader builds useful links to objects in these datasets. A warning is issued if either of these datasets has not been loaded. If multiple versions of these datasets are present in the Warehouse, the dataset with the maximum WID (typically the most recently loaded) is used.

The NCBI Taxonomy dataset is used to populate the NucleicAcid.GeneticCodeWID column for the nucleic acid molecules of each organism in CMR. See Table db_data for details.

The Enzyme dataset is used to create an EnzymaticReaction for each CMR Open Reading Frame that specifies an Enzyme Commission (EC) number. See Table new_ident for details.

Running the Loader

The CMR installation instructions contain details for running the loader, including options and a description of its output.

The CMR loader should not be run concurrently with other loaders. The loader is likely to fail in this case.

To show progress, the loader prints a period for each 1000 WIDs allocated. For large input tables that don't allocate many WIDs, a colon is printed every 1000 input lines.

Variants of CMR

The loader accepts a variant command line option that specifies which Open Reading Frames (ORFs) and associate genes and proteins are to be loaded. Its semantics are as follows:

all: [default] All ORFs in CMR are loaded.
original: If a genome has been sequenced at a non-TIGR site, only the non-TIGR ORFs are loaded; all TIGR ORFs for the genome are excluded. If a genome has been sequenced at TIGR, these TIGR ORFs are loaded.
primary: If a genome has been sequenced a non-TIGR site, all non-TIGRs ORFs are loaded. All TIGR ORFs that find the same gene as a non-TIGR ORF are excluded. All TIGR ORFs that find new genes are loaded. The gene-matching cirteria used by TIGR is that the genes must significantly overlap and must have the same 3' end; the 5' ends may differ due to alternate start sites.
TIGR: Only those ORFs that have been sequenced by TIGR are loaded.

To summarize, original loads only original (generally higher quality) annotations, while primary will load an automated annotation if there is no original annotation for that gene.

To implement the various variants, the following locus naming conventions are assumed:

'NTL...' = original nonTIGR annotation
'NTx...' [x /= 'L'] = automated TIGR (non-original) annotation
'yz...' [yz /= 'NT'] = original TIGR annotation
'NT01MC...' = original locus (magnetococcus organism, exception to above rules)

Dataset Specification

The loader adds one row to the Dataset table as follows:

Column values for Dataset row

Column Value assigned by CMR loader

WID The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse.

Name 'CMR'.

ReleaseDate NULL

Version Version assigned by TIGR to this release of CMR (e.g., '19.0').

ReleaseDate The date that this version of CMR was released.

LoadDate The time/date the loader was run.

ChangeDate The date and time the loader completed, NULL if the loader did not complete successfully.

LoadedBy The value of the system environment variable USER for the account running the loader.

Application 'CMR Loader'

ApplicationVersion 4.6

HomeURL http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl

QueryURL http://www.tigr.org/tigr-scripts/CMR2/GenomeSlicer.spl

**Column values for `Dataset` row**
Column	Value assigned by CMR loader
`WID`	The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse.
`Name`	'CMR'.
`ReleaseDate`	NULL
`Version`	Version assigned by TIGR to this release of CMR (e.g., '19.0').
`ReleaseDate`	The date that this version of CMR was released.
`LoadDate`	The time/date the loader was run.
`ChangeDate`	The date and time the loader completed, NULL if the loader did not complete successfully.
`LoadedBy`	The value of the system environment variable USER for the account running the loader.
`Application`	'CMR Loader'
`ApplicationVersion`	4.6
`HomeURL`	http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl
`QueryURL`	http://www.tigr.org/tigr-scripts/CMR2/GenomeSlicer.spl

Sequence Comparison Computation Specification

The CMR database contains sequence comparison results in its all_vs_all table. These are stored the SequenceMatch table of the warehouse. A description of the sequence comparison algorithm performed is stored in a row of the Computation table as follows:

Column values for Computation row

Column Value assigned by CMR loader

WID Uniquely specifies this computation in the warehouse.
Referred to by SequenceMatch table rows.

Name 'blastp'.

DatasetWID Dataset.WID for the entry in Dataset for this CMR dataset.

Description 'The search algorithm employed for the all vs. all searches is blastp. See http://www.ncbi.nlm.nih.gov/BLAST for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database.'

**Column values for `Computation` row**
Column	Value assigned by CMR loader
`WID`	Uniquely specifies this computation in the warehouse. Referred to by `SequenceMatch` table rows.
`Name`	'blastp'.
`DatasetWID`	`Dataset.WID` for the entry in `Dataset` for this CMR dataset.
`Description`	'The search algorithm employed for the all vs. all searches is blastp. See http://www.ncbi.nlm.nih.gov/BLAST for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database.'

The ComputationWID column of all rows added to SequenceMatch refer to this row of Computation.

Translation Semantics for CMR Tables

This section describes the semantic mapping between the tables comprising the CMR database and its associated flat file representation to its representation in the BioSpice data warehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Any attributes not listed are ignored.

Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant.

If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored. Most semantics are expressed in tabular form, showing the mapping of each input attribute to the warehouse Table.Column values computed from it. The most typical semantics is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Some attributes are ignored.

Comments that are NULL or are "-" are ignored.

Table db_data

The db_data table contains a description of each genome present in CMR. A row is added to the BioSource table for each entry in db_data.

Translation semantics for db_data

CMR Attribute Warehouse Semantics

organism_name BioSource.Name

genetic_code Used internally to look up a GeneticCode entry from the most recently loaded NCBI Taxonomy database;
if an entry is found where its GeneticCode.NCBIID matches genetic_code, its GeneticCode.WID is stored in NucleicAcid.GeneticCodeWID for any nucleic acids of this organism.

original_db DBID.XID; DBID.OtherWID = BioSource.WID

**Translation semantics for `db_data`**
CMR Attribute	Warehouse Semantics
organism_name	`BioSource.Name`
genetic_code	Used internally to look up a `GeneticCode` entry from the most recently loaded NCBI Taxonomy database; if an entry is found where its `GeneticCode.NCBIID` matches genetic_code, its `GeneticCode.WID` is stored in `NucleicAcid.GeneticCodeWID` for any nucleic acids of this organism.
original_db	`DBID.XID; DBID.OtherWID = BioSource.WID`

Linking Tables

No rows to linking tables are added when this file is loaded. However, various linking table rows are added when subsequent tables are loaded.

Table taxon

The taxon table supplements the genome descriptions in the db_data table. For each entry, an existing row of BioSource is updated. The row that is updated is the row whose Biosource.Name matches the concatenation of the genus, species, and strain fields. The update removes the strain from Biosource.Name and populates Biosource.Strain with it. If the strain name should remain in Biosource.Name after loading, it indicates that the Taxon row for that organism is either missing or erroneous.

Translation semantics for taxon

CMR Attribute Warehouse Semantics

genus,
species Concatenated together with separating spaces to form the genome's BioSource.Name.
This identifies the associated BioSource entry that is updated.

strain BioSource.Strain.

taxon_id Used internally to look up the taxon entry in the Warehouse being loaded from a previously loaded NCBI Taxonomy database;
this is stored in BioSource.TaxonWID. If the taxon is not found, NULL is stored.

short_name SynonymTable.Syn; SynonymTable.OtherWID = BioSource.WID

**Translation semantics for `taxon`**
CMR Attribute	Warehouse Semantics
genus, species	Concatenated together with separating spaces to form the genome's `BioSource.Name`. This identifies the associated `BioSource` entry that is updated.
strain	`BioSource.Strain`.
taxon_id	Used internally to look up the taxon entry in the Warehouse being loaded from a previously loaded NCBI Taxonomy database; this is stored in `BioSource.TaxonWID`. If the taxon is not found, NULL is stored.
short_name	`SynonymTable.Syn; SynonymTable.OtherWID = BioSource.WID`

Linking Tables

No rows to linking tables are added when this file is loaded.

Table asmbl_data

The asmbl_data table contains information for each DNA molecule in CMR. A NucleicAcid row is created for each entry in asmbl_data. If the assembly has a name, a Description row for the NucleicAcid is created.

Translation semantics for asmbl_data

CMR Attribute Warehouse Semantics

name NucleicAcid.Name

db_data_id ~NucleicAcid.BioSourceWID; determined via a lookup table that associates each db_data_id with a BioSource.WID.

type NucleicAcid.Class if the value is 'chromosome' or 'plasmid', else 'other';
NucleicAcid.FullySequenced is 'T' if the value is 'chromosome' or 'plasmid', else 'F';
a value other than 'chromosome' or 'plasmid' or 'pseudomolecule' causes a parse error. A pseudomolecule is a molecule that has been put together using individual contigs (which may or may not be ordered) in an "unfinished" genome. This means the assembly is not actually closed, but we shoved the pieces we had together to make a single molecule (the CMR isn't really set up to handle multiple contigs). This is the sequence we use to separate the contigs in the pseudomolecule (stops and starts in all 6 reading frames): NNNNNCACACACTTAATTAATTAAGTGTGTGNNNNN

topology NucleicAcid.Topology; any value other than 'circular' or 'linear' is loaded as 'other'.

**Translation semantics for `asmbl_data`**
CMR Attribute	Warehouse Semantics
name	`NucleicAcid.Name`
db_data_id	~`NucleicAcid.BioSourceWID`; determined via a lookup table that associates each db_data_id with a `BioSource.WID`.
type	`NucleicAcid.Class` if the value is 'chromosome' or 'plasmid', else 'other'; `NucleicAcid.FullySequenced` is 'T' if the value is 'chromosome' or 'plasmid', else 'F'; a value other than 'chromosome' or 'plasmid' or 'pseudomolecule' causes a parse error. A pseudomolecule is a molecule that has been put together using individual contigs (which may or may not be ordered) in an "unfinished" genome. This means the assembly is not actually closed, but we shoved the pieces we had together to make a single molecule (the CMR isn't really set up to handle multiple contigs). This is the sequence we use to separate the contigs in the pseudomolecule (stops and starts in all 6 reading frames): NNNNNCACACACTTAATTAATTAAGTGTGTGNNNNN
topology	`NucleicAcid.Topology`; any value other than 'circular' or 'linear' is loaded as 'other'.

Linking Tables

No linking table rows are added.

Table assembly

The assembly table contains the sequence of a DNA molecule. A Subsequence row is created for each entry in assembly. Sequences can be rather large, sometimes several megabytes long. Note that your database server may impose a limit on data transfers that cause sequences to be truncated (see Limitations).

Translation semantics for assembly

CMR Attribute Warehouse Semantics

asmbl_id Determines Subsequence.NucleicAcidWID by looking up the nucleic acid of asmbl_id a lookup table

sequence Subsequence.Sequence; its length is stored in Subsequence.Length
and in NucleicAcid.MoleculeLength where NucleicAcid.WID = Subsequence.NucleicAcidWID.

comment CommentTable.Comm; CommentTable.OtherWID = Subsequence.WID

change_date Entry.ModifiedDate; Entry.OtherWID = Subsequence.WID

**Translation semantics for `assembly`**
CMR Attribute	Warehouse Semantics
asmbl_id	Determines `Subsequence.NucleicAcidWID` by looking up the nucleic acid of `asmbl_id` a lookup table
sequence	`Subsequence.Sequence`; its length is stored in `Subsequence.Length` and in `NucleicAcid.MoleculeLength` where `NucleicAcid.WID` = `Subsequence.NucleicAcidWID`.
comment	`CommentTable.Comm; CommentTable.OtherWID = Subsequence.WID`
change_date	`Entry.ModifiedDate; Entry.OtherWID = Subsequence.WID`

Linking Tables

No linking table rows are added.

Table new_ident

The new_ident table contains information for each Open Reading Frame (ORF) in CMR that was not sequenced by TIGR. Each row defines a Protein and a Gene entry for the ORF.

Translation semantics for new_ident

CMR Attribute Warehouse Semantics

locus DBID.XID; DBID.OtherWID = Gene.WID.
Used internally to associate a Protein.WID and a Gene.WID with this entry, to facilitate subsequent updates to these rows by the loader.

nt_locus Gene.GenomeID
SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID

com_name Protein.Name

gene_sym Gene.Name

ec# SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;
Also, it is used internally to look up a Reaction entry from the most recently loaded Enzyme database:
if an Enzyme entry is found where Reaction.ECNumber matches ec# exactly, an
EnzymaticReaction entry is added where this Reaction.WID is
EnzymaticReaction.ReactionWID and Protein.WID is EnzymaticReaction.ProteinWID.
No check for the validity of the EC number is made.

comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

nt_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

pub_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

auto_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

date Entry.ModifiedDate; Entry.OtherWID = Protein.WID and
Entry.ModifiedDate; Entry.OtherWID = Gene.WID.

**Translation semantics for `new_ident`**
CMR Attribute	Warehouse Semantics
locus	`DBID.XID; DBID.OtherWID = Gene.WID`. Used internally to associate a `Protein.WID` and a `Gene.WID` with this entry, to facilitate subsequent updates to these rows by the loader.
nt_locus	`Gene.GenomeID` `SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID`
com_name	`Protein.Name`
gene_sym	`Gene.Name`
ec#	`SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;` Also, it is used internally to look up a `Reaction` entry from the most recently loaded Enzyme database: if an Enzyme entry is found where `Reaction.ECNumber` matches ec# exactly, an `EnzymaticReaction` entry is added where this `Reaction.WID` is `EnzymaticReaction.ReactionWID` and `Protein.WID` is `EnzymaticReaction.ProteinWID`. No check for the validity of the EC number is made.
comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
nt_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
pub_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
auto_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
date	`Entry.ModifiedDate; Entry.OtherWID = Protein.WID` and `Entry.ModifiedDate; Entry.OtherWID = Gene.WID`.

Linking Tables

A row is added to GeneWIDProteinWID for each entry, to associate the gene with its protein product.

Table nt_ident

NOTE: this table is no longer present in CMR, as of version 3.7.
The nt_ident table contains information for each Open Reading Frame (ORF) in CMR that was not sequenced by TIGR. Each row defines a Protein and a Gene entry for the ORF.

Translation semantics for nt_ident

CMR Attribute Warehouse Semantics

locus DBID.XID; DBID.OtherWID = Gene.WID.
Note: by CMR convention, each locus in this table begins with "NTL", to designate it as not sequenced at TIGR.
Used internally to associate a Protein.WID and a Gene.WID with this entry, to facilitate subsequent updates to these rows by the loader.

nt_locus Gene.GenomeID
SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID

com_name Protein.Name

gene_sym Gene.Name

ec# SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;
Also, it is used internally to look up a Reaction entry from the most recently loaded Enzyme database:
if an Enzyme entry is found where Reaction.ECNumber matches ec# exactly, an
EnzymaticReaction entry is added where this Reaction.WID is
EnzymaticReaction.ReactionWID and Protein.WID is EnzymaticReaction.ProteinWID.
No check for the validity of the EC number is made.

comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

nt_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

pub_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

auto_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

date Entry.ModifiedDate; Entry.OtherWID = Protein.WID and
Entry.ModifiedDate; Entry.OtherWID = Gene.WID.

**Translation semantics for `nt_ident`**
CMR Attribute	Warehouse Semantics
locus	`DBID.XID; DBID.OtherWID = Gene.WID`. Note: by CMR convention, each locus in this table begins with "NTL", to designate it as not sequenced at TIGR. Used internally to associate a `Protein.WID` and a `Gene.WID` with this entry, to facilitate subsequent updates to these rows by the loader.
nt_locus	`Gene.GenomeID` `SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID`
com_name	`Protein.Name`
gene_sym	`Gene.Name`
ec#	`SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;` Also, it is used internally to look up a `Reaction` entry from the most recently loaded Enzyme database: if an Enzyme entry is found where `Reaction.ECNumber` matches ec# exactly, an `EnzymaticReaction` entry is added where this `Reaction.WID` is `EnzymaticReaction.ReactionWID` and `Protein.WID` is `EnzymaticReaction.ProteinWID`. No check for the validity of the EC number is made.
comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
nt_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
pub_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
auto_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
date	`Entry.ModifiedDate; Entry.OtherWID = Protein.WID` and `Entry.ModifiedDate; Entry.OtherWID = Gene.WID`.

Linking Tables

A row is added to GeneWIDProteinWID for each entry, to associate the gene with its protein product.

Table ident

NOTE: this table is no longer present in CMR, as of version 3.7.
The ident table contains information for each Open Reading Frame (ORF) in CMR that was sequenced by TIGR. Each row defines a Protein and a Gene entry for the ORF.

Translation semantics for ident

CMR Attribute Warehouse Semantics

locus Gene.GenomeID;
DBID.XID; DBID.OtherWID = Gene.WID.
Used internally to associate a Protein.WID and a Gene.WID with this entry, to facilitate subsequent updates to these rows by the loader.

com_name Protein.Name

gene_sym Gene.Name

ec# SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;
Also, it is used internally to look up a Reaction entry from the most recently loaded Enzyme database:
if an Enzyme entry is found where Reaction.ECNumber matches ec# exactly, an
EnzymaticReaction row is added where this Reaction.WID is
EnzymaticReaction.ReactionWID and Protein.WID is EnzymaticReaction.ProteinWID.
No check for the validity of the EC number is made.

comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

nt_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

pub_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

auto_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

date Entry.ModifiedDate; Entry.OtherWID = Protein.WID and
Entry.ModifiedDate; Entry.OtherWID = Gene.WID.

**Translation semantics for `ident`**
CMR Attribute	Warehouse Semantics
locus	`Gene.GenomeID`; `DBID.XID; DBID.OtherWID = Gene.WID`. Used internally to associate a `Protein.WID` and a `Gene.WID` with this entry, to facilitate subsequent updates to these rows by the loader.
com_name	`Protein.Name`
gene_sym	`Gene.Name`
ec#	`SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;` Also, it is used internally to look up a `Reaction` entry from the most recently loaded Enzyme database: if an Enzyme entry is found where `Reaction.ECNumber` matches ec# exactly, an `EnzymaticReaction` row is added where this `Reaction.WID` is `EnzymaticReaction.ReactionWID` and `Protein.WID` is `EnzymaticReaction.ProteinWID`. No check for the validity of the EC number is made.
comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
nt_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
pub_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
auto_comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
date	`Entry.ModifiedDate; Entry.OtherWID = Protein.WID` and `Entry.ModifiedDate; Entry.OtherWID = Gene.WID`.

Linking Tables

A row is added to GeneWIDProteinWID for each entry, to associate the gene with its protein product.

Table feat_link

The feat_link table contains links between related locuses. It is used by the loader to determine whether to exclude secondary locuses when only primary locuses are being loaded (see Variants of CMR above).

No tables are changed directly bsaed on loading this table.

Translation semantics for feat_link

CMR Attribute Warehouse Semantics

id Ignored.

parent_locus Ignored unless loading the primary variant of CMR.
Otherwise, if this locus is a non-TIGR locus, the associated child_locus is marked so that it is excluded from the load of CMR.

child_locus See parent_locus

**Translation semantics for `feat_link`**
CMR Attribute	Warehouse Semantics
id	Ignored.
parent_locus	Ignored unless loading the primary variant of CMR. Otherwise, if this locus is a non-TIGR locus, the associated child_locus is marked so that it is excluded from the load of CMR.
child_locus	See parent_locus

Linking Tables

No linking table rows are added.

Table asm_feature

The asm_feature table contains information for each feature identified on DNA molecules in CMR. There are various types of CMR features; each entry contains a feat_type attribute.

IS - insertion sequence
NTORF - non-TIGR Open Reading Frame
ORF - TIGR Open Reading Frame
PHAGE - phage DNA (virus)
RBS - ribosomal binding site
RPT - repeat
RPT_A - repeat
TERM - rho-independant terminator
CRISPR - unknown
CRISPR_spacer - unknown
TE - unknown
IE - unknown
BACTERIOCIN - unknown
INTRON - intron feature
rRNA or NTrRNA - ribosomal RNA
tRNA or NTtRNA - transfer RNA
NTmiscRNA - unspecified type of RNA
sRNA - small RNA - they are broadly defined as the RNAs not directly involved in protein synthesis. Small RNAs are usually in the 75-400 nucleotides range, although some are as long as thousand base pairs. They are synthesized by either RNA Polymerase I, II or III.

The value of this attribute determines the warehouse loading semantics as follows.

If feat_type is 'ORF' or 'NTORF', the feature is a protein. Its locus should have an entry in the new_ident table. Its sequence is ignored.
If feat_type is 'rRNA', 'sRNA', 'tRNA', 'NTrRNA', 'NTtRNA', or 'NTmiscRNA', the feature is a nucleic acid. A NucleicAcid and a Subsequence are created for it, the latter to contain its sequence.
If feat_type is any other value, a Feature is created for it. Its sequence is ignored. If the value is not among the recognized values listed above, a warning is issued.

Since the loader semantics differ significantly depending on feat_type, each case is described separately.

Protein features

Translation semantics for asm_feature where feat_type is 'ORF' or 'NTORF'

CMR Attribute Warehouse Semantics

name Ignored

feat_type ['ORF','NTORF'] Gene.Type = 'polypeptide'.

db_data_id Used internally to map this entry to its associated BioSource .

asmbl_id Used internally to map this entry to the NucleicAcid of the replicon that contains this gene,
which is stored in Gene.NucleicAcidWID.

locus Used internally to map this entry to its associated Protein and Gene,
which are created during the loading of the new_ident table.

sequence Ignored (except for consistency checking; see below).

protein Protein.AASequence where Protein.WID is indicated by locus.

end5,
end3 If end5 < end3, Gene.Direction is set to 'forward', else it is set to 'reverse'.
min(end5, end3) is stored in Gene.CodingRegionStart.
max(end5, end3) is stored in Gene.CodingRegionEnd.
Note that these are indices into NucleicAcid.Sequence.
Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex.

comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID

date Entry.ModifiedDate; Entry.OtherWID = Protein.WID

**Translation semantics for `asm_feature` where **feat_type** is 'ORF' or 'NTORF'**
CMR Attribute	Warehouse Semantics
name	Ignored
feat_type ['ORF','NTORF']	`Gene.Type` = 'polypeptide'.
db_data_id	Used internally to map this entry to its associated `BioSource` .
asmbl_id	Used internally to map this entry to the `NucleicAcid` of the replicon that contains this gene, which is stored in `Gene.NucleicAcidWID`.
locus	Used internally to map this entry to its associated `Protein` and `Gene`, which are created during the loading of the new_ident table.
sequence	Ignored (except for consistency checking; see below).
protein	`Protein.AASequence` where `Protein.WID` is indicated by locus.
end5, end3	If end5 < end3, `Gene.Direction` is set to `'forward'`, else it is set to `'reverse'`. min(end5, end3) is stored in `Gene.CodingRegionStart`. max(end5, end3) is stored in `Gene.CodingRegionEnd`. Note that these are indices into `NucleicAcid.Sequence`. Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex.
comment	`CommentTable.Comm; CommentTable.OtherWID = Protein.WID`
date	`Entry.ModifiedDate; Entry.OtherWID = Protein.WID`

Consistency checking is performed on the lengths of the nucleotide and protein sequences. If the nucleotide length is not 3 * the amino acid length, a diagnostic is issued, and the entry is marked in the Entry table as containing an error.

Linking Tables

Rows are added to BioSourceWIDProteinWID and BioSourceWIDGeneWID to associate the gene and protein with the organism it occurs in.

RNA features (rRNA, sRNA, tRNA, NTrRNA, NTtRNA, and NTmisc_RNA)

Translation semantics for asm_feature where feat_type contains 'RNA'

CMR Attribute Warehouse Semantics

name Ignored

feat_type ['sRNA', 'NTmisc_RNA'] Determines NucleicAcid.Type = 'RNA' and
Gene.Type and NucleicAcid.Class = 'other'

feat_type ['rRNA', 'NTrRNA'] Determines NucleicAcid.Type = 'RNA' and
Gene.Type and NucleicAcid.Class = 'rRNA' .

feat_type ['tRNA', 'NTtRNA'] Determines NucleicAcid.Type = 'RNA' and
Gene.Type and NucleicAcid.Class = 'tRNA'

db_data_id Used internally to map this entry to its associated BioSource.

asmbl_id Used internally to map this entry to the NucleicAcid of the replicon that contains this gene,
which is stored in Gene.NucleicAcidWID.

locus Gene.GenomeID, and
DBID.XID; DBID.OtherWID = NucleciAcid.WID for the nucleic acid that is the RNA product.

sequence Subsequence.Sequence where Subsequence.NucleicAcidWID is indicated by asmbl_id.
Note the DNA sequence is stored explicitly.

protein Ignored (should be empty).

end5,
end3 If end5 < end3, Gene.Direction is set to 'forward', else it is set to 'reverse'.
min(end5, end3) is stored in Gene.CodingRegionStart.
max(end5, end3) is stored in Gene.CodingRegionEnd.
Note that these are indices into NucleicAcid.Sequence.
Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex.

comment CommentTable.Comm; CommentTable.OtherWID = NucleicAcid.WID

date Entry.ModifiedDate; Entry.OtherWID = NucleicAcid.WID

**Translation semantics for `asm_feature` where **feat_type** contains 'RNA'**
CMR Attribute	Warehouse Semantics
name	Ignored
feat_type ['sRNA', 'NTmisc_RNA']	Determines `NucleicAcid.Type` = 'RNA' and `Gene.Type` and `NucleicAcid.Class` = 'other'
feat_type ['rRNA', 'NTrRNA']	Determines `NucleicAcid.Type` = 'RNA' and `Gene.Type` and `NucleicAcid.Class` = 'rRNA' .
feat_type ['tRNA', 'NTtRNA']	Determines `NucleicAcid.Type` = 'RNA' and `Gene.Type` and `NucleicAcid.Class` = 'tRNA'
db_data_id	Used internally to map this entry to its associated `BioSource`.
asmbl_id	Used internally to map this entry to the `NucleicAcid` of the replicon that contains this gene, which is stored in `Gene.NucleicAcidWID`.
locus	`Gene.GenomeID`, and `DBID.XID; DBID.OtherWID = NucleciAcid.WID` for the nucleic acid that is the RNA product.
sequence	`Subsequence.Sequence` where `Subsequence.NucleicAcidWID` is indicated by asmbl_id. Note the DNA sequence is stored explicitly.
protein	Ignored (should be empty).
end5, end3	If end5 < end3, `Gene.Direction` is set to `'forward'`, else it is set to `'reverse'`. min(end5, end3) is stored in `Gene.CodingRegionStart`. max(end5, end3) is stored in `Gene.CodingRegionEnd`. Note that these are indices into `NucleicAcid.Sequence`. Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex.
comment	`CommentTable.Comm; CommentTable.OtherWID = NucleicAcid.WID`
date	`Entry.ModifiedDate; Entry.OtherWID = NucleicAcid.WID`

For each entry, a row is added to Gene, NucleicAcid, and Subsequence. CMR does not identify a gene associated with this type of entry; it is constructed from available information.

Note that there are two nucleic acids associated with each entry of this type - the nucleic acid for the replicon containing the feature, and the nucleic acid that represents the gene product. Gene.NucleicAcidWID references the replicon this feature occurs on, not the gene product. Gene.SubsequenceWID references the Subsequence that is created for this entry.

Linking Tables

A row is added to BioSourceWIDGeneWID to associate the gene with the organism it occurs in. A row is added to GeneWIDNucleicAcidWID to associate the gene with its RNA product.

Other features

Translation semantics for asm_feature where feat_type is anything but 'ORF', 'NTORF', and those containing 'RNA'

CMR Attribute Warehouse Semantics

name Feature.Description

feat_type [anything but 'ORF', 'NTORF', and those containing 'RNA'] Feature.Type.

db_data_id Used internally to map this entry to its associated BioSource.

asmbl_id Used internally to map this entry to the NucleicAcid of the replicon that contains this gene, which is associated with the feature using the linking table NucleicAcidWIDFeatureWID.

locus DBID.XID; DBID.OtherWID = Feature.WID.

sequence Ignored.

protein Ignored (should be empty).

end5 Feature.StartPosition.
This is an index into the full sequence of the replicon the feature occurs on.

end3 Feature.EndPosition.
This is an index into the full sequence of the replicon the feature occurs on.

comment CommentTable.Comm; CommentTable.OtherWID = Feature.WID

date Entry.ModifiedDate; Entry.OtherWID = Feature.WID

**Translation semantics for `asm_feature` where **feat_type** is anything but 'ORF', 'NTORF', and those containing 'RNA'**
CMR Attribute	Warehouse Semantics
name	`Feature.Description`
feat_type [anything but 'ORF', 'NTORF', and those containing 'RNA']	`Feature.Type`.
db_data_id	Used internally to map this entry to its associated `BioSource`.
asmbl_id	Used internally to map this entry to the `NucleicAcid` of the replicon that contains this gene, which is associated with the feature using the linking table `NucleicAcidWIDFeatureWID`.
locus	`DBID.XID; DBID.OtherWID = Feature.WID`.
sequence	Ignored.
protein	Ignored (should be empty).
end5	`Feature.StartPosition`. This is an index into the full sequence of the replicon the feature occurs on.
end3	`Feature.EndPosition`. This is an index into the full sequence of the replicon the feature occurs on.
comment	`CommentTable.Comm; CommentTable.OtherWID = Feature.WID`
date	`Entry.ModifiedDate; Entry.OtherWID = Feature.WID`

A row is added to Feature for each entry. The column value Feature.Class is always NULL. The column value Feature.Type is always 'N', indicating that this is a nucleic acid feature, and that the sequence is not explicitly stored (it is available by indexing the replicon). The column value Feature.SequenceWID always references the NucleicAcid of the replicon the feature occurs on. The column value Feature.ExperimentalSupport is always 'F' (false). The column value Feature.ComputationalSupport is always 'T' (true).

Linking Tables

The biological source of the feature is referenced by NucleicAcid.BioSourceWID of the replicon.

Table orf_attribute

The orf_attribute table supplements the descriptions of open reading frames by providing values of attributes of their associated nucleic acid sequences and proteins. No rows are added to any warehouse tables; for each entry, a row in either Subsequence or Protein is updated according to the attribute defined in the entry.

Translation semantics for orf_attribute

CMR Attribute Warehouse Semantics

locus Used internally to map this entry to its associated Subsequence or Protein

score Specifies the value of the attribute.

att_type Determines the attribute whose value is specified by score:
'GC' - Subsequence.PercentGC
'MW' - Protein.MolecularWeightCalc, converted from Daltons to kiloDaltons.
'PI' - Protein.PICalc

**Translation semantics for `orf_attribute`**
CMR Attribute	Warehouse Semantics
locus	Used internally to map this entry to its associated `Subsequence` or `Protein`
score	Specifies the value of the attribute.
att_type	Determines the attribute whose value is specified by score: 'GC' - `Subsequence.PercentGC` 'MW' - `Protein.MolecularWeightCalc`, converted from Daltons to kiloDaltons. 'PI' - `Protein.PICalc`

Linking Tables

No linking table rows are added.

Table accession

The accession table contains crossreferences to other databases (e.g., SWISS-PROT, EcoCyc) from loci defined in CMR. A row is added to the CrossReference table for each entry in accession.

Translation semantics for accession

CMR Attribute Warehouse Semantics

locus An internal lookup table of all loci from the previously parsed new_ident is searched;
the associated Gene WID is CrossReference.OtherWID.

accession_db CrossReference.DataBaseName

accession_id CrossReference.XID; CrossReference.OtherWID = Gene.WID

**Translation semantics for `accession`**
CMR Attribute	Warehouse Semantics
locus	An internal lookup table of all loci from the previously parsed `new_ident` is searched; the associated Gene WID is `CrossReference.OtherWID`.
accession_db	`CrossReference.DataBaseName`
accession_id	`CrossReference.XID; CrossReference.OtherWID = Gene.WID`

Linking Tables

No rows to linking tables are added when this file is loaded.

Table all_vs_all

The all_vs_all table contains the results of exhaustive BLAST searches of every amino acid sequence in CMR. Specifically, each protein is compared with every other protein in CMR, and the best matches for each are stored, along with information that characterizes the match. Due to its size, the table is divided into multiple files.

A row is added to SequenceMatch for each entry. A single row is added to Computation for all of CMR; SequenceMatch.ComputationWID references it to describe the sequence matching algorithm.

The search algorithm employed for the all vs. all searches is blastp. See NCBI BLAST site for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database. Other parameters are:

expect=.001 - the evalue cutoff we can have, the range is 0.001-1000 (website default is 10) lower numbers are more stringent
cutoff=120 - the cutoff score for reporting 'HSP's, the range is 50-120 (website default value is calculated from the EXPECT value) higher numbers are more stringent
alignments=500 - gives us at most 500 different accessions for each protein (website default is 250)
descriptions=500 - gives us at most 500 different accessions for the gene (website default is 500)
filter=seg+xnu - added filters. These filters mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993), or segments consisting of short-periodicity internal repeats, as determined by the XNU program of Claverie & States (Computers and Chemistry, 1993) Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences.

NOTE: The CMR website provides a somewhat different sequence matching procedure online, called Blast-Extend-Repraze or BER, that computes a modified Smith-Waterman alignment. See the CMR website for details.

Translation semantics for all_vs_all

CMR Attribute Warehouse Semantics

locus An internal lookup table of all loci from the previously parsed new_ident is searched;
the associated Protein WID is SequenceMatch.QueryWID.

accession An internal lookup table of all loci from the previously parsed new_ident is searched;
the associated Protein WID is SequenceMatch.MatchWID.

pvalue SequenceMatch.PValue.

per_id SequenceMatch.PercentIdentical.

per_sim SequenceMatch.PercentSimilar.

match_len SequenceMatch.Length.

match_order SequenceMatch.Rank.

**Translation semantics for `all_vs_all`**
CMR Attribute	Warehouse Semantics
locus	An internal lookup table of all loci from the previously parsed `new_ident` is searched; the associated `Protein` WID is `SequenceMatch.QueryWID`.
accession	An internal lookup table of all loci from the previously parsed `new_ident` is searched; the associated `Protein` WID is `SequenceMatch.MatchWID`.
pvalue	`SequenceMatch.PValue`.
per_id	`SequenceMatch.PercentIdentical`.
per_sim	`SequenceMatch.PercentSimilar`.
match_len	`SequenceMatch.Length`.
match_order	`SequenceMatch.Rank`.

Linking Tables

No linking table rows are added.

References

CMR Home Page