(C) 2005 SRI
International. All Rights Reserved. See BioWarehouse
Overview for license details.
Introduction
This document describes version 4.6 of the Comprehensive Microbial Resource (CMR) Loader. It is one of several database loaders comprising the BioWarehouse. The CMR Loader (referred to simply as the loader), loads a collection of files comprising the Comprehensive Microbial Resource (sometimes referred to as the Omniome databse) into the BioWarehouse - an Oracle relational database that provides a common representation for diverse bioinformatics databases. CMR is developed and maintained by The Institute for Genomic Research (TIGR) .
Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.
Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.
A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.
A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.
Full schema information, including source files and browseable documentation, is available with this distribution.
The CMR loader should not be run concurrently with other loaders. The loader is likely to fail in this case. This limitation was imposed for performance reasons.
The latest supported data version for the CMR loader is listed in the loader summary table. The loader may not be compatible with future versions of CMR. Due to changes in the CMR flat file format, the loader is not compatible with earlier CMR versions.
MySQL servers impose a limit on the maximum size of an SQL
statement. The CMR loader
currently assumes that this limit is at least 4,000,000 bytes, and will
truncate long data
elements such as sequences to ensure this limit is not exceeded. This
limit is currently not
changeable. It is necessary to configure your MySQL server so that its max_allowed_packets
parameter is at least 4M
, as described in Environment Setup (MySQL).
The CMR database contains modification dates for a number of its
data types.
However, these dates are ignored, rather than stored in Entry.ModifiedDate
.
The CMR database consists of several tables. Data for each table
is in a separate file
or group of files.
By CMR convention, files are named bcp_tablename.
A subset of these are used by the loader. Each file is described in the
following table,
and is listed in the order in which it is loaded.
NOTE: As of release 3.7, the ident and nt_ident tables
have been deprecated in favor of the new_ident table. The former tables
are loaded if present, in order to support backward compatibility.
Table | File name(s) | Loaded when | Record delimiter | Field delimiter |
---|---|---|---|---|
db_data | bcp_db_data | Always | Newline | Tab |
taxon | bcp_taxon | Always | Newline | Tab |
asmbl_data | bcp_asmbl_data | Always | &&& | @@@ |
assembly | bcp_assembly | Always | &&& | @@@ |
new_ident | bcp_new_ident | Loading any variant except TIGR (see variant command line option) |
&&& | @@@ |
nt_ident | bcp_nt_ident | DEPRECATED as of 3.7 in favor of new_ident. Loading any variant except TIGR (see variant command line option) |
&&& | @@@ |
feat_link | bcp_feat_link | Loading PRIMARY variant (see variant command line option) |
Newline | Tab |
ident | bcp_ident | DEPRECATED as of 3.7 in favor of new_ident. Always, but when loading PRIMARY ORFs, (see variant command line option) genes duplicated in nt_ident are not loaded |
&&& | @@@ |
asm_feature | bcp_asm_feature | Always | &&& | @@@ |
ORF_attribute | bcp_ORF_attribute | Always | Newline | Tab |
accession | bcp_accession | Always | Newline | Tab |
all_vs_all | bcp_all_vs_all_1, bcp_all_vs_all_2, etc., and/or bcp_all_vs_all_01, bcp_all_vs_all_02, etc. |
Always | Newline | Tab |
All other tables are ignored by the loader. All input files must reside in the same directory, which is specified as a command line parameter to the loader.
There are two file formats. One format uses a newline to terminate records, and a tab character to separate fields within the record. The other format, which allows embeddded tab and newline characters within its fields, uses the character sequence &&& as the record terminator and the character sequence @@@ as the field separator. In files with no line breaks between entries, any report by the loader of line numbers or total number of lines refers to entries, not lines.
The NCBI Taxonomy dataset is used to populate the NucleicAcid.GeneticCodeWID
column
for the nucleic acid molecules of each organism in CMR. See Table db_data
for details.
The Enzyme dataset is used to create an EnzymaticReaction
for each CMR Open Reading Frame
that specifies an Enzyme Commission (EC) number. See Table new_ident
for details.
The CMR loader should not be run concurrently with other loaders. The loader is likely to fail in this case.
To show progress, the loader prints a period for each 1000 WIDs allocated. For large input tables that don't allocate many WIDs, a colon is printed every 1000 input lines.
The loader accepts a variant command line option that specifies which Open Reading Frames (ORFs) and associate genes and proteins are to be loaded. Its semantics are as follows:
To implement the various variants, the following locus naming conventions are assumed:
Dataset
table as follows:
Column | Value assigned by CMR loader | |
---|---|---|
WID |
The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse. | |
Name |
'CMR'. | |
ReleaseDate |
NULL | |
Version |
Version assigned by TIGR to this release of CMR (e.g., '19.0'). | |
ReleaseDate |
The date that this version of CMR was released. | |
LoadDate |
The time/date the loader was run. | |
ChangeDate |
The date and time the loader completed, NULL if the loader did not complete successfully. | |
LoadedBy |
The value of the system environment variable USER for the account running the loader. | |
Application |
'CMR Loader' | |
ApplicationVersion |
4.6 | |
HomeURL |
http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl | |
QueryURL |
http://www.tigr.org/tigr-scripts/CMR2/GenomeSlicer.spl |
SequenceMatch
table of the warehouse. A description of the sequence comparison
algorithm performed is stored in a row of the Computation
table as follows:
Column | Value assigned by CMR loader | |
---|---|---|
WID |
Uniquely specifies this computation in the
warehouse. Referred to by SequenceMatch table rows. |
|
Name |
'blastp'. | |
DatasetWID |
Dataset.WID for the entry in Dataset
for this CMR dataset. |
|
Description |
'The search algorithm employed for the all vs. all searches is blastp. See http://www.ncbi.nlm.nih.gov/BLAST for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database.' |
The ComputationWID
column of all rows added to SequenceMatch
refer to this row of Computation
.
Comments that are NULL or are "-" are ignored.
BioSource
table for each entry in db_data.
CMR Attribute | Warehouse Semantics | |
---|---|---|
organism_name | BioSource.Name |
|
genetic_code | Used internally to look up a GeneticCode entry
from the most recently loaded NCBI Taxonomy database; if an entry is found where its GeneticCode.NCBIID matches
genetic_code, its GeneticCode.WID is
stored in NucleicAcid.GeneticCodeWID for any nucleic
acids of this organism. |
|
original_db | DBID.XID; DBID.OtherWID = BioSource.WID |
BioSource
is updated.
The row that is updated is the row whose Biosource.Name
matches the concatenation of the genus, species, and strain fields. The update removes
the strain from Biosource.Name
and populates Biosource.Strain
with it. If the strain name should remain in Biosource.Name
after loading,
it indicates that the Taxon
row for that organism is either missing or erroneous.
CMR Attribute | Warehouse Semantics | |
---|---|---|
genus, species |
Concatenated together with separating spaces
to form the genome's BioSource.Name . This identifies the associated BioSource entry that is
updated. |
|
strain | BioSource.Strain . |
|
taxon_id | Used internally to look up the taxon entry in
the Warehouse being loaded from a previously loaded NCBI Taxonomy
database; this is stored in BioSource.TaxonWID . If the taxon is not
found, NULL is stored. |
|
short_name | SynonymTable.Syn;
SynonymTable.OtherWID = BioSource.WID |
NucleicAcid
row is created for each entry in asmbl_data.
If the assembly has a name, a Description
row for the
NucleicAcid is created.
CMR Attribute | Warehouse Semantics | |
---|---|---|
name | NucleicAcid.Name |
|
db_data_id | ~NucleicAcid.BioSourceWID ;
determined via a lookup table
that associates each db_data_id with a BioSource.WID . |
|
type | NucleicAcid.Class if the value is 'chromosome' or 'plasmid', else 'other'; NucleicAcid.FullySequenced is 'T' if the value is 'chromosome' or 'plasmid', else 'F'; a value other than 'chromosome' or 'plasmid' or 'pseudomolecule' causes a parse error. A pseudomolecule is a molecule that has been put together using individual contigs (which may or may not be ordered) in an "unfinished" genome. This means the assembly is not actually closed, but we shoved the pieces we had together to make a single molecule (the CMR isn't really set up to handle multiple contigs). This is the sequence we use to separate the contigs in the pseudomolecule (stops and starts in all 6 reading frames): NNNNNCACACACTTAATTAATTAAGTGTGTGNNNNN |
|
topology | NucleicAcid.Topology ; any value
other than 'circular' or 'linear' is loaded as 'other'. |
Subsequence
row is created
for each entry in assembly. Sequences can be rather large,
sometimes several megabytes long. Note that your database server may
impose a limit
on data transfers that cause sequences to be truncated (see Limitations).
CMR Attribute | Warehouse Semantics | |
---|---|---|
asmbl_id | Determines Subsequence.NucleicAcidWID
by looking up the nucleic acid of asmbl_id a lookup table |
|
sequence | Subsequence.Sequence ; its
length is stored in Subsequence.Length and in NucleicAcid.MoleculeLength where NucleicAcid.WID
= Subsequence.NucleicAcidWID . |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = Subsequence.WID |
|
change_date | Entry.ModifiedDate; Entry.OtherWID =
Subsequence.WID |
Protein
and a Gene
entry for the ORF.
CMR Attribute | Warehouse Semantics | |
---|---|---|
locus | DBID.XID; DBID.OtherWID = Gene.WID .Used internally to associate a Protein.WID and a Gene.WID
with this entry, to facilitate subsequent updates to these rows by the
loader. |
|
nt_locus | Gene.GenomeID SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID |
|
com_name | Protein.Name |
|
gene_sym | Gene.Name |
|
ec# | SynonymTable.Syn;
SynonymTable.OtherWID = Protein.WID; Also, it is used internally to look up a Reaction entry
from the most recently loaded Enzyme database: if an Enzyme entry is found where Reaction.ECNumber
matches ec# exactly, an EnzymaticReaction entry is added where this Reaction.WID
is EnzymaticReaction.ReactionWID and Protein.WID
is EnzymaticReaction.ProteinWID . No check for the validity of the EC number is made. |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
nt_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
pub_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
auto_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
date | Entry.ModifiedDate; Entry.OtherWID =
Protein.WID and Entry.ModifiedDate; Entry.OtherWID = Gene.WID . |
GeneWIDProteinWID
for each entry, to
associate the gene with its protein product.
Protein
and a Gene
entry for the ORF.
CMR Attribute | Warehouse Semantics | |
---|---|---|
locus | DBID.XID; DBID.OtherWID = Gene.WID .Note: by CMR convention, each locus in this table begins with "NTL", to designate it as not sequenced at TIGR. Used internally to associate a Protein.WID and a Gene.WID
with this entry, to facilitate subsequent updates to these rows by the
loader. |
|
nt_locus | Gene.GenomeID SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID |
|
com_name | Protein.Name |
|
gene_sym | Gene.Name |
|
ec# | SynonymTable.Syn;
SynonymTable.OtherWID = Protein.WID; Also, it is used internally to look up a Reaction entry
from the most recently loaded Enzyme database: if an Enzyme entry is found where Reaction.ECNumber
matches ec# exactly, an EnzymaticReaction entry is added where this Reaction.WID
is EnzymaticReaction.ReactionWID and Protein.WID
is EnzymaticReaction.ProteinWID . No check for the validity of the EC number is made. |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
nt_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
pub_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
auto_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
date | Entry.ModifiedDate; Entry.OtherWID =
Protein.WID and Entry.ModifiedDate; Entry.OtherWID = Gene.WID . |
GeneWIDProteinWID
for each entry, to
associate the gene with its protein product.
Protein
and a Gene
entry for
the ORF.
CMR Attribute | Warehouse Semantics | |
---|---|---|
locus | Gene.GenomeID ; DBID.XID; DBID.OtherWID = Gene.WID .Used internally to associate a Protein.WID and a Gene.WID
with this entry, to facilitate subsequent updates to these rows by the
loader. |
|
com_name | Protein.Name |
|
gene_sym | Gene.Name |
|
ec# | SynonymTable.Syn;
SynonymTable.OtherWID = Protein.WID; Also, it is used internally to look up a Reaction entry
from the most recently loaded Enzyme database: if an Enzyme entry is found where Reaction.ECNumber
matches ec# exactly, an EnzymaticReaction row is added where this Reaction.WID
is EnzymaticReaction.ReactionWID and Protein.WID
is EnzymaticReaction.ProteinWID . No check for the validity of the EC number is made. |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
nt_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
pub_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
auto_comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
date | Entry.ModifiedDate; Entry.OtherWID =
Protein.WID and Entry.ModifiedDate; Entry.OtherWID = Gene.WID . |
GeneWIDProteinWID
for each entry, to
associate the gene with its protein product.
No tables are changed directly bsaed on loading this table.
CMR Attribute | Warehouse Semantics | |
---|---|---|
id | Ignored. | |
parent_locus | Ignored unless loading the primary variant of CMR. Otherwise, if this locus is a non-TIGR locus, the associated child_locus is marked so that it is excluded from the load of CMR. |
|
child_locus | See parent_locus |
NucleicAcid
and a Subsequence
are
created for it, the latter to contain its sequence. Feature
is created for it.
Its sequence is ignored. If the value is not among the
recognized values listed above, a warning is issued.
CMR Attribute | Warehouse Semantics | |
---|---|---|
name | Ignored | |
feat_type ['ORF','NTORF'] | Gene.Type = 'polypeptide'. |
|
db_data_id | Used internally to map this entry to its
associated BioSource . |
|
asmbl_id | Used internally to map this entry to the NucleicAcid
of the replicon that contains this gene, which is stored in Gene.NucleicAcidWID . |
|
locus | Used internally to map this entry to its
associated Protein and Gene , which are created during the loading of the new_ident table. |
|
sequence | Ignored (except for consistency checking; see below). | |
protein | Protein.AASequence where Protein.WID
is indicated by locus. |
|
end5, end3 |
If end5 < end3, Gene.Direction
is set to 'forward' , else it is set to 'reverse' .
min(end5, end3) is stored in Gene.CodingRegionStart .max(end5, end3) is stored in Gene.CodingRegionEnd .Note that these are indices into NucleicAcid.Sequence . Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex. |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = Protein.WID |
|
date | Entry.ModifiedDate; Entry.OtherWID
= Protein.WID |
Consistency checking is performed on the lengths of the
nucleotide and protein sequences.
If the nucleotide length is not 3 * the amino acid length, a diagnostic
is issued, and the entry is marked
in the Entry
table as containing an error.
BioSourceWIDProteinWID
and BioSourceWIDGeneWID
to associate the gene and protein with the organism it occurs in.
CMR Attribute | Warehouse Semantics | |
---|---|---|
name | Ignored | |
feat_type ['sRNA', 'NTmisc_RNA'] | Determines NucleicAcid.Type
= 'RNA' and Gene.Type and NucleicAcid.Class = 'other'
|
|
feat_type ['rRNA', 'NTrRNA'] | Determines NucleicAcid.Type = 'RNA' and Gene.Type and NucleicAcid.Class = 'rRNA' . |
|
feat_type ['tRNA', 'NTtRNA'] | Determines NucleicAcid.Type = 'RNA' and Gene.Type and NucleicAcid.Class = 'tRNA'
|
|
db_data_id | Used internally to map this entry to its
associated BioSource . |
|
asmbl_id | Used internally to map this entry to the NucleicAcid
of the replicon that contains this gene, which is stored in Gene.NucleicAcidWID . |
|
locus | Gene.GenomeID , and DBID.XID; DBID.OtherWID = NucleciAcid.WID for
the nucleic acid that is the RNA product. |
|
sequence | Subsequence.Sequence where Subsequence.NucleicAcidWID
is indicated by asmbl_id. Note the DNA sequence is stored explicitly. |
|
protein | Ignored (should be empty). | |
end5, end3 |
If end5 < end3, Gene.Direction
is set to 'forward' , else it is set to 'reverse' .
min(end5, end3) is stored in Gene.CodingRegionStart .max(end5, end3) is stored in Gene.CodingRegionEnd .Note that these are indices into NucleicAcid.Sequence . Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex. |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = NucleicAcid.WID |
|
date | Entry.ModifiedDate; Entry.OtherWID
= NucleicAcid.WID |
For each entry, a row is added to Gene
, NucleicAcid
,
and Subsequence
.
CMR does not identify a gene associated with this type of entry; it is
constructed from available information.
Note that there are two nucleic acids associated with each
entry of this type - the nucleic acid for the replicon
containing the feature, and the nucleic acid that represents the gene
product. Gene.NucleicAcidWID
references the replicon this
feature occurs on, not the gene product. Gene.SubsequenceWID
references the Subsequence that
is created for this entry.
BioSourceWIDGeneWID
to associate the
gene with the organism it occurs in.
A row is added to GeneWIDNucleicAcidWID
to associate the
gene with its RNA product.
CMR Attribute | Warehouse Semantics | |
---|---|---|
name | Feature.Description |
|
feat_type [anything but 'ORF', 'NTORF', and those containing 'RNA'] | Feature.Type . |
|
db_data_id | Used internally to map this entry to its
associated BioSource . |
|
asmbl_id | Used internally to map this entry to the NucleicAcid
of the replicon that contains this gene,
which is associated with the feature using the linking table NucleicAcidWIDFeatureWID .
|
|
locus | DBID.XID; DBID.OtherWID =
Feature.WID . |
|
sequence | Ignored. | |
protein | Ignored (should be empty). | |
end5 | Feature.StartPosition .This is an index into the full sequence of the replicon the feature occurs on. |
|
end3 | Feature.EndPosition .This is an index into the full sequence of the replicon the feature occurs on. |
|
comment | CommentTable.Comm;
CommentTable.OtherWID = Feature.WID |
|
date | Entry.ModifiedDate; Entry.OtherWID
= Feature.WID |
A row is added to Feature
for each entry.
The column value Feature.Class
is always NULL.
The column value Feature.Type
is always 'N', indicating
that this is a nucleic acid feature,
and that the sequence is not explicitly stored (it is available by
indexing the replicon).
The column value Feature.SequenceWID
always references
the NucleicAcid
of the replicon the feature occurs on.
The column value Feature.ExperimentalSupport
is always
'F' (false).
The column value Feature.ComputationalSupport
is always
'T' (true).
NucleicAcid.BioSourceWID
of the replicon. Subsequence
or Protein
is updated
according to the attribute defined in the entry.
CMR Attribute | Warehouse Semantics | |
---|---|---|
locus | Used internally to map this entry to its
associated Subsequence or Protein |
|
score | Specifies the value of the attribute. | |
att_type | Determines the attribute whose value is
specified by score: 'GC' - Subsequence.PercentGC 'MW' - Protein.MolecularWeightCalc , converted from
Daltons to kiloDaltons. 'PI' - Protein.PICalc |
CrossReference
table for each entry
in accession.
CMR Attribute | Warehouse Semantics | |
---|---|---|
locus | An internal lookup table of all loci from the
previously parsed new_ident is searched; the associated Gene WID is CrossReference.OtherWID . |
|
accession_db | CrossReference.DataBaseName |
|
accession_id | CrossReference.XID;
CrossReference.OtherWID = Gene.WID |
The search algorithm employed for the all vs. all searches is blastp. See NCBI BLAST site for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database. Other parameters are:
CMR Attribute | Warehouse Semantics | |
---|---|---|
locus | An internal lookup table of all loci from the
previously parsed new_ident is searched; the associated Protein WID is SequenceMatch.QueryWID . |
|
accession | An internal lookup table of all loci from the
previously parsed new_ident is searched; the associated Protein WID is SequenceMatch.MatchWID . |
|
pvalue | SequenceMatch.PValue . |
|
per_id | SequenceMatch.PercentIdentical . |
|
per_sim | SequenceMatch.PercentSimilar . |
|
match_len | SequenceMatch.Length . |
|
match_order | SequenceMatch.Rank . |