CMR Loader for BioWarehouse

Version 4.6


(C) 2005 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.


Introduction
Limitations
Installation and Building
Input data
Loader Dependencies and Prerequisites
Running the Loader
Variants of CMR
Dataset Specification
Sequence Comparison Computation Specification
Translation Semantics for CMR Tables
  • Table db_data
  • Table taxon
  • Table asmbl_data
  • Table assembly
  • Table new_ident
  • Table nt_ident
  • Table ident
  • Table feat_link
  • Table asm_feature
  • Table orf_attribute
  • Table accession
  • Table all_vs_all
    References

    Introduction

    This document describes version 4.6 of the Comprehensive Microbial Resource (CMR) Loader. It is one of several database loaders comprising the BioWarehouse. The CMR Loader (referred to simply as the loader), loads a collection of files comprising the Comprehensive Microbial Resource (sometimes referred to as the Omniome databse) into the BioWarehouse - an Oracle relational database that provides a common representation for diverse bioinformatics databases. CMR is developed and maintained by The Institute for Genomic Research (TIGR) .

    Overview of BioWarehouse Schema

    The Bio-SPICE warehouse schema contains the data definition statements for the BioWarehouse. These include four different types of tables - constant tables, object tables, linking tables, and special tables.

    Constant tables specify scientific data such as information from the Periodic Table of Elements, as well as constants used as column values in various warehouse tables.

    Object tables describe a type of entity in a source database, such as compounds and proteins. Each column of an object table specifies a parameter that characterizes the object. In addition to the parameters defined by the source database, the loader assigns a unique warehouse ID (WID) to each object, which is used by other tables to reference the object.

    A special type of warehouse object is the dataset. A dataset object is created for each dataset loaded into the warehouse, i.e., the SWISS-PROT loader adds one row to this table when it is run. Its WID is referred to as the dataset WID and is a column in each object table, specifying the source database of the object.

    A linking table describes relationships among objects. They contain WIDs of the associated objects, and any additional columns needed to characterize the relationship. In general, many-to-many relationships are supported. Special tables exist to capture reference and crossreference information and to facilitate lookup of objects.

    Full schema information, including source files and browseable documentation, is available with this distribution.


    Limitations

    The CMR loader should not be run concurrently with other loaders. The loader is likely to fail in this case. This limitation was imposed for performance reasons.

    The latest supported data version for the CMR loader is listed in the loader summary table. The loader may not be compatible with future versions of CMR. Due to changes in the CMR flat file format, the loader is not compatible with earlier CMR versions.

    MySQL servers impose a limit on the maximum size of an SQL statement. The CMR loader currently assumes that this limit is at least 4,000,000 bytes, and will truncate long data elements such as sequences to ensure this limit is not exceeded. This limit is currently not changeable. It is necessary to configure your MySQL server so that its max_allowed_packets parameter is at least 4M, as described in Environment Setup (MySQL).

    The CMR database contains modification dates for a number of its data types. However, these dates are ignored, rather than stored in Entry.ModifiedDate.


    Installation and Building

    See CMR installation instructions for details on installing and building the loader.

    Input data

    Currently, loadable data for CMR is available at the TIGR Omniome Database public FTP site.

    The CMR database consists of several tables. Data for each table is in a separate file or group of files. By CMR convention, files are named bcp_tablename. A subset of these are used by the loader. Each file is described in the following table, and is listed in the order in which it is loaded.
    NOTE: As of release 3.7, the ident and nt_ident tables have been deprecated in favor of the new_ident table. The former tables are loaded if present, in order to support backward compatibility.

    Input data for CMR Loader
    Table File name(s) Loaded when Record delimiter Field delimiter
    db_data bcp_db_data Always Newline Tab
    taxon bcp_taxon Always Newline Tab
    asmbl_data bcp_asmbl_data Always &&& @@@
    assembly bcp_assembly Always &&& @@@
    new_ident bcp_new_ident Loading any variant except TIGR
    (see variant command line option)
    &&& @@@
    nt_ident bcp_nt_ident DEPRECATED as of 3.7 in favor of new_ident.
    Loading any variant except TIGR
    (see variant command line option)
    &&& @@@
    feat_link bcp_feat_link Loading PRIMARY variant
    (see variant command line option)
    Newline Tab
    ident bcp_ident DEPRECATED as of 3.7 in favor of new_ident.
    Always, but when loading PRIMARY ORFs,
    (see variant command line option)
    genes duplicated in nt_ident are not loaded
    &&& @@@
    asm_feature bcp_asm_feature Always &&& @@@
    ORF_attribute bcp_ORF_attribute Always Newline Tab
    accession bcp_accession Always Newline Tab
    all_vs_all bcp_all_vs_all_1,
    bcp_all_vs_all_2, etc., and/or
    bcp_all_vs_all_01,
    bcp_all_vs_all_02, etc.
    Always Newline Tab

    All other tables are ignored by the loader. All input files must reside in the same directory, which is specified as a command line parameter to the loader.

    There are two file formats. One format uses a newline to terminate records, and a tab character to separate fields within the record. The other format, which allows embeddded tab and newline characters within its fields, uses the character sequence &&& as the record terminator and the character sequence @@@ as the field separator. In files with no line breaks between entries, any report by the loader of line numbers or total number of lines refers to entries, not lines.


    Loader Dependencies and Prerequisites

    Other that the standard warehouse creation procedure, the loader does not require that any other Warehouse tools be run prior to its execution. However, if the NCBI Taxonomy or Enzyme datasets have been loaded, the CMR loader builds useful links to objects in these datasets. A warning is issued if either of these datasets has not been loaded. If multiple versions of these datasets are present in the Warehouse, the dataset with the maximum WID (typically the most recently loaded) is used.

    The NCBI Taxonomy dataset is used to populate the NucleicAcid.GeneticCodeWID column for the nucleic acid molecules of each organism in CMR. See Table db_data for details.

    The Enzyme dataset is used to create an EnzymaticReaction for each CMR Open Reading Frame that specifies an Enzyme Commission (EC) number. See Table new_ident for details.


    Running the Loader

    The CMR installation instructions contain details for running the loader, including options and a description of its output.

    The CMR loader should not be run concurrently with other loaders. The loader is likely to fail in this case.

    To show progress, the loader prints a period for each 1000 WIDs allocated. For large input tables that don't allocate many WIDs, a colon is printed every 1000 input lines.

    Variants of CMR

    The loader accepts a variant command line option that specifies which Open Reading Frames (ORFs) and associate genes and proteins are to be loaded. Its semantics are as follows:

    To summarize, original loads only original (generally higher quality) annotations, while primary will load an automated annotation if there is no original annotation for that gene.

    To implement the various variants, the following locus naming conventions are assumed:

    1. 'NTL...' = original nonTIGR annotation
    2. 'NTx...' [x /= 'L'] = automated TIGR (non-original) annotation
    3. 'yz...' [yz /= 'NT'] = original TIGR annotation
    4. 'NT01MC...' = original locus (magnetococcus organism, exception to above rules)

    Dataset Specification

    The loader adds one row to the Dataset table as follows:

    Column values for Dataset row
    Column Value assigned by CMR loader
    WID The next available WID in the warehouse. Uniquely specifies this dataset in the warehouse.
    Name 'CMR'.
    ReleaseDate NULL
    Version Version assigned by TIGR to this release of CMR (e.g., '19.0').
    ReleaseDate The date that this version of CMR was released.
    LoadDate The time/date the loader was run.
    ChangeDate The date and time the loader completed, NULL if the loader did not complete successfully.
    LoadedBy The value of the system environment variable USER for the account running the loader.
    Application 'CMR Loader'
    ApplicationVersion 4.6
    HomeURL http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl
    QueryURL http://www.tigr.org/tigr-scripts/CMR2/GenomeSlicer.spl


    Sequence Comparison Computation Specification

    The CMR database contains sequence comparison results in its all_vs_all table. These are stored the SequenceMatch table of the warehouse. A description of the sequence comparison algorithm performed is stored in a row of the Computation table as follows:

    Column values for Computation row
    Column Value assigned by CMR loader
    WID Uniquely specifies this computation in the warehouse.
    Referred to by SequenceMatch table rows.

    Name 'blastp'.
    DatasetWID Dataset.WID for the entry in Dataset for this CMR dataset.
    Description 'The search algorithm employed for the all vs. all searches is blastp. See http://www.ncbi.nlm.nih.gov/BLAST for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database.'

    The ComputationWID column of all rows added to SequenceMatch refer to this row of Computation.


    Translation Semantics for CMR Tables

    This section describes the semantic mapping between the tables comprising the CMR database and its associated flat file representation to its representation in the BioSpice data warehouse. Semantics are expressed in tabular form, showing the mapping of each source attribute to the warehouse Table.Column values computed from it. The most typical case is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Any attributes not listed are ignored.

    Some attributes can occur multiple times for a source object. The notation ATTRIBUTE[*] is used to indicate that the semantics apply to all occurrences; typically a row is added to a warehouse table for each. The notation ATTRIBUTE[1], ATTRIBUTE[2], etc., is used where the attribute order is significant.

    If an attribute is missing from a source file but required by the warehouse schema (i.e., its column is qualified with NOT NULL), a warning is issued. If the missing attribute is not required, NULL is stored. Most semantics are expressed in tabular form, showing the mapping of each input attribute to the warehouse Table.Column values computed from it. The most typical semantics is that the attribute is simply copied into a warehouse column; if translation is more complex, an explanation is given. Some attributes are ignored.

    Comments that are NULL or are "-" are ignored.


    Table db_data

    The db_data table contains a description of each genome present in CMR. A row is added to the BioSource table for each entry in db_data.


    Translation semantics for db_data
    CMR Attribute Warehouse Semantics
    organism_name BioSource.Name
    genetic_code Used internally to look up a GeneticCode entry from the most recently loaded NCBI Taxonomy database;
    if an entry is found where its GeneticCode.NCBIID matches genetic_code, its GeneticCode.WID is stored in NucleicAcid.GeneticCodeWID for any nucleic acids of this organism.

    original_db DBID.XID; DBID.OtherWID = BioSource.WID

    Linking Tables

    No rows to linking tables are added when this file is loaded. However, various linking table rows are added when subsequent tables are loaded.

    Table taxon

    The taxon table supplements the genome descriptions in the db_data table. For each entry, an existing row of BioSource is updated. The row that is updated is the row whose Biosource.Name matches the concatenation of the genus, species, and strain fields. The update removes the strain from Biosource.Name and populates Biosource.Strain with it. If the strain name should remain in Biosource.Name after loading, it indicates that the Taxon row for that organism is either missing or erroneous.


    Translation semantics for taxon
    CMR Attribute Warehouse Semantics
    genus,
    species
    Concatenated together with separating spaces to form the genome's BioSource.Name.
    This identifies the associated BioSource entry that is updated.

    strain BioSource.Strain.
    taxon_id Used internally to look up the taxon entry in the Warehouse being loaded from a previously loaded NCBI Taxonomy database;
    this is stored in BioSource.TaxonWID. If the taxon is not found, NULL is stored.

    short_name SynonymTable.Syn; SynonymTable.OtherWID = BioSource.WID

    Linking Tables

    No rows to linking tables are added when this file is loaded.

    Table asmbl_data

    The asmbl_data table contains information for each DNA molecule in CMR. A NucleicAcid row is created for each entry in asmbl_data. If the assembly has a name, a Description row for the NucleicAcid is created.


    Translation semantics for asmbl_data
    CMR Attribute Warehouse Semantics
    name NucleicAcid.Name
    db_data_id ~NucleicAcid.BioSourceWID; determined via a lookup table that associates each db_data_id with a BioSource.WID.
    type NucleicAcid.Class if the value is 'chromosome' or 'plasmid', else 'other';
    NucleicAcid.FullySequenced is 'T' if the value is 'chromosome' or 'plasmid', else 'F';
    a value other than 'chromosome' or 'plasmid' or 'pseudomolecule' causes a parse error. A pseudomolecule is a molecule that has been put together using individual contigs (which may or may not be ordered) in an "unfinished" genome. This means the assembly is not actually closed, but we shoved the pieces we had together to make a single molecule (the CMR isn't really set up to handle multiple contigs). This is the sequence we use to separate the contigs in the pseudomolecule (stops and starts in all 6 reading frames): NNNNNCACACACTTAATTAATTAAGTGTGTGNNNNN

    topology NucleicAcid.Topology; any value other than 'circular' or 'linear' is loaded as 'other'.

    Linking Tables

    No linking table rows are added.

    Table assembly

    The assembly table contains the sequence of a DNA molecule. A Subsequence row is created for each entry in assembly. Sequences can be rather large, sometimes several megabytes long. Note that your database server may impose a limit on data transfers that cause sequences to be truncated (see Limitations).


    Translation semantics for assembly
    CMR Attribute Warehouse Semantics
    asmbl_id Determines Subsequence.NucleicAcidWID by looking up the nucleic acid of asmbl_id a lookup table

    sequence Subsequence.Sequence; its length is stored in Subsequence.Length
    and in NucleicAcid.MoleculeLength where NucleicAcid.WID = Subsequence.NucleicAcidWID.

    comment CommentTable.Comm; CommentTable.OtherWID = Subsequence.WID
    change_date Entry.ModifiedDate; Entry.OtherWID = Subsequence.WID

    Linking Tables

    No linking table rows are added.

    Table new_ident

    The new_ident table contains information for each Open Reading Frame (ORF) in CMR that was not sequenced by TIGR. Each row defines a Protein and a Gene entry for the ORF.


    Translation semantics for new_ident
    CMR Attribute Warehouse Semantics
    locus DBID.XID; DBID.OtherWID = Gene.WID.
    Used internally to associate a Protein.WID and a Gene.WID with this entry, to facilitate subsequent updates to these rows by the loader.

    nt_locus Gene.GenomeID
    SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID

    com_name Protein.Name
    gene_sym Gene.Name
    ec# SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;
    Also, it is used internally to look up a Reaction entry from the most recently loaded Enzyme database:
    if an Enzyme entry is found where Reaction.ECNumber matches ec# exactly, an
    EnzymaticReaction entry is added where this Reaction.WID is
    EnzymaticReaction.ReactionWID and Protein.WID is EnzymaticReaction.ProteinWID.
    No check for the validity of the EC number is made.

    comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    nt_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    pub_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    auto_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    date Entry.ModifiedDate; Entry.OtherWID = Protein.WID and
    Entry.ModifiedDate; Entry.OtherWID = Gene.WID.

    Linking Tables

    A row is added to GeneWIDProteinWID for each entry, to associate the gene with its protein product.

    Table nt_ident

    NOTE: this table is no longer present in CMR, as of version 3.7.
    The nt_ident table contains information for each Open Reading Frame (ORF) in CMR that was not sequenced by TIGR. Each row defines a Protein and a Gene entry for the ORF.


    Translation semantics for nt_ident
    CMR Attribute Warehouse Semantics
    locus DBID.XID; DBID.OtherWID = Gene.WID.
    Note: by CMR convention, each locus in this table begins with "NTL", to designate it as not sequenced at TIGR.
    Used internally to associate a Protein.WID and a Gene.WID with this entry, to facilitate subsequent updates to these rows by the loader.

    nt_locus Gene.GenomeID
    SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID

    com_name Protein.Name
    gene_sym Gene.Name
    ec# SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;
    Also, it is used internally to look up a Reaction entry from the most recently loaded Enzyme database:
    if an Enzyme entry is found where Reaction.ECNumber matches ec# exactly, an
    EnzymaticReaction entry is added where this Reaction.WID is
    EnzymaticReaction.ReactionWID and Protein.WID is EnzymaticReaction.ProteinWID.
    No check for the validity of the EC number is made.

    comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    nt_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    pub_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    auto_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    date Entry.ModifiedDate; Entry.OtherWID = Protein.WID and
    Entry.ModifiedDate; Entry.OtherWID = Gene.WID.

    Linking Tables

    A row is added to GeneWIDProteinWID for each entry, to associate the gene with its protein product.

    Table ident

    NOTE: this table is no longer present in CMR, as of version 3.7.
    The ident table contains information for each Open Reading Frame (ORF) in CMR that was sequenced by TIGR. Each row defines a Protein and a Gene entry for the ORF.


    Translation semantics for ident
    CMR Attribute Warehouse Semantics
    locus Gene.GenomeID;
    DBID.XID; DBID.OtherWID = Gene.WID.
    Used internally to associate a Protein.WID and a Gene.WID with this entry, to facilitate subsequent updates to these rows by the loader.

    com_name Protein.Name
    gene_sym Gene.Name
    ec# SynonymTable.Syn; SynonymTable.OtherWID = Protein.WID;
    Also, it is used internally to look up a Reaction entry from the most recently loaded Enzyme database:
    if an Enzyme entry is found where Reaction.ECNumber matches ec# exactly, an
    EnzymaticReaction row is added where this Reaction.WID is
    EnzymaticReaction.ReactionWID and Protein.WID is EnzymaticReaction.ProteinWID.
    No check for the validity of the EC number is made.
    comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    nt_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    pub_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    auto_comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
    date Entry.ModifiedDate; Entry.OtherWID = Protein.WID and
    Entry.ModifiedDate; Entry.OtherWID = Gene.WID.

    Linking Tables

    A row is added to GeneWIDProteinWID for each entry, to associate the gene with its protein product.

    Table feat_link

    The feat_link table contains links between related locuses. It is used by the loader to determine whether to exclude secondary locuses when only primary locuses are being loaded (see Variants of CMR above).

    No tables are changed directly bsaed on loading this table.

    Translation semantics for feat_link
    CMR Attribute Warehouse Semantics
    id Ignored.
    parent_locus Ignored unless loading the primary variant of CMR.
    Otherwise, if this locus is a non-TIGR locus, the associated child_locus is marked so that it is excluded from the load of CMR.

    child_locus See parent_locus

    Linking Tables

    No linking table rows are added.

    Table asm_feature

    The asm_feature table contains information for each feature identified on DNA molecules in CMR. There are various types of CMR features; each entry contains a feat_type attribute. The value of this attribute determines the warehouse loading semantics as follows. Since the loader semantics differ significantly depending on feat_type, each case is described separately.
    1. Protein features

      Translation semantics for asm_feature where feat_type is 'ORF' or 'NTORF'
      CMR Attribute Warehouse Semantics
      name Ignored
      feat_type ['ORF','NTORF'] Gene.Type = 'polypeptide'.
      db_data_id Used internally to map this entry to its associated BioSource .
      asmbl_id Used internally to map this entry to the NucleicAcid of the replicon that contains this gene,
      which is stored in Gene.NucleicAcidWID.

      locus Used internally to map this entry to its associated Protein and Gene,
      which are created during the loading of the new_ident table.

      sequence Ignored (except for consistency checking; see below).
      protein Protein.AASequence where Protein.WID is indicated by locus.
      end5,
      end3
      If end5 < end3, Gene.Direction is set to 'forward', else it is set to 'reverse'.
      min(end5, end3) is stored in Gene.CodingRegionStart.
      max(end5, end3) is stored in Gene.CodingRegionEnd.
      Note that these are indices into NucleicAcid.Sequence.
      Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex.

      comment CommentTable.Comm; CommentTable.OtherWID = Protein.WID
      date Entry.ModifiedDate; Entry.OtherWID = Protein.WID

      Consistency checking is performed on the lengths of the nucleotide and protein sequences. If the nucleotide length is not 3 * the amino acid length, a diagnostic is issued, and the entry is marked in the Entry table as containing an error.

      Linking Tables

      Rows are added to BioSourceWIDProteinWID and BioSourceWIDGeneWID to associate the gene and protein with the organism it occurs in.

    2. RNA features (rRNA, sRNA, tRNA, NTrRNA, NTtRNA, and NTmisc_RNA)

      Translation semantics for asm_feature where feat_type contains 'RNA'
      CMR Attribute Warehouse Semantics
      name Ignored
      feat_type ['sRNA', 'NTmisc_RNA'] Determines NucleicAcid.Type = 'RNA' and
      Gene.Type and NucleicAcid.Class = 'other'

      feat_type ['rRNA', 'NTrRNA'] Determines NucleicAcid.Type = 'RNA' and
      Gene.Type and NucleicAcid.Class = 'rRNA' .

      feat_type ['tRNA', 'NTtRNA'] Determines NucleicAcid.Type = 'RNA' and
      Gene.Type and NucleicAcid.Class = 'tRNA'

      db_data_id Used internally to map this entry to its associated BioSource.
      asmbl_id Used internally to map this entry to the NucleicAcid of the replicon that contains this gene,
      which is stored in Gene.NucleicAcidWID.

      locus Gene.GenomeID, and
      DBID.XID; DBID.OtherWID = NucleciAcid.WID for the nucleic acid that is the RNA product.

      sequence Subsequence.Sequence where Subsequence.NucleicAcidWID is indicated by asmbl_id.
      Note the DNA sequence is stored explicitly.

      protein Ignored (should be empty).
      end5,
      end3
      If end5 < end3, Gene.Direction is set to 'forward', else it is set to 'reverse'.
      min(end5, end3) is stored in Gene.CodingRegionStart.
      max(end5, end3) is stored in Gene.CodingRegionEnd.
      Note that these are indices into NucleicAcid.Sequence.
      Evidently CMR, as of this release, does not detect genes that span the origin of a circular replicon. For such genes, the coding region mappings are more complex.
      comment CommentTable.Comm; CommentTable.OtherWID = NucleicAcid.WID
      date Entry.ModifiedDate; Entry.OtherWID = NucleicAcid.WID

      For each entry, a row is added to Gene, NucleicAcid, and Subsequence. CMR does not identify a gene associated with this type of entry; it is constructed from available information.

      Note that there are two nucleic acids associated with each entry of this type - the nucleic acid for the replicon containing the feature, and the nucleic acid that represents the gene product. Gene.NucleicAcidWID references the replicon this feature occurs on, not the gene product. Gene.SubsequenceWID references the Subsequence that is created for this entry.

      Linking Tables

      A row is added to BioSourceWIDGeneWID to associate the gene with the organism it occurs in. A row is added to GeneWIDNucleicAcidWID to associate the gene with its RNA product.

    3. Other features

      Translation semantics for asm_feature where feat_type is anything but 'ORF', 'NTORF', and those containing 'RNA'
      CMR Attribute Warehouse Semantics
      name Feature.Description
      feat_type [anything but 'ORF', 'NTORF', and those containing 'RNA'] Feature.Type.
      db_data_id Used internally to map this entry to its associated BioSource.
      asmbl_id Used internally to map this entry to the NucleicAcid of the replicon that contains this gene, which is associated with the feature using the linking table NucleicAcidWIDFeatureWID.
      locus DBID.XID; DBID.OtherWID = Feature.WID.
      sequence Ignored.
      protein Ignored (should be empty).
      end5 Feature.StartPosition.
      This is an index into the full sequence of the replicon the feature occurs on.

      end3 Feature.EndPosition.
      This is an index into the full sequence of the replicon the feature occurs on.

      comment CommentTable.Comm; CommentTable.OtherWID = Feature.WID
      date Entry.ModifiedDate; Entry.OtherWID = Feature.WID

      A row is added to Feature for each entry. The column value Feature.Class is always NULL. The column value Feature.Type is always 'N', indicating that this is a nucleic acid feature, and that the sequence is not explicitly stored (it is available by indexing the replicon). The column value Feature.SequenceWID always references the NucleicAcid of the replicon the feature occurs on. The column value Feature.ExperimentalSupport is always 'F' (false). The column value Feature.ComputationalSupport is always 'T' (true).

      Linking Tables

      The biological source of the feature is referenced by NucleicAcid.BioSourceWID of the replicon.

    Table orf_attribute

    The orf_attribute table supplements the descriptions of open reading frames by providing values of attributes of their associated nucleic acid sequences and proteins. No rows are added to any warehouse tables; for each entry, a row in either Subsequence or Protein is updated according to the attribute defined in the entry.


    Translation semantics for orf_attribute
    CMR Attribute Warehouse Semantics
    locus Used internally to map this entry to its associated Subsequence or Protein
    score Specifies the value of the attribute.
    att_type Determines the attribute whose value is specified by score:
    'GC' - Subsequence.PercentGC
    'MW' - Protein.MolecularWeightCalc, converted from Daltons to kiloDaltons.
    'PI' - Protein.PICalc

    Linking Tables

    No linking table rows are added.

    Table accession

    The accession table contains crossreferences to other databases (e.g., SWISS-PROT, EcoCyc) from loci defined in CMR. A row is added to the CrossReference table for each entry in accession.


    Translation semantics for accession
    CMR Attribute Warehouse Semantics
    locus An internal lookup table of all loci from the previously parsed new_ident is searched;
    the associated Gene WID is CrossReference.OtherWID.

    accession_db CrossReference.DataBaseName
    accession_id CrossReference.XID; CrossReference.OtherWID = Gene.WID

    Linking Tables

    No rows to linking tables are added when this file is loaded.

    Table all_vs_all

    The all_vs_all table contains the results of exhaustive BLAST searches of every amino acid sequence in CMR. Specifically, each protein is compared with every other protein in CMR, and the best matches for each are stored, along with information that characterizes the match. Due to its size, the table is divided into multiple files.

    A row is added to SequenceMatch for each entry. A single row is added to Computation for all of CMR; SequenceMatch.ComputationWID references it to describe the sequence matching algorithm.

    The search algorithm employed for the all vs. all searches is blastp. See NCBI BLAST site for details. There is no max number of matches allowed in the all_vs_all searches. However, nothing under 10% identity or 40% similarity are reported, and nothing over a P-value of 1 is reported in the database. Other parameters are:

    NOTE: The CMR website provides a somewhat different sequence matching procedure online, called Blast-Extend-Repraze or BER, that computes a modified Smith-Waterman alignment. See the CMR website for details.


    Translation semantics for all_vs_all
    CMR Attribute Warehouse Semantics
    locus An internal lookup table of all loci from the previously parsed new_ident is searched;
    the associated Protein WID is SequenceMatch.QueryWID.

    accession An internal lookup table of all loci from the previously parsed new_ident is searched;
    the associated Protein WID is SequenceMatch.MatchWID.

    pvalue SequenceMatch.PValue.
    per_id SequenceMatch.PercentIdentical.
    per_sim SequenceMatch.PercentSimilar.
    match_len SequenceMatch.Length.
    match_order SequenceMatch.Rank.

    Linking Tables

    No linking table rows are added.

    References

  • CMR Home Page