Bio-SPICE BioWarehouse Version 4.6 June 17, 2009 For documentation on the Bio-SPICE BioWarehouse, open doc/index.html with the browser of your choice. CHANGES for release 4.6 (June 2009) ======================= Schema: * Added index: BioSource.Strain. * Added column: Feature.Variant All C loaders: * Accept YYYY/MM/DD format for release date. KEGG loader: * Support for KEGG version 50. * Major change: the LIGAND:REACTION file is now translated. Most reactions are created by translating it rather than during translation of the LIGAND:ENZYME file. CMR loader: * Changed logic to exclude non-original RNA locuses based on feat_type rather than naming conventions (which are not 100% reliable). UniProt loader: * Updated to Uniprot 15.2 schema * Improved handling of feature references * Now store s in Feature.Variant CHANGES for release 4.5 (December 2008) ======================= Schema: * Modified column: DataSet.ReleaseDate to be Date type instead of String All loaders: * Release date must be provided via the -r command line argument All Java loaders: * Java loaders now require using Java 1.6+ All C loaders: * Fixed Makefile bug in which not all old files were cleaned out; this may have caused strange problems when compiling for different architectures BioCyc Loader: * Extended allowed syntax for pathway links to gracefully ignore non-pathways KEGG loader: * Support for KEGG version 46.0. NOTE: the loader gives several hundred spurious syntax errors when parsing the Ligand:Enzyme file upon encountering PRODUCT or SUBSTRATE entries that are not tagged with a KEGG Compound identifier; these are not loaded into the Chemical table. NCBI Taxonomy loader: * Warns if a non-date or a date format other than YYYY-MM-DD is used to specify the dataset version UniProt loader: * Updated to UniProt format version 14.2 CHANGES for release 4.4 (October 2008) ======================= Uniprot Loader: * Updated loader to be compatible with latest Uniprot Schema (XSD), release 14.2 CHANGES for release 4.3 (September 2008) ======================= Schema changes: * New table: DataSetHierarchy Datasets may now be hierarchically organized. To convert any query of a specific dataset into one that queries that dataset and/or any contained datasets, join with DataSetHierarchy and add 'AND SomeTable.datasetwid=DataSetHierarchy.subwid AND DataSetHierarchy.superwid=X' to the SELECT clause where X is the dataset being queried. New loader - ChIP-Chip Loader * Loads data for chIP-chip (chromatin immunoprecipitation with genome-tiling DNA chip expression) experiments BioCyc loader: * New options: -l (link hierarchically to most recent "BioCyc" dataset) -w (add all data to an existing dataset) -F (makes all input files optional) * Bug alert: Under Oracle, -w hangs if datasetwid does not refer to a valid dataset. CHANGES for release 4.2 (April 2008) ======================= Schema changes: * Added EvidenceType column to Support table and enumeration descriptions BioCyc loader: * Loads regulation data from file regulation.dat: - Full support for enzymatic regulation - Minimal support for transcriptional regulation (see BioCyc loader manual) * Fixed bug in which no gene was linked to its transcription unit UniProt Loader: * Domain and component names are now treated as synonyms of the protein. CHANGES for release 4.1 (June 2007) ======================= Schema changes: * New tables: TranscriptionUnit and TranscriptionUnitComponent. * New columns to Feature table: RegionOrPoint and PointType. BioCyc loader: * Loads transcriptions units. * Loads promoters, terminators, and DNA binding sites. * Links MetaCyc terms to genes (if MetaCyc Ontology is loaded) * Links GO terms for proteins (if GO Ontology is loaded) * Fixed bug in loading publications - referent frames were sometimes clobbering the Citation columns of the publication referred to MAGE loader: * Fixed bug where MAGE Description element was improperly mapped * MAGE OntologyEntry.category now mapped as parent Term of OntologyEntry.name CHANGES for release 4.0 (November 2006) ======================= Schema changes: * Extensions to the schema to accomodate MicroArray Gene Expression data Schema extensions were derived from the MAGE Object Model; see the MAGE loader documentation for more information (mage-loader/doc/index.html). 109 new tables added to the schema. * Extensions to represent flow cytometry data: - new general-purpose table RelatedTerm for associatng a term with an object - new tables LightSource, FlowCytometryProbe, FlowCytometrySample - new BioSource columns Diseased, Disease, ATCCId - new ExperimentData column MageData - see the new tables and ExperimentData.html for representation details * Extensions to represent 2D protein gel data. Added tables ExperimentRelationship, GelLocation, ProteinWIDSpotWID, Spot, SpotIdMethod, and SpotWIDSpotIdMethodWID. * Added range of reserved WIDs to loaders with particular WID needs (used by eco2dbase loader) Special WID range - 1 to 999 Reserved WID range - 1,000 to 999,999 Regular WID range - 1,000,000 on * Replaced PointOfContact table with Contact table * New schema source file organization * New schema maintenance script (schema/build.xml) * See complete change list at head of schema/core-schema.xml New loader - MicroArray Gene Expression (MAGE) Loader * Loads gene expression data in MAGE-ML format * See loader documentation at mage-loader/doc/index.html New loader - BioPAX Loader * Loads data of BioPAX format * See loader documentation at biopax-loader/doc/index.html New loader - Eco2dbase Loader * The Eco2dbase data set contains E. Coli 2D Protein Gel data * We provide MySQL and Oracle dump files of the eco2dbase data set that are directly loaded into the BioWarehouse * See loader documentation at eco2dbase-loader/doc/index.html All Java loaders * Changed commit mechanism; should result in slight loader performance increases New Warehouse Utility Programs * See doc/Utils.html for complete details * Master build and install script * Database summary utilities * Dataset deletion tool New documentation: * How to Write a Loader (doc/HowToWriteALoader.html) * Schema design and modification guidelines (in doc/Schema.html) CHANGES for release 3.8 (September 2006) ======================= Schema changes: * New tables: Interaction and InteractionParticipant. These are not currently used, but will be used to represent molecular interactions. * New column: Chemical.Class ('T' if a chemical represents a class of related chemicals). BioCyc loader: * Bug fix for missing pathway reactions when reaction has no predecessors. * Bug fix: proteins are now included in Reactant and Product tables when they occur as reaction substrates. * New feature: substrates occurring in reactions, but not defined as compounds, are assumed to be classes of compounds. Each class is added to the Chemical table. CHANGES for release 3.7 (August 2006) ======================= CMR Loader: * Support for CMR 19.0 - deprecate input files bcp_ident and bcp_nt_ident in favor of bcp_new_ident. Also, see the indexing change note below. UniProt Loader: * Support for Uniprot 8.0 and later. All C-based loaders: * Bug fix for system header files not included that cause errors on some platforms. No schema table changes. Indexing change: * For MySQL, indexing of the SequenceMatch table has been disabled by commenting out index statements in schema/mysql-index.sql. This should only affect users that load the CMR database, or who load their own data into this table. The reason is that the table size (180 million rows) after the load of CMR makes index computation prohibitively slow. We are working on a solution; if you need this table indexed, please contact us. CHANGES for release 3.6 (April 2006) ======================= New support email address: * For support and inquiries please contact support@biowarehouse.org New loader - MetaCyc Ontology: * Loads the MetaCyc Chemical Compounds ontology, the Multifun Gene ontology, and the MetaCyc Pathway ontology into three separate datasets. * Each ontology term is stored in the Term table. Superclass relationships are stored in the TermRelationship table. Synonyms and alternative identifiers for a term are stored in the SynonymTable. * See the loader documentation for full details about the schema mappings and how to build and run the loader (metacyc-ontology-loader/doc/index.html). Major schema changes: * Added columns Term.Hierarchical and Term.Root to better represent hierarchcial ontologies. * Added DataSet.ChangeDate; it is NULL while a loader is running or if a fatal error has occurred, and can be used by applications to track change times. KEGG Loader: * Support for versions 34.0 - 37.0. Loader now requires bison 1.875+. UniProt Loader: * Support for verison 7.1 - 7.2 of UniProt. CHANGES for release 3.5 (July 2005) ======================= New loader - Gene Ontology: * Loads the Gene Ontology (termdb data only) into the BioWarehouse. Associations (instance data) are not loaded at this time. * Each ontology term is stored in the Term table. Relationships between terms (is_a or part_of relationships) are stored in the TermRelationship table. Synonyms and alternative identifiers for a term are stored in the SynonymTable. * See the loader documentation for full details about the schema mappings and how to build and run the loader (go-loader/docs/index.html). * See also http://geneontology.org/. MySQL changes: * Version 4.1.2+ of MySQL is now supported. Most recently tested version is 4.1.10. FOREIGN KEY constraint checking is now enforced in MySQL. Oracle changes: * Upgraded the Oracle JDBC driver to be compatible with Oracle 10.1.0.4.0. If you experience any difficulty running the Java loaders against your version of Oracle, please email us at support@biowarehouse.org. Major schema changes: * Added columns LoadedBy, Application, and ApplicationVersion to DataSet table. These are populated by our loader applications, with LoadedBy being the value of the $USER environment variable, Application being the name of the loader program, and ApplicationVersion being the version of the Biowarehouse release (e.g., 3.5). * Removed table TermWIDParentWID in favor of the new table TermRelationship. Added column Term.Definition. Uniprot/SwissProt/TrEMBL Loader: * Updated the loader to work with UniProt schema v1.14 (UniProt release 5.0). * The contents of are now stored in the SynonymTable instead of DBID. * Protein.MolecularWeightCalc is now stored as kDaltons, not Daltons. * BioSource.Strain is now populated by the contents of as the individual elements under strain have been deprecated in the UniProt schema. CMR Loader: * Support for another command line argument for the load variant - original (see cmr-loader/doc/index.html for usage, and cmr-loader/doc/cmr-manual.html for discussion): if a genome has been sequenced at a non-TIGR site, only the non-TIGR ORFs are loaded; all TIGR ORFs for the genome are excluded. If a genome has been sequenced at TIGR, these TIGR ORFs are loaded. * Numerous bug fixes: - Strain name is now stripped out of Biosource.Name. (It still populates Biosource.Strain). - Gene.Direction, Gene.CodingRegionStart and Gene.CodingRegionEnd are now populated as documented in the schema. - NucleicAcid.MoleculeLength is now populated for both DNA and RNA. * Support for CMR version 16.0 (May 2005): - Read files named bcp_all_vs_all_00, bcp_all_vs_all_01, etc., as well as those named using the old convention bcp_all_vs_all_1, bcp_all_vs_all_2, etc. BioCyc Loader: * Fixed a bug (manifested in the MetaCyc DB) that caused many proteins and enzymatic reactions to be omitted from the load. * Support for version 9.1 of BioCyc: - If a subpathway occurs in the PREDECESSORS slot of a pathway, all its reactions are included in the reaction graph for that pathway that is stored in the PathwayReaction table. NCBI Taxonomy Loader: * Removed extraneous white space from some names that also contain extraneous angle-bracketed information (e.g. the name of "Bacteria " is now stored as "Bacteria"). Enzyme Loader: * Application command-line parameters changed to accomodate setting of the data version and release date. CHANGES for release 3.1 ======================= Bug fixes: * Some of the binary files in release 3.0 were corrupted. They have been replaced with fresh copies. CHANGES for release 3.0 (December 2004) ======================= Major schema changes: * Organism table was removed in favor of BioSource and BioSubtype tables * NucleicAcid table was significantly generalized * Replicon table was removed in favor of generalized NucleicAcid table * Subsequence table was added * Renamed EnzReactionWIDChemicalWID table to EnzReactionAltCompound and added Cofactor column * Made Gene.Name nullable * Added GeneticCode.NCBIID * Added Description table, used for certain defining or distinguished comments * Generalized the Feature table * Clarified that GeneWIDNucleicAcidWID associates a gene with its nucleic acid product * Added Version column to DBID and CrossReference tables New tool - Enumeration definitions loader * Used when installing the Warehouse * Loads the Enumeration table with definitions for all terminology and abbreviations used in various data columns of Warehouse tables New loader - GenBank * Loads the XML version of the GenBank database. * Files need to be converted from the ASN1 format to XML, as described in the GenBank BioWarehouse documentation. * Currently only the BCT division has been tested. Swiss-Prot loader: * The Swiss-Prot loader has been re-implemented as the "UniProt Loader", which reads files in the UniProt XML format. Enzyme loader: * GUI version has been deprecated in favor of the command line version. * Simplified command line options. CMR loader: * Updated for latest version of data (circa November 2004) * Now supports three variants - all, primary, and TIGR - for loading different subsets of Open Reading Frames * If NCBI Taxonomy database is loaded, populates NucleicAcid.GeneticCodeWID * If Enzyme database is loaded, populates EnzymaticReaction with EC number and associated reaction KEGG loader: * Updated for latest version of data (32.0) * Improved parsing to remove spurious parse errors. * Converted manual to HTML format. BioCyc loader: * Updated for latest version of data (8.5) * Populates the EnzReactionCofactor, EnzReactionAltCompound, and EnzReactionInhibitorActivator tables * Stores all comments defined in the source database * Stores all DBLINKS defined in the source database in the CrossReference table NCBI Taxonomy loader: * Updated for latest version of data (circa August 2004) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CHANGES for release 2.1 (2004) ======================= Added mysql and oracle targets to schemadoc Makefile to construct the documentation. Separated non-required table indexes from the schema proper: * Updated schemadoc tool to accept concatenated schema+index file. * Updated translate-schema.lisp to support separate file of indexes. CHANGES for release 2.0.2 ========================= Support for KEGG version 27.0: * Support RELEASE property in genome file for version 27+. * Several fixes to compound and enzyme grammars. Fixed bug in cmr-loader (Oracle): Gene.Direction is now always NULL, since CMR apparently does not specify a transcription direction for its features. Fixed bug in biocyc-loader: some syntax errors in genes, compounds, and enxymatic reactions were not getting counted in error totals. Fixed bug in miamexpress-loader/src/miamexpress-archive-loader.perl: incorrect value for Experiment.ArchiveWID was being stored. CHANGES for release 2.0.1 ========================= CMR loader - removed stale files from src/ that were causing a compile error. NCBI loader (Oracle) - now allocates a special (low-valued) WID for its dataset. CHANGES for release 2.0 (December 2003) ======================= Extensive changes to support for the MySQL Database Management System. Support for representation of experimental data within a Warehouse dataset. New loaders ----------- CMR (Comprehensive Microbial Resource) NCBI Taxonomy MIAMExpress archive Versions of all loaders for MySQL. Schema changes -------------- The old schema file, warehouse-schema.sql, has been replaced with a DBMS-independent template, schema-template.sql, as well as specific schema for Oracle and MySQL, oracle-schema.sql and mysql-schema.sql. The template is used to generate the specific schema files programmatically. New tables: Warehouse WIDTable SpecialWIDTable NucleicAcid NucleicAcidWIDFeatureWID Computation Experiment ExperimentData Archive PointOfContact New columns: Replicon.NucleicAcidWID Gene.NucleicAcidWID Loader changes -------------- Numerous fixes to error detection and error count reporting have been made. Release 1.0 (November 2002) -----------