BioWarehouse Utilities



This document describes the set of utility programs available for various tasks in the BioWarehouse.

General Instructions
Master Build & Install Script
Perl Utilities
"Find the Object" Program
Database Summary (HTML Dump) Program
DataSet Deletion Program
Force Drop Schema Program
Schema Diff Tool


General Instructions:

Many of the utility scripts are run from Ant and require a properties file to be present with the application parameters in the file as name-value pairs.  This file is typically called "developer.properties" and should be placed in the same directory from which the Ant command is invoked.  Each utility below has specific instructions.  Ant commands should be issued from within the directory where the Ant file (typically called "build.xml") resides.

Example developer.properties file:


host=localhost
dbms=mysql
port=3306
username=myuser
password=mypass
name=database-name-or-SID
version=1
release=test
file=filename-if-needed


Master Build & Install Script

Purpose: This master script can build and run any of the BioWarehouse loader programs using one script for either DBMS.  The script is Ant-based, so commands can be chained together.  Examples usages are below.

Location: warehouse/utils/bin/warehouse-install.xml (build targets are defined in build.xml).

Usage:
Within the warehouse/utils/bin directory, copy the file developer.properties.template to developer.properties, and edit the file for your values.  You only need to define values for the properties you intend to use, which includes the database connection parameters and any of the loader properties for loaders you want to run.

To view the available actions for this file, type:

osprompt: ant -f warehouse-install.xml -p

Buildfile: warehouse-install.xml

Main targets:

 build-all               Build all Warehouse loaders
 build-all-c             Build all C-based Warehouse loaders
 build-all-java          Build all Java-based Warehouse loaders
 build-biocyc            Build the Biocyc loader
 build-cmr               Build the CMR loader
 build-enzyme            Build the Enzyme loader
 build-genbank           Build the GenBank loader
 build-go                Build the GO loader
 build-kegg              Build the KEGG loader
 build-mage              Build the MAGE loader
 build-metacyc-ontology  Build the Metacyc Ontology loader
 build-ncbi-taxonomy     Build the NCBI Taxonomy loader
 build-uniprot           Build the Uniprot loader
 clean-all               Clean all Warehouse loaders
 clean-all-c             Clean all C-based Warehouse loaders
 clean-all-java          Clean all Java-based Warehouse loaders
 clean-biocyc            Clean the Biocyc loader
 clean-cmr               Clean the CMR loader
 clean-enzyme            Clean the Enzyme loader
 clean-genbank           Clean the GenBank loader
 clean-go                Clean the GO loader
 clean-kegg              Clean the KEGG loader
 clean-mage              Clean the MAGE loader
 clean-metacyc-ontology  Clean the Metacyc Ontology loader
 clean-ncbi-taxonomy     Clean the NCBI Taxonomy loader
 clean-uniprot           Clean the Uniprot loader
 destroy-schema          Run the schema destruction script
 load-all                Resets schema and loads all loaders.
 load-biocyc             Runs the Biocyc loader.
 load-cmr                Runs the CMR loader.
 load-enzyme             Runs the Enzyme loader.  Dependencies: NCBI Taxonomy
 load-genbank            Runs the GenBank loader.  Dependencies: NCBI Taxonomy
 load-go                 Runs the GO loader.
 load-kegg               Runs the KEGG loader.
 load-mage               Runs the MAGE loader.
 load-metacyc-ontology   Runs the METACYC loader.
 load-ncbi-taxonomy      Runs the NCBI Taxonomy loader.
 load-schema             Load the schema DDL files
 load-uniprot            Runs the Uniprot loader.  Dependencies: NCBI Taxonomy, Enzyme
 reset-schema            Resets the schema.
 reset-warehouse         Resets the warehouse.  Reloads schema, then runs NCBI Taxonomy.
Default target: help


If a loader depends on another loader(s), make sure to run that loader(s) first.

Example Usages:

Load the schema and run the NCBI Taxonomy loader:

osprompt: ant -f warehouse-install.xml load-schema load-ncbi-taxonomy

Run the GO and BioCyc loaders (assuming schema has been defined):

osprompt: ant -f warehouse-install.xml load-go load-biocyc

Check that loaders will build on your system (compile all loaders but do not run them):

osprompt: ant -f warehouse-install.xml build-all


Perl Utilities

Various Perl scripts are provided with the Warehouse. This document explains how to configure the environment for these scripts, and how to run a script that is particularly useful for testing the Warehouse and summarizing its contents.

Setting up perl DBI

Perl DBI must be installed first. It can be obtained from the site: http://www.cpan.org/modules/by-module/DBI/.

Oracle: Install Perl DBD for Oracle. It can obtained from: http://www.cpan.org/modules/by-module/DBD/ .

MySQL: Install Perl DBD for MySQL. It can obtained from: http://www.cpan.org/modules/by-module/DBD/.

Querying existing datasets

The directory utils/src/perl contains a perl script dataSetStats.pl. When run, it will output the current data sets in the database, and the number of entries in each. Usage is:

Oracle:

 osprompt: perl dataSetStats.pl oracle userid password [sid] [host]
sid defaults to the value of the environment variable ORACLE_SID.
host defaults to localhost.

MySQL:

 osprompt: perl dataSetStats.pl mysql userid password [database] [host]
database defaults to the value of the environment variable ORACLE_SID (though this may not be what you wanted).
host defaults to localhost.

As an example, output like the following is printed for the Enzyme, Swissprot and Bio-Cyc loaders:

 WID: 2
NAME: Swiss-Prot
VERSION: 40.0
LOADDATE: 02-OCT-02
RELEASEDATE: October 2001
HOMEURL: http://www.expasy.org/sprot/
QUERYURL:
Number of Entries: 859675


WID: 859677
NAME: Enzyme
VERSION: unknown
LOADDATE: 03-OCT-02
RELEASEDATE: October 27, 2001
HOMEURL: http://www.expasy.org/enzyme/
QUERYURL:
Number of Entries: 15705


WID: 875382
NAME: BsubCyc
VERSION: 6.0
LOADDATE: 03-OCT-02
RELEASEDATE: 2002-02-15 00:00:00
HOMEURL: http://ecocyc.org:1555/BSUB/organism-summary?object=BSUB
QUERYURL: http://ecocyc.org:1555//BSUB/NEW-IMAGE?object=%s
Number of Entries: 10523

"Find the Object" Program

Purpose:  The purpose of the "Find the Object" program is to determine which table a BioWarehouse object resides in, give its WID.

Location: warehouse/utils/src/java/build.xml

Usage:

Requires a developer.properties file with database connection parameters and a property called "wid" whose value is the WID of the object to be searched for.

osprompt: ant run-find-object


Database Summary (HTML Dump) Program

Purpose:  This program queries the database and produces an HTML representation of all tables and all data in the database.  The resulting output file is called "summary.html". Only tables with one or more rows are represented in the output.  This program should not be run on large database instances, as the output file will be correspondingly large.

Location: warehouse/utils/src/java/build.xml

Usage:

Requires a warehouse.properties file with database connection parameters.

osprompt: ant run-html-dump


DataSet Summary Program

Purpose:  This program writes a summary of the datasets loaded into the BioWarehouse.  The resulting output file is called "datasets.txt".

Location: warehouse/utils/src/java/build.xml

Usage:

Requires a warehouse.properties file with database connection parameters.

osprompt: ant run-summarize-datasets

Hint: If the number of datasets is large, this text file may be difficult to read as the lines will wrap.  The output can be more easily read if opened in Excel, using the comma as the field delimieter.


DataSet Deletion Program

Purpose: This utility may be used to delete a single dataset from the BioWarehouse.  This program is useful if a loader was aborted during a run, and the resulting dataset is incomplete.

Location: utils/bin/runDeleteDataSet.sh

Usage:


Prerequisite: First build the Warehouse common Java utilities at warehouse/util/src/java using the command

ant build

Then use the script:

usage: runDeleteDataSet.sh
 -x,--datasetwids <datasetwids>   Comma-separated list of DataSet.WIDs of
                                  datasets to be deleted
 -d,--dbms <dbms>                 DBMS type (mysql or oracle)
 -h,--help                        Print usage instructions
 -n,--name <name>                 Name or SID of database
 -p,--properties <file>           Name of properties file
 -s,--host <host>                 Name or IP address of database server
                                  host
 -t,--port <port>                 Port database server is listening at
 -u,--username <username>         Username for connection to the database
 -w,--password <password>         Password for connection to the database

Properties may be set on the command line or in the properties file.
Values on the command line take precedence over those in a properties
file. Properties in a property file are specified in name-value pairs. For
example: port=1234


Alternatively, the Ant script at warehouse/utils/src/java/build.xml may be used (with the above properties specified in developer.properties):

osprompt: ant run-delete-dataset



Force Drop Schema Program

Purpose:  This program can be used to forcibly drop all tables in a database instance (and the two WID sequences if the database is Oracle).  This utility is useful if the database gets into a state where the schema destruction script does not work.  This may happen if there is an error while creating a schema, or if the schema is altered from its default BioWarehouse definition, or if a previous run of the schema description script failed.  All data will be lost -- use this utility with extreme caution.

Location: warehouse/schema/build.xml

Usage:

Requires a warehouse.properties file with database connection parameters.

osprompt: ant drop-all



Schema Diff Tool

Purpose:  This utilitity is used to report differences between different versions of the BioWarehouse schema (XML format).

Location: warehouse/util/bin/runSchemaDiffTool.sh

Usage:

Prerequisite: First build the Warehouse common Java utilities at warehouse/util/src/java using the command

ant build
Then use the script:

usage: runSchemaDiffTool.sh schema-file-1.xml schema-file-2.xml