BioCyc Loader

(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.

This document describes how to build and run the BioCyc Loader. The BioCyc loader is located in the biocyc-loader/flatfile/ subdirectory of the warehouse distribution. For more information regarding the BioCyc loader, see the BioCyc Manual. Note that BioCyc contains a number of databases, each describing a particular organism. Multiple databases may be loaded into the Warehouse; each is contained in a separate Warehouse dataset.

Building the BioCyc Loader
Running the BioCyc Loader
Testing the BioCyc Loader
Documentation

[top]  Building the BioCyc Loader

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document. If you want objects in BioCyc to be linked to GO terms in the Gene Ontology, the be sure to run the Gene Ontology loader first. The Gene Ontology needs to be loaded into the BioWarehouse as one DataSet (i.e., do not use the '-c' option).

To build the loader, bring up a shell and navigate to the biocyc-loader/flatfile/src/ directory. Then:

MySQL:
osprompt: make clean
osprompt: make db=mysql
Creates the file mysql-biocyc-loader

Oracle:
osprompt: make clean
osprompt: make db=oracle
If file wh_oracle_util.c is reported missing, re-run the above make command:
osprompt: make db=oracle
Creates the file oracle-biocyc-loader

Also, a symbolic link named "biocyc-loader" is created, which points to the newly created executable. This can be used as a synonym for the most recently created DBMS-specific loader if desired.

If the build fails and gives errors about header files which are not found, read the section on configuring the appropriate client in Environment Setup. Posible problems are: improper installation of ProC (Oracle) or library/header files installed in an incorrect place.

[top]  Running the BioCyc loader

The BioCyc loader can be run from the same machine where the Warehouse database is installed, or it can be run remotely. To run the Oracle loader remotely, use the Net8 configuration assistance to add the remote database. Consult the Oracle documentation for instructions on how to do this. To run the MySQL loader remotely, simply provide the correct host parameter for the location of the MySQL server.

Obtaining the BioCyc databases

The BioCyc Manual contains information regarding the BioCyc data sets. The latest supported data version for the BioCyc loader is listed in the loader summary table.

In this section are presented two examples which use the bsubcyc and hincyc data sets. The example data sets were created using version 8.5 of B. subtilis (bsubcyc) and version 8.5 of Hm. influenzae (hincyc). Log files from the MySQL load of these databases are at load-mysql-bsubcyc.log and load-mysql-hincyc.log.

See http://biocyc.org/ for information on how to obtain a license and download the data. Also see Pathway/Genome Database (PGDB) for a description of the database file format.

Running the BioCyc loader

The biocyc-loader/flatfile/src/ directory contains scripts to run the MySQL and Oracle loaders.

MySQL:

    ./run-mysql host database user password datadir organism sourcedb version releasedate [options] 
host - The machine address where the MySQL server/database resides.
database - Name of the MySQL database to be loaded.
user - MySQL userid.
password - MySQL password for userid.
datadir - Directory which contains the PGDB database files to be loaded.
organism - Name of the BioCyc organism to be loaded. Ex: "Hm. influenzae"
sourcedb - Name of the BioCyc database to be loaded. Ex: "HinCyc"
version - Version of the BioCyc database to be loaded. Ex: "8.5"
releasedate - Release date of the BioCyc database to be loaded. Ex: 2008-04-01
options - Any of the following may be specified: -l links the dataset created by the loader to a parent dataset named "BioCyc" by adding a row the DataSetHierarchy table. -m merges all data loaded into the dataset named "BioCyc". If "BioCyc" does not exist, it is created; if there are multiple "BioCyc"s the one with the largest dataset WID is used. -w datasetwid allows specification of the dataset into which all loaded date is merged; datasetwid must be the DataSet.WID of a previously loaded BioWarehouse dataset. -F causes warnings to be issued for any missing input files, rather than a fatal error.
For example:
    ./run-mysql 123.45.67.8 warehouse me mypwd /space/bio/databases/biocyc/bsubcyc/8.5 "B.subtilis" "BsubCyc" "8.5" 2008-04-01
This command loads the organism "B.subtilis" into the MySQL database named warehouse. The data files for organism are located in the directory /space/bio/databases/biocyc/bsubcyc/8.5 and the user name and password used to access MySQL are me and mypwd.

Another example which loads hincyc:

    ./run-mysql machine.myco.com Biospice her herpwd /space/bio/databases/biocyc/hincyc/8.5  "Hm.influenzae" "HinCyc" "8.5" 2008-04-01
This command loads the organism "Hm.influenzae" into the MySQL database named Biospice. The data files for organism are located in the directory /space/bio/databases/biocyc/hincyc/8.5 and the user name and password used to access MySQL are her and herpwd.

Oracle:

    ./run-oracle "user/passwd" datadir organism sourcedb version releasedate [options] 
user/passwd - User name and password. Ex: "dan/mypwd", "dan/mypwn@mydb"
datadir - Directory which contains the PGDB database files.
organism - Organism Name. Ex: "Hm. influenzae"
sourcedb - Name of the BioCyc database to be loaded. Ex: "HinCyc"
version - Version of the BioCyc database to be loaded. Ex: "8.5"
releasedate - Release date of the BioCyc database to be loaded. Ex: 2008-04-01
options - Any of the following may be specified: -l links the dataset created by the loader to a parent dataset named "BioCyc" by adding a row the DataSetHierarchy table. -m merges all data loaded into the dataset named "BioCyc". If "BioCyc" does not exist, it is created; if there are multiple "BioCyc"s the one with the largest dataset WID is used. -w datasetwid allows specification of the dataset into which all loaded date is merged; datasetwid must be the DataSet.WID of a previously loaded BioWarehouse dataset. -F causes warnings to be issued for any missing input files, rather than a fatal error.
For example:
    ./run-oracle "me/mypwd@mydb" /space/bio/databases/biocyc/bsubcyc/8.5 "B.subtilis" "BsubCyc" "8.5"  2008-04-01
This command loads the organism "B.subtilis" into the Oracle database mydb. The data files for organism are located in the directory /space/bio/databases/biocyc/bsubcyc/8.5 and the user name and password used to access Oracle is "me/mypwd".

Another example which loads hincyc:

    ./run-oracle "me/mypwd" /space/bio/databases/biocyc/hincyc/8.5 "Hm.influenzae" "HinCyc" "8.5"  2008-04-01
This command loads the organism "Hm.influenzae". The data files for organism are located in the directory /space/bio/databases/biocyc/hincyc/8.5 and the user name and password used to access Oracle is "me/mypwd".

[top]  Testing the BioCyc Loader

The loader may report parse errors; this is expected. The expected output when loading bsubcyc can be found here. The expected output when loading hincyc can be found here.

The database data sets should be queried to ensure the BioCyc data set is loaded. See the document on Running the Perl Utiltity scripts to check this. After running the query, the BioCyc data set should have 10523 entries loaded for bsubcyc and 5049 entries loaded for hincyc.

[top]  Developer Documentation

The following documents are located in the doc directory: