KEGG Loader

(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.


This document describes how to build and run the KEGG loader The kegg loader is located in the kegg-loader/ subdirectory of the warehouse distribution.

Building the KEGG Loader
Running the KEGG loader
Documentation

[top]  Building the KEGG Loader

Before building the loaders, make sure ProC is installed properly according to the Environment Setup Also make sure the schema is loaded into the database as specified in the Schema document.

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.

To build the loader, bring up a shell and navigate to the kegg-loader/src/ directory. Then:

MySQL:
osprompt: make clean
osprompt: make db=mysql
Creates the file mysql-kegg-loader

Oracle:
osprompt: make clean
osprompt: make db=oracle
If file wh_oracle_util.c is reported missing, re-run the above make command:
osprompt: make db=oracle
Creates the file oracle-kegg-loader

Also, a symbolic link named "kegg-loader" is created, which points to the newly created executable. This can be used as a synonym for the most recently created DBMS-specific loader if desired.

If the build fails and gives warnings about header files which are not found, read the section on configuring ProC in Environment Setup. Run the simple ProC test and make sure ProC is configured properly.

[top]  Running the KEGG loader

The KEGG loader can be run from the same machine where the Warehouse database is installed, or it can be run remotely. To run the Oracle loader remotely, consult the Oracle documentation for instructions. To run the MySQL loader remotely, simply provide the correct host parameter for the location of the MySQL server.

Obtaining the KEGG database

The latest supported data version for the KEGG loader is listed in the loader summary table. KEGG version and release information can generally be found at GenomeNet release information. Note that the KEGG version number is provided as a parameter to the loader.

Academic users may download archived KEGG versions by anonymous ftp from KEGG FTP site. The files to be downloaded are:

The file genome should reside at the top level of the directory containing the data. The other files should be uncompressed in this directory. When ligand.tar.gz is uncompressed, the files compound, reaction, and enzyme should reside in a subdirectory named ligand/ (along with various other files that are not used by the loader). When genes.tar.gz is uncompressed, numerous gene files should reside in a subdirectory named genes.

Running the KEGG loader

The kegg-loader/src/ directory contains scripts to run the MySQL and Oracle loaders.

MySQL:

    ./run-mysql host database user password datadir version releasedate 
        host - The machine address where the MySQL server/database resides.
        database - Name of the MySQL database to be loaded.
        user - MySQL userid.
        password - MySQL password for userid.
        datadir - Directory which contains the KEGG data files to be loaded.
        version - Version of KEGG being loaded.
        releasedate - Release date of KEGG being loaded.
   
For example:
    ./run-mysql 123.45.67.8 warehouse me mypwd /space/bio/databases/KEGG/released 43.0 2008-04-01
This command loads KEGG data into the MySQL database named warehouse. The data files are located in the directory /space/bio/databases/KEGG/released and the user name and password used to access MySQL are me and mypwd.

Oracle:

    ./run-oracle "user/passwd" datadir version releasedate 
        user/passwd - User name and password. Ex: "dan/mypwd", "dan/mypwd@mydb"
        datadir - Directory which contains the KEGG data files.
        version - Version of KEGG being loaded.
        releasedate - Release date of KEGG being loaded.
   
For example:
    ./run-oracle "me/mypwd@mydb" /space/bio/databases/KEGG/released 43.0 2008-04-01
This command loads KEGG data into the Oracle database mydb. The data files are located in the directory /space/bio/databases/KEGG/released and the user name and password used to access Oracle is "me/mypwd".

On a 1.5 GHZ machine with 1GB RAM, the MySQL KEGG (version 37.0) loader takes 14 hours to complete. The loader may print numerous parse errors; this is expected. The output of the KEGG loader should closely match the following output:

KEGG loader output

After running the loader, you may run a query to determine how many warehouse objects were loaded. See the document on Running the Perl Utiltity scripts for how to do so. This query should report over one million entries loaded; the exact correct number depends on the exact version of the database that was loaded. KEGG 37.0 contains 1,818,129 entries.

[top]  Documentation

The following documents are located in the doc directory:

  • KEGG Manual (HTML)
  • Sample MySQL output