This document describes how to build and run the KEGG loader
The kegg loader is located in the
kegg-loader/
subdirectory of the warehouse distribution.
Before building the loaders, make sure ProC is installed properly according to the Environment Setup Also make sure the schema is loaded into the database as specified in the Schema document.
Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.To build the loader, bring up a shell and navigate to the
kegg-loader/src/
directory. Then:MySQL:
osprompt: make clean
osprompt: make db=mysql
Creates the filemysql-kegg-loader
Oracle:
osprompt: make clean
osprompt: make db=oracle
If filewh_oracle_util.c
is reported missing, re-run the above make command:
osprompt: make db=oracle
Creates the fileoracle-kegg-loader
Also, a symbolic link named
"kegg-loader"
is created, which points to the newly created executable. This can be used as a synonym for the most recently created DBMS-specific loader if desired.If the build fails and gives warnings about header files which are not found, read the section on configuring ProC in Environment Setup. Run the simple ProC test and make sure ProC is configured properly.
Obtaining the KEGG database
The latest supported data version for the KEGG loader is listed in the loader summary table. KEGG version and release information can generally be found at GenomeNet release information. Note that the KEGG version number is provided as a parameter to the loader.
Academic users may download archived KEGG versions by anonymous ftp from KEGG FTP site. The files to be downloaded are:
The file genome should reside at the top level of the directory containing the data. The other files should be uncompressed in this directory. When ligand.tar.gz is uncompressed, the files compound, reaction, and enzyme should reside in a subdirectory named ligand/ (along with various other files that are not used by the loader). When genes.tar.gz is uncompressed, numerous gene files should reside in a subdirectory named genes.
- genome
- ligand.tar.gz
- genes.tar.gz
Running the KEGG loader
The
kegg-loader/src/
directory contains scripts to run the MySQL and Oracle loaders.MySQL:
For example:./run-mysql host database user password datadir version releasedate host - The machine address where the MySQL server/database resides. database - Name of the MySQL database to be loaded. user - MySQL userid. password - MySQL password for userid. datadir - Directory which contains the KEGG data files to be loaded. version - Version of KEGG being loaded. releasedate - Release date of KEGG being loaded.
./run-mysql 123.45.67.8 warehouse me mypwd /space/bio/databases/KEGG/released 43.0 2008-04-01This command loads KEGG data into the MySQL database namedwarehouse
. The data files are located in the directory/space/bio/databases/KEGG/released
and the user name and password used to access MySQL areme
andmypwd
.Oracle:
For example:./run-oracle "user/passwd" datadir version releasedate user/passwd - User name and password. Ex: "dan/mypwd", "dan/mypwd@mydb" datadir - Directory which contains the KEGG data files. version - Version of KEGG being loaded. releasedate - Release date of KEGG being loaded.
./run-oracle "me/mypwd@mydb" /space/bio/databases/KEGG/released 43.0 2008-04-01This command loads KEGG data into the Oracle databasemydb
. The data files are located in the directory/space/bio/databases/KEGG/released
and the user name and password used to access Oracle is"me/mypwd"
.
On a 1.5 GHZ machine with 1GB RAM, the MySQL KEGG (version 37.0) loader takes 14 hours to complete. The loader may print numerous parse errors; this is expected. The output of the KEGG loader should closely match the following output:
After running the loader, you may run a query to determine how many warehouse objects were loaded. See the document on Running the Perl Utiltity scripts for how to do so. This query should report over one million entries loaded; the exact correct number depends on the exact version of the database that was loaded. KEGG 37.0 contains 1,818,129 entries.
The following documents are located in the
doc
directory:KEGG Manual (HTML) Sample MySQL output