BioWarehouse GenBank Loader

Version 4.6

(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.

This document describes how to build and run the BioWarehouse GenBank loader.  The GenBank loader is located in the genbank-loader/ subdirectory of the BioWarehouse distribution.  For more information about the GenBank Loader, please see the GenBank Developer Manual.

Introduction
Building the Loader
Running the Loader
Troubleshooting
Developer Information


Introduction

The GenBank loader parses files containing data from the GenBank database developed at the US National Center for Biotechnology Information (NCBI). The file format used is the XML format produced by processing GenBank ASN.1 format files with the NCBI asn2xml tool.  

Building the Loader

Before building the loaders, make sure Java and Ant are properly set up according to the Environment Setup document, and make sure the environment tests for Java and Ant pass.

To build the loader, simply bring up a shell and navigate to the genbank-loader/ directory. Execute ant clean followed by ant. This will build the loaders and place the distribution in the directory genbank-loader/dist. If the build is successful, the following files should be in genbank-loader/dist:

GenBank.properties
genbank-loader-3.6.jar
runGenBankLoader.sh

An example of the expected build output can be found here.  A list of all available build targets is available using ant -projecthelp.

Running the Loader

The GenBank loader can be run remotely or on the same machine where the database server into which it loads data resides. 
Obtaining the input data files.
The latest supported data version for the GenBank loader is listed in the loader summary table. GenBank data files are available in several formats including GenBank flat file format, and ASN.1 format.  The BioWarehouse GenBank loader uses the ASN.1 format files, which must be converted to XML format using an asn2xml tool (described below).

The ASN.1 GenBank input files are available from ftp://ftp.ncbi.nih.gov/ncbi-asn1.  The README file at this site describes the file naming convention and what files are currently available.  In particular, the files are named gb*aso.gz, where the * consists of a division name and a file number.  For example, there are eight files that comprise the bct division, and they are named gbbct1.aso.gz through gbbct8.aso.gz.  GenBank divisions are described here.

Currently, we have only tested the loader on the BCT division.  Please see the Known Limitations section of the GenBank Loader Developers' Manual.

These files must be converted to XML using the asn2xml tool, available at ftp://ftp.ncbi.nih.gov/toolbox/xml/asn2xml/.  Versions of the tool are available for a variety of operating systems, including Linux.  The conversion of the ASN.1 format to XML is described at ftp://ftp.ncbi.nlm.nih.gov/toolbox/data_specs/OBSOLETE/asn2xml/README.asn2xml.

Usage and help for the asn2xml is available by executing:

prompt> asn2xml -

asn2xml 1.0   arguments:

  -i  Filename for asn.1 input [File In]
    default = stdin
  -e  Input is a Seq-entry [T/F]  Optional
    default = F
  -s  Input is a Seq-submit [T/F]  Optional
    default = F
  -b  Input asnfile in binary mode [T/F]  Optional
    default = T
  -o  Filename for XML output [File Out]  Optional
    default = stdout
  -l  Log errors to file named: [File Out]  Optional

Example usage:

prompt> gunzip input.aso.gz
prompt> asn2xml -i input.aso -o output.xml

The XML files produced by the asn2xml tool reference a DTD (NCBI_Seqset.dtd).  The NCBI_Seqset DTD references a set of DTD files (.mod files) that together define the XML schema.  The parser uses the DTD to parse the files, so it is necessary to download the schema files into the same directory as the input XML files.  The DTD and .mod files are available from http://www.ncbi.nlm.nih.gov/IEB/.  (Click on the "XML DTDs" link.)   Please Note: the generated XML files reference the DTD at the top of the file which may include an absolute path to the DTD files such as the following:

<!DOCTYPE Bioseq-set PUBLIC "-//NCBI//NCBI Seqset/EN" "/NCBI_Seqset.dtd">

The initial slash in "/NCBI_Seqset.dtd" tells the parser to look in your root directory.  This will most likely need to be changed to a local or relative path.  We suggest you remove the inital slash and to indicate that the DTD files are in the same directory as your input XML files.  For example:

<!DOCTYPE Bioseq-set PUBLIC "-//NCBI//NCBI Seqset/EN" "NCBI_Seqset.dtd">



Loading the GenBank data
The script runGenBankLoader.sh in the dist/ directory is used to parse and load the GenBank dataset.  The script has the following usage:

usage: runGenbankLoader.sh
 -l,--load-all              Load all data, including data found to be
                            suspect
 -d,--dbms <dbms>           DBMS type (mysql or oracle)
 -f,--file <file>           Name of input data file
 -h,--help                  Print usage instructions
 -n,--name <name>           Name or SID of database
 -p,--properties <file>     Name of properties file
 -r,--release               Release date of the input dataset
 -s,--host <host>           Name or IP address of database server host
 -t,--port <port>           Port database server is listening at
 -u,--username <username>   Username for connection to the database
 -v,--version               Version number of the input dataset
 -w,--password <password>   Password for connection to the database

Properties may be set on the command line or in the properties file.
Values on the command line take precedence over those in a properties
file. Properties in a property file are specified in name-value pairs. For
example: port=1234


Example:  Running the loader using only command line arguments:

./runGenBankLoader.sh -d oracle -f input.xml -n biospice -s localhost -t 1521 -u myname -p mypassword -v 123 -r "March 28, 2006"

Example:  Running the loader using a properties file:

Edit genbank.properties to have the required values:

dbms=oracle
file=input.xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123
release=March 28, 2006

Then run the script by passing in the name of the properties file:

./runGenBankLoader.sh -p genbank.properties

Expected output
The GenBank loader uses the Apache log4j logging system.  Output from the program is written to a file called GenBankLoader.log.  The log file is overwritten each time the loader is run.  The default threshold for the log file is set to "INFO". 

It is possible to override the default log4j configuration by copying the configuration file GenBank-log4j-config.xml from the etc/ directory to the dist/ directory and editing the threshold values for the console appender or file appender.  For example, to print debugging information to standard out, change the console appender threshold from

<param name="Threshold" value="OFF"/>

to

<param name="Threshold" value="DEBUG"/>.

Troubleshooting

The GenBank loader often requires a very large amount of memory. If the loaders report "java.lang.OutOfMemory" before the parser report, the loader failed. Alternatively, the loader may fail to run at all reporting an error such as "unable to allocate object heap". If either of these cases arises, it is possible to adjust the minimum and maximum heap sizes used by the loaders. To do this, edit the file runGenBankLoader.sh and look for the line which executes java and sets the -ms and -mx parameters:
${JAVA_HOME}/bin/java -ms100M -mx1500M ... (rest ommitted)
The default is a minimum heap size of 100M and a maximum heap size of 1500M. If the loader runs out of memory, the machine may not have enough memory. You can try increasing the maximum heap size which might help. If the loaders do not run because they cannot allocate the object heap, slowly decrease the maximum heap size until the error goes away.

Developer Information

For more detailed information on the implementation of the GenBank loader, please see the GenBank Developer Manual.