BioWarehouse GenBank Loader
Version 4.6
This document describes how to build and run the BioWarehouse GenBank
loader.
The GenBank loader is located in the genbank-loader/
subdirectory of the BioWarehouse distribution. For more
information
about the GenBank Loader, please see the GenBank Developer Manual.
Introduction
The GenBank loader parses files containing data from the GenBank
database developed
at the US National Center for Biotechnology Information (NCBI). The
file format
used is the XML format produced by processing GenBank ASN.1 format
files with the NCBI asn2xml
tool.
Building the Loader
Before building the loaders, make sure Java and Ant are properly set up
according to the Environment
Setup
document, and make sure the environment tests for Java and Ant pass.
To build the loader, simply bring up a shell and navigate to the genbank-loader/
directory. Execute ant clean
followed by ant
. This will
build the loaders and place the distribution in the directory genbank-loader/dist
.
If the build is successful, the following
files should be in genbank-loader/dist
:
GenBank.properties
genbank-loader-3.6.jar
runGenBankLoader.sh
An example of the expected build output can be found here. A list of all available
build targets is available using ant -projecthelp
.
Running the Loader
The GenBank loader can be run remotely or on the same machine where the
database server into which it loads data resides.
Obtaining the input data files.
The latest supported data version for the GenBank loader is listed in
the loader summary table.
GenBank data files are available in several formats including GenBank
flat file format,
and ASN.1 format. The BioWarehouse GenBank loader uses
the ASN.1 format files, which must be converted to XML format using an asn2xml
tool (described below).
The ASN.1 GenBank input files are available from ftp://ftp.ncbi.nih.gov/ncbi-asn1.
The README
file at this site describes the file naming convention and what files
are currently available. In particular, the files are named gb*aso.gz
,
where the * consists of a division name and a file number. For
example, there are eight files that comprise the bct
division, and they are named gbbct1.aso.gz
through
gbbct8.aso.gz
. GenBank divisions are described here.
Currently, we have only tested the loader on the BCT division.
Please see the Known
Limitations section of the GenBank Loader Developers' Manual.
These files must be
converted to XML using the asn2xml
tool, available at
ftp://ftp.ncbi.nih.gov/toolbox/xml/asn2xml/.
Versions of the tool are available for a variety of operating systems,
including Linux. The conversion of the ASN.1 format to XML is
described at ftp://ftp.ncbi.nlm.nih.gov/toolbox/data_specs/OBSOLETE/asn2xml/README.asn2xml.
Usage and help for the asn2xml
is available by executing:
prompt> asn2xml -
asn2xml 1.0 arguments:
-i Filename for asn.1 input [File In]
default = stdin
-e Input is a Seq-entry [T/F] Optional
default = F
-s Input is a Seq-submit [T/F] Optional
default = F
-b Input asnfile in binary mode [T/F] Optional
default = T
-o Filename for XML output [File Out] Optional
default = stdout
-l Log errors to file named: [File Out] Optional
Example usage:
prompt> gunzip input.aso.gz
prompt> asn2xml -i input.aso -o output.xml
The XML files produced by the asn2xml tool reference a DTD
(NCBI_Seqset.dtd). The NCBI_Seqset DTD references a set of DTD
files (.mod files) that together define the XML schema. The
parser uses the DTD to parse the files, so it is necessary to download
the schema files into the same directory as the input XML files.
The DTD and .mod files are available from http://www.ncbi.nlm.nih.gov/IEB/.
(Click on the "XML DTDs" link.) Please Note:
the generated XML files reference the DTD at the top of the file which
may include an absolute path to the DTD files such as the following:
<!DOCTYPE Bioseq-set PUBLIC "-//NCBI//NCBI Seqset/EN" "/NCBI_Seqset.dtd">
The initial slash in "/NCBI_Seqset.dtd" tells the parser to look in
your root directory. This will most likely need to be changed to
a local or relative path. We suggest you remove the inital slash
and to indicate that the DTD files are in the same directory as your
input XML files. For example:
<!DOCTYPE Bioseq-set PUBLIC "-//NCBI//NCBI Seqset/EN" "NCBI_Seqset.dtd">
Loading the GenBank data
The script runGenBankLoader.sh
in the dist/
directory
is used to parse and load the GenBank dataset. The script has the
following usage:
usage: runGenbankLoader.sh
-l,--load-all
Load all data, including data found to be
suspect
-d,--dbms
<dbms>
DBMS type (mysql or oracle)
-f,--file
<file>
Name of input data file
-h,--help
Print usage instructions
-n,--name
<name>
Name or SID of database
-p,--properties <file> Name of
properties file
-r,--release
Release date of the input dataset
-s,--host
<host>
Name or IP address of database server host
-t,--port
<port>
Port database server is listening at
-u,--username <username> Username for
connection to the database
-v,--version
Version number of the input dataset
-w,--password <password> Password for
connection to the database
Properties may be set on the command line or in the properties file.
Values on the command line take precedence over those in a properties
file. Properties in a property file are specified in name-value pairs.
For
example: port=1234
Example: Running the loader
using only command line arguments:
./runGenBankLoader.sh -d oracle
-f input.xml -n biospice -s localhost -t 1521 -u myname -p mypassword
-v 123 -r "March 28, 2006"
Example: Running the loader
using a properties file:
Edit genbank.properties
to have the required values:
dbms=oracle
file=input.xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123
release=March 28, 2006
Then run the script by passing in the name of the properties file:
./runGenBankLoader.sh -p genbank.properties
Expected output
The GenBank loader uses the Apache log4j logging
system. Output from the program is written to a file called GenBankLoader.log
.
The log file is overwritten each time the loader is run. The
default threshold for the log file is set to "INFO
".
It is possible to override the default log4j configuration by copying
the configuration file GenBank-log4j-config.xml
from the
etc/
directory to the dist/
directory and
editing the threshold values for the console appender or file
appender. For example, to print debugging information to standard
out, change the console appender threshold from
<param name="Threshold"
value="OFF"/>
to
<param name="Threshold" value="DEBUG"/>.
Troubleshooting
The GenBank loader often requires a very large amount of memory. If
the loaders report
"java.lang.OutOfMemory"
before the parser report, the
loader
failed. Alternatively, the loader may fail to run at all reporting an
error
such as "unable to allocate object heap". If either of these cases
arises,
it is possible to adjust the minimum and maximum heap sizes used by the
loaders. To do this, edit the file runGenBankLoader.sh
and look
for the
line which executes java and sets the -ms
and -mx
parameters:
${JAVA_HOME}/bin/java -ms100M -mx1500M ... (rest ommitted)
The default is a minimum heap size of 100M and a
maximum heap size of 1500M.
If the loader runs out of memory, the machine may not have enough
memory.
You can try increasing the maximum heap size which might help. If the
loaders
do not run because they cannot allocate the object heap, slowly
decrease
the maximum heap size until the error goes away.
Developer Information
For more detailed information on the implementation of the GenBank
loader, please see the GenBank
Developer Manual.