GenBank Loader Developer Manual
Version 4.6
(C) 2004 SRI International.
All Rights Reserved. See BioWarehouse
Overview for license details.
Overview
The GenBank loader loads the NCBI
GenBank
database of nucleotide sequences into the BioWarehouse (also
reffered to as the "Warehouse"). The
input files to the GenBank loader are a set of large XML
files. The XML schema is quite
complex, containing over 8,000
element definitions. The large size and complexity of the XML
make for a complex loader. The information in this document is
meant to given an overview of how the loader proceeds, but does not
document the specifc details in most cases.
The GenBank loader proceeds in a iterative cycle of three phrases:
parsing, processing, and storing. In the parsing stage, a
fragment of the input XML is parsed for specific pieces of
information. In the processing phase, the data mined from the
parsing stage is mapped or translated into a set of in-memory objects
patterned after the BioWarehouse schema.
In the storing phase, the data is stored into the Warehouse. The
reason for separating these activities is that each phase cannot be
executed until the previous one finishes. In the processing
stage, we do not know how to translate the data from the XML until the
entire fragement has been examined. How we handle one piece of
data often depends on the presence or absence of another piece of
data. Similarly, we cannot store the data into the Warehouse
until we finish the processing phase, as the data for one table insert
may not be determined until all the data has been processed.
In addition the GenBank XML schema exhibits a superstructure. The
input files are too large to be loaded into memory at one time.
Hence the parser maintains a notion of state and location within the
superstructure.
Finally, there are some data relationships that cannot be determined
until the parser has completed parsing all input files. The final
phase of the GenBank loader is to search for and handle these
relationships.
The Parsing Phase
The parser portion of the GenBank loader is a combination of a SAX and
DOM parser called, appropriately, SAXDOMIX. This open source XML
parsing library is produced by Devsphere. The
SAXDOMIX parser proceeds in the low-level event-based SAX mode until it
encounters a tag that we are interested in. At that point, the
parser consumes all SAX events and builds a DOM for the element we
requested. The DOM is then processed and stored into the
Warehouse. The parser reverts to SAX mode until the next element
of interest is detected.
The main element of interest in the XML schema is the
<Bioseq>. Each Bioseq represents XXX. Often, several
Bioseqs are grouped together into a set of Bioseqs. In this case,
there may be data present in the set that applies to all Bioseqs within
the set. However, there is no guarantee of the order in which
this information is encountered. Therefore, whenever the parser
detects the start of a set, it begins tracking the Bioseqs in the
set. When the set is closed (when the ending set tag is
encountered), any data collected as part of the set (called for
convience "metadata") is applied identically to each Bioseq within the
set. In addition, the GenBank schema allows
sets-within-sets. Metadata detected at any level of a set applies
to the Bioseqs in the set and to the Bioseqs in all sets contained by
the current set. Once a set has been closed, and the metadata
applied to all Bioseqs in its set, it is no longer necessary to
maintain the notion of the current set. If the current set is a
child of a parent set, it passes all its Bioseqs up to the parent set
to handle. If the current set is the "root" set, then the Bioseqs
are discarded after the data has been stored in the warehouse. In
all cases, a Bioseq is parsed, processed, and stored as soon as it is
encountered. After the conclusion of a set, any collected
metadata is applied to all Bioseqs within the set. Because the
Bioseq has already been previously stored in the Warehouse database,
the information from the metadata is generally applied as updates to
existing table entries, rather than creating new table entries (in most
cases).
The elements handled by the SAX parser are used to maintain our notion
of location within the superstructure. These elements include:
Bioseq-set
Bioseq-set_seq-set
The elements handled by the DOM parser include both the Bioseqs and the
"metadata" portion of the set structure. These elements include:
Bioseq
Bioseq-set_annot
Bioseq-set_descr
The following table lists all elements that we parse below a
<Bioseq> element. The format of each line in this table is:
Tag:Attribute/{Class}
Each tag that we parse for is listed. If we specifically check a
given attribute on a tag, it is listed as :Attribute
.
If the tag is parsed by
a particular class, the class name is listed between curly braces.
The hierarchy indicates two things:
- Tags listed at an indented level below another tag are searched
for as grandchildren of the tag only after the grandparent tag has been
found. However, tags at the same level of identation are not
necessarily at the same level in the XML document.
- All tag listed below a tag with a {Class} are parsed by that
class.
Table 1: XML elements parsed under <Bioseq>
Bioseq/{Bioseq}
Bioseq_id/{BioseqId}
Seq-id/{SeqId}
Seq-id_genbank, Seq-id_embl, Seq-id_pir, Seq-id_swissprot,
Seq-id_other, Seq-id_ddbj
Textseq-id/{TextseqId}
Textseq-id_name
Textseq-id_accession
Textseq-id_release
Textseq-id_version
Seq-id_gi
Seq-id_patent
Patent-seq_id/{PatentSeqId}
Patent-seq-id_seqid
Patent-seq-id_cit
Id-pat_country
Id-pat_id
id-pat_id_number
id-pat_id_app-number
Id-pat_doc-type
Seq-id_pdb
PDB-seq-id
PDB-seq-id_mol
Bioseq_inst/{BioseqInst}
Seq-inst/{SeqInst}
Seq-inst_repr:value
Seq-inst_mol:value
Seq-inst_length
Seq-inst_fuzz
Int-fuzz
Int-fuzz_lim:value
Seq-inst_topology:value
Seq-inst_strand:value
Seq-inst_seq-data
Seq-data_iupacna
IUPACna
Seq-inst_ext
Seq-ext_seg
Seq-loc_whole
Seq-id/{SeqId}
** This branch is the same as {SeqId} above **
Bioseq_annot/{BioseqAnnot}
Seq-feat/{SeqFeat}
Seq-feat_location/{SeqLoc}
Seq-loc_empty
Seq-id/{SeqId}
** This branch is the same as {SeqId} above **
Seq-loc_whole
Seq-id/{SeqId}
** This branch is the same as {SeqId} above **
Seq-loc_int
Seq-interval_from
Seq-interval_to
Na-strand:value
Seq-interval_id
Seq-id/{SeqId}
** This branch is the same as {SeqId} above **
Seq-interval_fuzz-from
Int-fuzz_lim:value
Seq-interval_fuzz-to
Int-fuzz_lim:value
Seq-feat_data
SeqFeatData_gene
Gene-ref_locus
Gene-ref_locus-tag
Gene-ref_pseudo:value
Gene-ref_syn_E
** processing all direct children of this element, but the name is not
checked
SeqFeatData_cdregion
Cdregion_frame:value
Genetic-code_E_id_def
Genetic-code_E_id
SeqFeatData_prot/{SeqFeatDataProt}
Prot-ref_name_E
Prot-ref_desc
Prot-ref_ec_E
Prot-ref_activity_E
SeqFeatData_rna
RNA-ref_pseudo
RNA-ref_type:value
Trna-ext_aa_ncbieaa
SeqFeatData_imp
Imp-feat_key
Seq-feat_comment
Seq-feat_product
Seq-feat_location/{SeqLoc}
** This branch is the same as {SeqLoc} above **
Seq-feat_partial:value
Seq-feat_xref
SeqFeatData_prot/{SeqFeatDataProt}
** This branch is the same as {SeqFeatDataProt} above **
Bioseq_descr/{BioseqDescr}
Dbtag/{Dbtag}
Object-id_id
BinomialOrgName_genus
BinomialOrgName_species
OrgMod
OrgMod_subtype
OrgMod_subtype:value
OrgName
OrgName_gcode
OrgName_div
SubSource
SubSource_subtype
SubSource_subtype:value
SubSource_name
|
Implementation Detail
The classes related to setting up the SAXDOMIX parser are located in
the package com.sri.biospice.warehouse.genbank.parser
.
Classes that represent parsed elements of XML are located in the com.sri.biospice.warehouse.genbank.xmlhandler
package. In the xmlhandler
package, each
constructor takes in a DOM tree rooted at the element that class
represents. If necessary, the class creates child classes for any
of the sub-branches of the tree that are relevent. In all cases,
all the "parsing" portion of the GenBank loader occurs inside the
constructors of the xmlhandler
classes. The
classes that track the information in a single Bioseq and the sets of
Bioseqs are located in the com.sri.biospice.warehouse.genbank
package.
The Processing Phase
The processing phase of the loader implements the logic that maps the
data from a Bioseq to the Warehouse schema tables. The mapping
for this loader is quite complex and is documented by several Microsoft
Excel spreadsheets. This section of the documentation is
intended to give an overview of the mapping relationships.
Table 2: Database tables populated by the GenBank
Loader
Dataset
|
BioSource
BioSubtype
Feature
Function
Gene
NucleicAcid
Protein
Subsequence
|
CommentTable
CrossReference
DBID
Entry
SynonymTable |
BioSourceWIDBioSubtypeWID
BioSourceWIDGeneWID
GeneWIDNucleicAcidWID
GeneWIDProteinWID
ProteinWIDFunctionWID |
- Our current understanding of a Bioseq is that either represents a
gene, and nucleic acid, or a protein. We know that a Bioseq
represents a protein if the molecule type (defined by Seq-inst_data) is
"aa". However, we currently have no way of distinguishing between
a Bioseq that represents a gene and one that represents a nucleic
acid. Currently, we create a NucleicAcid for each Bioseq that is
not of type "aa" and create a Protein otherwise.
- We are currently only storing non-bacterial Bioseqs.
However, we cannot always detect that a Bioseq is non-bacterial when we
first encounter it. (That infomation may reside elsewhere in the
set superstructure.) A Bioseq is assumed to be bacterial until
proven otherwise. As soon as we detect that the Bioseq represents
non-bacterial data, we delete all data from the Warehouse associated
with that Bioseq.
- When a DOM for a
Bioseq
is obtained, we parse four
of its immediate children: Bioseq_id
, Bioseq_inst
,
Bioseq_annot
, and Bioseq_descr
.
During the "parsing" phase, we extract relevant information from each
of these elements and its children. This data is not mapped to
the Warehouse schema until all four elements have been parsed, because
the mapping or "processing" of the data from one element may depend
upon what data was found in another subtree of the four elements.
- Parsing and processing the
Bioseq_id
element:
- All immediate
Seq-id
children of the Bioseq_id
element are parsed.
- The data in the
Seq-id
is parsed to obtain
various XIDs of this Bioseq
. If the dataset
name of the XID is "GenBank", then the XID information is stored in the
DBID
table, otherwise it is stored in the CrossReference
table. The parent object table entry (the object
referred to by OtherWID
) is either a NucleicAcid
or
a Protein
, depending on the molecule type of this Bioseq
(determined under Bioseq_inst
).
- Parsing and processing the
Bioseq_inst
element:
- The data in the child
Seq-inst
element generally
contains sequence-related data (length, topology, strandedness,
etc.).
- The contents of the "
value
" attribute of Seq-inst_mol
determine the molecule type of this Bioseq
. If the
value is "aa", the molecule type is Protein
, otherwise it
is NucleicAcid
. (The molecule type can be "aa",
"na", "dna", "rna", "not-set" or "other".)
- In addition, there may be more
Seq-id
elements. Unlike those under Bioseq_id
, these Seq-id
elements are parsed to obtain the constituent GIs of the Bioseq
,
which are stored as a comment associated with the NucleicAcid
for
the Bioseq
.
- The data parsed in this branch are only mapped to the Warehouse
if the molecule type of this
Bioseq
is NucleicAcid
.
- Parsing and processing the
Bioseq_annot
element:
- All
Seq-feat
child elements are parsed to obtain
feature information.
- Parsing and processing the
Bioseq_descr
element:
- This element is parsed to obtain
BioSource
and BioSubtype
infomation.
Implementation Detail
Each class that represents an XML element (in the xmlhandler
class)
implements the ProcessedElement
interface. That
is, each class
has a process(BioseqEntry)
method that is invoked to
perform the
processing phase for the element represented by the class. The
BioseqEntry
keeps track of the Warehouse table entries
relevant to one
Bioseq. The BioseqEntry
may represent a Bioseq
currently being
parsed or a Bioseq already stored in the database -- the ProcessedElement
is not aware of which type of Bioseq the entry object is.
The Storage Phase
The storage phase of the GenBank loader occurs after all data for a
Bioseq (or a Bioseq set) has been parsed and processed. All
Warehouse schema table entries are stored into the database and
committed. The GenBank loader uses the Warehouse Common
Java components to handle all database connectivity.
Known Limitations
As of version 3.0, this load has been tested on the BCT and CON
divisions. However, running the loader on the CON division takes
approximate 80 hours.
As of version 3.0, this loader has not been tested on the following
GenBank subdivisions (Bug 287):
- High-throughput genomic sequences: HTGS division
- Genome survey sequences: GSS division
- Patented sequences: PAT division
In addition, the following division MAY contain bacterial
sequences as well:
- High throughput cDNA sequencing: HTC division
As of version 3.0, point sequence features are ignored (Bug 525).
We ignore any <Seq-feat>
that contains a <Seq-loc_pnt>
as its location.
How-to Information
About the GenBank code base
The GenBank loader is written in Java. It uses the Common
Warehouse Java components located in the common/
subdirectory of the warehouse distribution. These common
components are re-usable Java classes that know how to interact with
databases and with the tables of the Warehouse schema. Several
Warehouse loaders currently use these components. For more
information, please see the documentation on
the common Java components.
How to generate the
JavaDoc API documentation
From the genbank-loader directory, execute
ant javadoc
The auto-generated API documentation is located at docs/api/index.html.
Please see the documentation
for the common Java components for information on the location of the
common JavaDoc.
How to run the loader using Ant
Developers may find it more convient to run the GenBank loader from the
Ant script. The run
Ant target is provided for
this use. To use this case, create a file called developer.properties
in the genbank-loader/
directory. Include the same
properties necessary to run the shell script. In addition, define
the run.jvm.memory
property to be passed as the maximum
heap size of the JVM.
How to use the utilities for examining the input files
The com.sri.warehouse.genbank.util
package contains some
utility programs to facility quick examination of an XML file using the
SAXDOMIX approach. To use this code, create a class that inserts
from DOMScanner
and implement the abstract methods.
The only input to the program is the name of a file. Use the Ant run-dom-scanner
target to run your program. (Be sure to insert the name of your
class into the java task of the run-dom-scanner
target.)
JDBC Issues
The GenBank parser exceeds most limits placed on JDBC data sizes in
many cases. In particular the sequence data (DNA, AA sequence,
etc.) can be thousands of characters long. In particular:
- We cannot create
Strings
that represent SQL insert
statements with length longer than ~4,000 characters. Hence we
use PreparedStatement
to do the inserts and updates into
tables.
- The
setString()
method of PreparedStatement
also
has a length limit of 32,766 characters. In addition, Oracle says
that using the setString()
method for large binary data
may result in corruption of the data when using the thin client:
- "There is a limitation regarding the use of
stream input for LOB
types. Stream input for LOB types can only be used for 8.1.7 or
later JDBC OCI driver connecting to an 8.1.7 or later Oracle
server. The use of stream input for LOB types in all other
configurations may result in data corruption. PreparedStatement
stream input APIs include: setBinaryStream(), setAsciiStream(),
setUnicodeStream(), setCharacterStream() and setObject()."
- To insert CLOBs, we use the
setCharacterStream()
method
of PreparedStatement
.
- The Oracle thin driver has an error using this
method. The Oracle OCI driver does not.
How to configure the log4j logging system
The GenBank parser uses an XML configuration file (GenBank-log4j-config.xml
)
to configure the logging utility. The configuration file is
included in the jar file when the program is compiled. However,
the program is written to check first for a file of this name in the
current working directory. If this file exists, the loader uses
it. Otherwise the loader uses the config file bundled inside the
jar file.
To override the default log4j settings, simply copy the GenBank-log4j-config.xml
file from the etc/
directory and place it in the current
working directory (the top-level directory if running the program with
Ant or the dist/
directory if running the program using
the shell script).
The threshold level at which log4j is set has a significant impact on
the loader performance. Although the log4j logging utility is
known for its speed, there are so many debugging statements in the
program, and it runs for such a long time, that turning the threshold
to DEBUG
can double or triple the execution time of the
program. Setting the logging threshold to DEBUG
causes
approximately a gigabyte of log file to be generated for every gigabyte
of input data! For these reasons, many debugging statements are
commented out of the code, but left in place to be easily re-included
if necessary.
Other Documentation
The GenBank XSD schema is NCBI_Seqset_new.xsd.
Developers working on the GenBank loader implementation should also be
familiar with the Warehouse Common
Java components.