GenBank Loader Developer Manual

Version 4.6

(C) 2004 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.


Overview
The Parsing Phase
The Processing Phase
The Storage Phase
Known Limitations
How-to Information
Other Documentation


Overview

The GenBank loader loads the NCBI GenBank database of nucleotide sequences into the BioWarehouse (also reffered to as the "Warehouse").  The input files to the GenBank loader are a set of large XML files.  The XML schema is quite complex, containing over 8,000 element definitions.  The large size and complexity of the XML make for a complex loader.  The information in this document is meant to given an overview of how the loader proceeds, but does not document the specifc details in most cases.

The GenBank loader proceeds in a iterative cycle of three phrases: parsing, processing, and storing.  In the parsing stage, a fragment of the input XML is parsed for specific pieces of information.  In the processing phase, the data mined from the parsing stage is mapped or translated into a set of in-memory objects patterned after the BioWarehouse schema.  In the storing phase, the data is stored into the Warehouse.  The reason for separating these activities is that each phase cannot be executed until the previous one finishes.  In the processing stage, we do not know how to translate the data from the XML until the entire fragement has been examined.  How we handle one piece of data often depends on the presence or absence of another piece of data.  Similarly, we cannot store the data into the Warehouse until we finish the processing phase, as the data for one table insert may not be determined until all the data has been processed.

In addition the GenBank XML schema exhibits a superstructure.  The input files are too large to be loaded into memory at one time.  Hence the parser maintains a notion of state and location within the superstructure.

Finally, there are some data relationships that cannot be determined until the parser has completed parsing all input files.  The final phase of the GenBank loader is to search for and handle these relationships.

The Parsing Phase

The parser portion of the GenBank loader is a combination of a SAX and DOM parser called, appropriately, SAXDOMIX.  This open source XML parsing library is produced by Devsphere.  The SAXDOMIX parser proceeds in the low-level event-based SAX mode until it encounters a tag that we are interested in.  At that point, the parser consumes all SAX events and builds a DOM for the element we requested.  The DOM is then processed and stored into the Warehouse.  The parser reverts to SAX mode until the next element of interest is detected.

The main element of interest in the XML schema is the <Bioseq>.  Each Bioseq represents XXX.  Often, several Bioseqs are grouped together into a set of Bioseqs.  In this case, there may be data present in the set that applies to all Bioseqs within the set.  However, there is no guarantee of the order in which this information is encountered.  Therefore, whenever the parser detects the start of a set, it begins tracking the Bioseqs in the set.  When the set is closed (when the ending set tag is encountered), any data collected as part of the set (called for convience "metadata") is applied identically to each Bioseq within the set.  In addition, the GenBank schema allows sets-within-sets.  Metadata detected at any level of a set applies to the Bioseqs in the set and to the Bioseqs in all sets contained by the current set.  Once a set has been closed, and the metadata applied to all Bioseqs in its set, it is no longer necessary to maintain the notion of the current set.  If the current set is a child of a parent set, it passes all its Bioseqs up to the parent set to handle.  If the current set is the "root" set, then the Bioseqs are discarded after the data has been stored in the warehouse.  In all cases, a Bioseq is parsed, processed, and stored as soon as it is encountered.  After the conclusion of a set, any collected metadata is applied to all Bioseqs within the set.  Because the Bioseq has already been previously stored in the Warehouse database, the information from the metadata is generally applied as updates to existing table entries, rather than creating new table entries (in most cases).

The elements handled by the SAX parser are used to maintain our notion of location within the superstructure.  These elements include:

Bioseq-set
Bioseq-set_seq-set


The elements handled by the DOM parser include both the Bioseqs and the "metadata" portion of the set structure.  These elements include:

Bioseq
Bioseq-set_annot
Bioseq-set_descr

The following table lists all elements that we parse below a <Bioseq> element.  The format of each line in this table is:

Tag:Attribute/{Class}

Each tag that we parse for is listed.  If we specifically check a given attribute on a tag, it is listed as :Attribute.  If the tag is parsed by
a particular class, the class name is listed between curly braces.

The hierarchy indicates two things:
  1. Tags listed at an indented level below another tag are searched for as grandchildren of the tag only after the grandparent tag has been found.  However, tags at the same level of identation are not necessarily at the same level in the XML document.
  2. All tag listed below a tag with a {Class} are parsed by that class.

Table 1:  XML elements parsed under <Bioseq>
Bioseq/{Bioseq}
    Bioseq_id/{BioseqId}
        Seq-id/{SeqId}
            Seq-id_genbank, Seq-id_embl, Seq-id_pir, Seq-id_swissprot, Seq-id_other, Seq-id_ddbj
                Textseq-id/{TextseqId}
                    Textseq-id_name
                    Textseq-id_accession
                    Textseq-id_release
                    Textseq-id_version
            Seq-id_gi
            Seq-id_patent
                Patent-seq_id/{PatentSeqId}
                    Patent-seq-id_seqid
                    Patent-seq-id_cit
                        Id-pat_country
                        Id-pat_id
                            id-pat_id_number
                            id-pat_id_app-number
                        Id-pat_doc-type
            Seq-id_pdb
                PDB-seq-id
                PDB-seq-id_mol
    Bioseq_inst/{BioseqInst}
        Seq-inst/{SeqInst}
            Seq-inst_repr:value
            Seq-inst_mol:value
            Seq-inst_length
            Seq-inst_fuzz
                Int-fuzz
                    Int-fuzz_lim:value
            Seq-inst_topology:value
            Seq-inst_strand:value
            Seq-inst_seq-data
                Seq-data_iupacna
                    IUPACna
            Seq-inst_ext
                Seq-ext_seg
                    Seq-loc_whole
                        Seq-id/{SeqId}
                            ** This branch is the same as {SeqId} above **
    Bioseq_annot/{BioseqAnnot}
        Seq-feat/{SeqFeat}
            Seq-feat_location/{SeqLoc}
                Seq-loc_empty
                    Seq-id/{SeqId}
                        ** This branch is the same as {SeqId} above **
                Seq-loc_whole
                    Seq-id/{SeqId}
                        ** This branch is the same as {SeqId} above **
                Seq-loc_int
                    Seq-interval_from
                    Seq-interval_to
                    Na-strand:value
                    Seq-interval_id
                        Seq-id/{SeqId}
                            ** This branch is the same as {SeqId} above **
                    Seq-interval_fuzz-from
                        Int-fuzz_lim:value
                    Seq-interval_fuzz-to
                        Int-fuzz_lim:value
            Seq-feat_data
                SeqFeatData_gene
                    Gene-ref_locus
                    Gene-ref_locus-tag
                    Gene-ref_pseudo:value
                    Gene-ref_syn_E
                        ** processing all direct children of this element, but the name is not checked
                SeqFeatData_cdregion
                    Cdregion_frame:value
                    Genetic-code_E_id_def
                    Genetic-code_E_id
                SeqFeatData_prot/{SeqFeatDataProt}
                    Prot-ref_name_E
                    Prot-ref_desc
                    Prot-ref_ec_E
                    Prot-ref_activity_E
                SeqFeatData_rna
                    RNA-ref_pseudo
                    RNA-ref_type:value
                    Trna-ext_aa_ncbieaa
                SeqFeatData_imp
                    Imp-feat_key
            Seq-feat_comment
            Seq-feat_product
                Seq-feat_location/{SeqLoc}
                    ** This branch is the same as {SeqLoc} above **
            Seq-feat_partial:value
            Seq-feat_xref
                SeqFeatData_prot/{SeqFeatDataProt}
                    ** This branch is the same as {SeqFeatDataProt} above **
    Bioseq_descr/{BioseqDescr}
        Dbtag/{Dbtag}
            Object-id_id
        BinomialOrgName_genus
        BinomialOrgName_species
        OrgMod
            OrgMod_subtype
            OrgMod_subtype:value
        OrgName
            OrgName_gcode
            OrgName_div
        SubSource
            SubSource_subtype
            SubSource_subtype:value
            SubSource_name


Implementation Detail

The classes related to setting up the SAXDOMIX parser are located in the package com.sri.biospice.warehouse.genbank.parser.  Classes that represent parsed elements of XML are located in the com.sri.biospice.warehouse.genbank.xmlhandler package.  In the xmlhandler package, each constructor takes in a DOM tree rooted at the element that class represents.  If necessary, the class creates child classes for any of the sub-branches of the tree that are relevent.  In all cases, all the "parsing" portion of the GenBank loader occurs inside the constructors of the xmlhandler classes.  The classes that track the information in a single Bioseq and the sets of Bioseqs are located in the com.sri.biospice.warehouse.genbank package.

The Processing Phase

The processing phase of the loader implements the logic that maps the data from a Bioseq to the Warehouse schema tables.  The mapping for this loader is quite complex and is documented by several Microsoft Excel spreadsheets.   This section of the documentation is intended to give an overview of the mapping relationships.

Table 2:  Database tables populated by the GenBank Loader
Dataset
BioSource
BioSubtype
Feature
Function
Gene
NucleicAcid
Protein
Subsequence
CommentTable
CrossReference
DBID
Entry
SynonymTable
BioSourceWIDBioSubtypeWID
BioSourceWIDGeneWID
GeneWIDNucleicAcidWID
GeneWIDProteinWID
ProteinWIDFunctionWID



Implementation Detail

Each class that represents an XML element (in the xmlhandler class) implements the ProcessedElement interface.  That is, each class has a process(BioseqEntry) method that is invoked to perform the processing phase for the element represented by the class.  The BioseqEntry keeps track of the Warehouse table entries relevant to one Bioseq.  The BioseqEntry may represent a Bioseq currently being parsed or a Bioseq already stored in the database -- the ProcessedElement is not aware of which type of Bioseq the entry object is.

The Storage Phase

The storage phase of the GenBank loader occurs after all data for a Bioseq (or a Bioseq set) has been parsed and processed.  All Warehouse schema table entries are stored into the database and committed.  The GenBank loader uses the Warehouse Common Java components to handle all database connectivity.


Known Limitations

As of version 3.0, this load has been tested on the BCT and CON divisions.  However, running the loader on the CON division takes approximate 80 hours.

As of version 3.0, this loader has not been tested on the following GenBank subdivisions (Bug 287):


In addition, the following division MAY contain bacterial sequences as well:

As of version 3.0, point sequence features are ignored (Bug 525).  We ignore any <Seq-feat> that contains a <Seq-loc_pnt> as its location.


How-to Information

About the GenBank code base

The GenBank loader is written in Java.  It uses the Common Warehouse Java components located in the common/ subdirectory of the warehouse distribution.  These common components are re-usable Java classes that know how to interact with databases and with the tables of the Warehouse schema.  Several Warehouse loaders currently use these components.  For more information, please see the documentation on the common Java components.

How to generate the JavaDoc API documentation

From the genbank-loader directory, execute

ant javadoc

The auto-generated API documentation is located at docs/api/index.html.

Please see the documentation for the common Java components for information on the location of the common JavaDoc.

How to run the loader using Ant

Developers may find it more convient to run the GenBank loader from the Ant script.  The run Ant target is provided for this use.  To use this case, create a file called developer.properties in the genbank-loader/ directory.  Include the same properties necessary to run the shell script.  In addition, define the run.jvm.memory property to be passed as the maximum heap size of the JVM.

How to use the utilities for examining the input files

The com.sri.warehouse.genbank.util package contains some utility programs to facility quick examination of an XML file using the SAXDOMIX approach.  To use this code, create a class that inserts from DOMScanner and implement the abstract methods.  The only input to the program is the name of a file.  Use the Ant run-dom-scanner target to run your program.  (Be sure to insert the name of your class into the java task of the run-dom-scanner target.)

JDBC Issues

The GenBank parser exceeds most limits placed on JDBC data sizes in many cases.  In particular the sequence data (DNA, AA sequence, etc.) can be thousands of characters long.  In particular:

How to configure the log4j logging system

The GenBank parser uses an XML configuration file (GenBank-log4j-config.xml) to configure the logging utility.  The configuration file is included in the jar file when the program is compiled.  However, the program is written to check first for a file of this name in the current working directory.  If this file exists, the loader uses it.  Otherwise the loader uses the config file bundled inside the jar file. 

To override the default log4j settings, simply copy the GenBank-log4j-config.xml file from the etc/ directory and place it in the current working directory (the top-level directory if running the program with Ant or the dist/ directory if running the program using the shell script).

The threshold level at which log4j is set has a significant impact on the loader performance.  Although the log4j logging utility is known for its speed, there are so many debugging statements in the program, and it runs for such a long time, that turning the threshold to DEBUG can double or triple the execution time of the program.  Setting the logging threshold to DEBUG causes approximately a gigabyte of log file to be generated for every gigabyte of input data!  For these reasons, many debugging statements are commented out of the code, but left in place to be easily re-included if necessary.


Other Documentation


The GenBank XSD schema is NCBI_Seqset_new.xsd.

Developers working on the GenBank loader implementation should also be familiar with the Warehouse Common Java components.