UniProt (Swiss-Prot/TrEMBL) Loader for the BioWarehouse

Version 4.6

(C) 2006 SRI International.
All Rights Reserved.  See BioWarehouse
Overview for license details.

Introduction
Obtaining the Input Files
Building the loader
Running the loader
Schema mappings
Developer Information

Introduction

The UniProt loader loads either the Swiss-Prot Protein Knowledgebase or TrEMBL (in the UniProt XML format) into the BioWarehouse. Information about UniProt can be found at http://www.ebi.uniprot.org/index.shtml. More information about Swiss-Prot and TrEMBL can be found at http://www.expasy.org/sprot/.

Obtaining the Input Files

The latest supported data version for the UniProt loader is listed in the loader summary table. The input file to the UniProt loader is an XML file following the UniProt XML format. These files may be downloaded from http://www.ebi.uniprot.org/database/download.shtml.

For Swiss-Prot, select the "UniProt/Swiss-Prot" database in the "XML" format. The downloaded file is named: uniprot_sprot.xml.gz.
For TrEMBL, select the "UniProt/TrEMBL" database in the "XML" format. The downloaded file is named:

uniprot_trembl.xml.gz.

This version of the UniProt loader supports UniProt version 15.2. Later version of UniProt can be loaded if the UniProt schema has not changed since version 15.2. Otherwise, please contact support@biowarehouse.org and request a patch to the UniProt loader.

Building the Loader

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.

To build the loader, bring up a shell and navigate to the uniprot-loader directory. Then:

osprompt: ant cleanosprompt: ant build

For a list of all project targets, execute:

osprompt: ant -projecthelp

The expected build output is here.

Running the Loader

Dependencies: The UniProt loader is dependant on the NCBI Taxonomy and Enzyme loaders.

The UniProt loader is run from the uniprot-loader/dist directory.

usage: runUniprotLoader.sh

 -q,--quit-after <num entries>   Quit after
parsing X number of entries.
                                
(For testing purposes only.)

 -l,--load-all                  
Load all data, including data found to be

                                
suspect

 -d,--dbms
<dbms>               
DBMS type (mysql or oracle)

 -f,--file
<file>               
Name of input data file(s)

 -h,--help                      
Print usage instructions

 -n,--name
<name>               
Name or SID of database

 -p,--properties
<file>          Name
of properties file

 -r,--release <release date>     Release
date of the input dataset

 -s,--host
<host>               
Name or IP address of database server

                                
host

 -t,--port
<port>               
Port database server is listening at

 -u,--username
<username>        Username for
connection to the database

 -v,--version <version number>   Version number of
the input dataset

 -w,--password
<password>        Password for
connection to the database



Properties may be set on the command line or in the properties file.

Values on the command line take precedence over those in a properties

file. Properties in a property file are specified in name-value pairs.
For

example: port=1234

A template properties file can be found in the dist directory (uniprot.properties).

Example of specifying parameters on the command line:

./runUniprotLoader.sh
-d oracle -f
input.xml -n biospice -s chive.ai.sri.com -t 1234 -u myusername -w
mypassword -v 123 -r "November 24, 2004"

Example: Running the loader using a properties file:

Edit uniprot.properties to have the required values:

dbms=oracle
file=input.xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123
release=November 24, 2004

Then run the script by passing in the name of the properties file:

./runUniprotLoader.sh -p
uniprot.properties

A log file is generated during the run. The log file is located at uniprot-loader/dist/UniprotLoader.log.

If one input file is specified, the DataSet.Name is taken from the 'dataset' attribute of the '<entry>' element (e.g. "SwissProt" or "TrEMBL"). If multiple input files are specified, DataSet.Name will be "UniProt".

Schema Mappings

Key
* = zero or more
+ = one or more
Att: x, y, z = A list of the attributes for that element

The following table describes the mapping between the UniProt XSD schema and the Warehouse schema. The "XML Schema Elements" column describes only the elements and attributes parsed by the UniProt loader, and is not a complete description of the UniProt XSD schema.

UniProt XML Schema	Mapping to Warehouse Schema
<uniprot>	[1] One entry is made in the DataSet table for the entire input file.
<entry>+ Att:dataset,created, modified	[2] Each entry is parsed, processed, and stored before parsing the next entry. [3] The created and modified attributes are applied to each `Entry` as `Entry.CreationDate` and `Entry.ModificationDate`, respectively. For each entry, the following rows are added to the Warehouse database (as described in below): Entry (one for each row in an Object Table) Protein (one row only) BioSource and BioSourceWIDProteinWID (one or more rows) Gene and GeneWIDProteinWID (zero or more rows) Feature (zero or more rows) DBID (two or more rows) SynonymTable (zero or more rows) Term (zero or more rows) Citation and CitationWIDOtherWID (one or more rows)
<accession>+	[4] The contents of <accession> are stored in `DBID.XID` with the `DBID.OtherWID` = the WID of the Protein for this <entry> and `DBID.Type` = "Accession".
<name>+	[5] The contents of <name> are stored in SynonymTable.Syn with the `SynonymTable.OtherWID` = the WID of the Protein for this <entry>.
<protein> Att: type, evidence, ref <name>+ <domain>* <component>* <name>+	Note: We currently do not handle the "evidence" atribute of protein. [6] If the "type" attribute contains "fragment" or "fragments", then set `Protein.Fragment` to "T". Otherwise set `Protein.Fragment` to "F". [8] <recommendedName>/<fullName> is stored as `Protein.Name`. All subsequent name elements are stored as synonyms in `SynonymTable`. Exception: if the <name> starts with "EC ", then we strip off this prefix and handle the name as an EC number as described in <dbReference>, below. [31] For each <domain>, if the <name> begins with "EC " then we strip off this prefix and handle the name as an EC number as described in <dbReference>, below. All other names will be stored in the `SynonymTable` with `SynonymTable.OtherWID` = the `WID` of the `Protein` for this <entry> For each <component> [11] For all<name> elements: The names listed here should be appended together and prefixed with the following, to construct a comment: "This protein can be cleaved to produce the following functional components: NAME1, NAME2, ..." Each name will be stored in the `SynonymTable` with `SynonymTable.OtherWID` = the `WID` of the `Protein` for this <entry>
<gene>* <name>+ Att: type	[12] For each <gene>, create an entry in `Gene`. If there is a <name> with type="primary", then that name is stored in `Gene.Name`. All other names are stored as synonyms. If there is no <name> with type="primary, then the first <name> encountered is stored as `Gene.Name`. All other names are stored as synonyms. `Gene.Type` = "polypeptide" An entry is added to `GeneWIDProteinWID` for each `Gene`.
<organism>+ <name>+ Att: type (common, full, scientific, synonym, or abbreviation) <dbReference>+ <lineage>*	[27] For each <organism> For each distinct <Name> Create an entry in `BioSource` `BioSource.Name` = <name> If the "type" attribute of <dbReference> is "NCBI Taxonomy", then set `BioSource.TaxonWID` equal to the WID obtained by the following query: "select OtherWID from DBID where XID=[the "id" attribute of <dbReference>] [29] For each entry in `BioSource`, add an entry in `BioSourceWIDProteinWID`with `ProteinWID` = the WID of the `Protein`for this uniprot <entry>. For each entry in `BioSource` and each entry in `Gene`, add an entry in `BioSourceWIDGeneWID`.
<reference>+ <citation> <dbReference> <sptrCitationGroup> <source>* <species>* <strain>* <plasmid>* <transposon>* <tissue>*	[26] Map <citation> to Citation table `Citation.Citation` = list of all available citation attributes, editors, and authors. `Citation.PMID` = id attribute of <dbReference> if "type" attribute="PubMed" `CitationWIDOtherWID.OtherWID` = `Protein.WID` `CitationWIDOtherWID.CitationWID` = `Citation.WID` from above [28] For each <source> in each <reference> Create a new entry in `BioSource` if the contents of any of the tags (strain, species, etc.) is different from any source tag previously encountered. `BioSource.Name` = contents of <species>, if present `BioSource.Strain` = contents of <strain>/<name>, if present Note: If a strain is listed without a species, use the scientific name of the organisms for BioSource.Name. If there is more than one organism, log a warning. `BioSource.Tissue` = contents of <tissue>, if present If a species is listed without a strain, we populate `BioSource.TaxonWID` using the following query: "`select WID from Taxon where Rank='species' and Name='species name'`". See also requirement [29] under <organism>, above.
<comment>	[13] The contents of each <comment> are stored in `CommentTable.Comm`. Prefix each comment with the "type" attribute (capitalize first letter) and follow by a colon and then the contents of <comment> For each protein function (type attribute is "fucntion" in xml file), store "protein function" in `Support.Type` and whether it is experimentally determined in `Support.EvidenceType`. The non-experimental qualifiers are in the "status" attribute, which is stored as "computational" in `Support.EvidenceType`: <comment type="function" status="By similarity"> <comment type="function" status="Potential"> <comment type="function" status="Probable"> All comments without "status" attribute should be experimental, which is stored as "experimental" in `Support.EvidenceType`.
<proteinExistence>	The type attribute of each <proteinExistence> is stored in `Support.EvidenceType` with "protein existence" as the value of `Support.Type`
<dbReference> Att: type, id, evidence, key <property>	[30] If the "type" attribute is "EC" then the "id" attribute contains an EC number. Add the EC number to the `SynonymTable` with `SynonymTable.OtherWID` = the `WID` of the `Protein` for this <entry> Add an entry to the `EnzymaticReaction` table with `EnzymaticReaction.ProteinWID` = `WID` of the `Protein` for this <entry> and `EnzymaticReaction.ReactionWID` = the `WID` of the `Reaction` whose `Reaction.ECNumber` equals the EC number found in the uniprot file. * [25] Note: in all cases of encountering EC numbers, do not add the EC number if the EC number has already been encountered. [14] Otherwise, for each <dbReference> an entry is made in `CrossReference:` `CrossReference.XID` = "id" attribute `CrossReference.Version` = null `CrossReference.DatabaseName` = "type" attribute `CrossReference.Type` = "Accession" If "type" attribute is: EMBL or DDBJ or GenBank `CrossReference.OtherWID` = `Gene.WID` Otherwise `CrossReference.OtherWID` = `Protein.WID`
<keyword>*	[15] The contents of <keyword> are stored as `Term.Name`.
<feature>* Att: type, status, id, description, evidence, ref <variation> <location> <begin>Att: position, status <end>Att: position, status <position> Att: position, status	[16] For each <feature>, create an entry in the `Feature`table `Feature.Type` = the "type" attribute of <feature> `Feature.Class` = the "status" attribute of <feature> `Feature.SubsequenceWID` = Protein.WID `Feature.Description` = "description" attribute `Feature.SequenceType` = "P" for protein [32] If the "ref" attribute is present, then the value of the "ref" attribute is also the value of the "key" attribute of the corresponding <reference> element. When this is encountered during parsing, add a row to the CitationWIDOtherWID, that links the Feature WID with the Citation WID that was already processed. A feature location either has a <begin>, <end> pair or a <position>. [20] If <begin> and <end> are present, the position attribute of these tags are stored as `Feature.StartPosition` and `Feature.EndPosition`, respectively. Set `Feature.StartPositionApproximate` and `Feature.EndPositionApproximate` dependong on the status attribute. [21] <position> is used when the length of the feature is one. The contents of <position> map to `Feature.StartPosition` and `Feature.EndPosition`. [33] For each <variation>, a new entry is created in the `Feature` table, with `Feature.Variant` set to the value contained by <variation>.
<sequence>Att:length,mass, checksum, updated	[22] The length attribute is stored as `Protein.Length`. [23] The mass attribute is stored as `Protein.MolecularWeightCalc`. The value is in the input file is in Daltons and is converted to kDa before being inserted into the database. [24] The contents of the <sequence> is stored as `Protein.AASequence`.

(Requirements count = [33]. Increment this number when adding requirements so that others don't have to search through the text to see what the requirements count is up to.)

Loader Design

The UniProt loader parses each <entry> in three steps: parsing, processing, and storing. During the parsing phase, the input XML data is prarsed into an object model. The processing phase translates the XML object model into the Warehouse schema model. (In effect, the processing phase performs the schema mapping described above.) During the storing phase, the data is stored into the Warehouse database.

Parsing

The parsing of the input XML data into a Java object model is done automatically by a XML-to-Java binding mechanism called XMLBeans. Prior to complilation of the Java code, the XMLBeans compiler examines the UniProt schema and generates a set of classes corresponding to the elements in the schema. These schema classes are then used by the UniProt loader. The UniProt loader uses the SAXDOMIX XML parses to parse a single <entry> element into a DOM. It then uses the XMLBeans schema classes to parse the DOM for the <entry> into an EntryType object. The class that performs the parsing phase is com.sri.biospice.warehouse.uniprot.parser.UniprotParser.

Processing

In this phase, the loader uses the accessor methods of the schema classes and maps the data to the Warehouse schema classes. com.sri.biospice.warehouse.uniprot.UniprotEntry performs the processing of a single <entry>.

Storing

In this phase, the loader uses the Warehouse schema classes in the Warehouse Common Java repository to store the data into an instance of the Warehouse database.

XMLBeans

It is not necessary to download XMLBeans to do development on the UniProt loader unless it is necessary to update the loader to a newer version of the UniProt schema. The schema Java classes are generated using the XMLBeans schema compiler (scomp), which examines the UniProt schema XSD document and generates a jar file containing all the UniProt schema classes. The generated jar file is located at lib/uniprot-schema-1.1.jar.

While XMLBeans provides an Ant task to run the schema compiler, the Ant task does not currently support all the options that the command-line version of the scomp compiler does. In particular, the "no UPA" option is not supported. The current version of the UniProt schema (1.1) violates the "Unique Particle Attribution" (UPA) rule for XML schemas in the definition of the comment type. Therefore it is necessary to compile the schema using the "no UPA" rule. The command line syntax used to generate the schema jar was:

[XMLBeans home]/bin/scomp.cmd
-noupa uniprot.xsd

The Javadoc API for the generated classes is located here. This documentation was generated by using the XMLBeans schema compiler to create Java source code files for the UniProt schema classes, and then using the standard Javadoc tool to create the API documentation. However, these source files aren't needed to compile or run the loader.

Running the loader from Ant

A developer may find it more convient to run the program using Ant instead of the shell script. The build.xml file loads the necessary properties from a developer.properties file located in the base directory. This file should contain all the same properties as used by the loader, plus the run.jvm.memory property used by the run script. The loader may then be run using the command ant run.