UniProt (Swiss-Prot/TrEMBL) Loader for the BioWarehouse

Version 4.6

(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.



Introduction

The UniProt loader loads either the Swiss-Prot Protein Knowledgebase or TrEMBL (in the UniProt XML format) into the BioWarehouse.  Information about UniProt can be found at http://www.ebi.uniprot.org/index.shtml.  More information about Swiss-Prot and TrEMBL can be found at http://www.expasy.org/sprot/.


Obtaining the Input Files

The latest supported data version for the UniProt loader is listed in the loader summary table.  The input file to the UniProt loader is an XML file following the UniProt XML format.  These files may be downloaded from http://www.ebi.uniprot.org/database/download.shtml

For Swiss-Prot, select the "UniProt/Swiss-Prot" database in the "XML" format.  The downloaded file is named: uniprot_sprot.xml.gz.
For TrEMBL, select the "UniProt/TrEMBL" database in the "XML" format.  The downloaded file is named: uniprot_trembl.xml.gz.

This version of the UniProt loader supports UniProt version 15.2. Later version of UniProt can be loaded if the UniProt schema has not changed since version 15.2.  Otherwise, please contact support@biowarehouse.org and request a patch to the UniProt loader.


Building the Loader

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.

To build the loader, bring up a shell and navigate to the uniprot-loader directory. Then:

osprompt: ant clean
osprompt: ant build

For a list of all project targets, execute:

    osprompt: ant -projecthelp

The expected build output is here.




Running the Loader

Dependencies: The UniProt loader is dependant on the NCBI Taxonomy and Enzyme loaders.

The UniProt loader is run from the uniprot-loader/dist directory.

usage: runUniprotLoader.sh
 -q,--quit-after <num entries>   Quit after parsing X number of entries.
                                 (For testing purposes only.)
 -l,--load-all                   Load all data, including data found to be
                                 suspect
 -d,--dbms <dbms>                DBMS type (mysql or oracle)
 -f,--file <file>                Name of input data file(s)
 -h,--help                       Print usage instructions
 -n,--name <name>                Name or SID of database
 -p,--properties <file>          Name of properties file
 -r,--release <release date>     Release date of the input dataset
 -s,--host <host>                Name or IP address of database server
                                 host
 -t,--port <port>                Port database server is listening at
 -u,--username <username>        Username for connection to the database
 -v,--version <version number>   Version number of the input dataset
 -w,--password <password>        Password for connection to the database

Properties may be set on the command line or in the properties file.
Values on the command line take precedence over those in a properties
file. Properties in a property file are specified in name-value pairs. For
example: port=1234

A template properties file can be found in the dist directory (uniprot.properties).

Example of specifying parameters on the command line:

./runUniprotLoader.sh -d oracle -f input.xml -n biospice -s chive.ai.sri.com -t 1234 -u myusername -w mypassword -v 123 -r "November 24, 2004"

Example:  Running the loader using a properties file:

Edit uniprot.properties to have the required values:

dbms=oracle
file=input.xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123
release=November 24, 2004

Then run the script by passing in the name of the properties file:

./runUniprotLoader.sh -p uniprot.properties

A log file is generated during the run.  The log file is located at uniprot-loader/dist/UniprotLoader.log.

If one input file is specified, the DataSet.Name is taken from the 'dataset' attribute of the '<entry>' element (e.g. "SwissProt" or "TrEMBL").  If multiple input files are specified, DataSet.Name will be "UniProt".


Schema Mappings

Key
* = zero or more
+ = one or more
Att: x, y, z  = A list of the attributes for that element


The following table describes the mapping between the UniProt XSD schema and the Warehouse schema.  The "XML Schema Elements" column describes only the elements and attributes parsed by the UniProt loader, and is not a complete description of the UniProt XSD schema.


UniProt XML Schema

Mapping to Warehouse Schema

  • <uniprot>
  • [1] One entry is made in the DataSet table for the entire input file.
    • <entry>+ Att:dataset,created, modified
  • [2] Each entry is parsed, processed, and stored before parsing the next entry.
  • [3] The created and modified attributes are applied to each Entry as Entry.CreationDate and Entry.ModificationDate, respectively.
  • For each entry, the following rows are added to the Warehouse database (as described in below):
    • Entry (one for each row in an Object Table)
    • Protein (one row only)
    • BioSource and BioSourceWIDProteinWID (one or more rows)
    • Gene and GeneWIDProteinWID (zero or more rows)
    • Feature (zero or more rows)
    • DBID (two or more rows)
    • SynonymTable (zero or more rows)
    • Term (zero or more rows)
    • Citation and CitationWIDOtherWID (one or more rows)
      • <accession>+
  • [4] The contents of <accession> are stored in DBID.XID with the DBID.OtherWID = the WID of the Protein for this <entry> and DBID.Type = "Accession".
      • <name>+
  • [5] The contents of <name> are stored in SynonymTable.Syn with the SynonymTable.OtherWID = the WID of the Protein for this <entry>.
      • <protein> Att: type, evidence, ref
        • <name>+
        • <domain>*
        • <component>*
          • <name>+
  • Note:  We currently do not handle the "evidence" atribute of protein.
  • [6] If the "type" attribute contains "fragment" or "fragments", then set Protein.Fragment to "T".  Otherwise set Protein.Fragment to "F".
  • [8] <recommendedName>/<fullName> is stored as Protein.Name.  All subsequent name elements are stored as synonyms in SynonymTable.    Exception: if the <name> starts with "EC ", then we strip off this prefix and handle the name as an EC number as described in <dbReference>, below.
  • [31] For each <domain>, if the <name> begins with "EC " then we strip off this prefix and handle the name as an EC number as described in <dbReference>, below. All other names will be stored in the SynonymTable with SynonymTable.OtherWID = the WID of the Protein for this <entry>
  • For each <component>
    • [11] For all<name> elements:
      • The names listed here should be appended together and prefixed with the following, to construct a comment: "This protein can be cleaved to produce the following functional components: NAME1, NAME2, ..."
      • Each name will be stored in the SynonymTable with SynonymTable.OtherWID = the WID of the Protein for this <entry>
      • <gene>*
        • <name>+ Att: type
  • [12] For each <gene>, create an entry in Gene.
    • If there is a <name> with type="primary", then that name is stored in Gene.Name.  All other names are stored as synonyms.
    • If there is no <name> with type="primary, then the first <name> encountered is stored as Gene.Name.  All other names are stored as synonyms.
    • Gene.Type = "polypeptide"
    • An entry is added to GeneWIDProteinWID for each Gene.
      • <organism>+
        • <name>+ Att: type (common, full, scientific, synonym, or abbreviation)
        • <dbReference>+
        • <lineage>*
  • [27] For each <organism>
    • For each distinct <Name>
      • Create an entry in BioSource
      • BioSource.Name = <name>
      • If the "type" attribute of <dbReference> is "NCBI Taxonomy", then set BioSource.TaxonWID equal to the WID obtained by the following query: "select OtherWID from DBID where XID=[the "id" attribute of <dbReference>]
  • [29] For each entry in BioSource, add an entry in BioSourceWIDProteinWID with ProteinWID = the WID of the Protein for this uniprot <entry>.  For each entry in BioSource and each entry in Gene, add an entry in BioSourceWIDGeneWID.
      • <reference>+
        • <citation>
          • <dbReference>
        • <sptrCitationGroup>
          • <source>*
            • <species>*
            • <strain>*
            • <plasmid>*
            • <transposon>*
            • <tissue>*
  • [26] Map <citation> to Citation table
    • Citation.Citation = list of all available citation attributes, editors, and authors.
    • Citation.PMID = id attribute of <dbReference> if "type" attribute="PubMed"
    • CitationWIDOtherWID.OtherWID = Protein.WID
    • CitationWIDOtherWID.CitationWID = Citation.WID from above
  • [28] For each <source> in each <reference>
    • Create a new entry in BioSource if the contents of any of the tags (strain, species, etc.) is different from any source tag previously encountered.
    • BioSource.Name = contents of <species>, if present
    • BioSource.Strain = contents of <strain>/<name>, if present
      • Note: If a strain is listed without a species, use the scientific name of the organisms for BioSource.Name.  If there is more than one organism, log a warning.
    • BioSource.Tissue = contents of <tissue>, if present
    • If a species is listed without a strain, we populate BioSource.TaxonWID using the following query: "select WID from Taxon where Rank='species' and Name='species name'".
  • See also requirement [29] under <organism>, above.
      • <comment>
  • [13] The contents of each <comment> are stored in CommentTable.Comm.  Prefix each comment with the "type" attribute (capitalize first letter) and follow by a colon and then the contents of <comment>
  • For each protein function (type attribute is "fucntion" in xml file), store "protein function" in Support.Type and whether it is experimentally determined in Support.EvidenceType.
    The non-experimental qualifiers are in the "status" attribute, which is stored as "computational" in Support.EvidenceType:
    <comment type="function" status="By similarity">
    <comment type="function" status="Potential">
    <comment type="function" status="Probable">
    All comments without "status" attribute should be experimental, which is stored as "experimental" in Support.EvidenceType.
      • <proteinExistence>
  • The type attribute of each <proteinExistence> is stored in Support.EvidenceType with "protein existence" as the value of Support.Type
      • <dbReference> Att: type, id, evidence, key
        • <property>
  • [30] If the "type" attribute is "EC" then the "id" attribute contains an EC number.
    • Add the EC number to the SynonymTable with SynonymTable.OtherWID = the WID of the Protein for this <entry>
    • Add an entry to the EnzymaticReaction table with EnzymaticReaction.ProteinWID = WID of the Protein for this <entry> and EnzymaticReaction.ReactionWID = the WID of the Reaction whose Reaction.ECNumber equals the EC number found in the uniprot file.
    • * [25] Note: in all cases of encountering EC numbers, do not add the EC number if the EC number has already been encountered.
  • [14] Otherwise, for each <dbReference> an entry is made in CrossReference:
    • CrossReference.XID = "id" attribute
    • CrossReference.Version = null
    • CrossReference.DatabaseName = "type" attribute
    • CrossReference.Type = "Accession"
    • If "type" attribute is:
      •  EMBL or DDBJ or GenBank
        • CrossReference.OtherWID = Gene.WID
      • Otherwise
        • CrossReference.OtherWID = Protein.WID
      • <keyword>*
  • [15] The contents of <keyword> are stored as Term.Name.
      • <feature>* Att: type, status, id, description, evidence, ref
        • <variation>
        • <location>
          • <begin>Att: position, status
          • <end>Att: position, status
          • <position> Att: position, status
  • [16] For each <feature>, create an entry in the Feature table
    • Feature.Type = the "type" attribute of <feature>
    • Feature.Class = the "status" attribute of <feature>
    • Feature.SubsequenceWID = Protein.WID
    • Feature.Description = "description" attribute
    • Feature.SequenceType = "P" for protein
    • [32]  If the "ref" attribute is present, then the value of the "ref" attribute is also the value of the "key" attribute of the corresponding <reference> element. When this is encountered during parsing, add a row to the CitationWIDOtherWID, that links the Feature WID with the Citation WID that was already processed.
    • A feature location either has a <begin>, <end> pair or a <position>.
      • [20] If <begin> and <end> are present, the position attribute of these tags are stored as Feature.StartPosition and Feature.EndPosition, respectively.  Set Feature.StartPositionApproximate and Feature.EndPositionApproximate dependong on the status attribute.
      • [21] <position> is used when the length of the feature is one.  The contents of <position> map to Feature.StartPosition and Feature.EndPosition.
    • [33] For each <variation>, a new entry is created in the Feature table, with Feature.Variant set to the value contained by <variation>.
      • <sequence>Att:length,mass, checksum, updated
  • [22] The length attribute is stored as Protein.Length.
  • [23] The mass attribute is stored as Protein.MolecularWeightCalc.  The value is in the input file is in Daltons and is converted to kDa before being inserted into the database.
  • [24] The contents of the <sequence> is stored as Protein.AASequence.

(Requirements count = [33].  Increment this number when adding requirements so that others don't have to search through the text to see what the requirements count is up to.)

Developer Information

Loader Design

The UniProt loader parses each <entry> in three steps: parsing, processing, and storing.  During the parsing phase, the input XML data is prarsed into an object model.  The processing phase translates the XML object model into the Warehouse schema model.  (In effect, the processing phase performs the schema mapping described above.)  During the storing phase, the data is stored into the Warehouse database.

Parsing

The parsing of the input XML data into a Java object model is done automatically by a XML-to-Java binding mechanism called XMLBeans.  Prior to complilation of the Java code, the XMLBeans compiler examines the UniProt schema and generates a set of classes corresponding to the elements in the schema.  These schema classes are then used by the UniProt loader.  The UniProt loader uses the SAXDOMIX XML parses to parse a single <entry> element into a DOM.  It then uses the XMLBeans schema classes to parse the DOM for the <entry> into an EntryType object.  The class that performs the parsing phase is com.sri.biospice.warehouse.uniprot.parser.UniprotParser.

Processing

In this phase, the loader uses the accessor methods of the schema classes and maps the data to the Warehouse schema classes.  com.sri.biospice.warehouse.uniprot.UniprotEntry performs the processing of a single <entry>.

Storing

In this phase, the loader uses the Warehouse schema classes in the Warehouse Common Java repository to store the data into an instance of the Warehouse database.

XMLBeans

It is not necessary to download XMLBeans to do development on the UniProt loader unless it is necessary to update the loader to a newer version of the UniProt schema.  The schema Java classes are generated using the XMLBeans schema compiler (scomp), which examines the UniProt schema XSD document and generates a jar file containing all the UniProt schema classes.  The generated jar file is located at lib/uniprot-schema-1.1.jar

While XMLBeans provides an Ant task to run the schema compiler, the Ant task does not currently support all the options that the command-line version of the scomp compiler does.  In particular, the "no UPA" option is not supported.  The current version of the UniProt schema (1.1) violates the "Unique Particle Attribution" (UPA) rule for XML schemas in the definition of the comment type.  Therefore it is necessary to compile the schema using the "no UPA" rule.  The command line syntax used to generate the schema jar was:

[XMLBeans home]/bin/scomp.cmd -noupa uniprot.xsd

The Javadoc API for the generated classes is located here.  This documentation was generated by using the XMLBeans schema compiler to create Java source code files for the UniProt schema classes, and then using the standard Javadoc tool to create the API documentation.  However, these source files aren't needed to compile or run the loader.

Running the loader from Ant

A developer may find it more convient to run the program using Ant instead of the shell script.  The build.xml file loads the necessary properties from a developer.properties file located in the base directory.  This file should contain all the same properties as used by the loader, plus the run.jvm.memory property used by the run script.  The loader may then be run using the command ant run.