UniProt (Swiss-Prot/TrEMBL) Loader for the BioWarehouse
Version
4.6
(C) 2006 SRI International.
All Rights Reserved. See BioWarehouse
Overview for license details.
Introduction
Obtaining the Input Files
The latest supported data version for
the UniProt loader is listed in the
loader summary table.
The input file to the UniProt loader is
an XML file following the UniProt XML format. These files may be
downloaded from
http://www.ebi.uniprot.org/database/download.shtml.
For Swiss-Prot, select the "UniProt/Swiss-Prot" database in the "XML"
format. The
downloaded file is named:
uniprot_sprot.xml.gz.
For TrEMBL, select the "UniProt/TrEMBL" database in the "XML"
format. The
downloaded file is named:
uniprot_trembl.xml.gz.
This version of the UniProt loader supports UniProt version
15.2.
Later version of UniProt can be loaded if the UniProt schema has not
changed since version 15.2. Otherwise, please contact
support@biowarehouse.org and request a patch to the UniProt
loader.
Building
the Loader
Running the Loader
Dependencies:
The UniProt loader is dependant on the NCBI Taxonomy and Enzyme loaders.
The UniProt loader is run from the uniprot-loader/dist
directory.
usage: runUniprotLoader.sh
-q,--quit-after <num entries> Quit after
parsing X number of entries.
(For testing purposes only.)
-l,--load-all
Load all data, including data found to be
suspect
-d,--dbms
<dbms>
DBMS type (mysql or oracle)
-f,--file
<file>
Name of input data file(s)
-h,--help
Print usage instructions
-n,--name
<name>
Name or SID of database
-p,--properties
<file> Name
of properties file
-r,--release <release date> Release
date of the input dataset
-s,--host
<host>
Name or IP address of database server
host
-t,--port
<port>
Port database server is listening at
-u,--username
<username> Username for
connection to the database
-v,--version <version number> Version number of
the input dataset
-w,--password
<password> Password for
connection to the database
Properties may be set on the command line or in the properties file.
Values on the command line take precedence over those in a properties
file. Properties in a property file are specified in name-value pairs.
For
example: port=1234
A template properties file can be found
in the dist directory (
uniprot.properties
).
Example of specifying parameters on the command line:
./runUniprotLoader.sh
-d oracle -f
input.xml -n biospice -s chive.ai.sri.com -t 1234 -u myusername -w
mypassword -v 123 -r "November 24, 2004"
Example: Running the loader
using a properties file:
Edit uniprot.properties
to have the required values:
dbms=oracle
file=input.xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123
release=November 24, 2004
Then run the script by passing in the name of the properties file:
./runUniprotLoader.sh -p
uniprot.properties
A log file is generated during the
run. The log file is located at uniprot-loader/dist/UniprotLoader.log
.
If one input file is specified, the DataSet.Name is taken from the
'dataset' attribute of the '<entry>' element (e.g. "SwissProt" or
"TrEMBL"). If multiple input files are specified, DataSet.Name
will be "UniProt".
Schema Mappings
Key
* = zero or more
+ = one or more
Att: x, y, z = A list of the attributes for that element |
The following table describes the
mapping between the UniProt XSD
schema and the Warehouse schema. The "XML Schema Elements" column
describes only the elements
and attributes parsed by the UniProt loader, and is not a complete
description of the UniProt XSD schema.
UniProt XML Schema
|
Mapping to Warehouse Schema
|
|
- [1] One entry is made in the DataSet table for the entire input
file.
|
- <entry>+ Att:dataset,created, modified
|
- [2] Each entry is parsed, processed, and stored before
parsing
the next entry.
- [3] The created and modified attributes are applied to each
Entry as Entry.CreationDate and Entry.ModificationDate ,
respectively.
- For each entry, the following rows are added to the
Warehouse database (as described in below):
- Entry (one for
each row in an Object Table)
- Protein (one row
only)
- BioSource and BioSourceWIDProteinWID (one
or more rows)
- Gene and GeneWIDProteinWID (zero or more
rows)
- Feature (zero or
more rows)
- DBID (two or
more rows)
- SynonymTable (zero
or more rows)
- Term (zero or
more rows)
- Citation and CitationWIDOtherWID (one or more
rows)
|
|
- [4] The contents of <accession> are stored in
DBID.XID
with the DBID.OtherWID = the WID of the Protein for this
<entry> and DBID.Type = "Accession".
|
|
- [5] The contents of <name> are stored in SynonymTable.Syn
with the SynonymTable.OtherWID = the WID of the Protein
for this
<entry>.
|
- <protein> Att: type, evidence, ref
- <name>+
- <domain>*
- <component>*
|
- Note: We currently
do not handle the "evidence" atribute of protein.
- [6] If the "type" attribute contains "fragment" or
"fragments", then set
Protein.Fragment to "T".
Otherwise set Protein.Fragment to "F".
- [8] <recommendedName>/<fullName> is stored as
Protein.Name .
All subsequent name elements are stored as synonyms in SynonymTable .
Exception: if the
<name> starts with "EC ", then we strip off this prefix and
handle the name as an EC number as described in <dbReference>,
below.
- [31] For each <domain>, if the <name> begins
with "EC " then we strip off this prefix and handle the name as an EC
number as described in <dbReference>, below. All other names will be stored in the
SynonymTable with
SynonymTable.OtherWID = the WID of
the Protein for this <entry>
- For each <component>
- [11] For all<name> elements:
- The names listed here should be appended together and
prefixed with the following, to construct a comment: "This protein can
be cleaved to produce the following functional components: NAME1, NAME2,
..."
- Each name will be stored in the
SynonymTable with
SynonymTable.OtherWID = the WID of
the Protein for this <entry>
|
|
- [12] For each <gene>, create an entry in
Gene .
- If there is a <name> with type="primary", then that
name is stored in
Gene.Name . All other names are
stored as synonyms.
- If there is no <name> with type="primary, then the
first <name> encountered is stored as
Gene.Name .
All other names are stored as synonyms.
Gene.Type = "polypeptide"
- An entry is added to
GeneWIDProteinWID for
each Gene .
|
- <organism>+
- <name>+ Att: type (common, full, scientific,
synonym, or abbreviation)
- <dbReference>+
- <lineage>*
|
- [27] For each
<organism>
- Create an entry in
BioSource
BioSource.Name =
<name>
- If the "type"
attribute of <dbReference> is "NCBI
Taxonomy", then set
BioSource.TaxonWID equal to the WID
obtained by the
following query: "select OtherWID from DBID where XID=[the "id"
attribute of <dbReference>]
- [29] For each entry
in
BioSource , add an entry in BioSourceWIDProteinWID
with ProteinWID = the WID of the Protein
for this uniprot <entry>. For each entry
in BioSource and each entry in Gene , add
an entry in BioSourceWIDGeneWID .
|
- <reference>+
- <citation>
- <sptrCitationGroup>
- <source>*
- <species>*
- <strain>*
- <plasmid>*
- <transposon>*
- <tissue>*
|
- [26] Map <citation> to Citation
table
Citation.Citation = list of all available
citation
attributes,
editors, and authors.
Citation.PMID = id attribute of
<dbReference> if
"type" attribute="PubMed"
CitationWIDOtherWID.OtherWID = Protein.WID
CitationWIDOtherWID.CitationWID = Citation.WID
from above
- [28] For each <source>
in each <reference>
- Create a new entry in
BioSource if the
contents of any of the tags
(strain, species, etc.) is
different from any source tag previously encountered.
BioSource.Name = contents
of <species>, if present
BioSource.Strain =
contents of <strain>/<name>, if present
- Note: If a strain
is
listed without a species, use the scientific name of the organisms for
BioSource.Name. If there is more
than one organism, log a warning.
BioSource.Tissue = contents of
<tissue>, if present
- If a species is listed without a strain, we populate
BioSource.TaxonWID
using the following query: "select WID from Taxon where
Rank='species' and Name='species name' ".
- See also requirement
[29] under <organism>, above.
|
|
- [13] The contents of each <comment> are stored in
CommentTable.Comm .
Prefix each comment with the "type" attribute (capitalize first letter)
and follow by a colon and then the contents of <comment>
- For each protein function (type attribute is "fucntion" in xml file), store "protein function" in
Support.Type and whether it is experimentally determined in
Support.EvidenceType . The non-experimental qualifiers are in the "status" attribute, which is
stored as "computational" in Support.EvidenceType :
<comment type="function" status="By similarity">
<comment type="function" status="Potential">
<comment type="function" status="Probable">
All comments without "status" attribute should be experimental, which is stored as "experimental" in
Support.EvidenceType .
|
|
- The type attribute of each <proteinExistence> is stored in
Support.EvidenceType with "protein existence"
as the value of Support.Type
|
- <dbReference> Att:
type, id, evidence, key
|
- [30] If the "type" attribute is "EC" then the "id"
attribute contains an EC number.
- Add the EC number to the
SynonymTable with
SynonymTable.OtherWID = the WID of
the Protein for this <entry>
- Add an entry to the
EnzymaticReaction table
with EnzymaticReaction.ProteinWID = WID of
the Protein for this <entry> and EnzymaticReaction.ReactionWID
= the WID of the Reaction whose Reaction.ECNumber
equals the EC number found in the
uniprot file.
- * [25] Note: in all
cases
of encountering EC numbers, do not add the EC number if
the EC number has already been encountered.
- [14] Otherwise, for each <dbReference> an entry is
made in
CrossReference:
CrossReference.XID = "id" attribute
CrossReference.Version = null
CrossReference.DatabaseName = "type"
attribute
CrossReference.Type = "Accession"
- If "type" attribute is:
CrossReference.OtherWID = Gene.WID
- Otherwise
CrossReference.OtherWID = Protein.WID
|
|
- [15] The contents of
<keyword> are stored as
Term.Name .
|
- <feature>* Att: type, status, id, description,
evidence, ref
- <variation>
- <location>
- <begin>Att: position, status
- <end>Att: position, status
- <position> Att: position, status
|
- [16] For each <feature>, create an entry in the
Feature
table
Feature.Type = the "type" attribute of
<feature>
Feature.Class = the "status" attribute of
<feature>
Feature.SubsequenceWID = Protein.WID
Feature.Description = "description"
attribute
Feature.SequenceType
= "P" for protein- [32] If the "ref" attribute is present, then the value of the "ref"
attribute is also the value of the "key" attribute of the corresponding
<reference> element. When this is encountered during parsing, add
a row to the CitationWIDOtherWID, that links the Feature WID with the
Citation WID that was already processed.
- A feature location either has a <begin>,
<end>
pair or a <position>.
- [20] If <begin> and <end> are present, the
position
attribute of these tags are stored as
Feature.StartPosition
and Feature.EndPosition , respectively. Set Feature.StartPositionApproximate
and Feature.EndPositionApproximate dependong on the
status attribute.
- [21] <position> is used when the length of the
feature is one. The contents of <position> map to
Feature.StartPosition
and Feature.EndPosition .
- [33] For each <variation>, a new entry is created in the
Feature table, with Feature.Variant set to the value contained by <variation>.
|
- <sequence>Att:length,mass, checksum, updated
|
- [22] The length attribute is stored as
Protein.Length .
- [23] The mass attribute is stored as
Protein.MolecularWeightCalc .
The value is in the input file is in Daltons and is converted to kDa
before being inserted into the database.
- [24] The contents of the <sequence> is stored as
Protein.AASequence .
|
(Requirements count = [33].
Increment this number when adding
requirements so that others don't have to search through the text to
see what the requirements count is up to.)
Developer Information
Loader Design
The UniProt loader parses each
<entry> in three steps: parsing, processing, and storing.
During the parsing phase, the input XML data is prarsed into an object
model. The processing phase translates the XML object model into
the Warehouse schema model. (In effect, the processing phase
performs the schema mapping described above.) During the storing
phase, the data is stored into the Warehouse database.
Parsing
The parsing of the input XML data into
a Java object model is done automatically by a XML-to-Java binding
mechanism called XMLBeans. Prior to complilation of the Java
code, the XMLBeans compiler examines the UniProt schema and generates a
set of classes corresponding to the elements in the schema. These
schema classes are then used by the UniProt loader. The UniProt
loader uses the SAXDOMIX XML parses to parse a single <entry>
element into a DOM. It then uses the XMLBeans schema classes to
parse the DOM for the <entry>
into an EntryType
object. The class that performs the parsing phase is com.sri.biospice.warehouse.uniprot.parser.UniprotParser
.
Processing
In this phase, the loader uses the
accessor methods of the schema classes and maps the data to the
Warehouse schema classes. com.sri.biospice.warehouse.uniprot.UniprotEntry
performs the processing of a single <entry>
.
Storing
In this phase, the loader uses the
Warehouse schema classes in the Warehouse Common Java repository to
store the data into an instance of the Warehouse database.
XMLBeans
It is not necessary to download
XMLBeans to do development on the UniProt loader unless it is necessary
to update the loader to a newer
version of the UniProt schema. The schema Java classes are
generated using the XMLBeans schema compiler (
scomp
),
which examines
the UniProt schema XSD document and generates a jar file containing all
the UniProt schema classes. The generated jar file is located at
lib/uniprot-schema-1.1.jar
.
While XMLBeans provides an Ant task to run the schema compiler, the Ant
task does not currently support all the options that the command-line
version of the scomp compiler does. In particular, the "no UPA"
option is not supported. The current version of the UniProt
schema (1.1) violates the "Unique Particle Attribution" (UPA) rule for
XML schemas in the definition of the comment type. Therefore it
is necessary to compile the schema using the "no UPA" rule. The
command line syntax used to generate the schema jar was:
[XMLBeans home]/bin/scomp.cmd
-noupa uniprot.xsd
The Javadoc API for the generated classes is located
here. This documentation was
generated by using the XMLBeans schema compiler to create Java source
code files for the UniProt schema classes, and then using the standard
Javadoc tool to create the API documentation. However, these
source files aren't needed to compile or run the loader.
Running the loader from
Ant
A developer may find it more convient
to run the program using Ant instead of the shell script. The
build.xml file loads the necessary properties from a developer.properties
file located in the base directory. This file should contain all
the same properties as used by the loader, plus the run.jvm.memory
property used by the run script. The loader may then be run using
the command ant run
.