Gene Ontology (GO) Loader

Version 4.6
(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.

This document describes how to build and run the Gene Ontology (GO) Loader.  The GO Loader is located in the go-loader/ subdirectory of the BioWarehouse distribution.  For more information regarding the implementation of the GO Loader, see the GO Manual.  At this time, the GO loader parses only the ontology terms and not the associations (instance data).


Obtaining the Input Files
Building the GO Loader
Running the GO Loader
Documentation

Obtaining the Input Files

The latest supported data version for the GO loader is listed in the loader summary table.  Input files to the GO loader can be obtained from http://geneontology.org/.  The latest release can be found at ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest/.  The BioWarehouse GO Loader reads TermDB data in the OBO format in the XML format.  Input files are named go_{date}-termdb.obo-xml.gz, where {date} is a year and a month, or a descriptor like "daily".  The files must be extracted using gunzip before running the loader.

Building the GO Loader

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.

To build the loader, bring up a shell and navigate to the go-loader directory. Then:

osprompt: ant clean
osprompt: ant build

For a list of all project targets, execute:

    osprompt: ant -projecthelp

The expected build output can be viewed here.

Running the GO Loader

The GO loader is run from the go-loader/dist directory.

usage: runGOLoader.sh
  -q,--quit-after <num entries>   Quit after parsing X number of entries.
                                 (For testing purposes only.)
 -l,--load-all                   Load all data, including data found to be
                                 suspect
 -c,--create-separate-datasets   Load data as three separate datasets
                                 (biological_process, molecular_function, cellular_location)
 -d,--dbms <dbms>                DBMS type (mysql or oracle)
 -f,--file <file>                Name of input data file
 -h,--help                       Print usage instructions
 -n,--name <name>                Name or SID of database
 -p,--properties <file>          Name of properties file
 -r,--release <release date>     Release date of the input dataset
 -s,--host <host>                Name or IP address of database server
                                 host
 -t,--port <port>                Port database server is listening at
 -u,--username <username>        Username for connection to the database
 -v,--version <version number>   Version number of the input dataset
 -w,--password <password>        Password for connection to the database

Properties may be set on the command line or in the properties file.
Values on the command line take precedence over those in a properties
file. Properties in a property file are specified in name-value pairs. For
example: port=1234


A template properties file can be found in the dist directory (go.properties).

Example of specifying parameters on the command line:

./runGOLoader.sh -d oracle -f go_200503-termdb.obo-xml -n biospice -s chive.ai.sri.com -t 1234 -u myusername -w mypassword -v 123 -r "March 2005" -c

Example:  Running the loader using a properties file:

Edit GO.properties to have the required values:

dbms=oracle
file=go_200503-termdb.obo-xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123
release=March 2005
create-separate-datasets=true

Then run the script by passing in the name of the properties file:

./runGOLoader.sh -p go.properties

A log file is generated during the run.  The log file is located at go-loader/dist/GOLoader.log.  If you choose to create three separate DataSets for the three namespaces (biological_process, molecular_function, and cellular_component) instead of one DataSet, the namespace is appended to DataSet.Name for each of the three namespace DataSets (e.g. "Gene Ontology biological_process").

Documentation

For more detailed information on the implementation of the GO Loader, including schema mappings, please see the GO Manual.