Gene Ontology (GO) Loader

Version 4.6

(C) 2006 SRI International.
All Rights Reserved.  See BioWarehouse
Overview for license details.

This document describes how to build and run the Gene Ontology (GO) Loader. The GO Loader is located in the go-loader/ subdirectory of the BioWarehouse distribution. For more information regarding the implementation of the GO Loader, see the GO Manual. At this time, the GO loader parses only the ontology terms and not the associations (instance data).

Obtaining the Input Files
Building the GO Loader
Running the GO Loader
Documentation

Obtaining the Input Files

The latest supported data version for the GO loader is listed in the loader summary table. Input files to the GO loader can be obtained from http://geneontology.org/. The latest release can be found at ftp://ftp.geneontology.org/pub/go/godatabase/archive/latest/. The BioWarehouse GO Loader reads TermDB data in the OBO format in the XML format. Input files are named go_{date}-termdb.obo-xml.gz, where {date} is a year and a month, or a descriptor like "daily". The files must be extracted using gunzip before running the loader.

Building the GO Loader

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.

To build the loader, bring up a shell and navigate to the go-loader directory. Then:

osprompt: ant cleanosprompt: ant build

For a list of all project targets, execute:

osprompt: ant -projecthelp

The expected build output can be viewed here.

Running the GO Loader

The GO loader is run from the go-loader/dist directory.

usage: runGOLoader.sh

-q,--quit-after <num entries>   Quit after
parsing X number of entries.

                                
(For testing purposes only.)

 -l,--load-all                  
Load all data, including data found to be

                                
suspect

 -c,--create-separate-datasets   Load data as three
separate datasets

                                
(biological_process, molecular_function, cellular_location)

 -d,--dbms
<dbms>               
DBMS type (mysql or oracle)

 -f,--file
<file>               
Name of input data file

 -h,--help                      
Print usage instructions

 -n,--name
<name>               
Name or SID of database

 -p,--properties
<file>          Name
of properties file

 -r,--release <release date>     Release
date of the input dataset

 -s,--host
<host>               
Name or IP address of database server

                                
host

 -t,--port
<port>               
Port database server is listening at

 -u,--username
<username>        Username for
connection to the database

 -v,--version <version number>   Version number of
the input dataset

 -w,--password
<password>        Password for
connection to the database



Properties may be set on the command line or in the properties file.

Values on the command line take precedence over those in a properties

file. Properties in a property file are specified in name-value pairs.
For

example: port=1234

A template properties file can be found in the dist directory (go.properties).

Example of specifying parameters on the command line:

./runGOLoader.sh
-d oracle -f go_200503-termdb.obo-xml -n biospice -s chive.ai.sri.com
-t 1234 -u myusername -w
mypassword -v 123 -r "March 2005" -c

Example: Running the loader using a properties file:

Edit GO.properties to have the required values:

dbms=oracle
file=go_200503-termdb.obo-xml
name=biospice
host=localhost
port=1521
username=myname
password=mypassword
version=123

release=March 2005

create-separate-datasets=true

Then run the script by passing in the name of the properties file:

./runGOLoader.sh -p go.properties

A log file is generated during the run. The log file is located at go-loader/dist/GOLoader.log. If you choose to create three separate DataSets for the three namespaces (biological_process, molecular_function, and cellular_component) instead of one DataSet, the namespace is appended to DataSet.Name for each of the three namespace DataSets (e.g. "Gene Ontology biological_process").

Documentation

For more detailed information on the implementation of the GO Loader, including schema mappings, please see the GO Manual.