BioPax Loader

Version 4.6

(C) 2006 SRI International. All Rights Reserved.  See BioWarehouse Overview for license details.

Introduction
Building the Loader
Running the Loader
Loader Architecture
Developer Utilities


Introduction

The BioPax Loader loads BioPax formatted data into the BioWarehouse. It pertains only to BioPAX Level 2, and it loads only BioPAX physical interaction data.
This program requires that the BioWarehouse schema has been loaded and that the NCBI Taxonomy Loader has been run.

Information on the BioPax ontology can be found here http://www.biopax.org/.

The ontology subset currently supported by the loader can be seen here Ontology Instance Mapping.

The entries created by the loader are listed here BioPax to BioWarehouse Mappings.

Building the Loader

Before building the loaders, make sure Java and Ant are properly set up according to the Environment Setup document, and make sure the environment tests for Java and Ant pass.

To build the loader, bring up a shell and navigate to the biopax-loader/ directory. Execute ant. This will build the loader and place the distribution in directory biopax-loader/dist.

Running the Loader

Before running the loader, edit the biopax-loader/dist/bioPaxLoader.properties file for your environment.

The loader may be run using the script located at biopax-loader/dist/runBioPaxLoader.sh

The usage of the script is described below. 
[Please note that the script expects to use the biopax-loader/dist/bioPaxLoader.properties file]


usage: runBioPaxLoader.sh
 -c,--spring-config <spring-config>                   Name of the Spring configuration file for this loader.
 -i,--input-dir <input-dir>                           Directory that contains the input data files.
 -o,--output-dir <output-dir>                         Directory to contain any output files created by this loader.
                                                      The contents of output-dir will be deleted on each run.
                                                      The directory tree will be created if it does not already exist.
 -e,--error-dir <error-dir>                           Directory to contain any error files created by this loader.
                                                      The contents of error-dir will be deleted on each run.
                                                      The directory tree will be created if it does not already exist.
 -q,--dump-query-results                              Dump the extracted
                                                      SPARQL query results to the output-dir.
 -x,--input-file-extensions <input-file-extensions>   Process only files within the input directory that have these
                                                      filename extensions. Example: .owl
 -g,--ontology-files-dir <ontology-files-dir>         Directory that contains the ontology files that are to be loaded.
                                                      For example, the directory which contains the biopax ontology file
                                                       (ie: biopax-level2.owl).
 -a,--dataset-name <dataset-name>                     Name to use for all dataSets created with this application.
                                                        Example: 'BIND'
 -l,--parse-only                                      Only parse and validate the input data files.
                                                      NOTE: Using this option will cause the application to not even
                                                      attempt to connect to the database.
 -m,--dataset-home-url <dataset-home-url>             URL for the site where the data contained within the dataSets originated.
 -d,--dbms <dbms>                                     DBMS type (mysql or oracle)
 -h,--help                                            Print usage instructions
 -n,--name <name>                                     Name or SID of database
 -p,--properties <file>                               Name of properties file
 -r,--release <release date>                          Release date of the input dataset
 -s,--host <host>                                     Name or IP address of database server host
 -t,--port <port>                                     Port database server is listening at
 -u,--username <username>                             Username for connection to the database
 -v,--version <version number>                        Version number of the input dataset
 -w,--password <password>                             Password for connection to the database

Properties set on the command line take precedence over those in a properties file.
Properties in a property file are specified in name-value pairs. For example: port=1234


If any input files are corrupted to the point that they cannot be processed, they will be copied to the specified error-dir along with a diagnostic message text.

The loader's output log file will contain information about each file that was processed.
A sample of last several lines of a log file can be see here bioPaxLoader.log

Loader Architecture

The loader was written using an interface based design with the implementation classes wired together within Spring configuration files.
Information on Spring can be found here The Spring Framework.

The Spring configuration files are located in the biopax-loader/etc directory. The configurations required by the BioPax Loader are copied to the biopax-loader/dist directory by the build script. When a developer is running the loader from the build script (rather than the .sh script found in the dist directory) the configuration files within the etc directory are used.

The data is obtained from the input files using SPARQL queries. All of the queries are defined within the Spring configuration file etc/biopax-queries.xml
[Note: After clicking on the link, you will want to view page source to retain the proper formatting].

Developer Utilities 

The following three utilities are run from the BioPax Loader's Ant build script.

Ant Task
Description
run-query-validator
Validates the return variable names from the SPARQL queries and the key names assigned to the query map against those defined in the constants classes.

An example output log file can be seen here bioPaxLoader.log
create-instance-report
Creates a mapping of all ontology classes/properties referenced from a set of RDF files.

Run this utility to determine which ontology classes/properties need to be supported to accomodate the input data.

An example listing can be seen here biopax-instance-report

The listing file will be created in the specified output-dir.
create-basedata-creation-sql-script
Creates SQL scripts that will create base table data required by the BioPax loader for testing purposes.

Eliminates the need to run the Taxon data loader, thereby reducing developer setup/testing time.
Run this utility after installing the initial Warehouse DB schema.

The scripts (for both Oracle and MySql) will be created in the specified output-dir.

Example scripts:
  insert-biopax-basedata.mysql.sql
  insert-biopax-basedata.oracle.sql

NOTE: This utility is only to be used to facilitate initialization of an isolated DB for testing purposes.

 
Twinkle

You can easily test any SPARQL queries outside of the BioPax loader by using a SPARQL GUI tool such as Twinkle.

To start Twinkle, execute the Ant 'run' target from the Ant script biopax-loader/tools/twinkle/build.xml.




Just copy/paste the following prefix specifications into the Twinkle edit window followed by the query text. Then, load the input rdf file.
       
    PREFIX bp: <http://www.biopax.org/release/biopax-level2.owl#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>