Creation of eco2dbase dump files

Version 4.6

(C) 2006 SRI
International. All Rights Reserved.  See BioWarehouse
Overview for license details.

Introduction
Limitations
Run eco2dbase loader

Loading source eco2dbase data into Oracle database

Setup environment

Building the eco2dbase loader

Running the the eco2dbase loader

Updating shell scripts to dump eco2dbase data

Exporting Oracle database dump

Exporting MySQL database dumps

Verify Oracle and MySQL dump files

Update new Oracle and MySQL dump files in CVS

References

Introduction

This process only needs to be run by SRI. This only needs to be run if there a change to the schema that affects the eco2dbase data, or if there is an update to the eco2dbase data.

This document describes how to run the eco2dbase Loader to create the MySQL and Oracle dump files, so that Bio-SPICE Warehouse users can simply quickly load (import) the data from the dump files into their database. Bio-SPICE Warehouse users do not need to run this step, since they can just import the data from the dump files as explained in the main eco2dbase documentation.

To ensure that are no conflicts of WIDs when Bio-SPICE Warehouse users import the data, it is very important to note, that WIDs have been set aside and reserved the eco2dbase data. For the eco2dbase DataSet, the reserved WID is 1001 or larger. For the rest of the eco2dbase data, the reserved WIDs are from 0 to 999,999. WIDs were reserved for the eco2dbase dataset to ensure that there are no conflicts (overlapping WIDs) with other datasets in the eco2dbase dataset. This is necesary to do because Bio-SPICE Warehouse users do not run the full eco2dbase loader. Users instead load database dumps into their Bio-SPICE Warehouse schema. Because their schema may already have other data in there it is necessary to make sure there is no overlap with the WIDs.

For the translation semantics of the eco2dbase tables, please see the eco2dbase translation semantics document.

Limitations

Running this loader requires that you've loaded the schema into the database as specified in the Schema doc. The schema must not contain any data from any other datasets.

MySQL servers impose a limit on the maximum size of an SQL statement. The eco2dbase loader currently assumes that this limit is at least 4,000,000 bytes, and will truncate long data elements such as sequences to ensure this limit is not exceeded. This limit is currently not changeable. It is necessary to configure your MySQL server so that its max_allowed_packets parameter is at least 4M, as described in Environment Setup (MySQL).

Run eco2dbase loader

Loading source eco2dbase data into Oracle database

The eco2dbase data was given to SRI in July 2000 as an Oracle database dump. This database dump file is located at /home/rosemary1/bio/databases/eco2dbase/July-2000/eco2dbase.dmp, but has already been loaded into a schema. The connection information is as follows: username=kejariwal1; password=ke_orcl; database_name=biospice. This can be accessed on Chive. In all of the following documentation, this database containing the original eco2dbase data given to SRI will be referred to as the source datatabase.

If the data no longer exists in the database above, or needs to be reloaded, load the dump file into a new schema. You will need to ask the Oracle admin to load the dump file for you, due to permission issues.

Setup environment

Before loading the data, make sure both the Oracle and MySQL environment is configured according to the Environment Setup (MySQL). Also, make sure to setup both a Oracle and a MySQL Bio-SPICE Warehouse schema into a database as specified in the Schema doc. The schemas must not contain any other datasets.

In addition, you must have the Oracle binary imp and exp included in their PATH environment variable, and MySQL users must have the mysql binary as part of their PATH variable.

Also, for Oracle, set the TNS_ADMIN environmental variable to the directory containing your tnsnames.ora file. The tnsnames.ora file is the conventional file used to specify the connection parameters to the database. For example, your cshrc file coudl have: setenv TNS_ADMIN /export/home/chive1/u01/app/oracle/product/10.1.0/db_1/network/admin

The shell scripts to export/import the data is run from the eco2dbase-loader/bin directory. The database dump files are in the eco2dbase-loader/data directory. Upoon successful completing of running the loader and creating both the Oracle and MySQL dump files you will update these dump files in CVS.

Building the eco2dbase loader

To build the loader, bring up a shell and navigate to the eco2dbase-loader directory. Then:

osprompt: ant cleanosprompt: ant build

For a list of all project targets, execute:

osprompt: ant -projecthelp

Running the the eco2dbase loader

The eco2dbase loader is run from the eco2dbase-loader/dist directory.

usage: runEco2DBaseLoader.sh -D,--sourcedbms source DBMS type (mysql or oracle) -N,--sourcename Name or SID of source database -S,--sourcehost Name or IP address of souce database server host -T,--sourceport source Port database server is listening at -U,--sourceusername Username for connection to the source database -W,--sourcepassword Password for connection to the source database -d,--dbms DBMS type (mysql or oracle) -h,--help Print usage instructions -n,--name Name or SID of database -p,--properties Name of properties file -s,--host Name or IP address of database server host -t,--port Port database server is listening at -u,--username Username for connection to the database -w,--password Password for connection to the database

The source database is the Oracle database containing the original eco2dbase data/schema given to SRI

Properties may be set on the command line or in the properties file. Values on the command line take precedence over those in a properties file. Properties in a property file are specified in name-value pairs. For example: port=1234

You need to run the loader twice. Once, where the data is loaded into an Oracle Bio-SPICE Warehouse schema, and then again, but this time loading the data into a MySQL Bio-SPICE Warehouse schema.

A template properties file can be found in the dist directory (eco2dbase.properties).

Example of properties file for copying data into Oracle database:


#biowarehouse schema
host=localhost
port=1521
dbms=oracle
name=biospice
username=myname
password=mypassword

#data will be extracted from eco2dbase database
sourcehost=localhost
sourceport=3125
sourcedbms=oracle
sourcename=biospice
sourceusername=kejariwal1
sourcepassword=ke_orcl

Example of properties file for copying data into MySQL database:


#biowarehouse schema
host=localhost
port=1522
dbms=mysql
name=biospice
username=myname
password=mypassword

#data will be extracted from eco2dbase database
sourcehost=localhost
sourceport=3125
sourcedbms=oracle
sourcename=biospice
sourceusername=kejariwal1
sourcepassword=ke_orcl

You can specify the parameters on the command line:

osprompt: ./runEco2DBaseLoader.sh -D oracle -N biospice -S localhost -T 3125 -U kejariwal1 -W ke_orcl -d dbms_type -n sid -s localhost -t 3123 -u user -w password

Alternatively, you can use the properties file. After you edit the eco2dbase.properties file to have the required values, you can run the script by passing in the name of the properties file:

osprompt: ./runEco2DBaseLoader.sh -p eco2dbase.properties

Remember, you need to run the loader twice. Once, where the data is loaded into an Oracle Bio-SPICE Warehouse schema, and then again, but this time loading the data into a MySQL Bio-SPICE Warehouse schema.

A log file is generated during the run. The log file is located at eco2dbase-loader/dist/eco2dbaseLoader.log. Because you need to do one run for loading an Oracle database, and one run for loading the MySQL datbase, after you have completed the first run, it's recommended that you retain the log file by renaming it so that the second run does not overwrite the log file.

Updating shell scripts to dump eco2dbase data

The scripts will need to be updated if there are any changes to the schema that affects the eco2dbase data. The export scripts, mysql-dump-eco2dbase.sh and oracle-dump-eco2dbase.sh, are located in eco2dbase-loader/bin. Looking at the scripts, you can see that the tables options specify the tables to be exported, as well as the order in which the tables must be exported. The reason this is necessary is that some tables do not need to be exported (ex: Enumeration table, and empty tables), and because the order is very important during the import (the Gene table is dependent on the DataSet table, so the DataSet table must be exported first).

If you don't know whether any additional tables contain eco2dbase data, you can do a full export of the schema, and look at the log file:

osprompt: exp user/password@database file=fulldump_orcl.dmp log=exportFull.log

Then look at exportFull.log to see what tables contain data. If you need help with options for Oracle exp, use: exp help=yes. Help for mysqldump can be found via: mysqldump --help.

Exporting Oracle database dump

osprompt: sh ./oracle-dump-eco2dbase.sh user password database

Upon completion, it will then say: created dump file: eco2dbase_orcl.dmp. You should examine the file (use UNIX more) to check that the file has real data content in the file.

The log file for this is: orcl_export.log

Exporting MySQL database dump

osprompt: sh ./mysql-dump-eco2dbase.sh user database host port

Upon completion, it will then say: created dump file: eco2dbase_mysql.dmp. You should examine the file (use UNIX more) to check that the file has real data content in the file.

Verify Oracle and MySQL dump files

Information on how Bio-SPICE Warehouse users will load these database dumps can be found in the main eco2dbase documentation. It is suggested that after you create the dump files, that you start against with new blank Oracle and MySQL Bio-SPICE Warehouse schemas, and import the data files in the exact same way Bio-SPICE Warehouse users will. This will verify that the dump files were created correctly (tip: the Oracle process gives more logging information, and creates a log file, so it is recommended that you look at the Oracle log files to makes sure there are no problems).

Update new Oracle and MySQL dump files in CVS

After you have verified the dump files, update the dump files in the eco2dbase-loader/data directory of CVS so that Bio-SPICE Warehouse users will be able to use the new, updated data.

References

eco2dbase at NCBI ftp site

Chapter 115 in the second edition of the book "Escherichia coli and Salmonella" by Neidhardt.