ChIP-Chip Loader for the BioWarehouse

Version 4.6

(C) 2008 SRI International.
        All Rights Reserved.  See BioWarehouse
        Overview for license details.

Introduction
Input Files
Usage Modes
Building the loader
Running the loader
Relational Mapping of the Input Data to the BioWarehouse Schema
References

Introduction

The purpose of this loader is to be able to take the data results of running a chIP-chip biological assay and store it in a relational database, namely, BioWarehouse.

There is a hierarchy of data files that can store information about a chIP-chip (chromatin immunoprecipitation with genome-tiling DNA chip expression) experiment that the chip-chip loader can accept as input. The transformation of these files can be summarized as follows:

GFF --> MAGE-TAB --> MAGE-ML --> BioWarehouse

Please see the file bwh_chIP-chip_loader_dataflow.pdf to get a more detailed data-flow diagram.

The MAGE-ML files, plus some small text files, are the only required input. You can alternatively give the chip-chip loader as input files encoded according to the MAGE-TAB standard [1]. MAGE-TAB files in general are less complex and easier to read. They are stored as spreadsheet files in text format. Additionally, you can provide GFF files [2] with some additional meta-data to allow the chip-chip loader to auto-generate some of the required MAGE-TAB files. Many experiments have their data stored as MAGE-TAB and/or MAGE-ML files at ArrayExpress. If you are encoding the meta-data and experimental data from a novel experiment, you will need to encode your information in one of the data formats described below.

Input Files

File Formats:

In this section, the required format and naming of the various input files for the chip-chip loader will be described.

CSV Format

For the purposes of the chip-chip loader, all such files should be stored as tab character delimited files, with
double-quote characters (") at the beginning and end of each data field. Such files should end with ".csv", even
though these files are not comma-separated values. This makes it convenient to open these documents in a spreadsheet program, which will then prompt you to specify the record separation character (i.e., the TAB character).

XML Format
MAGE-ML files are XML files as described by the MAGE-ML format at MGED

Required files:

Experiment Description MAGE-ML file

This file is the MAGE-ML encoding of the experimental meta-data, including the protocols used, the references for identifiers, publication links, etc.
Needs any associated.proc data files

Array Description MAGE-ML file

This file is the MAGE-ML encoding of the array design, including the features, reporters, composite elements, and physical chip design.

Antibody / Protein Linking file

This is a tab-delimited three-column flat-file that shows the association of elements in the MAGE-ML files, antibodies, and precipitated proteins. The first column is the identifier attribute of the element in the MAGE-ML file describing the ChIP-chip experiment that corresponds to an antibody. The second column is the UNIQUE-ID of the antibody protein as described in the proteins.dat file. The third column is the UNIQUE-ID of the protein that the antibody precipitates. Note that this format allows for many-to-many relationships between antibodies and precipitated proteins, by merely inserting additional entries that keep one or two columns the same, while changing the others.
A special format is considered for specifying existing protein entries in the BioWarehouse. Instead of the UNIQUE-ID in column two or three, the column value can be of the format ::UNIQUE-ID. For example, and entry of 'EcoCyc:12.1:GcvR-gly' in the third column would be an acceptible way to describe an existing E. coli protein that is precipitated by an antibody.

Protein Information file

This file captures meta-data about the precipitated protein(s) and the antibodies used in the chIP-chip experiment. These are represented as Experimental Factors in the MAGE Object Model. It is stored using the same file format and file name (proteins.dat) as is used in the BioWarehouse BioCyc Loader.
chipchip.properties file

This file allows you to store the information that the chip-chip loader needs to upload all of the input files into BioWarehouse. The format of this file is to have attribute value pairs separated by an equal sign (and no spaces) on a single line. An attribute with multiple values can put multiple values after the equal sign, but there needs to be a space between them. Please see section Running The Loader below.

Optional files:

Array Description MAGE-ML file

Mention that this is needed if you need to link data to tile coordinates.

Investigation Design File (IDF) MAGE-TAB file
Array Description File (ADF) ArrayExpress file

This file is not a MAGE-TAB format ADF file, but an earlier format created by ArrayExpress. See reference [4].If present, this file will be converted into an Array Description MAGE-ML file.

Derived Array Data Matrix File (DADMF) MAGE-TAB file

Required for generating the ED MAGE-ML file's .proc data files.

Generic Feature File(s) (GFF)
GFF Configuration File

gff file name - The name of the GFF file that is in the input directory.
experiment name - The name of the experiment, in brief. For example, 'reppas2006'.
name of column which DADMF file is linked to. Specify which column in your SDRF file precedes the 'Derived Array Data Matrix File'. This will be placed at the top of the auto-generated DADMF file. For example, if the preceding column in the SDRF file is titled "Normalization Name", then you would put in this column "Normalization". Please see the MAGE-TAB documentation for more details.
DADMF column ID - These are the values of the SDRF's '<COLUMN> Name' column, where <COLUMN> stands for the DADMF column name that was used for deriving the data in the GFF file. From the previous example, these would be the values in the column below 'Normalization Name'. This allows one to link the dataset as represented in the GFF file with a particular data object as described in the SDRF file. Please see the MAGE-TAB documentation for more details.
DADMF quantitation type - The ArrayExpress conversion scripts require that the user specifies the quantitation type for the data in the DADMF file. They define a controlled vocabulary of acceptible terms, which are defined on their website [3]. A common example would be 'Normalized', for generic log-ratio normalized data. As a note, this is a requirement that is in the ArrayExpress scripts but is not described in the MAGE-TAB Specification.
ADF chip provider - The manufacturer of the chip. Examples might be 'Affymetrix', 'Illumina', or 'NimbleGen'.
ADF bio-database - The source of your biological sequence. An example might be 'refseq'. Please see the ArrayExpress ADF documentation [4] for more details.
ADF genome sequence - The ID of your biological sequence. For E. coli from the refseq database, the ID would be 'U00096'. Please see the ArrayExpress ADF documentation [4] for more details.

Usage Modes

The chIP-chip loader is flexible about the types of files that it accepts. This leads to a number of possible modes by which to use it. Below we summarize some possible combinations of files to provide the loader to successfully load your experimental data.

Only MAGE-ML files, Array Description already loaded

Run with Experiment Description MAGE-ML file, protein.dat file, antibody-link.dat.csv file, and chipchip.properties file.

Only MAGE-ML files, Array Description not loaded

Run with Experiment Description MAGE-ML file, Array Description MAGE-ML file, protein.dat file, antibody-link.dat.csv file, and chipchip.properties file.

MAGE-TAB files, Array Description already loaded

Run with IDF, SDRF, and DADMF MAGE-TAB files, protein.dat file, antibody-link.dat.csv file, and chipchip.properties file.

MAGE-TAB files, Array Description not loaded

Run with ADF, IDF, SDRF, and DADMF MAGE-TAB files, protein.dat file, antibody-link.dat.csv file, and chipchip.properties file.

Data in GFF files

Run with GFF file(s), gff_config.txt file, IDF file, SDRF file, protein.dat file, antibody-link.dat.csv file, and chipchip.properties file.

This mode will auto-generate an ADF file to be transformed into a Array Description MAGE-ML file, ready to be loaded into the BioWarehouse.

Building The Loader

Before building the loader, make sure the environment is configured according to the Environment Setup. Also make sure the schema is loaded into the database as specified in the Schema document.

To build the loader, bring up a shell and navigate to the chip-chip-loader directory. Then:

osprompt: ant clean
osprompt: ant build

For a list of all project targets, execute:

osprompt: ant
    -projecthelp

Running the Loader

Dependencies: The ChIP-Chip Loader does not require that any other data set is loaded. For the purpose of querying the chip-chip loader dataset, it is recommended that the corresponding genome is loaded into the BioWarehouse, using the BioCyc, CMR, or Genbank loaders.

The ChIP-Chip Loader is run from the chip-chip-loader/dist directory.

Two scripts are available for running the Chip-Chip loader. The first script (automateChIPchipLoader.sh) is used to autmoatically convert input files to the correct format and then executes the loader. The second script (runChipChipLoader.sh) is used to directly run the loader, provided all the input files are already in their correct formats. Note that the first script calls the second script after doing the data conversion.

Using automateChIPchipLoader.sh

The activity of the automateChIPchipLoader.sh script is summarized in the data flow diagram bwh_chIP-chip_loader_dataflow.pdf.

Pre-requisites:

Have Perl installed.
Please have the Tab2MAGE Perl scripts from ArrayExpress installed, and magetab.pl & arraymage.pl in your executable path. The website for this software is: http://tab2mage.sourceforge.net/

The particular package to download from their download page is called 'Tab2MAGE', available here.

Place all of the input files into a directory by themselves
Place database connection parameters into a file called "chipchip.properties" as name-value pairs (as described by the usage section for runChipChipLoader.sh below) including at least the following:

name=mydatabase name or SID
host=localhost
port=1234
username=myname
password=mypassword
dbms=mysql

Usage:

./automateChIPchipLoader.sh name ADaccession EDaccession directory

name - Experiment Name: the same one used in the second column for the gff_config.txt file.

ADaccession - Array Description MAGE-ML Accession number. For example, 'BWH_123'. This needs to be the same prefix for the Array Description MAGE-ML file or ADF file, if you are not allowing the script to auto-generate these files. For example, 'BWH_123.AD.mage.xml', or 'BWH_123.adf.csv'.

EDaccession - Experiment Description MAGE-ML Accession number. For example, 'BWH_123'. This needs to be the same prefix for the Experiment Description MAGE-ML file if you are not allowing the script to auto-generate this file.

directory - Directory which stores all of the input files. It is recommended to create a new directory, and place all of the input files there. Then, give the full path to this directory as an argument to the script.

Troubleshooting

Fixing MAGE-TAB formatting mistakes

The Tab2MAGE package includes a Perl script called 'expt_check.pl' that is very useful for making sure that MAGE-TAB files are formatted correctly
http://tab2mage.sourceforge.net/docs/expt_check.html

Using runChipChipLoader.sh

usage: runChipChipLoader.sh

 -a,--protein-dir
    <dir>          
    Directory where 'proteins.dat' file is

                                 
    located

 -b,--linking-file <file>        
    Antibody linking file

 -v2,--biocyc-version <version>   BioCyc data
    version

 -d,--dbms
    <dbms>                
    DBMS type (mysql or oracle)

 -f,--file
    <file>                
    Name of input data file(s)

 -h,--help                       
    Print usage instructions

 -n,--name
    <name>                
    Name or SID of database

 -p,--properties <file>          
    Name of properties file

 -r,--release <release date>     
    Release date of the input dataset

 -s,--host
    <host>                
    Name or IP address of database server

                                 
    host

 -t,--port
    <port>                
    Port database server is listening at

 -u,--username <username>        
    Username for connection to the database

 -v,--version <version number>   
    Version number of the input dataset

 -w,--password <password>        
    Password for connection to the database

Properties may be set on the command line
    or in the properties file.

Values on the command line take precedence over those in a
    properties

file. Properties in a property file are specified in name-value pairs.
    For

example: port=1234

Relational Mapping of the Input Data to the BioWarehouse Schema

All of the MAGE-ML data input is mapped in the same way as is described in the MAGE Loader documentation.
All of the protein data stored in the proteins.dat file is mapped in the same way as is described in the BioCyc Loader documentation.
From the antibody-link.dat.csv input file, CrossReference table entries are set up linking the
elements from the MAGE-ML files to the Protein table entries that represent the antibody. In the CrossReference table, a row is set up between every FactorValue row and Protein table row representing a row in the antibody linking file. The FactorValue row WID is the CrossReference row's OtherWID, and the Protein row WID is the CrossReference row's CrossWID.
For the relationship between the antibodies and the proteins that they precipitate, Interaction table entries are set up for every unique antibody-precipitated protein pair. The type column is "immunoprecipitation". There is an entry in the InteractionParticipant table for the antibody, and an entry for the precipitated protein as well, linking it to the Interaction table entry.

References:

http://www.mged.org/mage-tab/spec1.1.html

http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml

http://tab2mage.sourceforge.net/docs/ArrayExpress/Datafile/QT_list.html

http://www.ebi.ac.uk/miamexpress/help/adf/index.html