The purpose of this loader is to be able to take the data results of running a chIP-chip biological assay and
store it in a relational database, namely, BioWarehouse.
There is a hierarchy of data files that can store information about a chIP-chip (chromatin immunoprecipitation with genome-tiling DNA chip expression) experiment that the chip-chip
loader
can accept as input. The transformation of these files can be summarized as follows:
GFF --> MAGE-TAB --> MAGE-ML --> BioWarehouse
Please see the file bwh_chIP-chip_loader_dataflow.pdf to get a
more detailed data-flow diagram.
The MAGE-ML files, plus some small text files, are the only required input. You can alternatively
give the chip-chip loader as input files encoded according to the MAGE-TAB standard [1]. MAGE-TAB
files in general are less complex and easier to read. They are stored as spreadsheet files in text format.
Additionally, you can provide GFF files [2] with some additional meta-data to allow the chip-chip loader to
auto-generate some of the required MAGE-TAB files.
Many experiments have their data stored as MAGE-TAB and/or MAGE-ML files at ArrayExpress. If you are encoding the meta-data and experimental data from a novel experiment, you will need to encode your information in one of the data formats described below.
In this section, the required format and naming of the various input files for the
chip-chip
loader will be described.
CSV FormatFor the purposes of
the chip-chip loader, all such files should be stored as tab character delimited files, with
double-quote
characters (") at the beginning and end of each data field. Such files should end with ".csv", even
though these
files are not comma-separated values. This makes it convenient to open these documents in a spreadsheet program, which will then prompt you to specify the record separation character (i.e., the TAB character).
XML FormatMAGE-ML
files are XML files as described by the MAGE-ML format at
MGED
- Experiment Description MAGE-ML file
- This file is the MAGE-ML encoding of the experimental meta-data, including the protocols used, the references for identifiers, publication links, etc.
- Needs any associated.proc data files
- Array Description MAGE-ML file
- This file is the MAGE-ML encoding of the array design, including the features, reporters, composite elements, and physical chip design.
- Antibody / Protein Linking file
- This is a tab-delimited three-column flat-file that shows the association of elements in the MAGE-ML files, antibodies, and precipitated proteins. The first column is the identifier attribute of the element in the MAGE-ML file describing the ChIP-chip experiment that corresponds to an antibody. The second column is the UNIQUE-ID of the antibody protein as described in the proteins.dat file. The third column is the UNIQUE-ID of the protein that the antibody precipitates. Note that this format allows for many-to-many relationships between antibodies and precipitated proteins, by merely inserting additional entries that keep one or two columns the same, while changing the others.
- A special format is considered for specifying existing protein entries in the BioWarehouse. Instead of the UNIQUE-ID in column two or three, the column value can be of the format ::UNIQUE-ID. For example, and entry of 'EcoCyc:12.1:GcvR-gly' in the third column would be an acceptible way to describe an existing E. coli protein that is precipitated by an antibody.
- Protein Information file
- This file captures meta-data about the precipitated protein(s) and the antibodies used in the chIP-chip experiment. These are represented as Experimental Factors in the MAGE Object Model. It is stored using the same file format and file name (proteins.dat) as is used in the BioWarehouse BioCyc Loader.
- chipchip.properties file
- This
file allows you to store the information that the chip-chip loader
needs to upload all of the input files into BioWarehouse. The format of
this file is to have attribute value pairs separated by an equal sign
(and no spaces) on a single line. An attribute with multiple values can
put multiple values after the equal sign, but there needs to be a space
between them. Please see section Running The Loader below.
Optional files:
- Array Description MAGE-ML file
- Mention that this is needed if you need to link data to tile coordinates.
- Investigation Design File (IDF) MAGE-TAB file
- Array Description File (ADF) ArrayExpress file
- This file is not a MAGE-TAB format ADF file, but an earlier format created by ArrayExpress. See reference [4].
- If present, this file will be converted into an Array Description MAGE-ML file.
- Derived Array Data Matrix File (DADMF) MAGE-TAB file
- Required for generating the ED MAGE-ML file's .proc data files.
- Generic Feature File(s) (GFF)
- GFF Configuration File
Additional meta-data is needed by the chip-chip loader to be able to transform GFF files [2]
into
ADF and DADMF
files. The file is named 'gff_config.txt'.
Eight fields are placed on a single line, with white space acting
as
the field separator. There should be
a separate line for every GFF file that needs to be processed.
The
eight
fields are described as follows:
- gff file name - The name of the GFF file that is in the input directory.
- experiment name - The name of the experiment, in brief. For example,
'reppas2006'.
-
name of column which DADMF file is linked to. Specify which column in
your SDRF file precedes the 'Derived Array Data Matrix File'. This will
be placed at the top of the auto-generated DADMF file. For example, if
the preceding column in the SDRF file is titled "Normalization Name",
then you would put in this column "Normalization". Please see the
MAGE-TAB documentation for more details.
- DADMF column ID
- These are the values of the SDRF's '<COLUMN> Name' column,
where <COLUMN> stands for the DADMF column name that was used for
deriving the data in the GFF file. From the previous example, these
would be the values in the column below 'Normalization Name'. This
allows one to link the dataset as represented in the GFF file with a
particular data object as described in the SDRF file. Please see the
MAGE-TAB documentation for more details.
- DADMF
quantitation type - The ArrayExpress conversion scripts require that
the user specifies the quantitation type for the data in the DADMF
file. They define a controlled vocabulary of acceptible terms, which
are defined on their website [3]. A common example would be
'Normalized', for generic log-ratio normalized data. As a note, this is
a requirement that is in the ArrayExpress scripts but is not described
in the MAGE-TAB Specification.
- ADF chip provider - The manufacturer of the chip. Examples might be
'Affymetrix', 'Illumina', or 'NimbleGen'.
-
ADF bio-database - The source of your biological sequence. An example
might be 'refseq'. Please see the ArrayExpress ADF documentation [4]
for more details.
- ADF genome sequence - The ID of your
biological sequence. For E. coli from the refseq database, the ID would
be 'U00096'. Please see the ArrayExpress ADF documentation [4] for more
details.
Usage Modes
The chIP-chip loader is flexible about the types of files that it accepts. This leads to a number of possible modes by which to use it. Below we summarize some possible combinations of files to provide the loader to successfully load your experimental data.
Only MAGE-ML files, Array Description already loaded
Run with Experiment Description MAGE-ML file, protein.dat file, antibody-link.dat.csv
file, and chipchip.properties file.
Only MAGE-ML files, Array Description not loaded
Run
with Experiment Description MAGE-ML file, Array Description MAGE-ML
file, protein.dat file, antibody-link.dat.csv file, and
chipchip.properties file.
MAGE-TAB files, Array Description already loaded
Run with IDF, SDRF, and DADMF MAGE-TAB files, protein.dat file, antibody-link.dat.csv
file, and chipchip.properties file.
MAGE-TAB files, Array Description not loaded
Run with ADF, IDF, SDRF, and DADMF MAGE-TAB files, protein.dat file,
antibody-link.dat.csv file, and chipchip.properties file.
Data in GFF files
Run
with GFF file(s), gff_config.txt file, IDF file, SDRF file, protein.dat
file, antibody-link.dat.csv file, and chipchip.properties file.
This
mode will auto-generate an ADF file to be transformed into a Array
Description MAGE-ML file, ready to be loaded into the BioWarehouse.
Building The Loader
Before building the loader, make sure the environment is configured according to the
Environment Setup. Also make sure the schema is loaded into the database
as specified in the Schema document.
To build the loader, bring up a shell and navigate to the
chip-chip-loader
directory. Then:
osprompt: ant clean
osprompt: ant build
For a list of all project targets, execute:
osprompt: ant
-projecthelp
Running the Loader
Dependencies: The ChIP-Chip Loader
does not require that any other data set is loaded. For the purpose of querying the chip-chip loader dataset, it is recommended that the corresponding genome is loaded into the BioWarehouse, using the BioCyc, CMR, or Genbank loaders.
The ChIP-Chip Loader is run from the chip-chip-loader/dist
directory.
Two scripts are available for running the Chip-Chip loader. The first script (automateChIPchipLoader.sh
)
is used to autmoatically convert input files to the correct format and then executes the loader. The second script (runChipChipLoader.sh
)
is used to directly run the
loader, provided all the input files are already in their correct
formats. Note that the first script calls the second script after
doing the data conversion.
Using automateChIPchipLoader.sh
Pre-requisites:
- Have Perl installed.
- Please have the Tab2MAGE Perl scripts from ArrayExpress installed, and
magetab.pl
& arraymage.pl
in your executable path. The website for this software is: http://tab2mage.sourceforge.net/
- The particular package to download from their download page is called 'Tab2MAGE', available here.
- Place all of the input files into a directory by themselves
- Place database connection parameters into a file called "
chipchip.properties
" as name-value pairs
(as described by the usage section for runChipChipLoader.sh
below) including at least the
following:
name=mydatabase name or SID
host=localhost
port=1234
username=myname
password=mypassword
dbms=mysql
Usage:
./automateChIPchipLoader.sh name ADaccession EDaccession directory
name
- Experiment Name: the same one used in the second column for the
gff_config.txt file.
ADaccession
- Array Description MAGE-ML Accession number. For example, 'BWH_123'.
This needs to be the same prefix for the Array Description
MAGE-ML file or ADF file, if you are not allowing the script to
auto-generate these files. For example, 'BWH_123.AD.mage.xml', or
'BWH_123.adf.csv'.
EDaccession
- Experiment
Description MAGE-ML Accession number. For example, 'BWH_123'. This
needs to be the same prefix for the Experiment Description MAGE-ML file
if you are not allowing the script to auto-generate this file.
directory
-
Directory which stores all of the input files. It is recommended to
create a new directory, and place all of the input files there. Then,
give the full path to this directory as an argument to the script.
Troubleshooting
- Fixing MAGE-TAB formatting mistakes
Using runChipChipLoader.sh
usage: runChipChipLoader.sh
-a,--protein-dir
<dir>
Directory where 'proteins.dat' file is
located
-b,--linking-file <file>
Antibody linking file
-v2,--biocyc-version <version> BioCyc data
version
-d,--dbms
<dbms>
DBMS type (mysql or oracle)
-f,--file
<file>
Name of input data file(s)
-h,--help
Print usage instructions
-n,--name
<name>
Name or SID of database
-p,--properties <file>
Name of properties file
-r,--release <release date>
Release date of the input dataset
-s,--host
<host>
Name or IP address of database server
host
-t,--port
<port>
Port database server is listening at
-u,--username <username>
Username for connection to the database
-v,--version <version number>
Version number of the input dataset
-w,--password <password>
Password for connection to the database
Properties may be set on the command line
or in the properties file.
Values on the command line take precedence over those in a
properties
file. Properties in a property file are specified in name-value pairs.
For
example: port=1234
Relational Mapping of the Input Data to the BioWarehouse Schema
- All of the MAGE-ML data input is mapped in the same way as is described in the MAGE Loader documentation.
- All of the protein data stored in the proteins.dat file is mapped in the same way as is described in the BioCyc Loader documentation.
- From the antibody-link.dat.csv input file, CrossReference table entries are set up linking the
elements from the MAGE-ML files to the Protein table entries that represent the antibody. In the CrossReference table, a row is set up between every FactorValue row and Protein table row representing a row in the antibody linking file. The FactorValue row WID is the CrossReference row's OtherWID, and the Protein row WID is the CrossReference row's CrossWID.
- For the relationship between the antibodies and the proteins that they precipitate, Interaction table entries are set up for every unique antibody-precipitated protein pair. The type column is "immunoprecipitation". There is an entry in the InteractionParticipant table for the antibody, and an entry for the precipitated protein as well, linking it to the Interaction table entry.
References:
[1] MAGE-TAB Specification, Version 1.1 : http://www.mged.org/mage-tab/spec1.1.html
[2] Sanger GFF
File Format Specificaton: http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
[3]
ArrayExpress Quantitation Types: http://tab2mage.sourceforge.net/docs/ArrayExpress/Datafile/QT_list.html
[4]
ArrayExpress Array Definition File specification: http://www.ebi.ac.uk/miamexpress/help/adf/index.html