Information
|
Licensing
|
Technical Specs
|
Support
|
|
Pathway Tools Data-File Formats
Pathway Tools Data-File Formats
Each Pathway/Genome Database (PGDB) within the BioCyc
Database Collection has been exported into a set of data files to facilitate
use of these data by other programs and database management systems. The
file formats are described below.
The data files themselves can be obtained in several ways:
- Data files for BioCyc databases can be downloaded
[download instructions]
- You can generate data files from any PGDB using the desktop
version of Pathway Tools by invoking
the Pathway/Genome Navigator command File → Export → Entire DB to Flat Files.
- For PGDBs created by other Pathway Tools users, contact the PGDB authors or
obtain the PGDB from the PGDB registry
and then use the preceding Export command to create the data files
Alternative Data Access Approaches
You may find it easier to obtain PGDB data via the following approaches:
Data-File Formats
The data files generated from each PGDB fall into the following categories
of formats. Note that in most cases the tabular files provide a subset
of the information found in the attribute-value format files, but are typically
easier to use.
The following terms are synonymous in
the context of data files: field, column, attribute, and slot. Most data
file slots correspond one-to-one with a PGDB slot; their semantics are
explained in "Guide to the Pathway Tools Schema", which appears as a file entitled "Pathway-Tools-Schema.pdf" in the data file download directory.
following data file slots are not described in the Pathway Tools Schema:
- UNIQUE-ID : a series of characters that uniquely identifies an object within
a PGDB
- TYPES : the parent class(es) of an object
- SUBUNIT-COMPOSITION : a description of
the genes that encode the subunits of a multimer in a comma-delimited form.
For example, genes A, B, C with subunit coefficients m, n, o would be expressed
as:
- REACTION-EQUATION : generated from the values in the LEFT and RIGHT slots
of a reaction frame
Note that the database types for the values of the DBLINKS slots are explained at the following link:
/ptools/DBLINKS.html
Format |
Description |
Attribute-Value |
Each attribute-value file contains data for one class of objects, such
as genes or proteins. A file is divided into entries, where one entry
describes one database object.
An entry consists of a set of attribute-value pairs, which describe
properties of the object, and relationships of the object to other object.
Each attribute-value pair typically resides on a single line of the file,
although in some cases for values that are long strings, the value will
reside on multiple lines. An attribute-value pair consists of an
attribute name, followed by the string " - " and a value, for example:
LEFT - NADP
A value that requires more than one line is continued by a newline followed
by a /. Thus, literal slashes at the beginning of a line must be escaped
as //. A line that contains only // separates objects. Comment lines can
be anywhere in the file and must begin with the following symbol:
#
Starting in version 6.5 of Pathway Tools, attribute-value files can also
contain annotation-value pairs. Annotations are a mechanism for attaching
labeled values to specific attribute values. For example, we might
want to specify a coefficient for a reactant in a chemical reaction.
An annotations refers to the attribute value that immediately precedes
the annotation. An annotation-value pair consists of a caret symbol
"^" that points upward to indicate that the annotation annotates the preceding
attribute value, followed by the annotation label, followed by the string
" - ", followed by a value. The same attribute name or annotation label
with different values can appear any number of times in an object.
An example annotation-value pair that refers to the preceding attribute-value
pair is:
LEFT - NADP
^COEFFICIENT - 1
|
BioPAX |
The file contains a Biological Pathways Exchange (BioPAX) - Level 2 and Level 3 dumps of the database. |
FASTA |
Each object (either a polypeptide or a polynucleotide) in the file
begins with a line that begins with the following symbol:
>
On the same line as the > is a comment describing the object.
The remaining lines contain the object's amino-acid or nucleotide sequence. The sequence is typically
broken into multiple lines, each of which must have the same arbitrary
length, except the last line, which may be shorter. NCBI
describes the FASTA format in more detail. |
Mol File |
Mol files contain chemical structures for compounds in PGDBs. Mol files are provided
for MetaCyc only. |
Ocelot |
The file contains a Lisp format dump of the entire database. |
Other |
Enzyme Sequence Files
Three files unique to MetaCyc provide sequence information associated with many MetaCyc reactions
for use by the Pathway Hole Filler.
They are present within the MetaCyc flat-file distribution.
All of the files are formatted as Lisp
lists, actually as a list of lists. Each Lisp list is a series of items
enclosed in parentheses.
- protein-seq-ids-reduced-70.dat -- A list of lists, where each list
contains a MetaCyc reaction id, an EC number, and one or more protein
identifiers with a prefix of either UNIPROT for uniprot identifiers or
PID for ENTREZ-derived identifiers. In this file, sequences with more
than 70% blast similarity to previously processed sequences have been
removed as "redundant." The .seq file is supplied for the reduced set
only.
- protein-seq-ids-reduced-70.seq -- A list of lists, with each list consisting of a
protein id with database prefix, followed by the amino acid sequence
of the protein specified by the id. Corresponds to the preceding .dat file.
- protein-seq-ids-unreduced.dat -- Similar to protein-seq-ids-reduced-70.dat,
but no removal of similar sequences has been performed.
Atom Mapping Files
Two files within the MetaCyc flat-file distribution contain computed atom mappings for the
mass balanced reactions of MetaCyc.
|
SBML |
The file contains an SBML dump of the reaction network within the database. |
Tabular |
Each tabular file contains data for one class of objects, such as reactions
or pathways. This type of file contains a single table of tab-delimited
columns and newline-delimited rows. The first row contains headers which
describe the data beneath them. Each of the remaining rows represents an
object, and each column is an attribute of the object. Column names that
would otherwise be the same contain a number x having values 1, 2, 3, etc.
to distinguish them. Comment lines can be anywhere in the file and must
begin with the following symbol:
#
|
Data File List
The following table summarizes the files that are generated for each PGDB.
You can jump to a full description of each file by clicking the file name.
File Name |
Brief Description |
Format |
enzymes.col |
Enzymatic reactions and enzymes |
Tabular |
genes.col |
Genes |
Tabular |
pathways.col |
Pathways and the genes that encode enzymes in each pathway |
Tabular |
protcplxs.col |
Protein complexes and the genes that encode each subunit in a complex |
Tabular |
transporters.col |
Transporters, their subunit structures, and what they transport |
Tabular |
func-associations.col |
Various functional associations between genes (not available for most organisms) |
Tabular |
bindrxns.dat |
Binding reactions between proteins and DNA sites |
Attribute-Value |
classes.dat |
PGDB classes and their relationships |
Attribute-Value |
compounds.dat |
Chemical compounds |
Attribute-Value |
dnabindsites.dat |
DNA binding sites |
Attribute-Value |
enzrxns.dat |
Enzymatic reactions |
Attribute-Value |
genes.dat |
Genes |
Attribute-Value |
pathways.dat |
Pathways, including relationships among reactions |
Attribute-Value |
promoters.dat |
Promoters |
Attribute-Value |
protein-features.dat |
Protein features (for example, active sites) |
Attribute-Value |
proteins.dat |
Proteins |
Attribute-Value |
protligandcplxes.dat |
Complexes between proteins and small-molecule ligands |
Attribute-Value |
pubs.dat |
Publications |
Attribute-Value |
reactions.dat |
Chemical reactions |
Attribute-Value |
regulation.dat |
Regulatory interactions of all types |
Attribute-Value |
regulons.dat |
Transcription factors |
Attribute-Value |
species.dat |
List of all Species (this file is in only the MetaCyc DB) |
Attribute-Value |
terminators.dat |
Terminators |
Attribute-Value |
transunits.dat |
Transcription units |
Attribute-Value |
protseq.fsa |
Protein sequences (in 13.5 , renamed from formerly protseq.fasta) |
FASTA |
dnaseq.fsa |
Nucleotide sequences for all genes (new in 13.5) |
FASTA |
[orgid].gaf |
GO Annotations |
GAF-2.2 |
[orgid]base.ocelot |
Entire database |
Ocelot |
biopax-level2.owl |
All pathways, reactions, etc. that can be represented in BioPAX level 2 format |
BioPAX |
biopax-level3.owl |
All pathways, reactions, etc. that can be represented in BioPAX level 3 format |
BioPAX |
MetaCyc-Molfiles.tgz |
MetaCyc compound mol files (molecular structures) |
This file is available as part of MetaCyc flat files |
Tabular Data Files
To view a sample, click the file name.
File Name |
Description |
enzymes.col |
For each enzymatic reaction in the PGDB, the file lists the reaction
equation, up to 4 pathways that contain the reaction, up to 4 cofactors
for the enzyme, up to 4 activators, up to 4 inhibitors, and the subunit
structure of the enzyme.
Columns (multiple columns are indicated in parentheses):
- UNIQUE-ID
- NAME
- REACTION-EQUATION
- PATHWAYS (4)
- COFACTORS (4)
- ACTIVATORS (4)
- INHIBITORS (4)
- SUBUNIT-COMPOSITION
|
genes.col |
For each gene in the PGDB, the file lists its names (including up to
4 synonyms), location, product, and up to 4 parent classes (types). Note:
Gene Type is a class in the gene ontology designed by Dr. M. Riley.
Columns (multiple columns are indicated in parentheses):
- UNIQUE-ID
- BLATTNER-ID (E. coli PGDB only; column omitted otherwise)
- NAME
- PRODUCT-NAME
- SWISS-PROT-ID
- REPLICON
- START-BASE
- END-BASE
- SYNONYMS (4)
- GENE TYPE (4)
|
pathways.col |
For each pathway in the PGDB, the file lists the genes that encode
the enzymes in that pathway.
Columns (multiple columns are indicated in parentheses; n is the maximum
number of genes for all pathways in the PGDB):
- UNIQUE-ID
- NAME
- GENE-NAME (n)
- GENE-ID (n)
|
protcplxs.col |
For each protein complex in the PGDB, the file lists the genes that
encode the subunits of the complex.
Columns (multiple columns are indicated in parentheses; n is the maximum
number of genes for all protein complexes in the PGDB):
|
transporters.col |
For each transporter in the PGDB, the file lists the transport reaction
equation and the transporter's subunit composition.
Columns:
|
func-associations.col |
This file contains all functional associations among genes in
the EcoCyc Pathway/Genome Database.
There are three types of functional associations included in this file.
The sections are separated by two rows each starting with a '#'. These functional
associations include:
- Pathway functional associations, i.e., genes coding for enzymes in
the same metabolic pathway,
- Protein complex functional associations, i.e., genes whose products
are components in heteromultimeric protein complexes,
- Transcription factor/regulated gene pairs, i.e., a pairs of
genes, A and B, where the product of gene A is a component of a transcription
factor that regulates gene B,
Pathway Functional Associations
Each entry is a tab-delimited list of the pathway's ID, the pathway name
and the list of genes in the pathway.
Columns:
- UNIQUE-ID
- NAME
- GENE-ID (n, BLATTNER-ID for E. coli)
Protein Complex Functional Associations
Each entry is a tab-delimited list of the complex's ID, the complex name
and the list of component genes.
Columns:
- UNIQUE-ID
- NAME
- GENE-ID (n, BLATTNER-ID for E. coli)
Transcription Factor/Regulated Gene Pairs
Each entry is a tab-delimited list of the transcription factor's ID, name,
the transcription factor gene, and the regulated gene.
Columns:
- UNIQUE-ID
- NAME
- GENE-ID (BLATTNER-ID for E. coli)
- GENE-ID (BLATTNER-ID for E. coli)
|
Attribute-Value Data Files
To view a truncated sample file, click the file name.
The meanings of the attributes are explained in the chapter "Guide to the
Pathway Tools Schema" in the Pathway Tools User's Guide, which is
available as part of the Pathway Tools software distribution.
FASTA Files
To view a sample, click the file name.
protseq.fsa |
This file lists the amino acid sequence of each protein monomer in the PGDB. (In 13.5 , renamed from formerly protseq.fasta) |
dnaseq.fsa |
This file lists the DNA nucleotide sequence of each gene in the PGDB. (New in 13.5)
Includes RNAs. The extent of each sequence is the coding region, on its coding strand. |
Data File Extraction
When you receive your data file distribution for a PGDB called "Xcyc",
it will be stored as a compressed archive file. How to extract the files
depends on which operating system you're using:
- Unix: To extract the files into the same directory in which you
saved the distribution file:
- uncompress -c -v Xcyc-flatfiles.tar.Z | tar xfp -
- PC or Mac: Use WinZip, PKUnZip, Stuffit Expander, or another unzipping
utility to extract the files from Xcyc-flatfiles.zip
Determining the Difference Between Experimentally Verified and Computationally Predicted Data
Researchers who want to use the data in the data files as a gold
standard for evaluating computational predictions need to know that all
their data is in fact experimentally verified. In the past, this was
not a problem, as all the data in EcoCyc could be reasonably expected
to be experimentally verified, and the vast majority of data in other
PGDBs could be assumed to have been predicted. However, as time goes
by, more and more computationally predicted data is being added to
EcoCyc. Thus, it becomes crucial when parsing the data files to be
able to tell the difference.
Many classes of objects (e.g. enzymatic-reactions, pathways,
transcription units, promoters, etc.) can have associated evidence
codes. You can read more about the Pathway Tools Evidence Ontology here. There are
four top-level codes. All experimental evidence codes start with
EV-EXP, and all computational evidence codes start with EV-COMP. The
other top-level codes are EV-AS (author statement) and EV-IC (inferred
by curator).
Because evidence codes are typically associated with citations, they
are stored in the CITATIONS slot. Values of the citations slot can
take the following possible formats:
- citation
- citation:ev-code
- citation:ev-code:other-data
- :ev-code:other-data
where citation is a citation to the literature, usually a
PubMed ID, ev-code is one of the codes in our evidence
ontology, and other-data is zero or more additional fields
that can generally be ignored (such as a timestamp, curator identifier, etc.).
The CITATIONS slot is included in most of the attribute-value files.
Evidence for enzyme or transporter function is found in enzrxns.dat.
Evidence for non-enzymatic function of a protein is found in
proteins.dat.
Examples of CITATIONS lines include:
- CITATIONS - 94304911
- CITATIONS - 17123542:EV-EXP-IDA-PURIFIED-PROTEIN:3377353002:keseler
- CITATIONS - :EV-EXP:3277835750:pkarp
- CITATIONS - :EV-EXP-IEP-COREGULATION
- CITATIONS - :EV-COMP-AINF:3371579821:kr
- CITATIONS - 92065800:EV-COMP-HINF-FN-FROM-SEQ:3279554727:mhance
- CITATIONS - 1849603:EV-AS-NAS:3342906733:martin
The first example is a citation without an evidence code. The next
three all describe experimental evidence. The fourth and fifth
examples show computational evidence, and the last example is an
author statement.
To strip out objects with only computational evidence, search for all
the top-level code prefixes in the CITATIONS line. Any given object
may have both experimental and computational evidence, so it is not
sufficient just to look for the EV-COMP tag -- you must also make sure
that the object lacks an EV-EXP tag. Some objects may have no
attached evidence codes. In EcoCyc, it is probably safe to assume
that these are experimentally determined, as the assignments probably
predated our use of evidence codes, and were added at a time when
EcoCyc contained only experimentally determined data. In other PGDBs
of course that assumption does not hold.
For example, to find all EcoCyc pathways with experimental evidence,
you would look in the pathways.dat file for all pathways that have at
least one CITATIONS line with an EV-EXP tag (you would also have to
decide whether or not to accept EV-AS and EV-IC tags). To find all
transcription units with experimental evidence, you would do the same
with the transunits.dat file. However, you should bear in mind that
not all experimental evidence is of equal value. In particular, many
transcription units are predicted based on high-throughput gene
expression analysis which, although experimental in nature, is not
generally considered high-quality. Before extracting your data,
consider carefully which evidence codes you wish to include or exclude.
To find all EcoCyc genes with experimentally determined functions, the
situation is more complicated still, as the evidence codes are not
assigned to the genes themselves. To determine whether a particular
gene has an experimentally determined function, you will need to look
at its product (identified by the PRODUCT attribute) or a complex that
includes the product (identified by the COMPONENT-OF attribute on the
polypeptide). An evidence code may be attached directly to the
protein in the proteins.dat file. However, if the gene codes for an
enzyme, the evidence code will instead be found attached to the
enzymatic-reaction entry or entries (identified by the CATALYZES
attribute on the protein) in the enzrxns.dat file. If the gene is a
transcription factor, the evidence code will be found in the
corresponding entries (identified by the REGULATES attribute) in the
regulation.dat file. Thus, you may need to follow several links and
use several files to find the full list of evidence codes for any
given gene. To extract all genes with experimental evidence, you
might find it easiest to work backwards -- extract all objects from
enzrxns.dat, regulation.dat and proteins.dat with experimental
evidence codes, and follow them backwards (via the ENZYME, REGULATOR,
COMPONENTS and GENE attributes) to get the list of relevant genes.
|