Pathway Tools Data-File Formats

Information

Licensing

Technical Specs

Support

Each Pathway/Genome Database (PGDB) within the BioCyc Database Collection has been exported into a set of data files to facilitate use of these data by other programs and database management systems. The file formats are described below.

The data files themselves can be obtained in several ways:

Data files for BioCyc databases can be downloaded [download instructions]
You can generate data files from any PGDB using the desktop version of Pathway Tools by invoking the Pathway/Genome Navigator command File → Export → Entire DB to Flat Files.
For PGDBs created by other Pathway Tools users, contact the PGDB authors or obtain the PGDB from the PGDB registry and then use the preceding Export command to create the data files

Alternative Data Access Approaches

You may find it easier to obtain PGDB data via the following approaches:

Using BioCyc Web Services
Using Pathway Tools APIs in Python, Java, Lisp, and Perl
Using the BioCyc.org Tools → Search → Advanced Search and Tools → SmartTables

Data-File Formats

The data files generated from each PGDB fall into the following categories of formats. Note that in most cases the tabular files provide a subset of the information found in the attribute-value format files, but are typically easier to use.

The following terms are synonymous in the context of data files: field, column, attribute, and slot. Most data file slots correspond one-to-one with a PGDB slot; their semantics are explained in "Guide to the Pathway Tools Schema", which appears as a file entitled "Pathway-Tools-Schema.pdf" in the data file download directory. following data file slots are not described in the Pathway Tools Schema:

UNIQUE-ID : a series of characters that uniquely identifies an object within a PGDB
TYPES : the parent class(es) of an object
SUBUNIT-COMPOSITION : a description of the genes that encode the subunits of a multimer in a comma-delimited form. For example, genes A, B, C with subunit coefficients m, n, o would be expressed as:

A*m,B*n,C*o

REACTION-EQUATION : generated from the values in the LEFT and RIGHT slots of a reaction frame

Note that the database types for the values of the DBLINKS slots are explained at the following link:
/ptools/DBLINKS.html

Format Description

Attribute-Value Each attribute-value file contains data for one class of objects, such as genes or proteins. A file is divided into entries, where one entry describes one database object.
An entry consists of a set of attribute-value pairs, which describe properties of the object, and relationships of the object to other object. Each attribute-value pair typically resides on a single line of the file, although in some cases for values that are long strings, the value will reside on multiple lines. An attribute-value pair consists of an attribute name, followed by the string " - " and a value, for example:
LEFT - NADP
A value that requires more than one line is continued by a newline followed by a /. Thus, literal slashes at the beginning of a line must be escaped as //. A line that contains only // separates objects. Comment lines can be anywhere in the file and must begin with the following symbol:
#
Starting in version 6.5 of Pathway Tools, attribute-value files can also contain annotation-value pairs. Annotations are a mechanism for attaching labeled values to specific attribute values. For example, we might want to specify a coefficient for a reactant in a chemical reaction. An annotations refers to the attribute value that immediately precedes the annotation. An annotation-value pair consists of a caret symbol "^" that points upward to indicate that the annotation annotates the preceding attribute value, followed by the annotation label, followed by the string " - ", followed by a value. The same attribute name or annotation label with different values can appear any number of times in an object. An example annotation-value pair that refers to the preceding attribute-value pair is:
LEFT - NADP
^COEFFICIENT - 1

BioPAX The file contains a Biological Pathways Exchange (BioPAX) - Level 2 and Level 3 dumps of the database.

FASTA Each object (either a polypeptide or a polynucleotide) in the file begins with a line that begins with the following symbol:
>
On the same line as the > is a comment describing the object. The remaining lines contain the object's amino-acid or nucleotide sequence. The sequence is typically broken into multiple lines, each of which must have the same arbitrary length, except the last line, which may be shorter. NCBI describes the FASTA format in more detail.

Mol File Mol files contain chemical structures for compounds in PGDBs. Mol files are provided for MetaCyc only.

Ocelot The file contains a Lisp format dump of the entire database.

Other Enzyme Sequence Files
Three files unique to MetaCyc provide sequence information associated with many MetaCyc reactions for use by the Pathway Hole Filler. They are present within the MetaCyc flat-file distribution. All of the files are formatted as Lisp lists, actually as a list of lists. Each Lisp list is a series of items enclosed in parentheses.

protein-seq-ids-reduced-70.dat -- A list of lists, where each list contains a MetaCyc reaction id, an EC number, and one or more protein identifiers with a prefix of either UNIPROT for uniprot identifiers or PID for ENTREZ-derived identifiers. In this file, sequences with more than 70% blast similarity to previously processed sequences have been removed as "redundant." The .seq file is supplied for the reduced set only.

protein-seq-ids-reduced-70.seq -- A list of lists, with each list consisting of a protein id with database prefix, followed by the amino acid sequence of the protein specified by the id. Corresponds to the preceding .dat file.

protein-seq-ids-unreduced.dat -- Similar to protein-seq-ids-reduced-70.dat, but no removal of similar sequences has been performed.

Atom Mapping Files
Two files within the MetaCyc flat-file distribution contain computed atom mappings for the mass balanced reactions of MetaCyc.

atom-mappings.dat -- Each atom-mapping in the file follows the encoding given in the PGDB Concepts Guide in Section Reaction Atom Mappings .

atom-mappings-smiles.dat -- This file encodes atom mappings using the SMILES representation.

SBML The file contains an SBML dump of the reaction network within the database.

Tabular Each tabular file contains data for one class of objects, such as reactions or pathways. This type of file contains a single table of tab-delimited columns and newline-delimited rows. The first row contains headers which describe the data beneath them. Each of the remaining rows represents an object, and each column is an attribute of the object. Column names that would otherwise be the same contain a number x having values 1, 2, 3, etc. to distinguish them. Comment lines can be anywhere in the file and must begin with the following symbol:
#

Data File List

The following table summarizes the files that are generated for each PGDB. You can jump to a full description of each file by clicking the file name.

File Name Brief Description Format

enzymes.col Enzymatic reactions and enzymes Tabular

genes.col Genes Tabular

pathways.col Pathways and the genes that encode enzymes in each pathway Tabular

protcplxs.col Protein complexes and the genes that encode each subunit in a complex Tabular

transporters.col Transporters, their subunit structures, and what they transport Tabular

func-associations.col Various functional associations between genes (not available for most organisms) Tabular

bindrxns.dat Binding reactions between proteins and DNA sites Attribute-Value

classes.dat PGDB classes and their relationships Attribute-Value

compounds.dat Chemical compounds Attribute-Value

dnabindsites.dat DNA binding sites Attribute-Value

enzrxns.dat Enzymatic reactions Attribute-Value

genes.dat Genes Attribute-Value

pathways.dat Pathways, including relationships among reactions Attribute-Value

promoters.dat Promoters Attribute-Value

protein-features.dat Protein features (for example, active sites) Attribute-Value

proteins.dat Proteins Attribute-Value

protligandcplxes.dat Complexes between proteins and small-molecule ligands Attribute-Value

pubs.dat Publications Attribute-Value

reactions.dat Chemical reactions Attribute-Value

regulation.dat Regulatory interactions of all types Attribute-Value

regulons.dat Transcription factors Attribute-Value

species.dat List of all Species (this file is in only the MetaCyc DB) Attribute-Value

terminators.dat Terminators Attribute-Value

transunits.dat Transcription units Attribute-Value

protseq.fsa Protein sequences (in 13.5 , renamed from formerly protseq.fasta) FASTA

dnaseq.fsa Nucleotide sequences for all genes (new in 13.5) FASTA

[orgid].gaf GO Annotations GAF-2.2

[orgid]base.ocelot Entire database Ocelot

biopax-level2.owl All pathways, reactions, etc. that can be represented in BioPAX level 2 format BioPAX

biopax-level3.owl All pathways, reactions, etc. that can be represented in BioPAX level 3 format BioPAX

MetaCyc-Molfiles.tgz MetaCyc compound mol files (molecular structures) This file is available as part of MetaCyc flat files

Tabular Data Files

To view a sample, click the file name.

File Name Description

enzymes.col For each enzymatic reaction in the PGDB, the file lists the reaction equation, up to 4 pathways that contain the reaction, up to 4 cofactors for the enzyme, up to 4 activators, up to 4 inhibitors, and the subunit structure of the enzyme.
Columns (multiple columns are indicated in parentheses):

UNIQUE-ID

NAME

REACTION-EQUATION

PATHWAYS (4)

COFACTORS (4)

ACTIVATORS (4)

INHIBITORS (4)

SUBUNIT-COMPOSITION

genes.col For each gene in the PGDB, the file lists its names (including up to 4 synonyms), location, product, and up to 4 parent classes (types). Note: Gene Type is a class in the gene ontology designed by Dr. M. Riley.
Columns (multiple columns are indicated in parentheses):

UNIQUE-ID

BLATTNER-ID (E. coli PGDB only; column omitted otherwise)

NAME

PRODUCT-NAME

SWISS-PROT-ID

REPLICON

START-BASE

END-BASE

SYNONYMS (4)

GENE TYPE (4)

pathways.col For each pathway in the PGDB, the file lists the genes that encode the enzymes in that pathway.
Columns (multiple columns are indicated in parentheses; n is the maximum number of genes for all pathways in the PGDB):

UNIQUE-ID

NAME

GENE-NAME (n)

GENE-ID (n)

protcplxs.col For each protein complex in the PGDB, the file lists the genes that encode the subunits of the complex.
Columns (multiple columns are indicated in parentheses; n is the maximum number of genes for all protein complexes in the PGDB):

UNIQUE-ID

NAME

GENE-NAME (n)

GENE-ID (n)

SUBUNIT-COMPOSITION

transporters.col For each transporter in the PGDB, the file lists the transport reaction equation and the transporter's subunit composition.
Columns:

UNIQUE-ID

NAME

REACTION-EQUATION

SUBUNIT-COMPOSITION

func-associations.col This file contains all functional associations among genes in the EcoCyc Pathway/Genome Database.

There are three types of functional associations included in this file. The sections are separated by two rows each starting with a '#'. These functional associations include:

Pathway functional associations, i.e., genes coding for enzymes in the same metabolic pathway,

Protein complex functional associations, i.e., genes whose products are components in heteromultimeric protein complexes,

Transcription factor/regulated gene pairs, i.e., a pairs of genes, A and B, where the product of gene A is a component of a transcription factor that regulates gene B,

Pathway Functional Associations
Each entry is a tab-delimited list of the pathway's ID, the pathway name and the list of genes in the pathway.
Columns:

UNIQUE-ID

NAME

GENE-ID (n, BLATTNER-ID for E. coli)

Protein Complex Functional Associations
Each entry is a tab-delimited list of the complex's ID, the complex name and the list of component genes.
Columns:

UNIQUE-ID

NAME

GENE-ID (n, BLATTNER-ID for E. coli)

Transcription Factor/Regulated Gene Pairs
Each entry is a tab-delimited list of the transcription factor's ID, name, the transcription factor gene, and the regulated gene.
Columns:

UNIQUE-ID

NAME

GENE-ID (BLATTNER-ID for E. coli)

GENE-ID (BLATTNER-ID for E. coli)

Attribute-Value Data Files

To view a truncated sample file, click the file name.

The meanings of the attributes are explained in the chapter "Guide to the Pathway Tools Schema" in the Pathway Tools User's Guide, which is available as part of the Pathway Tools software distribution.

classes.dat For each class, the file lists its names and its parent classes (types). This file covers every class in the Pathway Tools ontology.
Attributes:

UNIQUE-ID

TYPES

COMMENT

COMMON-NAME

SYNONYMS

bindrxns.dat This file lists binding reactions between proteins and DNA binding sites such as promoters.
Attributes:

UNIQUE-ID

TYPES

ACTIVATORS

INHIBITORS

OFFICIAL-EC

REACTANTS

compounds.dat This file lists all chemical compounds in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

ABBREV-NAME

ATOM-CHARGES

CHEMICAL-FORMULA

CITATIONS

COFACTORS-OF

COFACTORS-OR-PROSTHETIC-GROUPS-OF

COMMENT

COMPONENT-OF

CREDITS

DATA-SOURCE

DBLINKS

INCHI

MOLECULAR-WEIGHT

N+1-NAME

N-1-NAME

N-NAME

PKA1

PKA2

PKA3

PROSTHETIC-GROUPS-OF

REGULATES

SMILES

SUPERATOMS

SYNONYMS

SYSTEMATIC-NAME

dnabindsites.dat This file lists all DNA binding sites in the PGDB.
Attributes:

UNIQUE-ID

TYPES

ABS-CENTER-POS

CITATIONS

COMMENT

COMPONENT-OF

DBLINKS

INVOLVED-IN-REGULATION
REGULATED-PROMOTER

RELATIVE-CENTER-DISTANCE

SYNONYMS

TYPE-OF-EVIDENCE

enzrxns.dat This file lists all enzymatic reactions in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

ALTERNATIVE-COFACTORS

ALTERNATIVE-SUBSTRATES

CITATIONS

COFACTOR-BINDING-COMMENT

COFACTORS

COFACTORS-OR-PROSTHETIC-GROUPS

COMMENT

ENZYME

KM

PH-OPT

PROSTHETIC-GROUPS

REACTION

REACTION-DIRECTION

REGULATED-BY

REQUIRED-PROTEIN-COMPLEX

SYNONYMS

TEMPERATURE-OPT

genes.dat This file lists all genes in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

CENTISOME-POSITION

CITATIONS

COMMENT

COMPONENT-OF

COMPONENTS

DBLINKS

IN-PARALOGOUS-GENE-GROUP

INTERRUPTED?

LAST-UPDATE

LEFT-END-POSITION

PRODUCT

PRODUCT-STRING

RIGHT-END-POSITION

SYNONYMS

TRANSCRIPTION-DIRECTION

pathways.dat This file lists all pathways in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

CITATIONS

CLASS-INSTANCE-LINKS

COMMENT

CREDITS

DBLINKS

ENZYME-USE

HYPOTHETICAL-REACTIONS

IN-PATHWAY

NET-REACTION-EQUATION

PATHWAY-INTERACTIONS

PATHWAY-LINKS

POLYMERIZATION-LINKS

PREDECESSORS

PRIMARIES

REACTION-LAYOUT

REACTION-LIST

SPECIES

SUB-PATHWAYS

SUPER-PATHWAYS

SYNONYMS

promoters.dat This file lists all promoters in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

ABSOLUTE-PLUS-1-POS

BINDS-SIGMA-FACTOR

CITATIONS

COMMENT

COMPONENT-OF

DBLINKS

PROMOTER-EVIDENCE

REGULATED-BY

SYNONYMS

protein-features.dat This file lists all the protein features (such as active sites) in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

COMMENT

ATTACHED-GROUP

CITATIONS

FEATURE-OF

HOMOLOGY-MOTIF

LEFT-END POSITION

POSSIBLE-FEATURE-STATES

RESIDUE-NUMBER

RESIDUE-TYPE

RIGHT-END-POSITION

proteins.dat This file lists all proteins in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

CATALYZES

CHEMICAL-FORMULA

CITATIONS

COMMENT

COMPONENT-OF

CREDITS

DBLINKS

DNA-FOOTPRINT-SIZE

FEATURES

GENE

GO-TERMS

ISOZYME-SEQUENCE-SIMILARITY

LOCATIONS

MODIFIED-FORM

MOLECULAR-WEIGHT-EXP

MOLECULAR-WEIGHT-KD

MOLECULAR-WEIGHT-SEQ

PI

REGULATES

SPECIES

SPLICE-FORM-INTRONS

SYMMETRY

SYNONYMS

UNMODIFIED-FORM

protligandcplxes.dat This file lists all the complexes of proteins with small-molecule ligands in the PGDB.
Attributes:

UNIQUE-ID

TYPES

CATALYZES

COMMON-NAME
COMMENT
COMPONENTS

DBLINKS

DNA-FOOTPRINT-SIZE

MOLECULAR-WEIGHT-KD

MOLECULAR-WEIGHT-SEQ

REGULATES

SYMMETRY

SYNONYMS

pubs.dat This file lists all non-PubMed publications referenced in the PGDB.
Attributes:

UNIQUE-ID

TYPES

ABSTRACT

AUTHORS

COMMENT

MEDLINE-UID

PUBMED-ID

REFERENT-FRAME

SOURCE

TITLE

YEAR

reactions.dat This file lists all chemical reactions in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

CITATIONS

COMMENT

DELTAG0

EC-NUMBER

ENZYMATIC-REACTION

IN-PATHWAY

LEFT

OFFICIAL-EC?

ORPHAN?

RIGHT

SIGNAL

SPECIES

SPONTANEOUS?

SYNONYMS

regulation.dat This file lists all the regulatory relationships in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

ASSOCIATED-BINDING-SITE

COMMENT

MECHANISM

MODE

PHYSIOLOGICALLY-RELEVANT?

REGULATED-BY

REGULATED-ENTITY

REGULATOR

SYNONYMS

regulons.dat This file lists all transcription factors in the PGDB and the genes that they regulate by binding upstream of the transcription unit containing those genes.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

ACTIVATORS-ALLOSTERIC-OF

ACTIVATORS-NONALLOSTERIC-OF

ACTIVATORS-UNKMECH-OF

APPEARS-IN-BINDING-REACTIONS

AROMATIC-RINGS

ATOM-CHARGES

ATOM-CHIRALITY

CATALYZES

CHARGE

CHEMICAL-FORMULA

CITATIONS

COFACTORS-OF

COFACTORS-OR-PROSTHETIC-GROUPS-OF

COMMENT

COMPONENT-COEFFICIENTS

COMPONENT-OF

COMPONENTS

CREDITS

DATA-SOURCE

DBLINKS

DNA-FOOTPRINT-SIZE

FEATURES

FUNCTIONAL-ASSIGNMENT-COMMENT

FUNCTIONAL-ASSIGNMENT-STATUS

GENE

GO-TERMS

INHIBITORS-ALLOSTERIC-OF

INHIBITORS-COMPETITIVE-OF

INHIBITORS-IRREVERSIBLE-OF

INHIBITORS-NONCOMPETITIVE-OF

INHIBITORS-OTHER-OF

INHIBITORS-UNCOMPETITIVE-OF

INHIBITORS-UNKMECH-OF

INSTANCE-NAME-TEMPLATE

ISOZYME-SEQUENCE-SIMILARITY

LOCATIONS

MODIFIED-FORM

MOLECULAR-WEIGHT

MOLECULAR-WEIGHT-EXP

MOLECULAR-WEIGHT-KD

MOLECULAR-WEIGHT-SEQ

N+1-NAME

N-1-NAME

N-NAME

NEIDHARDT-SPOT-NUMBER

PI

PROSTHETIC-GROUPS-OF

REGULATED-BY

REGULATES

SPECIES

SPLICE-FORM-INTRONS
STRUCTURE-BONDS

SYMMETRY

SYNONYMS

UNMODIFIED-FORM

terminators.dat This file lists all terminators in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

APPEARS-IN-BINDING-REACTIONS

CITATIONS

COMMENT

COMPONENT-OF

COMPONENTS

CREDITS

DATA-SOURCE

DBLINKS

INSTANCE-NAME-TEMPLATE

LEFT-END-POSITION

RIGHT-END-POSITION

SYNONYMS

transunits.dat This file lists all transcription units in the PGDB.
Attributes:

UNIQUE-ID

TYPES

COMMON-NAME

APPEARS-IN-BINDING-REACTIONS

CITATIONS

COMMENT

COMPONENT-OF

COMPONENTS

CREDITS

DATA-SOURCE

DBLINKS

EXTENT-UNKNOWN?

INSTANCE-NAME-TEMPLATE

LEFT-END-POSITION

REGULATED-BY

RIGHT-END-POSITION

SYNONYMS

FASTA Files

To view a sample, click the file name.

protseq.fsa This file lists the amino acid sequence of each protein monomer in the PGDB. (In 13.5 , renamed from formerly protseq.fasta)

dnaseq.fsa This file lists the DNA nucleotide sequence of each gene in the PGDB. (New in 13.5)
Includes RNAs. The extent of each sequence is the coding region, on its coding strand.

Data File Extraction

When you receive your data file distribution for a PGDB called "Xcyc", it will be stored as a compressed archive file. How to extract the files depends on which operating system you're using:

Unix: To extract the files into the same directory in which you saved the distribution file:

uncompress -c -v Xcyc-flatfiles.tar.Z | tar xfp -

PC or Mac: Use WinZip, PKUnZip, Stuffit Expander, or another unzipping utility to extract the files from Xcyc-flatfiles.zip

Determining the Difference Between Experimentally Verified and Computationally Predicted Data

Researchers who want to use the data in the data files as a gold standard for evaluating computational predictions need to know that all their data is in fact experimentally verified. In the past, this was not a problem, as all the data in EcoCyc could be reasonably expected to be experimentally verified, and the vast majority of data in other PGDBs could be assumed to have been predicted. However, as time goes by, more and more computationally predicted data is being added to EcoCyc. Thus, it becomes crucial when parsing the data files to be able to tell the difference.

Many classes of objects (e.g. enzymatic-reactions, pathways, transcription units, promoters, etc.) can have associated evidence codes. You can read more about the Pathway Tools Evidence Ontology here. There are four top-level codes. All experimental evidence codes start with EV-EXP, and all computational evidence codes start with EV-COMP. The other top-level codes are EV-AS (author statement) and EV-IC (inferred by curator).

Because evidence codes are typically associated with citations, they are stored in the CITATIONS slot. Values of the citations slot can take the following possible formats:

citation
citation:ev-code
citation:ev-code:other-data
:ev-code:other-data

where citation is a citation to the literature, usually a PubMed ID, ev-code is one of the codes in our evidence ontology, and other-data is zero or more additional fields that can generally be ignored (such as a timestamp, curator identifier, etc.).

The CITATIONS slot is included in most of the attribute-value files. Evidence for enzyme or transporter function is found in enzrxns.dat. Evidence for non-enzymatic function of a protein is found in proteins.dat. Examples of CITATIONS lines include:

CITATIONS - 94304911
CITATIONS - 17123542:EV-EXP-IDA-PURIFIED-PROTEIN:3377353002:keseler
CITATIONS - :EV-EXP:3277835750:pkarp
CITATIONS - :EV-EXP-IEP-COREGULATION
CITATIONS - :EV-COMP-AINF:3371579821:kr
CITATIONS - 92065800:EV-COMP-HINF-FN-FROM-SEQ:3279554727:mhance
CITATIONS - 1849603:EV-AS-NAS:3342906733:martin

The first example is a citation without an evidence code. The next three all describe experimental evidence. The fourth and fifth examples show computational evidence, and the last example is an author statement.

To strip out objects with only computational evidence, search for all the top-level code prefixes in the CITATIONS line. Any given object may have both experimental and computational evidence, so it is not sufficient just to look for the EV-COMP tag -- you must also make sure that the object lacks an EV-EXP tag. Some objects may have no attached evidence codes. In EcoCyc, it is probably safe to assume that these are experimentally determined, as the assignments probably predated our use of evidence codes, and were added at a time when EcoCyc contained only experimentally determined data. In other PGDBs of course that assumption does not hold.

For example, to find all EcoCyc pathways with experimental evidence, you would look in the pathways.dat file for all pathways that have at least one CITATIONS line with an EV-EXP tag (you would also have to decide whether or not to accept EV-AS and EV-IC tags). To find all transcription units with experimental evidence, you would do the same with the transunits.dat file. However, you should bear in mind that not all experimental evidence is of equal value. In particular, many transcription units are predicted based on high-throughput gene expression analysis which, although experimental in nature, is not generally considered high-quality. Before extracting your data, consider carefully which evidence codes you wish to include or exclude.

To find all EcoCyc genes with experimentally determined functions, the situation is more complicated still, as the evidence codes are not assigned to the genes themselves. To determine whether a particular gene has an experimentally determined function, you will need to look at its product (identified by the PRODUCT attribute) or a complex that includes the product (identified by the COMPONENT-OF attribute on the polypeptide). An evidence code may be attached directly to the protein in the proteins.dat file. However, if the gene codes for an enzyme, the evidence code will instead be found attached to the enzymatic-reaction entry or entries (identified by the CATALYZES attribute on the protein) in the enzrxns.dat file. If the gene is a transcription factor, the evidence code will be found in the corresponding entries (identified by the REGULATES attribute) in the regulation.dat file. Thus, you may need to follow several links and use several files to find the full list of evidence codes for any given gene. To extract all genes with experimental evidence, you might find it easiest to work backwards -- extract all objects from enzrxns.dat, regulation.dat and proteins.dat with experimental evidence codes, and follow them backwards (via the ENZYME, REGULATOR, COMPONENTS and GENE attributes) to get the list of relevant genes.