Information

Pathway Tools Overview Pathway Tools Testimonials
Publications
Release Note History
Contributions
Pathway Tools Blog
Technical Datasheet
Fact Sheet
Pathway Tools Testimonials
Contact Us

Licensing

Academic Licenses Commercial Licenses

Technical Specs

Web Services
Pathway Tools APIs
Installation Guide
Ontologies
Operations
File Formats

Support

Submitting Bug Reports
Tutorials
FAQs
Webinars
Pathway Tools Data-File Formats

Pathway Tools Data-File Formats

Each Pathway/Genome Database (PGDB) within the BioCyc Database Collection has been exported into a set of data files to facilitate use of these data by other programs and database management systems. The file formats are described below.

The data files themselves can be obtained in several ways:

  • Data files for BioCyc databases can be downloaded [download instructions]

  • You can generate data files from any PGDB using the desktop version of Pathway Tools by invoking the Pathway/Genome Navigator command File  →  Export  →  Entire DB to Flat Files.

  • For PGDBs created by other Pathway Tools users, contact the PGDB authors or obtain the PGDB from the PGDB registry and then use the preceding Export command to create the data files

Alternative Data Access Approaches

You may find it easier to obtain PGDB data via the following approaches:

Data-File Formats

The data files generated from each PGDB fall into the following categories of formats. Note that in most cases the tabular files provide a subset of the information found in the attribute-value format files, but are typically easier to use.

The following terms are synonymous in the context of data files: field, column, attribute, and slot. Most data file slots correspond one-to-one with a PGDB slot; their semantics are explained in "Guide to the Pathway Tools Schema", which appears as a file entitled "Pathway-Tools-Schema.pdf" in the data file download directory. following data file slots are not described in the Pathway Tools Schema:

  • UNIQUE-ID : a series of characters that uniquely identifies an object within a PGDB
  • TYPES : the parent class(es) of an object
  • SUBUNIT-COMPOSITION : a description of the genes that encode the subunits of a multimer in a comma-delimited form. For example, genes A, B, C with subunit coefficients m, n, o would be expressed as:
    • A*m,B*n,C*o
  • REACTION-EQUATION : generated from the values in the LEFT and RIGHT slots of a reaction frame
Note that the database types for the values of the DBLINKS slots are explained at the following link:
/ptools/DBLINKS.html

Format Description
Attribute-Value Each attribute-value file contains data for one class of objects, such as genes or proteins.  A file is divided into entries, where one entry describes one database object.

An entry consists of a set of attribute-value pairs, which describe properties of the object, and relationships of the object to other object. Each attribute-value pair typically resides on a single line of the file, although in some cases for values that are long strings, the value will reside on multiple lines.  An attribute-value pair consists of an attribute name, followed by the string " - " and a value, for example:

LEFT - NADP
A value that requires more than one line is continued by a newline followed by a /. Thus, literal slashes at the beginning of a line must be escaped as //. A line that contains only // separates objects. Comment lines can be anywhere in the file and must begin with the following symbol:
#
Starting in version 6.5 of Pathway Tools, attribute-value files can also contain annotation-value pairs.  Annotations are a mechanism for attaching labeled values to specific attribute values.  For example, we might want to specify a coefficient for a reactant in a chemical reaction. An annotations refers to the attribute value that immediately precedes the annotation.  An annotation-value pair consists of a caret symbol "^" that points upward to indicate that the annotation annotates the preceding attribute value, followed by the annotation label, followed by the string " - ", followed by a value. The same attribute name or annotation label with different values can appear any number of times in an object.  An example annotation-value pair that refers to the preceding attribute-value pair is:
LEFT - NADP
^COEFFICIENT - 1
BioPAX The file contains a Biological Pathways Exchange (BioPAX) - Level 2 and Level 3 dumps of the database.
FASTA Each object (either a polypeptide or a polynucleotide) in the file begins with a line that begins with the following symbol:
>
On the same line as the > is a comment describing the object. The remaining lines contain the object's amino-acid or nucleotide sequence. The sequence is typically broken into multiple lines, each of which must have the same arbitrary length, except the last line, which may be shorter. NCBI describes the FASTA format in more detail.
Mol File Mol files contain chemical structures for compounds in PGDBs. Mol files are provided for MetaCyc only.
Ocelot The file contains a Lisp format dump of the entire database.
Other Enzyme Sequence Files
Three files unique to MetaCyc provide sequence information associated with many MetaCyc reactions for use by the Pathway Hole Filler. They are present within the MetaCyc flat-file distribution. All of the files are formatted as Lisp lists, actually as a list of lists. Each Lisp list is a series of items enclosed in parentheses.

  • protein-seq-ids-reduced-70.dat -- A list of lists, where each list contains a MetaCyc reaction id, an EC number, and one or more protein identifiers with a prefix of either UNIPROT for uniprot identifiers or PID for ENTREZ-derived identifiers. In this file, sequences with more than 70% blast similarity to previously processed sequences have been removed as "redundant." The .seq file is supplied for the reduced set only.

  • protein-seq-ids-reduced-70.seq -- A list of lists, with each list consisting of a protein id with database prefix, followed by the amino acid sequence of the protein specified by the id. Corresponds to the preceding .dat file.

  • protein-seq-ids-unreduced.dat -- Similar to protein-seq-ids-reduced-70.dat, but no removal of similar sequences has been performed.

Atom Mapping Files
Two files within the MetaCyc flat-file distribution contain computed atom mappings for the mass balanced reactions of MetaCyc.

SBML The file contains an SBML dump of the reaction network within the database.
Tabular Each tabular file contains data for one class of objects, such as reactions or pathways.  This type of file contains a single table of tab-delimited columns and newline-delimited rows. The first row contains headers which describe the data beneath them. Each of the remaining rows represents an object, and each column is an attribute of the object. Column names that would otherwise be the same contain a number x having values 1, 2, 3, etc. to distinguish them. Comment lines can be anywhere in the file and must begin with the following symbol:
#

Data File List

The following table summarizes the files that are generated for each PGDB. You can jump to a full description of each file by clicking the file name.

File Name Brief Description Format
enzymes.col Enzymatic reactions and enzymes Tabular
genes.col Genes  Tabular
pathways.col Pathways and the genes that encode enzymes in each pathway Tabular
protcplxs.col Protein complexes and the genes that encode each subunit in a complex Tabular
transporters.col Transporters, their subunit structures, and what they transport Tabular
func-associations.col Various functional associations between genes (not available for most organisms) Tabular
bindrxns.dat Binding reactions between proteins and DNA sites Attribute-Value
classes.dat PGDB classes and their relationships Attribute-Value
compounds.dat Chemical compounds Attribute-Value
dnabindsites.dat DNA binding sites Attribute-Value
enzrxns.dat Enzymatic reactions Attribute-Value
genes.dat Genes Attribute-Value
pathways.dat Pathways, including relationships among reactions Attribute-Value
promoters.dat Promoters Attribute-Value
protein-features.dat Protein features (for example, active sites) Attribute-Value
proteins.dat Proteins Attribute-Value
protligandcplxes.dat Complexes between proteins and small-molecule ligands Attribute-Value
pubs.dat Publications Attribute-Value
reactions.dat Chemical reactions Attribute-Value
regulation.dat Regulatory interactions of all types Attribute-Value
regulons.dat Transcription factors Attribute-Value
species.dat List of all Species (this file is in only the MetaCyc DB) Attribute-Value
terminators.dat Terminators Attribute-Value
transunits.dat Transcription units Attribute-Value
protseq.fsa Protein sequences (in 13.5 , renamed from formerly protseq.fasta) FASTA
dnaseq.fsa Nucleotide sequences for all genes (new in 13.5) FASTA
[orgid].gaf GO Annotations GAF-2.2
[orgid]base.ocelot Entire database Ocelot
biopax-level2.owl All pathways, reactions, etc. that can be represented in BioPAX level 2 format BioPAX
biopax-level3.owl All pathways, reactions, etc. that can be represented in BioPAX level 3 format BioPAX
MetaCyc-Molfiles.tgz MetaCyc compound mol files (molecular structures) This file is available as part of MetaCyc flat files

Tabular Data Files

To view a sample, click the file name.

File Name Description
enzymes.col For each enzymatic reaction in the PGDB, the file lists the reaction equation, up to 4 pathways that contain the reaction, up to 4 cofactors for the enzyme, up to 4 activators, up to 4 inhibitors, and the subunit structure of the enzyme.

Columns (multiple columns are indicated in parentheses):

  • UNIQUE-ID
  • NAME
  • REACTION-EQUATION
  • PATHWAYS (4)
  • COFACTORS (4)
  • ACTIVATORS (4)
  • INHIBITORS (4)
  • SUBUNIT-COMPOSITION
genes.col For each gene in the PGDB, the file lists its names (including up to 4 synonyms), location, product, and up to 4 parent classes (types). Note: Gene Type is a class in the gene ontology designed by Dr. M. Riley.

Columns (multiple columns are indicated in parentheses):

  • UNIQUE-ID
  • BLATTNER-ID (E. coli PGDB only; column omitted otherwise)
  • NAME
  • PRODUCT-NAME
  • SWISS-PROT-ID
  • REPLICON
  • START-BASE
  • END-BASE
  • SYNONYMS (4)
  • GENE TYPE (4)
pathways.col For each pathway in the PGDB, the file lists the genes that encode the enzymes in that pathway.

Columns (multiple columns are indicated in parentheses; n is the maximum number of genes for all pathways in the PGDB):

  • UNIQUE-ID
  • NAME
  • GENE-NAME (n)
  • GENE-ID (n)
protcplxs.col For each protein complex in the PGDB, the file lists the genes that encode the subunits of the complex.

Columns (multiple columns are indicated in parentheses; n is the maximum number of genes for all protein complexes in the PGDB):

transporters.col For each transporter in the PGDB, the file lists the transport reaction equation and the transporter's subunit composition.

Columns:

func-associations.col This file contains all functional associations among genes in the EcoCyc Pathway/Genome Database.

There are three types of functional associations included in this file. The sections are separated by two rows each starting with a '#'. These functional associations include:
  1. Pathway functional associations, i.e., genes coding for enzymes in the same metabolic pathway,
  2. Protein complex functional associations, i.e., genes whose products are components in heteromultimeric protein complexes,
  3. Transcription factor/regulated gene pairs, i.e., a pairs of genes, A and B, where the product of gene A is a component of a transcription factor that regulates gene B,

Pathway Functional Associations
Each entry is a tab-delimited list of the pathway's ID, the pathway name and the list of genes in the pathway.
Columns:

  • UNIQUE-ID
  • NAME
  • GENE-ID (n, BLATTNER-ID for E. coli)

Protein Complex Functional Associations
Each entry is a tab-delimited list of the complex's ID, the complex name and the list of component genes.
Columns:

  • UNIQUE-ID
  • NAME
  • GENE-ID (n, BLATTNER-ID for E. coli)

Transcription Factor/Regulated Gene Pairs
Each entry is a tab-delimited list of the transcription factor's ID, name, the transcription factor gene, and the regulated gene.
Columns:

  • UNIQUE-ID
  • NAME
  • GENE-ID (BLATTNER-ID for E. coli)
  • GENE-ID (BLATTNER-ID for E. coli)

Attribute-Value Data Files

To view a truncated sample file, click the file name.

The meanings of the attributes are explained in the chapter "Guide to the Pathway Tools Schema" in the Pathway Tools User's Guide, which is available as part of the Pathway Tools software distribution.

classes.dat For each class, the file lists its names and its parent classes (types). This file covers every class in the Pathway Tools ontology.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMENT
  • COMMON-NAME
  • SYNONYMS
bindrxns.dat This file lists binding reactions between proteins and DNA binding sites such as promoters.

Attributes:

  • UNIQUE-ID
  • TYPES
  • ACTIVATORS
  • INHIBITORS
  • OFFICIAL-EC
  • REACTANTS
compounds.dat This file lists all chemical compounds in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • ABBREV-NAME
  • ATOM-CHARGES
  • CHEMICAL-FORMULA
  • CITATIONS
  • COFACTORS-OF
  • COFACTORS-OR-PROSTHETIC-GROUPS-OF
  • COMMENT
  • COMPONENT-OF
  • CREDITS
  • DATA-SOURCE
  • DBLINKS
  • INCHI
  • MOLECULAR-WEIGHT
  • N+1-NAME
  • N-1-NAME
  • N-NAME
  • PKA1
  • PKA2
  • PKA3
  • PROSTHETIC-GROUPS-OF
  • REGULATES
  • SMILES
  • SUPERATOMS
  • SYNONYMS
  • SYSTEMATIC-NAME
dnabindsites.dat This file lists all DNA binding sites in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • ABS-CENTER-POS
  • CITATIONS
  • COMMENT
  • COMPONENT-OF
  • DBLINKS
  • INVOLVED-IN-REGULATION
  • REGULATED-PROMOTER
  • RELATIVE-CENTER-DISTANCE
  • SYNONYMS
  • TYPE-OF-EVIDENCE
enzrxns.dat This file lists all enzymatic reactions in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • ALTERNATIVE-COFACTORS
  • ALTERNATIVE-SUBSTRATES
  • CITATIONS
  • COFACTOR-BINDING-COMMENT
  • COFACTORS
  • COFACTORS-OR-PROSTHETIC-GROUPS
  • COMMENT
  • ENZYME
  • KM
  • PH-OPT
  • PROSTHETIC-GROUPS
  • REACTION
  • REACTION-DIRECTION
  • REGULATED-BY
  • REQUIRED-PROTEIN-COMPLEX
  • SYNONYMS
  • TEMPERATURE-OPT
genes.dat This file lists all genes in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • CENTISOME-POSITION
  • CITATIONS
  • COMMENT
  • COMPONENT-OF
  • COMPONENTS
  • DBLINKS
  • IN-PARALOGOUS-GENE-GROUP
  • INTERRUPTED?
  • LAST-UPDATE
  • LEFT-END-POSITION
  • PRODUCT
  • PRODUCT-STRING
  • RIGHT-END-POSITION
  • SYNONYMS
  • TRANSCRIPTION-DIRECTION
pathways.dat This file lists all pathways in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • CITATIONS
  • CLASS-INSTANCE-LINKS
  • COMMENT
  • CREDITS
  • DBLINKS
  • ENZYME-USE
  • HYPOTHETICAL-REACTIONS
  • IN-PATHWAY
  • NET-REACTION-EQUATION
  • PATHWAY-INTERACTIONS
  • PATHWAY-LINKS
  • POLYMERIZATION-LINKS
  • PREDECESSORS
  • PRIMARIES
  • REACTION-LAYOUT
  • REACTION-LIST
  • SPECIES
  • SUB-PATHWAYS
  • SUPER-PATHWAYS
  • SYNONYMS
promoters.dat This file lists all promoters in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • ABSOLUTE-PLUS-1-POS
  • BINDS-SIGMA-FACTOR
  • CITATIONS
  • COMMENT
  • COMPONENT-OF
  • DBLINKS
  • PROMOTER-EVIDENCE
  • REGULATED-BY
  • SYNONYMS
protein-features.dat This file lists all the protein features (such as active sites) in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • COMMENT
  • ATTACHED-GROUP
  • CITATIONS
  • FEATURE-OF
  • HOMOLOGY-MOTIF
  • LEFT-END POSITION
  • POSSIBLE-FEATURE-STATES
  • RESIDUE-NUMBER
  • RESIDUE-TYPE
  • RIGHT-END-POSITION
proteins.dat This file lists all proteins in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • CATALYZES
  • CHEMICAL-FORMULA
  • CITATIONS
  • COMMENT
  • COMPONENT-OF
  • CREDITS
  • DBLINKS
  • DNA-FOOTPRINT-SIZE
  • FEATURES
  • GENE
  • GO-TERMS
  • ISOZYME-SEQUENCE-SIMILARITY
  • LOCATIONS
  • MODIFIED-FORM
  • MOLECULAR-WEIGHT-EXP
  • MOLECULAR-WEIGHT-KD
  • MOLECULAR-WEIGHT-SEQ
  • PI
  • REGULATES
  • SPECIES
  • SPLICE-FORM-INTRONS
  • SYMMETRY
  • SYNONYMS
  • UNMODIFIED-FORM
protligandcplxes.dat This file lists all the complexes of proteins with small-molecule ligands in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • CATALYZES
  • COMMON-NAME
  • COMMENT
  • COMPONENTS
  • DBLINKS
  • DNA-FOOTPRINT-SIZE
  • MOLECULAR-WEIGHT-KD
  • MOLECULAR-WEIGHT-SEQ
  • REGULATES
  • SYMMETRY
  • SYNONYMS
pubs.dat This file lists all non-PubMed publications referenced in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • ABSTRACT
  • AUTHORS
  • COMMENT
  • MEDLINE-UID
  • PUBMED-ID
  • REFERENT-FRAME
  • SOURCE
  • TITLE
  • YEAR
reactions.dat This file lists all chemical reactions in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • CITATIONS
  • COMMENT
  • DELTAG0
  • EC-NUMBER
  • ENZYMATIC-REACTION
  • IN-PATHWAY
  • LEFT
  • OFFICIAL-EC?
  • ORPHAN?
  • RIGHT
  • SIGNAL
  • SPECIES
  • SPONTANEOUS?
  • SYNONYMS
regulation.dat This file lists all the regulatory relationships in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • ASSOCIATED-BINDING-SITE
  • COMMENT
  • MECHANISM
  • MODE
  • PHYSIOLOGICALLY-RELEVANT?
  • REGULATED-BY
  • REGULATED-ENTITY
  • REGULATOR
  • SYNONYMS
regulons.dat This file lists all transcription factors in the PGDB and the genes that they regulate by binding upstream of the transcription unit containing those genes.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • ACTIVATORS-ALLOSTERIC-OF
  • ACTIVATORS-NONALLOSTERIC-OF
  • ACTIVATORS-UNKMECH-OF
  • APPEARS-IN-BINDING-REACTIONS
  • AROMATIC-RINGS
  • ATOM-CHARGES
  • ATOM-CHIRALITY
  • CATALYZES
  • CHARGE
  • CHEMICAL-FORMULA
  • CITATIONS
  • COFACTORS-OF
  • COFACTORS-OR-PROSTHETIC-GROUPS-OF
  • COMMENT
  • COMPONENT-COEFFICIENTS
  • COMPONENT-OF
  • COMPONENTS
  • CREDITS
  • DATA-SOURCE
  • DBLINKS
  • DNA-FOOTPRINT-SIZE
  • FEATURES
  • FUNCTIONAL-ASSIGNMENT-COMMENT
  • FUNCTIONAL-ASSIGNMENT-STATUS
  • GENE
  • GO-TERMS
  • INHIBITORS-ALLOSTERIC-OF
  • INHIBITORS-COMPETITIVE-OF
  • INHIBITORS-IRREVERSIBLE-OF
  • INHIBITORS-NONCOMPETITIVE-OF
  • INHIBITORS-OTHER-OF
  • INHIBITORS-UNCOMPETITIVE-OF
  • INHIBITORS-UNKMECH-OF
  • INSTANCE-NAME-TEMPLATE
  • ISOZYME-SEQUENCE-SIMILARITY
  • LOCATIONS
  • MODIFIED-FORM
  • MOLECULAR-WEIGHT
  • MOLECULAR-WEIGHT-EXP
  • MOLECULAR-WEIGHT-KD
  • MOLECULAR-WEIGHT-SEQ
  • N+1-NAME
  • N-1-NAME
  • N-NAME
  • NEIDHARDT-SPOT-NUMBER
  • PI
  • PROSTHETIC-GROUPS-OF
  • REGULATED-BY
  • REGULATES
  • SPECIES
  • SPLICE-FORM-INTRONS
  • STRUCTURE-BONDS
  • SYMMETRY
  • SYNONYMS
  • UNMODIFIED-FORM
terminators.dat This file lists all terminators in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • APPEARS-IN-BINDING-REACTIONS
  • CITATIONS
  • COMMENT
  • COMPONENT-OF
  • COMPONENTS
  • CREDITS
  • DATA-SOURCE
  • DBLINKS
  • INSTANCE-NAME-TEMPLATE
  • LEFT-END-POSITION
  • RIGHT-END-POSITION
  • SYNONYMS
transunits.dat This file lists all transcription units in the PGDB.

Attributes:

  • UNIQUE-ID
  • TYPES
  • COMMON-NAME
  • APPEARS-IN-BINDING-REACTIONS
  • CITATIONS
  • COMMENT
  • COMPONENT-OF
  • COMPONENTS
  • CREDITS
  • DATA-SOURCE
  • DBLINKS
  • EXTENT-UNKNOWN?
  • INSTANCE-NAME-TEMPLATE
  • LEFT-END-POSITION
  • REGULATED-BY
  • RIGHT-END-POSITION
  • SYNONYMS

FASTA Files

To view a sample, click the file name.

protseq.fsa This file lists the amino acid sequence of each protein monomer in the PGDB. (In 13.5 , renamed from formerly protseq.fasta)
dnaseq.fsa This file lists the DNA nucleotide sequence of each gene in the PGDB. (New in 13.5)
Includes RNAs. The extent of each sequence is the coding region, on its coding strand.

Data File Extraction

When you receive your data file distribution for a PGDB called "Xcyc", it will be stored as a compressed archive file. How to extract the files depends on which operating system you're using:
  • Unix: To extract the files into the same directory in which you saved the distribution file:
    • uncompress -c -v Xcyc-flatfiles.tar.Z | tar xfp -
  • PC or Mac: Use WinZip, PKUnZip, Stuffit Expander, or another unzipping utility to extract the files from Xcyc-flatfiles.zip

Determining the Difference Between Experimentally Verified and Computationally Predicted Data

Researchers who want to use the data in the data files as a gold standard for evaluating computational predictions need to know that all their data is in fact experimentally verified. In the past, this was not a problem, as all the data in EcoCyc could be reasonably expected to be experimentally verified, and the vast majority of data in other PGDBs could be assumed to have been predicted. However, as time goes by, more and more computationally predicted data is being added to EcoCyc. Thus, it becomes crucial when parsing the data files to be able to tell the difference.

Many classes of objects (e.g. enzymatic-reactions, pathways, transcription units, promoters, etc.) can have associated evidence codes. You can read more about the Pathway Tools Evidence Ontology here. There are four top-level codes. All experimental evidence codes start with EV-EXP, and all computational evidence codes start with EV-COMP. The other top-level codes are EV-AS (author statement) and EV-IC (inferred by curator).

Because evidence codes are typically associated with citations, they are stored in the CITATIONS slot. Values of the citations slot can take the following possible formats:

  • citation
  • citation:ev-code
  • citation:ev-code:other-data
  • :ev-code:other-data
where citation is a citation to the literature, usually a PubMed ID, ev-code is one of the codes in our evidence ontology, and other-data is zero or more additional fields that can generally be ignored (such as a timestamp, curator identifier, etc.).

The CITATIONS slot is included in most of the attribute-value files. Evidence for enzyme or transporter function is found in enzrxns.dat. Evidence for non-enzymatic function of a protein is found in proteins.dat. Examples of CITATIONS lines include:

  • CITATIONS - 94304911
  • CITATIONS - 17123542:EV-EXP-IDA-PURIFIED-PROTEIN:3377353002:keseler
  • CITATIONS - :EV-EXP:3277835750:pkarp
  • CITATIONS - :EV-EXP-IEP-COREGULATION
  • CITATIONS - :EV-COMP-AINF:3371579821:kr
  • CITATIONS - 92065800:EV-COMP-HINF-FN-FROM-SEQ:3279554727:mhance
  • CITATIONS - 1849603:EV-AS-NAS:3342906733:martin
The first example is a citation without an evidence code. The next three all describe experimental evidence. The fourth and fifth examples show computational evidence, and the last example is an author statement.

To strip out objects with only computational evidence, search for all the top-level code prefixes in the CITATIONS line. Any given object may have both experimental and computational evidence, so it is not sufficient just to look for the EV-COMP tag -- you must also make sure that the object lacks an EV-EXP tag. Some objects may have no attached evidence codes. In EcoCyc, it is probably safe to assume that these are experimentally determined, as the assignments probably predated our use of evidence codes, and were added at a time when EcoCyc contained only experimentally determined data. In other PGDBs of course that assumption does not hold.

For example, to find all EcoCyc pathways with experimental evidence, you would look in the pathways.dat file for all pathways that have at least one CITATIONS line with an EV-EXP tag (you would also have to decide whether or not to accept EV-AS and EV-IC tags). To find all transcription units with experimental evidence, you would do the same with the transunits.dat file. However, you should bear in mind that not all experimental evidence is of equal value. In particular, many transcription units are predicted based on high-throughput gene expression analysis which, although experimental in nature, is not generally considered high-quality. Before extracting your data, consider carefully which evidence codes you wish to include or exclude.

To find all EcoCyc genes with experimentally determined functions, the situation is more complicated still, as the evidence codes are not assigned to the genes themselves. To determine whether a particular gene has an experimentally determined function, you will need to look at its product (identified by the PRODUCT attribute) or a complex that includes the product (identified by the COMPONENT-OF attribute on the polypeptide). An evidence code may be attached directly to the protein in the proteins.dat file. However, if the gene codes for an enzyme, the evidence code will instead be found attached to the enzymatic-reaction entry or entries (identified by the CATALYZES attribute on the protein) in the enzrxns.dat file. If the gene is a transcription factor, the evidence code will be found in the corresponding entries (identified by the REGULATES attribute) in the regulation.dat file. Thus, you may need to follow several links and use several files to find the full list of evidence codes for any given gene. To extract all genes with experimental evidence, you might find it easiest to work backwards -- extract all objects from enzrxns.dat, regulation.dat and proteins.dat with experimental evidence codes, and follow them backwards (via the ENZYME, REGULATOR, COMPONENTS and GENE attributes) to get the list of relevant genes.