[ login ]

This page is part of the DAS 1.53E specification

Structuring DAS Protein Feature Annotation

A primary aim of the BioSapiens project is the integration of protein sequence annotation from regionally distributed data providers. We have chosen to implement a distributed annotation system (DAS). DAS annotation is provided in the form of Features, composed of a feature type, a feature description, and a sequence position.

There are two important factors when integrating annotations from different sources: Firstly, the terms that are used need to be standardised so that 'like' terms can be identified and compared. Secondly, evidence must be provided to describe how each annotation is created. In general, some features are annotated by a curator from experimental evidence from the literature (the UniProt/SwissProt annotations). Whilst these data provide an accurate source of information, other means of annotation by more automatic methods provide a much greater coverage.

An effective way of structuring feature annotation is to develop an ontology of protein feature types. An ontology provides a structured and precisely defined common controlled vocabulary in a dynamic environment so that changes can occur as different uses are invented and new terms added. We are proposing the new Protein Feature Ontology, jointly developed by the BioSapiens, UniProt, and GO consortia, as well as the GO evidence ontology, for adoption by BioSapiens partners. In the following sections, we describe the ontologies we recommend for use, as well as minor technical changes to the use of the DAS protocol, which will allow DAS client software to provide annotation in a much more structured, user-friendly way. An example of the layout of the final format is shown in the figure below:

The Protein Feature Ontology.

The Protein Feature Ontology is available from the Ontology Lookup Service (OLS)

http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS

The protein feature ontology is a set of terms which describe the features which make up protein function and form. It is divided into two parts: Positional terms which refer to a specific residue or range of residues in the protein and non-positional terms which refer to the whole protein sequence or structure.

Within the ontology you will find three types of ontology term IDs:

This is because the ontology is a composite ontology, taking terms from all three sources and linking them together to create the final ontology. Details of the structure and location of the terms are below.

Non-positional annotations

This section contains information which refers to the whole protein and its biological function. Fields include:

Positional features describe primary information, that is, actual features that are present and annotated on the protein sequence or structure and are located on a particular residue or subsequence of the peptide. These terms clearly fall within the scope of the Sequence Ontology, an ontology provided by the GO consortium which is suitable for describing biological sequences. As a result, these terms have been integrated into the SO for further development. The terms are filtered from SO and added into the Protein Feature Ontology automatically. More details on the classification of these features can be seen in the ontology.

Additional terms to describe post translational modifications

In addition to these terms, members of the BioSapiens NoE also provide annotations for post-translational modifications. For these annotations, the PSI-MOD terms will be used. For ease of use, these terms have been integrated into the Protein Feature Ontology under the post_translational_modification term.

To do:

  1. For each feature type you provide, please browse the ontology (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS ) and select the term which describes your feature. If your feature is not present in the ontology or there is some other problem please notify me immediately (gabby@ebi.ac.uk).

  2. The ontology term and the reference id (either beginning with SO: BS: or MOD:) must be added to the TYPEs command (see figures below for correct format).

  3. The FEATURE tag (label identifier) specifies any specific information which relates to that feature in that particular protein.

Evidence codes ECO

Currently, DAS derived from manual curation or experimental evidence is indistinguishable from annotations which have been predicted. To allow a fine grained attribution of data source type, terms from the Evidence Code Ontology (ECO) should be used to classify each DAS feature annotation.

To do:

  1. For each feature you provide, select an evidence code from these: http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=ECO

Avoiding data redundancy

The introduction of standardised feature types and evidence codes will already allow DAS clients to provide a much more user-friendly interface. However, we still need to address the problem of high redundancy in the data provided by multiple servers.

Example:

UniProt provides many different annotations of domains, for example, the SMART domain. In this case, the Server is "UniProt" and the feature type is "SMART". However, SMART also provides a DAS server and in this case "SMART" is the server and "SMART domain" is the feature type. The same domain on the same protein will be annotated twice. To allow the DAS clients to detect such redundancy, we need to provide information on the primary source of each feature annotation.

In the new structure, the feature type will be "domain". The source information will be provided in the "METHOD" tag of the DAS response.

SERVER

Feature Type

Method

UniProt

Domain

SMART

SMART

Domain

SMART

To do:

  1. If you do annotate by running someone else's method or transferring data from another database, does the source of the annotation also have a DAS server?


Updates/Questions/Comments

We expect the Protein Feature Ontology to dynamically evolve over the next few weeks to reflect the needs of the BioSapiens consortium. Questions, comments, and term requests should be sent to gabby@ebi.ac.uk .

All questions and comments will be added to the BioSapiens Ontology JIRA Tracker system.

To view these comments, there is no need to log in. Please check the listed comments before sending me an email to check that the issue has not already been raised.

Implementation

The following schema shows how to map protein feature information to the DAS protocol. We have structured the mapping so that it is backwards compatible with existing DAS servers and clients, but will allow modern clients a much more user-friendly display of protein annotation from the BioSapiens consortium.

Changes have been implemented by UniProtKB

UniProtKB have already implemented the changes, please see their DAS server for more help and information:

http://www.ebi.ac.uk/das-srv/uniprot/das/uniprot/features?segment=P03973