]
This page is part of the DAS 1.53E specification
Structuring DAS Protein Feature Annotation
A primary aim of the BioSapiens project is the integration of protein sequence annotation from regionally distributed data providers. We have chosen to implement a distributed annotation system (DAS). DAS annotation is provided in the form of Features, composed of a feature type, a feature description, and a sequence position.
There are two important factors when integrating annotations from different sources: Firstly, the terms that are used need to be standardised so that 'like' terms can be identified and compared. Secondly, evidence must be provided to describe how each annotation is created. In general, some features are annotated by a curator from experimental evidence from the literature (the UniProt/SwissProt annotations). Whilst these data provide an accurate source of information, other means of annotation by more automatic methods provide a much greater coverage.
An effective way of structuring feature annotation is to develop an ontology of protein feature types. An ontology provides a structured and precisely defined common controlled vocabulary in a dynamic environment so that changes can occur as different uses are invented and new terms added. We are proposing the new Protein Feature Ontology, jointly developed by the BioSapiens, UniProt, and GO consortia, as well as the GO evidence ontology, for adoption by BioSapiens partners. In the following sections, we describe the ontologies we recommend for use, as well as minor technical changes to the use of the DAS protocol, which will allow DAS client software to provide annotation in a much more structured, user-friendly way. An example of the layout of the final format is shown in the figure below:
The Protein Feature Ontology.
The Protein Feature Ontology is available from the Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS
The protein feature ontology is a set of terms which describe the features which make up protein function and form. It is divided into two parts: Positional terms which refer to a specific residue or range of residues in the protein and non-positional terms which refer to the whole protein sequence or structure.
Within the ontology you will find three types of ontology term IDs:
-
BS, identifying BioSapiens only terms
-
SO, identifying terms from the Sequence Ontology
-
MOD, identifying terms from the protein modification ontology PSI-MOD
This is because the ontology is a composite ontology, taking terms from all three sources and linking them together to create the final ontology. Details of the structure and location of the terms are below.
Non-positional annotations
This section contains information which refers to the whole protein and its biological function. Fields include:
-
A reference to publications that exist for that protein. These are normally supplied from the publications listed by the Uniprot/SwissProt curators.
-
A family_annotation, a free text indication of the family to which the protein belongs.
-
Functional_annotation that is either a free text description, an EC_annotation for enzymes, or a GO_annotation.
-
Potentially mispredicted protein sequences, using the erroneous_protein and abnormal_protein categories.
Positional features describe primary information, that is, actual features that are present and annotated on the protein sequence or structure and are located on a particular residue or subsequence of the peptide. These terms clearly fall within the scope of the Sequence Ontology, an ontology provided by the GO consortium which is suitable for describing biological sequences. As a result, these terms have been integrated into the SO for further development. The terms are filtered from SO and added into the Protein Feature Ontology automatically. More details on the classification of these features can be seen in the ontology.
Additional terms to describe post translational modifications
In addition to these terms, members of the BioSapiens NoE also provide annotations for post-translational modifications. For these annotations, the PSI-MOD terms will be used. For ease of use, these terms have been integrated into the Protein Feature Ontology under the post_translational_modification term.
To do:
-
For each feature type you provide, please browse the ontology (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS ) and select the term which describes your feature. If your feature is not present in the ontology or there is some other problem please notify me immediately (gabby@ebi.ac.uk).
-
The ontology term and the reference id (either beginning with SO: BS: or MOD:) must be added to the TYPEs command (see figures below for correct format).
-
The FEATURE tag (label identifier) specifies any specific information which relates to that feature in that particular protein.
Evidence codes ECO
Currently, DAS derived from manual curation or experimental evidence is indistinguishable from annotations which have been predicted. To allow a fine grained attribution of data source type, terms from the Evidence Code Ontology (ECO) should be used to classify each DAS feature annotation.
To do:
-
For each feature you provide, select an evidence code from these: http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=ECO
Avoiding data redundancy
The introduction of standardised feature types and evidence codes will already allow DAS clients to provide a much more user-friendly interface. However, we still need to address the problem of high redundancy in the data provided by multiple servers.
Example:
UniProt provides many different annotations of domains, for example, the SMART domain. In this case, the Server is "UniProt" and the feature type is "SMART". However, SMART also provides a DAS server and in this case "SMART" is the server and "SMART domain" is the feature type. The same domain on the same protein will be annotated twice. To allow the DAS clients to detect such redundancy, we need to provide information on the primary source of each feature annotation.
In the new structure, the feature type will be "domain". The source information will be provided in the "METHOD" tag of the DAS response.
|
SERVER |
Feature Type |
Method |
|
UniProt |
Domain |
SMART |
|
SMART |
Domain |
SMART |
To do:
-
If you do annotate by running someone else's method or transferring data from another database, does the source of the annotation also have a DAS server?
-
If yes, annotate the METHOD tag with the nickname of this server as given in the DAS Registry at http://www.dasregistry.org/.
Please see the figures below for actual format. -
If no, write the method name into this field. Please be careful to use the actual name of the program as it is published and spelt.
-
If the annotation you provide in this track is derived from your own in-house method, please fill the METHOD tag with the name of your server.
Updates/Questions/Comments
We expect the Protein Feature Ontology to dynamically evolve over the next few weeks to reflect the needs of the BioSapiens consortium. Questions, comments, and term requests should be sent to gabby@ebi.ac.uk .
All questions and comments will be added to the BioSapiens Ontology JIRA Tracker system.
To view these comments, there is no need to log in. Please check the listed comments before sending me an email to check that the issue has not already been raised.
Implementation
The following schema shows how to map protein feature information to the DAS protocol. We have structured the mapping so that it is backwards compatible with existing DAS servers and clients, but will allow modern clients a much more user-friendly display of protein annotation from the BioSapiens consortium.
Changes have been implemented by UniProtKB
UniProtKB have already implemented the changes, please see their DAS server for more help and information:
http://www.ebi.ac.uk/das-srv/uniprot/das/uniprot/features?segment=P03973
home
list sources
validate
register new
statistics
history
docu