Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks

Fenoy, Luis Emilio; Edera, Alejandro; Stegmayer, Georgina

doi:https://doi.org/10.1093/bib/bbac232

Artículo

Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks

Fenoy, Luis Emilio Icon

; Edera, Alejandro Icon

; Stegmayer, Georgina Icon

Fecha de publicación: 06/2022

Editorial: Oxford University Press

Revista: Briefings In Bioinformatics

ISSN: 1467-5463

Idioma: Inglés

Tipo de recurso: Artículo publicado

Clasificación temática:

Ciencias de la Información y Bioinformática

Resumen

A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.

Palabras clave: AUTOMATIC FUNCTION PREDICTION , EMBEDDING , MACHINE LEARNING , PROTEIN REPRESENTATION , PROTEOMICS , TRANSFER LEARNING

Ver el registro completo

Archivos asociados

Tamaño: 1.622Mb

Formato: PDF

Solicitar

Licencia

Excepto donde se diga explícitamente, este item se publica bajo la siguiente descripción: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Unported (CC BY-NC-SA 2.5)

Identificadores

URI: http://hdl.handle.net/11336/213945

URL: https://academic.oup.com/bib/article-abstract/23/4/bbac232/6618242

DOI: https://doi.org/10.1093/bib/bbac232

Colecciones

Articulos(SINC(I))
Articulos de INST. DE INVESTIGACION EN SEÑALES, SISTEMAS E INTELIGENCIA COMPUTACIONAL

Citación

Fenoy, Luis Emilio; Edera, Alejandro; Stegmayer, Georgina; Transfer learning in proteins: evaluating novel protein learned representations for bioinformatics tasks; Oxford University Press; Briefings In Bioinformatics; 23; 4; 6-2022; 1-19

Altmétricas