Artículo
Evaluating large language models for annotating proteins
Vitale, Rosario; Bugnon, Leandro Ariel
; Fenoy, Luis Emilio
; Milone, Diego Humberto
; Stegmayer, Georgina




Fecha de publicación:
05/2024
Editorial:
Oxford University Press
Revista:
Briefings In Bioinformatics
ISSN:
1467-5463
Idioma:
Inglés
Tipo de recurso:
Artículo publicado
Clasificación temática:
Resumen
Motivation: In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than fifteen thousand possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments, and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. Results: To address this issue, we propose and evaluate here a novel protocol based on transfer learning. This requires the use of protein large language models, trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein large language models together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods.
Palabras clave:
transfer learning
,
protein annotation
,
deep learning
,
large language models
Archivos asociados
Licencia
Identificadores
Colecciones
Articulos(SINC(I))
Articulos de INST. DE INVESTIGACION EN SEÑALES, SISTEMAS E INTELIGENCIA COMPUTACIONAL
Articulos de INST. DE INVESTIGACION EN SEÑALES, SISTEMAS E INTELIGENCIA COMPUTACIONAL
Citación
Vitale, Rosario; Bugnon, Leandro Ariel; Fenoy, Luis Emilio; Milone, Diego Humberto; Stegmayer, Georgina; Evaluating large language models for annotating proteins; Oxford University Press; Briefings In Bioinformatics; 25; 3; 5-2024; 1-12
Compartir
Altmétricas