Repositorio Institucional
Repositorio Institucional
CONICET Digital
  • Inicio
  • EXPLORAR
    • AUTORES
    • DISCIPLINAS
    • COMUNIDADES
  • Estadísticas
  • Novedades
    • Noticias
    • Boletines
  • Ayuda
    • General
    • Datos de investigación
  • Acerca de
    • CONICET Digital
    • Equipo
    • Red Federal
  • Contacto
JavaScript is disabled for your browser. Some features of this site may not work without it.
  • INFORMACIÓN GENERAL
  • RESUMEN
  • ESTADISTICAS
 
Artículo

Evaluating large language models for annotating proteins

Vitale, Rosario; Bugnon, Leandro ArielIcon ; Fenoy, Luis EmilioIcon ; Milone, Diego HumbertoIcon ; Stegmayer, GeorginaIcon
Fecha de publicación: 05/2024
Editorial: Oxford University Press
Revista: Briefings In Bioinformatics
ISSN: 1467-5463
Idioma: Inglés
Tipo de recurso: Artículo publicado
Clasificación temática:
Ciencias de la Información y Bioinformática

Resumen

Motivation: In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than fifteen thousand possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments, and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. Results: To address this issue, we propose and evaluate here a novel protocol based on transfer learning. This requires the use of protein large language models, trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein large language models together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods.
Palabras clave: transfer learning , protein annotation , deep learning , large language models
Ver el registro completo
 
Archivos asociados
Thumbnail
 
Tamaño: 672.9Kb
Formato: PDF
.
Descargar
Licencia
info:eu-repo/semantics/openAccess Excepto donde se diga explícitamente, este item se publica bajo la siguiente descripción: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Unported (CC BY-NC-SA 2.5)
Identificadores
URI: http://hdl.handle.net/11336/258409
URL: https://academic.oup.com/bib/article/25/3/bbae177/7665115
DOI: https://doi.org/10.1093/bib/bbae177
Colecciones
Articulos(SINC(I))
Articulos de INST. DE INVESTIGACION EN SEÑALES, SISTEMAS E INTELIGENCIA COMPUTACIONAL
Citación
Vitale, Rosario; Bugnon, Leandro Ariel; Fenoy, Luis Emilio; Milone, Diego Humberto; Stegmayer, Georgina; Evaluating large language models for annotating proteins; Oxford University Press; Briefings In Bioinformatics; 25; 3; 5-2024; 1-12
Compartir
Altmétricas
 

Enviar por e-mail
Separar cada destinatario (hasta 5) con punto y coma.
  • Facebook
  • X Conicet Digital
  • Instagram
  • YouTube
  • Sound Cloud
  • LinkedIn

Los contenidos del CONICET están licenciados bajo Creative Commons Reconocimiento 2.5 Argentina License

https://www.conicet.gov.ar/ - CONICET

Inicio

Explorar

  • Autores
  • Disciplinas
  • Comunidades

Estadísticas

Novedades

  • Noticias
  • Boletines

Ayuda

Acerca de

  • CONICET Digital
  • Equipo
  • Red Federal

Contacto

Godoy Cruz 2290 (C1425FQB) CABA – República Argentina – Tel: +5411 4899-5400 repositorio@conicet.gov.ar
TÉRMINOS Y CONDICIONES