Repositorio Institucional
Repositorio Institucional
CONICET Digital
  • Inicio
  • EXPLORAR
    • AUTORES
    • DISCIPLINAS
    • COMUNIDADES
  • Estadísticas
  • Novedades
    • Noticias
    • Boletines
  • Ayuda
    • General
    • Datos de investigación
  • Acerca de
    • CONICET Digital
    • Equipo
    • Red Federal
  • Contacto
JavaScript is disabled for your browser. Some features of this site may not work without it.
  • INFORMACIÓN GENERAL
  • RESUMEN
  • ESTADISTICAS
 
Artículo

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Kim, Yohan; Sidney, John; Buus, Søren; Sette, Alessandro; Nielsen, MortenIcon ; Peters, Bjoern
Fecha de publicación: 07/2014
Editorial: BioMed Central
Revista: Bmc Bioinformatics
ISSN: 1471-2105
Idioma: Inglés
Tipo de recurso: Artículo publicado
Clasificación temática:
Otras Ciencias de la Computación e Información

Resumen

BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.
Palabras clave: Benchmarking of Mhc Class I Predictors , Epitope Prediction , Sequence Similarity , Cross-Validation
Ver el registro completo
 
Archivos asociados
Thumbnail
 
Tamaño: 661.3Kb
Formato: PDF
.
Descargar
Licencia
info:eu-repo/semantics/openAccess Excepto donde se diga explícitamente, este item se publica bajo la siguiente descripción: Creative Commons Attribution 2.5 Unported (CC BY 2.5)
Identificadores
URI: http://hdl.handle.net/11336/17977
DOI: http://dx.doi.org/10.1186/1471-2105-15-241
URL: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241
Colecciones
Articulos(IQUIFIB)
Articulos de INST.DE QUIMICA Y FISICO-QUIMICA BIOLOGICAS "PROF. ALEJANDRO C. PALADINI"
Citación
Kim, Yohan; Sidney, John; Buus, Søren; Sette, Alessandro; Nielsen, Morten; et al.; Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions; BioMed Central; Bmc Bioinformatics; 15; 241; 7-2014; 1-9
Compartir
Altmétricas
 

Enviar por e-mail
Separar cada destinatario (hasta 5) con punto y coma.
  • Facebook
  • X Conicet Digital
  • Instagram
  • YouTube
  • Sound Cloud
  • LinkedIn

Los contenidos del CONICET están licenciados bajo Creative Commons Reconocimiento 2.5 Argentina License

https://www.conicet.gov.ar/ - CONICET

Inicio

Explorar

  • Autores
  • Disciplinas
  • Comunidades

Estadísticas

Novedades

  • Noticias
  • Boletines

Ayuda

Acerca de

  • CONICET Digital
  • Equipo
  • Red Federal

Contacto

Godoy Cruz 2290 (C1425FQB) CABA – República Argentina – Tel: +5411 4899-5400 repositorio@conicet.gov.ar
TÉRMINOS Y CONDICIONES