Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

Kim, Yohan; Sidney, John; Buus, Søren; Sette, Alessandro; Nielsen, Morten; Peters, Bjoern

doi:10.1186/1471-2105-15-241

Mostrar el registro sencillo del ítem

dc.contributor.author

Kim, Yohan Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Sidney, John Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Buus, Søren Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Sette, Alessandro Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Nielsen, Morten Se ha confirmado la validez de este valor de autoridad por un usuario

dc.contributor.author

Peters, Bjoern Se ha confirmado la validez de este valor de autoridad por un usuario

dc.date.available

2017-06-12T15:50:21Z

dc.date.issued

2014-07

dc.identifier.citation

Kim, Yohan; Sidney, John; Buus, Søren; Sette, Alessandro; Nielsen, Morten; et al.; Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions; BioMed Central; Bmc Bioinformatics; 15; 241; 7-2014; 1-9

dc.identifier.issn

1471-2105

dc.identifier.uri

http://hdl.handle.net/11336/17977

dc.description.abstract

BACKGROUND: It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. RESULTS: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. CONCLUSION: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

dc.format

application/pdf

dc.language.iso

eng

dc.publisher

BioMed Central Se ha confirmado la validez de este valor de autoridad por un usuario

dc.rights

info:eu-repo/semantics/openAccess

dc.rights.uri

https://creativecommons.org/licenses/by/2.5/ar/

dc.subject

Benchmarking of Mhc Class I Predictors

dc.subject

Epitope Prediction

dc.subject

Sequence Similarity

dc.subject

Cross-Validation

dc.subject.classification

Otras Ciencias de la Computación e Información Se ha confirmado la validez de este valor de autoridad por un usuario

dc.subject.classification

Ciencias de la Computación e Información Se ha confirmado la validez de este valor de autoridad por un usuario

dc.subject.classification

CIENCIAS NATURALES Y EXACTAS Se ha confirmado la validez de este valor de autoridad por un usuario

dc.title

Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

dc.type

info:eu-repo/semantics/article

dc.type

info:ar-repo/semantics/artículo

dc.type

info:eu-repo/semantics/publishedVersion

dc.date.updated

2017-06-09T15:01:05Z

dc.journal.volume

15

dc.journal.number

241

dc.journal.pagination

1-9

dc.journal.pais

Reino Unido Se ha confirmado la validez de este valor de autoridad por un usuario

dc.journal.ciudad

Londres

dc.description.fil

Fil: Kim, Yohan. La Jolla Institute for Allergy and Immunology; Estados Unidos

dc.description.fil

Fil: Sidney, John. La Jolla Institute for Allergy and Immunology; Estados Unidos

dc.description.fil

Fil: Buus, Søren. Universidad de Copenhagen; Dinamarca

dc.description.fil

Fil: Sette, Alessandro. La Jolla Institute for Allergy and Immunology; Estados Unidos

dc.description.fil

Fil: Nielsen, Morten. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Universidad Nacional de San Martín. Instituto de Investigaciones Biotecnológicas; Argentina. Technical University of Denmark; Dinamarca

dc.description.fil

Fil: Peters, Bjoern. La Jolla Institute for Allergy and Immunology; Estados Unidos

dc.journal.title

Bmc Bioinformatics Se ha confirmado la validez de este valor de autoridad por un usuario

dc.relation.alternativeid

info:eu-repo/semantics/altIdentifier/doi/http://dx.doi.org/10.1186/1471-2105-15-241

dc.relation.alternativeid

info:eu-repo/semantics/altIdentifier/url/https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-241

Archivos asociados

Tamaño: 661.3Kb

Formato: PDF

Descargar