Artículo
MessIRve: A Large-Scale Spanish Information Retrieval Dataset
Valentini, Francisco Tomás
; Cotik, Viviana Erica
; Furman, Damián Ariel
; Bercovich, Ivan; Altszyler Lemcovich, Edgar Jaim
; Pérez, Juan Manuel





Fecha de publicación:
09/2024
Editorial:
Cornell University
Revista:
arXiv
ISSN:
2331-8422
Idioma:
Inglés
Tipo de recurso:
Artículo publicado
Clasificación temática:
Resumen
Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanishspeaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.
Archivos asociados
Licencia
Identificadores
Colecciones
Articulos(ICC)
Articulos de INSTITUTO DE INVESTIGACION EN CIENCIAS DE LA COMPUTACION
Articulos de INSTITUTO DE INVESTIGACION EN CIENCIAS DE LA COMPUTACION
Citación
Valentini, Francisco Tomás; Cotik, Viviana Erica; Furman, Damián Ariel; Bercovich, Ivan; Altszyler Lemcovich, Edgar Jaim; et al.; MessIRve: A Large-Scale Spanish Information Retrieval Dataset; Cornell University; arXiv; 9-2024; 1-13
Compartir
Altmétricas