Mostrar el registro sencillo del ítem
dc.contributor.author
Valentini, Francisco Tomás
dc.contributor.author
Kozlowski, Diego
dc.contributor.author
Lariviere, Vincent
dc.date.available
2025-10-31T13:56:08Z
dc.date.issued
2025-04
dc.identifier.citation
Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12
dc.identifier.issn
2331-8422
dc.identifier.uri
http://hdl.handle.net/11336/274494
dc.description.abstract
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
dc.format
application/pdf
dc.language.iso
eng
dc.publisher
Cornell University
dc.rights
info:eu-repo/semantics/openAccess
dc.rights.uri
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.subject
CROSS-LINGUAL INFORMATION RETRIEVAL
dc.subject
ACADEMIC SEARCH
dc.subject
MULTILINGUAL EMBEDDINGS
dc.subject
MACHINE TRANSLATION
dc.subject
EVALUATION RESOURCES
dc.subject.classification
Otras Ciencias de la Computación e Información
dc.subject.classification
Ciencias de la Computación e Información
dc.subject.classification
CIENCIAS NATURALES Y EXACTAS
dc.title
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
dc.type
info:eu-repo/semantics/article
dc.type
info:ar-repo/semantics/artículo
dc.type
info:eu-repo/semantics/publishedVersion
dc.date.updated
2025-10-31T10:34:26Z
dc.journal.pagination
1-12
dc.journal.pais
Estados Unidos
dc.description.fil
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
dc.description.fil
Fil: Kozlowski, Diego. University of Montreal; Canadá
dc.description.fil
Fil: Lariviere, Vincent. University of Montreal; Canadá
dc.journal.title
arXiv.org
dc.relation.alternativeid
info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264
dc.relation.alternativeid
info:eu-repo/semantics/altIdentifier/doi/https://doi.org/10.48550/arXiv.2504.16264
Archivos asociados