CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent

doi:https://doi.org/10.48550/arXiv.2504.16264

Artículo

CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Valentini, Francisco Tomás Icon

; Kozlowski, Diego; Lariviere, Vincent

Fecha de publicación: 04/2025

Editorial: Cornell University

Revista: arXiv.org

ISSN: 2331-8422

Idioma: Inglés

Tipo de recurso: Artículo publicado

Clasificación temática:

Otras Ciencias de la Computación e Información

Resumen

Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.

Palabras clave: CROSS-LINGUAL INFORMATION RETRIEVAL , ACADEMIC SEARCH , MULTILINGUAL EMBEDDINGS , MACHINE TRANSLATION , EVALUATION RESOURCES

Ver el registro completo

Archivos asociados

Tamaño: 860.5Kb

Formato: PDF

Descargar

Licencia

Excepto donde se diga explícitamente, este item se publica bajo la siguiente descripción: Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Unported (CC BY-NC-SA 2.5)

Identificadores

URI: http://hdl.handle.net/11336/274494

URL: https://arxiv.org/abs/2504.16264

DOI: https://doi.org/10.48550/arXiv.2504.16264

Colecciones

Articulos(ICC)
Articulos de INSTITUTO DE INVESTIGACION EN CIENCIAS DE LA COMPUTACION

Citación

Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12

Altmétricas