Mostrar el registro sencillo del ítem

dc.contributor.author
Valentini, Francisco Tomás  
dc.contributor.author
Kozlowski, Diego  
dc.contributor.author
Lariviere, Vincent  
dc.date.available
2025-10-31T13:56:08Z  
dc.date.issued
2025-04  
dc.identifier.citation
Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12  
dc.identifier.issn
2331-8422  
dc.identifier.uri
http://hdl.handle.net/11336/274494  
dc.description.abstract
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.  
dc.format
application/pdf  
dc.language.iso
eng  
dc.publisher
Cornell University  
dc.rights
info:eu-repo/semantics/openAccess  
dc.rights.uri
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/  
dc.subject
CROSS-LINGUAL INFORMATION RETRIEVAL  
dc.subject
ACADEMIC SEARCH  
dc.subject
MULTILINGUAL EMBEDDINGS  
dc.subject
MACHINE TRANSLATION  
dc.subject
EVALUATION RESOURCES  
dc.subject.classification
Otras Ciencias de la Computación e Información  
dc.subject.classification
Ciencias de la Computación e Información  
dc.subject.classification
CIENCIAS NATURALES Y EXACTAS  
dc.title
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents  
dc.type
info:eu-repo/semantics/article  
dc.type
info:ar-repo/semantics/artículo  
dc.type
info:eu-repo/semantics/publishedVersion  
dc.date.updated
2025-10-31T10:34:26Z  
dc.journal.pagination
1-12  
dc.journal.pais
Estados Unidos  
dc.description.fil
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina  
dc.description.fil
Fil: Kozlowski, Diego. University of Montreal; Canadá  
dc.description.fil
Fil: Lariviere, Vincent. University of Montreal; Canadá  
dc.journal.title
arXiv.org  
dc.relation.alternativeid
info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264  
dc.relation.alternativeid
info:eu-repo/semantics/altIdentifier/doi/https://doi.org/10.48550/arXiv.2504.16264