Spanish-language text classification for environmental evidence synthesis using multilingual pre-trained models

Berdejo-Espinola, V., Hajas, Á., Cornford, R. ORCID: https://orcid.org/0000-0002-9963-3603, Ye, N., & Amano, T. (2025). Spanish-language text classification for environmental evidence synthesis using multilingual pre-trained models. Environmental Evidence 14 (1) e21. 10.1186/s13750-025-00370-9.

[thumbnail of s13750-025-00370-9.pdf]
Preview
Text
s13750-025-00370-9.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

Artificial intelligence (AI) is increasingly being explored as a tool to optimize and accelerate various stages of evidence synthesis. A persistent challenge in environmental evidence syntheses is that these remain predominantly monolingual (English), leading to biased results and misinforming cross-scale policy decisions. AI offers a promising opportunity to incorporate non-English language evidence in evidence syntheses screening process and help to move beyond the current monolingual focus of evidence syntheses. Using a corpus of Spanish-language peer-reviewed papers on biodiversity conservation interventions, we developed and evaluated text classifiers using supervised machine learning models. Our best-performing model achieved 100% recall meaning no relevant papers (n = 9) were missed and filtered out over 70% (n = 867) of negative documents based only on the title and abstract of each paper. The text was encoded using a pre-trained multilingual model and class-weights were used to deal with a highly imbalanced dataset (0.79%). This research therefore offers an approach to reducing the manual, time-intensive effort required for document screening in evidence syntheses—with minimal risk of missing relevant studies. It highlights the potential of multilingual large language models and class-weights to train a light-weight non-English language classifier that can effectively filter irrelevant texts, using only a small non-English language labelled corpus. Future work could build on our approach to develop a multilingual classifier that enables the inclusion of any non-English scientific literature in evidence syntheses.

Item Type: Article
Research Programs: Biodiversity and Natural Resources (BNR)
Biodiversity and Natural Resources (BNR) > Biodiversity, Ecology, and Conservation (BEC)
Depositing User: Luke Kirwan
Date Deposited: 17 Nov 2025 09:04
Last Modified: 17 Nov 2025 09:04
URI: https://pure.iiasa.ac.at/20990

Actions (login required)

View Item View Item