An oversampling-undersampling strategy for large-scale data linkage

Hassani, H., Entezarian, M.R., Zaeimzadeh, S., Marvian, L., & Komendantova, N. ORCID: https://orcid.org/0000-0003-2568-6179 (2025). An oversampling-undersampling strategy for large-scale data linkage. Frontiers in Big Data 8 10.3389/fdata.2025.1542483.

[thumbnail of fdata-1-1542483.pdf]
Preview
Text
fdata-1-1542483.pdf - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

Effective record linkage in big data, particularly in imbalanced datasets, is a critical yet highly challenging task due to the inherent complexity involved. This article utilizes an oversampling-undersampling strategy to address linkage imbalances, enabling more accurate and efficient record linkage within large-scale datasets. It tries to increase the instances of the minority class and decrease the dominance of the majority classes to try to reach a more balanced dataset that can be used for training and testing. Sensitivity testing was carried out by varying the training-test ratio and degree of imbalance.

Item Type: Article
Research Programs: Advancing Systems Analysis (ASA)
Advancing Systems Analysis (ASA) > Cooperation and Transformative Governance (CAT)
Depositing User: Luke Kirwan
Date Deposited: 12 May 2025 07:41
Last Modified: 12 May 2025 07:41
URI: https://pure.iiasa.ac.at/20573

Actions (login required)

View Item View Item