Global ML-ready dataset for mining areas in satellite images

Jasansky, S., Maus, V. ORCID: https://orcid.org/0000-0002-7385-4723, Popa, M., & Wilbik, A. (2024). Global ML-ready dataset for mining areas in satellite images. 10.5281/zenodo.14195737.

Full text not available from this repository.

Abstract

This dataset is a global resource for machine learning applications in mining area detection and semantic segmentation on satellite imagery. It contains Sentinel-2 satellite images and corresponding mining area masks + bounding boxes for 1,210 sites worldwide. Ground-truth masks are derived from Maus et al. (2022) and Tang et al. (2023), and validated through manual verification to ensure accurate alignment with Sentinel-2 imagery from specific timestamps.

The dataset includes three mask variants:

Masks exclusively from Maus et al. (n=1,090)
Masks exclusively from Tang et al. (n=817)
A preferred mask selected from either Maus or Tang based on alignment quality determined during manual review (n=1,210).
Each tile corresponds to a 2048x2048 pixel Sentinel-2 image, with metadata on mine type (surface, placer, underground, brine & evaporation) and scale (artisanal, industrial). For convenience, the preferred mask dataset is already split into training (75%), validation (15%), and test (10%) sets.

Furthermore, dataset quality was validated by re-validating test set tiles manually and correcting any mismatches between mining polygons and visually observed true mining area in the images, resulting in the following estimated quality metrics:

Combined Maus Tang
Accuracy 99.78 99.74 99.83
Precision 99.22 99.20 99.24
Recall 95.71 96.34 95.10
Note that the dataset does not contain the Sentinel-2 images themselves but contains a reference to specific Sentinel-2 images. Thus, for any ML applications, the images must be persisted first. For example, Sentinel-2 imagery is available from Microsoft's Planetary Computer and filterable via STAC API: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a. Additionally, the temporal specificity of the data allows integration with other imagery sources from the indicated timestamp, such as Landsat or other high-resolution imagery.

Source code used to generate this dataset and to use it for ML model training is available at https://github.com/SimonJasansky/mine-segmentation. It includes useful Python scripts, e.g. to download Sentinel-2 images via STAC API, or to divide tile images (2048x2048px) into smaller chips (e.g. 512x512px).

A database schema, a schematic depiction of the dataset generation process, and a map of the global distribution of tiles are provided in the accompanying images.

Item Type: Data
Additional Information: Creative Commons Attribution Share Alike 4.0 International
Research Programs: Advancing Systems Analysis (ASA)
Advancing Systems Analysis (ASA) > Novel Data Ecosystems for Sustainability (NODES)
Depositing User: Luke Kirwan
Date Deposited: 09 Jan 2025 15:19
Last Modified: 09 Jan 2025 15:19
URI: https://pure.iiasa.ac.at/20321

Actions (login required)

View Item View Item