Jasansky, S., Maus, V. ORCID: https://orcid.org/0000-0002-7385-4723, Popa, M., & Wilbik, A. (2024). Global ML-ready dataset for mining areas in satellite images. 10.5281/zenodo.14195737.
Full text not available from this repository.Abstract
This dataset is a global resource for machine learning applications in mining area detection and semantic segmentation on satellite imagery. It contains Sentinel-2 satellite images and corresponding mining area masks + bounding boxes for 1,210 sites worldwide. Ground-truth masks are derived from Maus et al. (2022) and Tang et al. (2023), and validated through manual verification to ensure accurate alignment with Sentinel-2 imagery from specific timestamps.
The dataset includes three mask variants:
Masks exclusively from Maus et al. (n=1,090)
Masks exclusively from Tang et al. (n=817)
A preferred mask selected from either Maus or Tang based on alignment quality determined during manual review (n=1,210).
Each tile corresponds to a 2048x2048 pixel Sentinel-2 image, with metadata on mine type (surface, placer, underground, brine & evaporation) and scale (artisanal, industrial). For convenience, the preferred mask dataset is already split into training (75%), validation (15%), and test (10%) sets.
Furthermore, dataset quality was validated by re-validating test set tiles manually and correcting any mismatches between mining polygons and visually observed true mining area in the images, resulting in the following estimated quality metrics:
Combined Maus Tang
Accuracy 99.78 99.74 99.83
Precision 99.22 99.20 99.24
Recall 95.71 96.34 95.10
Note that the dataset does not contain the Sentinel-2 images themselves but contains a reference to specific Sentinel-2 images. Thus, for any ML applications, the images must be persisted first. For example, Sentinel-2 imagery is available from Microsoft's Planetary Computer and filterable via STAC API: https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a. Additionally, the temporal specificity of the data allows integration with other imagery sources from the indicated timestamp, such as Landsat or other high-resolution imagery.
Source code used to generate this dataset and to use it for ML model training is available at https://github.com/SimonJasansky/mine-segmentation. It includes useful Python scripts, e.g. to download Sentinel-2 images via STAC API, or to divide tile images (2048x2048px) into smaller chips (e.g. 512x512px).
A database schema, a schematic depiction of the dataset generation process, and a map of the global distribution of tiles are provided in the accompanying images.
Item Type: | Data |
---|---|
Additional Information: | Creative Commons Attribution Share Alike 4.0 International |
Research Programs: | Advancing Systems Analysis (ASA) Advancing Systems Analysis (ASA) > Novel Data Ecosystems for Sustainability (NODES) |
Depositing User: | Luke Kirwan |
Date Deposited: | 09 Jan 2025 15:19 |
Last Modified: | 09 Jan 2025 15:19 |
URI: | https://pure.iiasa.ac.at/20321 |
Actions (login required)
View Item |