Renhart, A. (2025). EU Regional Policy Payments 2007-2013. 10.5281/zenodo.16894218.
Full text not available from this repository.Abstract
This dataset contains aggregated, temporally and spatially explicit data compiled from the various national and subnational databases, covering the period from 2007-2013. The project-level data is presented in an aggregate form. The dataset consists of six files, each corresponding to a different level of NUTS coding (NUTS 1-3) according to the 2016 NUTS specification.
For each file, the following columns are included the following:
Identifiers:
NUTS Code: The unique identifier for the NUTS (2016) region
Year: The starting year of the projects considered for aggregation.
Category: Broad intervention field (category) of policy payment; 5 categories.
Variables:
Total eligible expenditure: The (imputed) monetary amount of funding that could be granted. The total eligible expenditure is usually larger than the realized policy payments. It can be interpreted as an upper bound. All values are expressed in Euro at current prices.
The temporal dimension is yearly, ranging from 2007-2013. There are some observations with a later date due to the N+3 rule. The spatial dimension is identified by NUTS codes (2016), with granularity ranging from level 1 to level 3.
Each project can fall into one or more categories (BIF):
R&D, innovation (rd_innov)
ICT infrastructure (ict)
Productive investment and business development (prod_business)
Energy infrastructure (energy)
Environmental infrastructure and environment (environ)
Culture, heritage, and tourism (cultur_heritage_tour)
Urban and territorial dimension (urban_territor)
Transport infrastructure (transport)
Quality employment and labour mobility (lab_market)
Social inclusion
Social, health and education infrastructure (soc_infra)
Education and training (educ_train)
Technical assistance and institutional capacity (ta_inst_cap_other)
The projects are categorized into 13 categories of intervention based on the available data post 2013 where this categorization is present. The main features for prediction are derived from the project descriptions which were translated into English using machine translation. Firstly, the words contained in the text documents are tokenized and stemmed. Next, stop words, numbers, punctuation, separators, and hyphens are removed as well as tokens occurring less than 50 times within and across all documents or less than 40 times across all documents, leaving roughly 4 000 tokens to be used as document features. In addition to the document features, three other variables which are readily available in both data sets are utilized: The amount of Euros allocated to the project (adjusted for differences over time), the type of the fund (ESF, ERDF, …) and the country of implementation.
Together with the document features they are used as inputs for a random forest trained on a random subsample of 50 000 observation of the available data post 2013. Overall accuracy for the final model out of sample is 56 percent, with the largest category making up around 28 percent of observations, resulting in a Kappa value of 0.44. Due to the very imbalanced distribution of categories (some categories make up less than 1 percent of all observations) the sensitivities and specificities regarding the categories vary widely and categorization into smaller categories is often highly insensitive. Also, since the text descriptions in the data before 2013 are of poorer quality, one must expect the model to perform worse in practice. Some additional checks were performed comparing the results to the distribution of counts in categories post 2013 and the distribution of project sums before 2013. Model scores are tweaked to overweight smaller categories to achieve a better resemblance of the two distributions. The underlying project-level data on EU regional funds contains variables on the project itself (title, description, location, and project end and/or start date), the project’s beneficiary (name, location), the policy area to which the policy area contributes, and monetary information (type of fund used, co-financing rate, paid sums, eligible costs, etc.). Substantial effort was made to manually check
The dataset contains observations for all EU member states, and in the case of Interreg projects, on some EU neighboring states. Please note that this dataset is intended for research and analysis in the fields of climatology, environmental science, and related disciplines. Users are encouraged to cite this dataset appropriately if utilized in academic or scientific publications.
| Item Type: | Data |
|---|---|
| Additional Information: | Creative Commons Attribution 4.0 International |
| Research Programs: | Biodiversity and Natural Resources (BNR) Biodiversity and Natural Resources (BNR) > Integrated Biosphere Futures (IBF) |
| Depositing User: | Luke Kirwan |
| Date Deposited: | 09 Jan 2026 10:37 |
| Last Modified: | 09 Jan 2026 10:37 |
| URI: | https://pure.iiasa.ac.at/21187 |
Actions (login required)
![]() |
View Item |
Tools
Tools