European Research and Innovation Program
There are many projects in the European Union that are funded by the European Research and Innovation Programme. To make sense of those projects and their results, we have to do lots of data processing and analysis.
On European level, the European Research and Innovation Program (H2020) is the main funding instrument for research and innovation. It is a very large program with a budget of 80 billion euros for the period 2014-2020. The program is divided into different parts, each with its own budget and focus. It is the consecutive program of FP7 (another large European research program).
I have tried to make sense of those research program by analyzing the projects and their results, such as who are funded and what they do. This post will try to give an overview of those projects and construct a more structured dataset from the raw data for further analysis.
I have built a dataset and save it to Kaggle. One can follow this link to download the dataset.
FP7 projects
Each research program has two datasets: projects and organizations. The projects dataset contains information about the projects, such as the title, the abstract, the budget, the start and end date, the project coordinator, the participants, the topics, the call, etc. The organizations dataset contains information about the organizations, such as the name, the country, the type, the address, etc.
The organizations dataset tells who is involved in the projects. The projects dataset tells what is done in the projects. The following code gives a glimpse of the projects dataset for FP7 (the predecessor of H2020).
project <- fread("./cordis/fp7/data/csv/project.csv")
str(project, width = 70, strict.width = "cut")
# Classes ‘data.table’ and 'data.frame': 25785 obs. of 25 variables:
# $ id : int 223925 315871 229760 286978 211948 28306..
# $ acronym : chr "LENVIS" "APT-STEP" "FORUM GMES" "BIOCA"..
# $ status : chr "CLO" "CLO" "CLO" "CLO" ...
# $ title : chr "Localised environmental and health inf"..
# $ startDate : chr "2008-09-01" "2012-10-01" "2008-04-01" "..
# $ endDate : IDate, format: "2012-01-31" ...
# $ totalCost : chr "3131818" "1150921,4" "239699" "1410933"..
# $ ecMaxContribution : chr "2232223" "1150921" "150000" "1004500" ...
# $ legalBasis : chr "FP7-ICT" "FP7-REGPOT" "FP7-SPACE" "FP7"..
# $ topics : chr "ICT-2007.6.3" "REGPOT-2012-2013-1" "SP"..
# $ ecSignatureDate : chr "" "" "" "" ...
# $ frameworkProgramme: chr "FP7" "FP7" "FP7" "FP7" ...
# $ masterCall : chr "" "" "" "" ...
# $ subCall : chr "FP7-ICT-2007-2" "FP7-REGPOT-2012-2013-"..
# $ fundingScheme : chr "CP" "CSA-SA" "CSA-SA" "BSG-SME" ...
# $ nature : chr "" "" "" "" ...
# $ objective : chr "The main goal of the LENVIS project is"..
# $ contentUpdateDate : chr "2019-07-16 12:16:47" "2017-09-24 09:38"..
# $ rcn : chr "87602" "104995" "87337" "100971" ...
# $ grantDoi : chr "" "" "" "" ...
# $ V21 : chr "" "" "" "" ...
# $ V22 : chr "" "" "" "" ...
# $ V23 : chr "" "" "" "" ...
# $ V24 : POSIXct, format: NA ...
# $ V25 : int NA NA NA NA NA NA NA NA NA NA ...
# - attr(*, ".internal.selfref")=<externalptr>
### --------------- Clean the data ------------------ ###
# convert to numberic values
project %>%
.[, totalCost := as.double(
str_replace(totalCost, ",", ".")
)] %>%
.[, ecMaxContribution := as.numeric(
str_replace(ecMaxContribution, ",", ".")
)] %>%
str()
# get start and end year
project %>%
.[, startDate := as.IDate(startDate, "%Y-%m-%d")] %>%
str()
# add project span in months, startYear, endYear
project %>%
.[, projSpan := (interval(startDate, endDate) %/% months(1))] %>%
str()
project %>%
.[, startYear := year(startDate)] %>%
.[, endYear := year(endDate)] %>%
str()
# legalBasis group
project %>%
.[, .N, by = legalBasis] %>%
.[N > 10] %>%
.[order(-rank(N))] %>%
head(10) %>%
kable("pipe")
The variable legalBasis
tells us the theme of the project, for instance, ‘FP7-SPACE’ is about space technology. The following table gives an overview of the themes.
Legal Basis | No. |
---|---|
FP7-PEOPLE | 11103 |
FP7-IDEAS-ERC | 4553 |
FP7-ICT | 2325 |
FP7-SME | 1036 |
FP7-HEALTH | 1003 |
FP7-JTI | 807 |
FP7-NMP | 805 |
FP7-TRANSPORT | 720 |
FP7-KBBE | 516 |
FP7-ENVIRONMENT | 493 |

The above table shows that the most popular theme is FP7-PEOPLE, which is about creating a more human society. The second most popular theme is FP7-IDEAS-ERC, which is about basic research. The third most popular theme is FP7-ICT, which is about information and communication technology. The fourth most popular theme is FP7-SME, which is about small and medium enterprises.
In terms of the total grant amount, the most two popular themes are FP7-ICT and FP7-IDEAS-ERC, which receive around 7.7 and 7.8 billion euros respectively. Figure 1 was produced by the following code.
options(repr.plot.width = 9, repr.plot.height = 7)
project %>%
.[grep("FP", legalBasis)] %>%
.[, .(budget = sum(ecMaxContribution)/1e6), by = legalBasis] %>%
.[, budget2 := as.character(round(budget, 0))] %>%
.[, budgetIndex := paste(legalBasis, budget2, sep = "\n")] %>%
treemap(index = "budgetIndex",
vSize = "budget",
vColor = "budget",
type = "manual",
palette = "Blues",
algorithm = "squarified",
title = "Total EU Contribution in different programmes",
title.legend = "million",
fontsize.labels = c(10), border.lwds = c(1, 1))
Since there are many themes, we will focus on ICT, which is the one that receives the most grant amount. There are 2325 projects in the ICT theme, and the average project span is around 36 months (three years). The average grant amount is 4.8 million euros. However, the maximum grant amount is 74.98 million euros. The following table gives the summary statistics of the grant amount in the ICT theme, which shows that the grant amount is highly skewed.
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
0.12 | 2.59 | 3.84 | 4.79 | 5.16 | 74.98 |
Like many variables in the business world, only the log of the grant amount is normally distributed. The following figure shows the distribution of the log of the grant amount in the ICT theme.

Here is the code to produce the above table and figure.
project %>%
.[legalBasis == "FP7-ICT"] %>%
unique(by = "id") %>%
with(summary(totalCost)/1000000) %>%
as.matrix() %>% t() %>%
as.data.frame() %>%
kable("pipe", align = "c", digits = 2)
options(repr.plot.width = 9, repr.plot.height = 5)
par(mfrow = c(1, 2))
hist(project$totalCost/1000000, breaks = 30, col = "grey",
xlab = "Total Cost (million)",
main = "Histogram of Total Cost")
hist(log(project$totalCost), breaks = 30, col = "grey",
xlab = "Log(Total Cost)",
main = "Histogram Log(Total Cost)")
Now, we will try to understand the topic of the projects in the ICT theme. The following table gives the top 10 words in the project titles in the ICT theme and wordclouds of the project titles in the ICT theme.
word | N |
---|---|
systems | 212 |
based | 157 |
future | 130 |
internet | 123 |
european | 122 |
energy | 119 |
research | 118 |
management | 113 |
data | 106 |
services | 104 |

As we can see that the top frequency words and wordclouds are not very informative. We will try to use the topic model to understand the topic of the projects in the ICT theme. The following code cleans the project objectives and gives the summary statistics of the number of words in the project objectives.
project %>%
.[legalBasis == "FP7-ICT"] %>%
unique(by = "id") %>%
.[, .(objective)] %>%
# remove punctuation
.[, objective := gsub("[[:punct:]]", "", objective)] %>%
# remove control characters
.[, objective := gsub("[[:cntrl:]]", "", objective)] %>%
# to lower case
.[, objective := tolower(objective)] %>%
# remove numbers
.[, objective := gsub("[[:digit:]]", "", objective)] %>%
# remove stop words with tm package
.[, objective := removeWords(objective, stopwords("english"))] %>%
# strip white spaces
.[, objective := stripWhitespace(objective)] %>%
# stemming
.[, objective := stemDocument(objective)] %>%
# now we have the clean text data
# we need to convert it to a document-term matrix
# add document length
.[, doc_length := str_count(objective, "\\S+")] -> project_doc
summary(project_doc$doc_length)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 28.0 166.0 177.0 174.8 186.0 318.0
The shortest project objective has 28 words, and the longest one has 318 words. The average project objective has 175 words. The following code creates a document-term matrix and a term-document matrix. We first drop the documents with less than 100 words in order to reduce the sparsity of the document-term matrix.
# create a document-term matrix
# we will filter out documents with more than 100 words
# create document index
project_doc %>%
.[doc_length >= 100] %>%
.[, doc_id := .I] %>%
# create a document-term matrix
.[, .(doc_id, objective)] %>%
unnest_tokens(word, objective) %>%
.[, .N, by = c("doc_id", "word")] %>%
.[order(doc_id, -N)] %>%
as.data.frame() %>%
cast_dtm(doc_id, word, N) -> dtm_objective
# understand the document-term matrix
inspect(dtm_objective)
# <<DocumentTermMatrix (documents: 2265, terms: 22694)>>
# Non-/sparse entries: 272039/51129871
# Sparsity : 99%
# Maximal term length: 56
# Weighting : term frequency (tf)
# selected sample docs and terms
# Terms
# Docs data develop network project research servic system technolog use will
# 103 0 3 0 4 0 0 3 4 1 13
# 145 0 1 0 4 0 0 4 0 3 3
# 1722 0 1 1 4 0 0 7 3 8 8
# 1794 0 1 0 1 0 16 6 0 3 3
# 1811 1 1 1 1 0 5 1 1 1 0
# 1925 5 2 1 1 0 2 3 1 2 4
# 1948 7 1 0 3 6 0 5 0 3 8
# 23 0 0 0 2 0 0 3 0 5 6
# 715 2 1 2 3 9 1 0 2 0 4
# 727 0 2 2 2 0 0 4 4 3 8
We have 2265 documents and 22694 terms in the document-term matrix. The table above shows the selected sample of the document-term matrix, which is a sparse matrix. Now, we will remove the terms that appear in less than 5 documents and more than 90% of the documents. Doing so will give us a collection of terms that are more informative, which balances the uniqueness and the commonality of the terms (i.e., the terms that are too unique or too common are not informative).
word_term <- findFreqTerms(dtm_objective,
lowfreq = 5,
highfreq = nrow(dtm_objective) * 0.9)
# create a new document-term matrix with the filtered words
dtm_objective_filtered <- dtm_objective[, word_term]
dtm_objective_filtered
# <<DocumentTermMatrix (documents: 2265, terms: 4851)>>
# Non-/sparse entries: 232500/10755015
# Sparsity : 98%
# Maximal term length: 20
# Weighting : term frequency (tf)
After dropping the terms that are too unique or too common, we have 4851 terms in the document-term matrix. The term-document matrix also becomes less sparse.
Now, we will use the LDA model to understand the topics of the projects in the ICT theme. First, we need to define the number of topics. We will use a package called ldatuning
to find the optimal number of topics.
lda_fit1 <- ldatuning::FindTopicsNumber(dtm_objective_filtered,
topics = seq(2, 10),
metrics = c("CaoJuan2009", "Deveaud2014"),
method = "Gibbs",
control = list(seed = 789),
verbose = TRUE)
FindTopicsNumber_plot(lda_fit1)

From the plot above, we will choose 9 as the optimal number of topics. The following code fits with LDA and plot the topics.
set.seed(789)
lda_fit2 <- LDA(dtm_objective_filtered, k = 9, method = "Gibbs",
control = list(iter = 1000,
verbose = 50))
# get values of theta
options(repr.plot.width = 11, repr.plot.height = 9)
tidy(lda_fit2, matrix = "beta") %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(word = reorder_within(term, beta, topic, sep="")) %>%
ggplot(aes(word, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE, alpha = 0.85) +
coord_flip() +
facet_wrap(~topic, scales = "free") +
labs(x = NULL, y = expression(beta)) +
theme_bw() +
# make x axis labels more readable with bold font
theme(axis.text.y = element_text(face = "bold", size = 10),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
By look at words in each topic, we can interpret the topics in Figure 5. For example, topic 1 is about quantum computing, topic 3 is probably about the smart manufacturing with mobile devices, and topic 6 is more about the economic policy on ICT.

I also found topic 4 and 9 are interesting. Topic 4 is about social media and topic 9 is about the platform for digital economy.
Figure 5 shows that Latent Dirichlet Allocation (LDA) is a useful tool to understand the topics of the projects in the ICT theme. The topics are not perfect, but they are useful to understand the projects in a very efficient way (assuming she/he will not read all the project objectives by herself/himself).
With those 9 topics we classified, we can now answer the following questions:
- What are the most popular topics in the ICT theme?
- Which topic receives the most funding?

Figure 6 shows that there are more or less equal number of projects in each topic. The most popular topic is topic 1, which is about quantum computing. The following table shows the top 10 topics in terms of the total cost of the projects and the average cost of the projects in each topic.
topic | totalCost(million) | avrgCost(million) |
---|---|---|
3 | 1747.42 | 5.64 |
1 | 1681.41 | 5.08 |
8 | 1434.17 | 5.22 |
7 | 1329.14 | 4.43 |
4 | 1318.15 | 4.76 |
9 | 1276.01 | 6.51 |
5 | 1102.56 | 5.63 |
6 | 726.44 | 2.34 |
2 | 302.68 | 4.39 |
We can see that European Commission invests more in topic 3, which is about smart manufacturing and energy with mobile devices. In general, topic 3 is all about using ICT to improve the efficiency of the manufacturing, transportation, and energy sectors. The following table gives some sample projects in topic 3.
id | topic | title |
---|---|---|
314328 | 3 | SMART MOBILITY IN SMART CITY |
223937 | 3 | SMART-antenna multimode wireless mesh Network |
257544 | 3 | Smart and Efficient Location, idEntification, and Cooperation Techniques |
287534 | 3 | Semantic Tools for Carbon Reduction in Urban Planning |
314331 | 3 | Optimized electric Drivetrain by INtegration |
H2020 projects
After analyzing the FP7 program, we will do the same analysis for the H2020 program. The H2020 program is a continuation of the FP7 program. The H2020 program has a budget of 80 billion euros for the period 2014-2020.
legalBasis | N | Note |
---|---|---|
H2020-EU.1.3. | 11700 | EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions (MSCA) |
H2020-EU.1.1. | 7792 | EXCELLENT SCIENCE - European Research Council (ERC) |
H2020-EU.2.3. | 3162 | INDUSTRIAL LEADERSHIP - Innovation In SMEs |
H2020-EU.2.1.1. | 1870 | INDUSTRIAL LEADERSHIP - ICT enabling |
H2020-EU.3.4. | 1756 | SOCIETAL CHALLENGES - Smart, Green And Integrated Transport |
H2020-EU.3.3. | 1450 | SOCIETAL CHALLENGES - Secure, clean and efficient energy |
H2020-EU.3.1. | 1196 | SOCIETAL CHALLENGES - Health, demographic change and well-being |
H2020-EU.3.2. | 914 | SOCIETAL CHALLENGES - Food security, etc. |
H2020-EU.3.5. | 739 | SOCIETAL CHALLENGES - Climate action, Environment, ect. |
H2020-EU.1.2. | 618 | EXCELLENT SCIENCE - Future and Emerging Technologies (FET) |

The following tables compares the total cost and the average cost of the projects in the H2020 program and the FP7 program for the ICT theme.
Program | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | N | TotalCost |
---|---|---|---|---|---|---|---|---|
FP7 | 0.12 | 2.59 | 3.84 | 4.79 | 5.16 | 74.98 | 2325 | 7.8 billion |
H2020 | 0.07 | 0.48 | 3.12 | 5.69 | 5.4 | 180.32 | 1870 | 6.7 billion |
We can see that the total cost of the projects in the H2020 program is lower than the total cost of the projects in the FP7 program. The average cost of the projects in the H2020 program is higher than the average cost of the projects in the FP7 program due to the fact that the distribution of the total cost of the projects in the H2020 program is more skewed than that of the FP7 program.

Now, we will find top 10 keywords in the ICT theme of the H2020 program. The following table shows the top 10 keywords in the ICT theme of the H2020 program, and compares them with the top 10 keywords in the ICT theme of the FP7 program. It shows that data and smart system becomes more popular. This might be due to the fact that AI and machine learning are becoming more popular in the ICT field.
word (H2020) | N (H2020) | word (FP7) | N (FP7) |
---|---|---|---|
platform | 179 | systems | 212 |
data | 169 | based | 157 |
smart | 133 | future | 130 |
based | 130 | internet | 123 |
european | 113 | european | 122 |
system | 108 | energy | 119 |
systems | 107 | research | 118 |
digital | 105 | management | 113 |
cloud | 100 | data | 106 |
5g | 91 | services | 104 |
We will not plot the word cloud for the H2020 program because the word cloud is not very informative. The above table gives us a glimpse of the projects in the ICT theme of the H2020 program. Instead, we will use the LDA model to find the topics in the H2020 program.
# <<DocumentTermMatrix (documents: 1792, terms: 18750)>>
# Non-/sparse entries: 217873/33382127
# Sparsity : 99%
# Maximal term length: 114
# Weighting : term frequency (tf)
# Sample :
# Terms
# Docs data develop innov market project provid system technolog use will
# 1187 7 0 0 1 1 2 0 1 0 2
# 1358 19 1 0 0 0 4 0 5 3 5
# 1756 0 2 4 1 4 1 0 3 0 4
# 1777 12 1 1 0 5 1 0 1 1 5
# 286 0 1 1 2 0 2 0 0 1 3
# 35 18 3 2 2 0 2 10 4 3 5
# 449 2 1 2 2 2 1 1 0 0 9
# 566 0 0 5 3 0 1 3 4 0 5
# 659 0 4 0 1 1 0 0 1 1 5
# 973 0 2 4 0 3 1 1 4 0 1
There are 1792 projects whose objective has length greater than 100, which is less than the number of projects in the FP7 program. This does not mean European Commission invest less in the ICT theme of the H2020 program. It might be due to the fact that some topics in the ICT theme of the FP7 becomes an independent theme in the H2020 program. For example, H2020-EU.3.4. and H2020-EU.3.3. are the topics in the ICT theme of the FP7 program. However, they are independent topics in the H2020 program.

The optimal number of topics for the H2020 program is 12, which is higher than the optimal number of topics for the FP7 program. This might be due to the fact that the topics in the H2020 program are more specific than those in the FP7 program. This also makes sense as the technology is becoming more advanced our knowledge about the technology is becoming more specific and complex.

Figure 10 tells us lots of information about the projects in the ICT theme of the H2020 program. First, topic 1 is still about quantum information and communication, which is the same as the FP7 program. This shows the consistency of European Commission in the ICT theme. Some topics like topics 7 to 12 are very similar to the topics in the FP7 program. However, some topics like topics 2, 3, 4, 5, and 6 are very different from the topics in the FP7 program. For instance, topic 4 is about AI, which is new in the H2020 program. This shows that European Commission is trying to invest more in the AI field. Topic 6 is about building up an ecosystem for small and medium enterprises (SMEs) to develop their products with ICT technology.

Figure 11 shows the number of projects in each topic for the H2020 program. It shows that topic 7 has the smallest number of projects. This makes sense as the topic 7 is about designing and evaluating the research policy on the ICT technology, which is not a very practical topic.
topic | totalCost | avrgCost | Note |
---|---|---|---|
1 | 3039.39 | 16.89 | Photonics based manufacturing |
2 | 1102.04 | 12.97 | Smart Manufacturing |
9 | 974.39 | 6.29 | 5G Infrastructure |
4 | 960.45 | 6.20 | AI |
12 | 941.58 | 7.08 | ICT Enabled Sustainable Growth |
5 | 911.34 | 6.33 | Big Data |
6 | 875.62 | 6.30 | Ecosystem for SMEs |
11 | 590.24 | 3.06 | ICT for Health |
10 | 447.26 | 2.31 | ICT for Education |
7 | 337.20 | 2.37 | Research Policy on ICT |
3 | 120.78 | 0.54 | Ecommerce |
8 | 77.07 | 1.64 | Others |
Latent Dirichlet Allocation (LDA) is really helpful and can help us to summarize topics very efficiently. For instance, topic 2 is about smart manufacturing and the following table gives 5 samples of projects in this topic. You can see that one could make inferences on the topic based on the key words in Figure 10.
id | topic | title |
---|---|---|
723145 | 2 | Ecosystem for Collaborative Manufacturing Processes – Intra- and Interfactory Integration and Automation |
692466 | 2 | Power Semiconductor and Electronics Manufacturing 4.0 |
871875 | 2 | A “Smart” Self-monitoring composite tool for aerospace composite manufacturing using Silicon photonic multi-sEnsors Embedded using through-thickness Reinforcement techniques |
723699 | 2 | Driving up Reliability and Efficiency of Additive Manufacturing |
101017284 | 2 | AI-Driven Cognitive Robotic Platform for Agile Production environments |