European Research and Innovation Program

There are many projects in the European Union that are funded by the European Research and Innovation Programme. To make sense of those projects and their results, we have to do lots of data processing and analysis.

On European level, the European Research and Innovation Program (H2020) is the main funding instrument for research and innovation. It is a very large program with a budget of 80 billion euros for the period 2014-2020. The program is divided into different parts, each with its own budget and focus. It is the consecutive program of FP7 (another large European research program).

I have tried to make sense of those research program by analyzing the projects and their results, such as who are funded and what they do. This post will try to give an overview of those projects and construct a more structured dataset from the raw data for further analysis.

I have built a dataset and save it to Kaggle. One can follow this link to download the dataset.

FP7 projects

Each research program has two datasets: projects and organizations. The projects dataset contains information about the projects, such as the title, the abstract, the budget, the start and end date, the project coordinator, the participants, the topics, the call, etc. The organizations dataset contains information about the organizations, such as the name, the country, the type, the address, etc.

The organizations dataset tells who is involved in the projects. The projects dataset tells what is done in the projects. The following code gives a glimpse of the projects dataset for FP7 (the predecessor of H2020).

project <- fread("./cordis/fp7/data/csv/project.csv")
str(project, width = 70,  strict.width = "cut")
# Classes ‘data.table’ and 'data.frame':	25785 obs. of  25 variables:
#  $ id                : int  223925 315871 229760 286978 211948 28306..
#  $ acronym           : chr  "LENVIS" "APT-STEP" "FORUM GMES" "BIOCA"..
#  $ status            : chr  "CLO" "CLO" "CLO" "CLO" ...
#  $ title             : chr  "Localised environmental and health inf"..
#  $ startDate         : chr  "2008-09-01" "2012-10-01" "2008-04-01" "..
#  $ endDate           : IDate, format: "2012-01-31" ...
#  $ totalCost         : chr  "3131818" "1150921,4" "239699" "1410933"..
#  $ ecMaxContribution : chr  "2232223" "1150921" "150000" "1004500" ...
#  $ legalBasis        : chr  "FP7-ICT" "FP7-REGPOT" "FP7-SPACE" "FP7"..
#  $ topics            : chr  "ICT-2007.6.3" "REGPOT-2012-2013-1" "SP"..
#  $ ecSignatureDate   : chr  "" "" "" "" ...
#  $ frameworkProgramme: chr  "FP7" "FP7" "FP7" "FP7" ...
#  $ masterCall        : chr  "" "" "" "" ...
#  $ subCall           : chr  "FP7-ICT-2007-2" "FP7-REGPOT-2012-2013-"..
#  $ fundingScheme     : chr  "CP" "CSA-SA" "CSA-SA" "BSG-SME" ...
#  $ nature            : chr  "" "" "" "" ...
#  $ objective         : chr  "The main goal of the LENVIS project is"..
#  $ contentUpdateDate : chr  "2019-07-16 12:16:47" "2017-09-24 09:38"..
#  $ rcn               : chr  "87602" "104995" "87337" "100971" ...
#  $ grantDoi          : chr  "" "" "" "" ...
#  $ V21               : chr  "" "" "" "" ...
#  $ V22               : chr  "" "" "" "" ...
#  $ V23               : chr  "" "" "" "" ...
#  $ V24               : POSIXct, format: NA ...
#  $ V25               : int  NA NA NA NA NA NA NA NA NA NA ...
#  - attr(*, ".internal.selfref")=<externalptr> 

### --------------- Clean the data ------------------ ###
# convert to numberic values 
project %>%
    .[, totalCost := as.double(
        str_replace(totalCost, ",", ".")
        )] %>%
    .[, ecMaxContribution := as.numeric(
        str_replace(ecMaxContribution, ",", ".")
        )] %>%
    str()


# get start and end year
project %>%
    .[, startDate := as.IDate(startDate, "%Y-%m-%d")] %>%
    str()

# add project span in months, startYear, endYear
project %>%
    .[, projSpan := (interval(startDate, endDate) %/% months(1))] %>%
    str()

project %>%
    .[, startYear := year(startDate)] %>%
    .[, endYear := year(endDate)] %>%
    str()

# legalBasis group
project %>%
    .[, .N, by = legalBasis] %>%
    .[N > 10] %>%
    .[order(-rank(N))] %>%
    head(10) %>%
    kable("pipe")

The variable legalBasis tells us the theme of the project, for instance, ‘FP7-SPACE’ is about space technology. The following table gives an overview of the themes.

Legal Basis No.
FP7-PEOPLE 11103
FP7-IDEAS-ERC 4553
FP7-ICT 2325
FP7-SME 1036
FP7-HEALTH 1003
FP7-JTI 807
FP7-NMP 805
FP7-TRANSPORT 720
FP7-KBBE 516
FP7-ENVIRONMENT 493
Airbus patents distribution
Figure 1. The table gives the top 10 themes in terms of number of projects and the figure gives the distribution of the projects in terms of grant amount (in million euros). You can zoom in the figure by clicking on it.

The above table shows that the most popular theme is FP7-PEOPLE, which is about creating a more human society. The second most popular theme is FP7-IDEAS-ERC, which is about basic research. The third most popular theme is FP7-ICT, which is about information and communication technology. The fourth most popular theme is FP7-SME, which is about small and medium enterprises.

In terms of the total grant amount, the most two popular themes are FP7-ICT and FP7-IDEAS-ERC, which receive around 7.7 and 7.8 billion euros respectively. Figure 1 was produced by the following code.

options(repr.plot.width = 9, repr.plot.height = 7)
project %>%
    .[grep("FP", legalBasis)] %>%
    .[, .(budget = sum(ecMaxContribution)/1e6), by = legalBasis] %>%
    .[, budget2 := as.character(round(budget, 0))] %>% 
    .[, budgetIndex := paste(legalBasis, budget2, sep = "\n")] %>%
    treemap(index = "budgetIndex",
            vSize = "budget",
            vColor = "budget",
            type = "manual",
            palette = "Blues",
            algorithm = "squarified",
            title = "Total EU Contribution in different programmes",
            title.legend = "million",
            fontsize.labels = c(10),  border.lwds = c(1, 1))

Since there are many themes, we will focus on ICT, which is the one that receives the most grant amount. There are 2325 projects in the ICT theme, and the average project span is around 36 months (three years). The average grant amount is 4.8 million euros. However, the maximum grant amount is 74.98 million euros. The following table gives the summary statistics of the grant amount in the ICT theme, which shows that the grant amount is highly skewed.

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.12 2.59 3.84 4.79 5.16 74.98

Like many variables in the business world, only the log of the grant amount is normally distributed. The following figure shows the distribution of the log of the grant amount in the ICT theme.

fp7 totalcost hist1
Figure 2. Histogram of the the grant amount and its log transformation in the FP7 program.

Here is the code to produce the above table and figure.

project %>%
    .[legalBasis == "FP7-ICT"] %>%
    unique(by = "id") %>%
    with(summary(totalCost)/1000000) %>%
    as.matrix() %>% t() %>% 
    as.data.frame() %>%
    kable("pipe", align = "c", digits = 2)

options(repr.plot.width = 9, repr.plot.height = 5)
par(mfrow = c(1, 2))
hist(project$totalCost/1000000, breaks = 30, col = "grey",
                xlab = "Total Cost (million)",
                main = "Histogram of Total Cost")
hist(log(project$totalCost), breaks = 30, col = "grey",
                xlab = "Log(Total Cost)",
                main = "Histogram Log(Total Cost)")

Now, we will try to understand the topic of the projects in the ICT theme. The following table gives the top 10 words in the project titles in the ICT theme and wordclouds of the project titles in the ICT theme.

word N
systems 212
based 157
future 130
internet 123
european 122
energy 119
research 118
management 113
data 106
services 104
Airbus patents distribution
Figure 3. The table gives the top 10 words in the project titles in the ICT theme and the figure gives the wordcloud of the project titles in the ICT theme.

As we can see that the top frequency words and wordclouds are not very informative. We will try to use the topic model to understand the topic of the projects in the ICT theme. The following code cleans the project objectives and gives the summary statistics of the number of words in the project objectives.

project %>%
    .[legalBasis == "FP7-ICT"] %>%
    unique(by = "id") %>%
    .[, .(objective)] %>% 
    # remove punctuation
    .[, objective := gsub("[[:punct:]]", "", objective)] %>%
    # remove control characters
    .[, objective := gsub("[[:cntrl:]]", "", objective)] %>%
    # to lower case
    .[, objective := tolower(objective)] %>%
    # remove numbers
    .[, objective := gsub("[[:digit:]]", "", objective)] %>%
    # remove stop words with tm package
    .[, objective := removeWords(objective, stopwords("english"))] %>%
    # strip white spaces
    .[, objective := stripWhitespace(objective)] %>%
    # stemming
    .[, objective := stemDocument(objective)] %>%
    # now we have the clean text data
    # we need to convert it to a document-term matrix
    # add document length
    .[, doc_length := str_count(objective, "\\S+")] -> project_doc

summary(project_doc$doc_length)
#  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  28.0   166.0   177.0   174.8   186.0   318.0

The shortest project objective has 28 words, and the longest one has 318 words. The average project objective has 175 words. The following code creates a document-term matrix and a term-document matrix. We first drop the documents with less than 100 words in order to reduce the sparsity of the document-term matrix.

# create a document-term matrix
# we will filter out documents with more than 100 words
# create document index
project_doc %>%
    .[doc_length >= 100] %>%
    .[, doc_id := .I] %>%
    # create a document-term matrix
    .[, .(doc_id, objective)] %>%
    unnest_tokens(word, objective) %>% 
    .[, .N, by = c("doc_id", "word")] %>%
    .[order(doc_id, -N)] %>% 
    as.data.frame() %>% 
    cast_dtm(doc_id, word, N) -> dtm_objective

# understand the document-term matrix
inspect(dtm_objective)

# <<DocumentTermMatrix (documents: 2265, terms: 22694)>>
# Non-/sparse entries: 272039/51129871
# Sparsity           : 99%
# Maximal term length: 56
# Weighting          : term frequency (tf)
# selected sample docs and terms 
#        Terms
# Docs   data develop network project research servic system technolog use will
#   103     0       3       0       4        0      0      3         4   1   13
#   145     0       1       0       4        0      0      4         0   3    3
#   1722    0       1       1       4        0      0      7         3   8    8
#   1794    0       1       0       1        0     16      6         0   3    3
#   1811    1       1       1       1        0      5      1         1   1    0
#   1925    5       2       1       1        0      2      3         1   2    4
#   1948    7       1       0       3        6      0      5         0   3    8
#   23      0       0       0       2        0      0      3         0   5    6
#   715     2       1       2       3        9      1      0         2   0    4
#   727     0       2       2       2        0      0      4         4   3    8

We have 2265 documents and 22694 terms in the document-term matrix. The table above shows the selected sample of the document-term matrix, which is a sparse matrix. Now, we will remove the terms that appear in less than 5 documents and more than 90% of the documents. Doing so will give us a collection of terms that are more informative, which balances the uniqueness and the commonality of the terms (i.e., the terms that are too unique or too common are not informative).

word_term <- findFreqTerms(dtm_objective,
                        lowfreq = 5,
                        highfreq = nrow(dtm_objective) * 0.9)

# create a new document-term matrix with the filtered words
dtm_objective_filtered <- dtm_objective[, word_term]
dtm_objective_filtered
# <<DocumentTermMatrix (documents: 2265, terms: 4851)>>
# Non-/sparse entries: 232500/10755015
# Sparsity           : 98%
# Maximal term length: 20
# Weighting          : term frequency (tf)

After dropping the terms that are too unique or too common, we have 4851 terms in the document-term matrix. The term-document matrix also becomes less sparse.

Now, we will use the LDA model to understand the topics of the projects in the ICT theme. First, we need to define the number of topics. We will use a package called ldatuning to find the optimal number of topics.

lda_fit1 <- ldatuning::FindTopicsNumber(dtm_objective_filtered,
                                        topics = seq(2, 10),
                                        metrics = c("CaoJuan2009", "Deveaud2014"),
                                        method = "Gibbs",
                                        control = list(seed = 789),
                                        verbose = TRUE)

FindTopicsNumber_plot(lda_fit1)
fp7 totalcost hist1
Figure 4. the plot of the optimal number of topics based on the CaoJuan2009 and Deveaud2014 metrics.

From the plot above, we will choose 9 as the optimal number of topics. The following code fits with LDA and plot the topics.

set.seed(789)
lda_fit2 <- LDA(dtm_objective_filtered, k = 9, method = "Gibbs",
                                        control = list(iter = 1000,
                                                        verbose = 50))

# get values of theta
options(repr.plot.width = 11, repr.plot.height = 9)
tidy(lda_fit2, matrix = "beta") %>% 
    group_by(topic) %>%
    slice_max(beta, n = 10) %>%
    ungroup() %>%
    arrange(topic, -beta) %>%
    mutate(word = reorder_within(term, beta, topic, sep="")) %>%
    ggplot(aes(word, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE, alpha = 0.85) +
    coord_flip() +
    facet_wrap(~topic, scales = "free") +
    labs(x = NULL, y = expression(beta)) +
    theme_bw() +
    # make x axis labels more readable with bold font
    theme(axis.text.y = element_text(face = "bold", size = 10),
            panel.grid.major = element_blank(),
            panel.grid.minor = element_blank())

By look at words in each topic, we can interpret the topics in Figure 5. For example, topic 1 is about quantum computing, topic 3 is probably about the smart manufacturing with mobile devices, and topic 6 is more about the economic policy on ICT.

fp7 topics
Figure 5. the plot of topics based on the objectives of the projects in the ICT theme (you can click on the image to zoom in).

I also found topic 4 and 9 are interesting. Topic 4 is about social media and topic 9 is about the platform for digital economy.

Figure 5 shows that Latent Dirichlet Allocation (LDA) is a useful tool to understand the topics of the projects in the ICT theme. The topics are not perfect, but they are useful to understand the projects in a very efficient way (assuming she/he will not read all the project objectives by herself/himself).

With those 9 topics we classified, we can now answer the following questions:

  1. What are the most popular topics in the ICT theme?
  2. Which topic receives the most funding?
fp7 topics
Figure 6. histogram of the number of projects in each topic

Figure 6 shows that there are more or less equal number of projects in each topic. The most popular topic is topic 1, which is about quantum computing. The following table shows the top 10 topics in terms of the total cost of the projects and the average cost of the projects in each topic.

topic totalCost(million) avrgCost(million)
3 1747.42 5.64
1 1681.41 5.08
8 1434.17 5.22
7 1329.14 4.43
4 1318.15 4.76
9 1276.01 6.51
5 1102.56 5.63
6 726.44 2.34
2 302.68 4.39

We can see that European Commission invests more in topic 3, which is about smart manufacturing and energy with mobile devices. In general, topic 3 is all about using ICT to improve the efficiency of the manufacturing, transportation, and energy sectors. The following table gives some sample projects in topic 3.

id topic title
314328 3 SMART MOBILITY IN SMART CITY
223937 3 SMART-antenna multimode wireless mesh Network
257544 3 Smart and Efficient Location, idEntification, and Cooperation Techniques
287534 3 Semantic Tools for Carbon Reduction in Urban Planning
314331 3 Optimized electric Drivetrain by INtegration

H2020 projects

After analyzing the FP7 program, we will do the same analysis for the H2020 program. The H2020 program is a continuation of the FP7 program. The H2020 program has a budget of 80 billion euros for the period 2014-2020.

legalBasis N Note
H2020-EU.1.3. 11700 EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions (MSCA)
H2020-EU.1.1. 7792 EXCELLENT SCIENCE - European Research Council (ERC)
H2020-EU.2.3. 3162 INDUSTRIAL LEADERSHIP - Innovation In SMEs
H2020-EU.2.1.1. 1870 INDUSTRIAL LEADERSHIP - ICT enabling
H2020-EU.3.4. 1756 SOCIETAL CHALLENGES - Smart, Green And Integrated Transport
H2020-EU.3.3. 1450 SOCIETAL CHALLENGES - Secure, clean and efficient energy
H2020-EU.3.1. 1196 SOCIETAL CHALLENGES - Health, demographic change and well-being
H2020-EU.3.2. 914 SOCIETAL CHALLENGES - Food security, etc.
H2020-EU.3.5. 739 SOCIETAL CHALLENGES - Climate action, Environment, ect.
H2020-EU.1.2. 618 EXCELLENT SCIENCE - Future and Emerging Technologies (FET)
fp7 topics
Figure 7. distribution of the total cost of the projects in the H2020 program (you can click on the image to zoom in).

The following tables compares the total cost and the average cost of the projects in the H2020 program and the FP7 program for the ICT theme.

Program Min. 1st Qu. Median Mean 3rd Qu. Max. N TotalCost
FP7 0.12 2.59 3.84 4.79 5.16 74.98 2325 7.8 billion
H2020 0.07 0.48 3.12 5.69 5.4 180.32 1870 6.7 billion

We can see that the total cost of the projects in the H2020 program is lower than the total cost of the projects in the FP7 program. The average cost of the projects in the H2020 program is higher than the average cost of the projects in the FP7 program due to the fact that the distribution of the total cost of the projects in the H2020 program is more skewed than that of the FP7 program.

fp7 totalcost hist1
Figure 8. Histogram of the the grant amount and its log transformation in the H2020 program.

Now, we will find top 10 keywords in the ICT theme of the H2020 program. The following table shows the top 10 keywords in the ICT theme of the H2020 program, and compares them with the top 10 keywords in the ICT theme of the FP7 program. It shows that data and smart system becomes more popular. This might be due to the fact that AI and machine learning are becoming more popular in the ICT field.

word (H2020) N (H2020) word (FP7) N (FP7)
platform 179 systems 212
data 169 based 157
smart 133 future 130
based 130 internet 123
european 113 european 122
system 108 energy 119
systems 107 research 118
digital 105 management 113
cloud 100 data 106
5g 91 services 104

We will not plot the word cloud for the H2020 program because the word cloud is not very informative. The above table gives us a glimpse of the projects in the ICT theme of the H2020 program. Instead, we will use the LDA model to find the topics in the H2020 program.

# <<DocumentTermMatrix (documents: 1792, terms: 18750)>>
# Non-/sparse entries: 217873/33382127
# Sparsity           : 99%
# Maximal term length: 114
# Weighting          : term frequency (tf)
# Sample             :
#       Terms
# Docs   data develop innov market project provid system technolog use will
#   1187    7       0     0      1       1      2      0         1   0    2
#   1358   19       1     0      0       0      4      0         5   3    5
#   1756    0       2     4      1       4      1      0         3   0    4
#   1777   12       1     1      0       5      1      0         1   1    5
#   286     0       1     1      2       0      2      0         0   1    3
#   35     18       3     2      2       0      2     10         4   3    5
#   449     2       1     2      2       2      1      1         0   0    9
#   566     0       0     5      3       0      1      3         4   0    5
#   659     0       4     0      1       1      0      0         1   1    5
#   973     0       2     4      0       3      1      1         4   0    1

There are 1792 projects whose objective has length greater than 100, which is less than the number of projects in the FP7 program. This does not mean European Commission invest less in the ICT theme of the H2020 program. It might be due to the fact that some topics in the ICT theme of the FP7 becomes an independent theme in the H2020 program. For example, H2020-EU.3.4. and H2020-EU.3.3. are the topics in the ICT theme of the FP7 program. However, they are independent topics in the H2020 program.

fp7 totalcost hist1
Figure 9. the plot of the optimal number of topics for H2020 program.

The optimal number of topics for the H2020 program is 12, which is higher than the optimal number of topics for the FP7 program. This might be due to the fact that the topics in the H2020 program are more specific than those in the FP7 program. This also makes sense as the technology is becoming more advanced our knowledge about the technology is becoming more specific and complex.

fp7 topics
Figure 10. the plot of topics based on the objectives of the projects in the ICT theme for H2020 program (you can click on the image to zoom in).

Figure 10 tells us lots of information about the projects in the ICT theme of the H2020 program. First, topic 1 is still about quantum information and communication, which is the same as the FP7 program. This shows the consistency of European Commission in the ICT theme. Some topics like topics 7 to 12 are very similar to the topics in the FP7 program. However, some topics like topics 2, 3, 4, 5, and 6 are very different from the topics in the FP7 program. For instance, topic 4 is about AI, which is new in the H2020 program. This shows that European Commission is trying to invest more in the AI field. Topic 6 is about building up an ecosystem for small and medium enterprises (SMEs) to develop their products with ICT technology.

fp7 topics
Figure 11. histogram of the number of projects in each topic for H2020 program.

Figure 11 shows the number of projects in each topic for the H2020 program. It shows that topic 7 has the smallest number of projects. This makes sense as the topic 7 is about designing and evaluating the research policy on the ICT technology, which is not a very practical topic.

topic totalCost avrgCost Note
1 3039.39 16.89 Photonics based manufacturing
2 1102.04 12.97 Smart Manufacturing
9 974.39 6.29 5G Infrastructure
4 960.45 6.20 AI
12 941.58 7.08 ICT Enabled Sustainable Growth
5 911.34 6.33 Big Data
6 875.62 6.30 Ecosystem for SMEs
11 590.24 3.06 ICT for Health
10 447.26 2.31 ICT for Education
7 337.20 2.37 Research Policy on ICT
3 120.78 0.54 Ecommerce
8 77.07 1.64 Others

Latent Dirichlet Allocation (LDA) is really helpful and can help us to summarize topics very efficiently. For instance, topic 2 is about smart manufacturing and the following table gives 5 samples of projects in this topic. You can see that one could make inferences on the topic based on the key words in Figure 10.

id topic title
723145 2 Ecosystem for Collaborative Manufacturing Processes – Intra- and Interfactory Integration and Automation
692466 2 Power Semiconductor and Electronics Manufacturing 4.0
871875 2 A “Smart” Self-monitoring composite tool for aerospace composite manufacturing using Silicon photonic multi-sEnsors Embedded using through-thickness Reinforcement techniques
723699 2 Driving up Reliability and Efficiency of Additive Manufacturing
101017284 2 AI-Driven Cognitive Robotic Platform for Agile Production environments