Yet Another Guide to Working with EPO Patent Data

It took me a while to get into patent analytics, which is a 'wild' world. I hope my guide for patent analytics could help you to navigate through.

The first question came to my mind was ‘where should I start ?’ when I wanted to do patent analytics for my research project. Like everyone, I ended up with Google or ChatGPT (if you got a chance to ask there). I found out that the EPO Worldwide Patent Statistical Database (PATSTAT) is one of the most widely used patent databases for researchers and business analytics. Right now users could get access to PATSTAT online free for one month. Since most people need to use this database longer than one month, this guide will cover how you could access the PATSTAT database for free and use it whenever you want.

Knowing the data

Paper by (Kang & Tarasconi, 2016) presented the examples of frequently used patent statistics in the following table.

Table 1. What we can do with patent information (Kang & Tarasconi, 2016).
Analysis Patent statistics Purposes
Citation analysis Forward citations Measure technological value
  Backward citations Find knowledge source
Patent counts Patent counts Observe patent portfolio
analysis RTA (Revealed Technology Advance) Identify core technological competence
  PS (Patent Share)  
Technology class analysis Generality Measure endogenous applicability to different technological fields
  Originality Measure knowledge absorption from different technological fields
Inventor analysis Inventor counts Measure invention quality
    Measure absorptive capability
  Inventor Identify specific inventors’ info such as star engineers
    Follow mobility of R&D personnel

In terms of discussed advantages and disadvantages of using PATSTAT data, they made a summary as follows:

There are three most critical issues that users must recognize when planning to use PATSTAT (Kang & Tarasconi, 2016):

  1. Users need to harmonize names of applicants and inventors; for example, Toyota Motor Corporation appears as “Toyota Motor Corporation,” “Toyota Motor Co,” “Toyota Jidosha Kabushiki Kaisha” (“Jidosha” and “Kabushiki Kaisha” mean a car and a corporation in Japanese, respectively), “Toyota Jidosha K. K.,” and so on.
  2. Users need to be aware of the case that one entity can appear with different names in patent documents.
  3. Users need to understand how EPO records and tracks its documents, especially when those documents are from different patent offices, such as USPTO.

With those background information, I believe you are well prepared to start working with some datasets now.

Getting the dataset

If you want to do some quick search, please try Espacenet, which uses the same database with PATSTAT. However, as we have mentioned, there are name issues in this database. Moreover, you cannot do multiple searches by using SQL, R or Python with Espacenet. PATSTAT does support SQL queries (De Rassenfosse et al., 2014), but it is very time consuming to extract information you want. So, where and how could get the dataset then?

The starting point is OECD Harmonized Applicant Names (HAN) Database, which is free for non-commercial use after filling in a short online form.

The OECD HAN database provides a grouping of patent applicant’s names which has been elaborated with business register data. The names of patent applicants were originally extracted from European Patent Office’s (EPO) Worldwide Statistical Patent Database (PATSTAT, Spring 2022). The database also includes the list of patent documents filed to the EPO, the US Patent and Trademarks Office (USPTO) or through the Patent Co-operation Treaty (PCT).

Since applicants’ names could have some level of similairity but still refers to the different entity (such as Volkswagen Group vs. Volkswagen Shanghai), we need to identify those entities by matching the official business register database. OECD patent research group cleaned/harmonized names by matching against company names from business register data - ORBIS database from Bureau van Dijk.

Sometimes, names have typos but computer could not identify those names as the same entity such as “Volksgen (missing ‘w’) Shanghai”. For cases like this, we need to clean and harmonized names again.

Please read OECD’s methodology report to know how they clean, harmonize and consolidate applicant names.

Understanding OECD HAN database structure

Figure 1 gives OECD HAN database structure (open the zoomed version for more details). It has four tables: HAN_PERSON, HAN_NAMES, HARM_NAMES, and HAN_PATENTS. We will use one patent file as an example to walk through those tables. Notice that there are around 18 million rows in HAN_PATENTS table, which accounts for around 18% of EPO’s total patent document.1 This means the OECD HAN database has a European-centered bias. This is partly due to the useage of ORBIS database in the matching process, which has more register information for European firms. We will discuss how to get and analyze patent data for Chinese firms in the coming sections of this post.

OECD HAN database illustration
Figure 1. OECD HAN database structure: there are four tables, in which HAN_PERSON is the correspondence table which include all cleaned names.

Now, we chose the first six records (or entries) from HAN_PATENTS table and search those patents from EPO’s Espacenet. In the HAN_PATENTS table, the variable Appln_id refers to the patent application identifier in PATSTAT database. You can use it to retreat the document if you can access to PATSTAT database. When we search in Espacenet, we use Patent_number instead of Appln_id.

Table 2. First six rows of HAN_PATENTS, which starts at HAN_ID = 4.
HAN_ID HARM_ID Appln_id Publn_auth Patent_number
4 4 311606173 US US8668089
7 7 439191607 US US9409947
7 7 518367793 US US10836794
10 10 365204276 US US8513480
14 14 336903179 WO WO2011112122
14 14 363622722 WO WO2012064218

Let’s search for US8668089 - the first entry with HAN_ID=4. The following search result was returned by Espacenet.

Searching results for US8668089
Figure 2. Espacenet's searching returned result that includes patent name, inventor, applicant, CPC classes, and publication information and priority date.

To understand meaning of those dates, code and numbers, we need to learn some legal terms in the field of patent analytics.

One has to be very careful when you are working with patent data because you might have different reference IDs pointing to the same patent. Therefore, it is better to work only on granted patents unless you want to do patent analysis based on application documents. Having a grip on it by looking at the Figure 3.

Searching results illustated for US8668089
Figure 3. The complexity of a patent file; read the granted patent file [pdf] to have a big picture of this patent file.

It makes sense for a firm to file a patent in different judicial offices, therefore you see the patent in Figure 3 with publication number US2010096288(A1) was granted in different offices and then different document numbers such as US8668089(B2), EP2024242(B1). This might lead to the issue of overcounting patents if we do not search cautiously.

Let’s review the first from HAN_PATENTS, which we have extracted the patent document for the first entry with patent number of US8668089. Notice that the starting entry in HAN_PATENTs table is at HAN_ID=4 instead of 1, this means for companies with HAN_ID=1 (or 2-3), there is no patent information for them.

HAN_ID HARM_ID Appln_id Publn_auth Patent_number
4 4 311606173 US US8668089

Table 3 presents the first seven rows of HAN_PERSON. Compared with Table 2, it starts at HAN_ID=1, this means some names picked by algorithms do not have patent files even though they have unique person IDs. In this table, Person_id refers to applicant’s unique identifier from PATSTAT. This table is of little misleading as many of them are firms’ names. So, be aware that Person_name_clean refers to the harmonised applicant name.

Table 3. First seven rows of HAN_PERSON, which starts at HAN_ID = 1.
HAN_ID HARM_ID Person_id Person_name_clean Person_ctry_code Matched
1 1 15164069 & HAMBOURG NIENDORF DE 0
2 2 30816597 & KK JP 0
3 3 67383967 “ASTRONIT” CLOSE CORP RU 0
3 3 47772459 “ASTRONIT” CLOSE CORP RU 0
4 4 47210990 “DEUTSCHE SEE” GMBH DE 0
5 5 47367244 “EFIRNOIE” OPEN JOINT STOCK CO RU 0
5 5 64495153 “EFIRNOIE” OPEN JOINT STOCK CO RU 0

For the first four rows, here are possible reasons why they are included in Table 3 but not in Table 2:

Differences between HAN_ID and HARM_ID. In Table 4, we have selected four names with same HAN_ID but different HARM_IDs. This shows that HAN_ID was grouped based on further harmonized process. Clearly, “PERMNEFTEMASHREMONT” PUBLIC JOINT STOCK CO is very likely to referring the same entity with PUBLIC JOINT STOCK CO “PERMNEFTEMASHREMONT”.

Table 4. The selected rows from `HAN_PERSON` database.
HAN_ID HARM_ID Person_id Person_name_clean Person_ctry_code Matched
6 6 67613311 “EUROSTANDART” LTD LIABILITY CO RU 0
7 7 53655360 “IVIX” LTD RU 0
8 8 48761393 “PERMNEFTEMASHREMONT” PUBLIC JOINT STOCK CO RU 1
8 8 64875050 “PERMNEFTEMASHREMONT” PUBLIC JOINT STOCK CO RU 1
8 2462366 48754911 PUBLIC JOINT STOCK CO “PERMNEFTEMASHREMONT” RU 1
8 2462366 64543597 PUBLIC JOINT STOCK CO “PERMNEFTEMASHREMONT” RU 1

In the next section, we will do a case study to link a small dataset with OECD HAN datasets and try to answer some interesting analytical questions.

A case study with OECD HAN datasets

From now on, we will only work on three datasets: HAN_NAMES, HAN_PERSON and HAN_PATENTS as those two datasets include all information we need. With those two datasets we will try to do patent analysis for some german firms from information and communication technology industry. Here are components of this case study:

Read the dataset

Table 5 gives the top rows of our dataset german_firms.csv, which has firms’ native names and international names. We will use the column of name_international as native names might have special german characters like ö or ü.

Table 5. Top rows of German firms dataset.
name_native name_international
Airbus Defence and Space GmbH Airbus Defence and Space GmbH
EurA AG EurA AG
TuTech Innovation GmbH TuTech Innovation GmbH
FFT Produktionssysteme GmbH & Co. KG. FFT Produktionssysteme GmbH & Co. KG.
Diehl Aviation Laupheim GmbH Diehl Aviation Laupheim GmbH

Table 6 gives the top rows of HAN_NAMES. Our goal is to extract HAN_ID based on the value of Clean_name. For instance, we will try to find Airbus Defence and Space GmbH in HAN_NAMES.

Table 6. Top rows of `HAN_NAMES`.
HAN_ID Clean_name Person_ctry_code
1 & HAMBOURG NIENDORF DE
2 & KK JP
3 “ASTRONIT” CLOSE CORP RU
4 “DEUTSCHE SEE” GMBH DE
5 “EFIRNOIE” OPEN JOINT STOCK CO RU

Notice that all names in HAN_PATENTS are of upper case, we need to convert our german firms’ names into upper cases too. Then we can do our query.

# Code Block 1. query names in han_names 
airbus <- toupper('Airbus Defence and Space GmbH')
han_names %>%
    .[Clean_name %like% airbus]

The above code returns nothing as we could not find the exact match. Therefore, we need to query with less characters.

# Code Block 2. query names in han_names with less characters   
airbus <- toupper('Airbus Defence')
han_names %>%
    .[Clean_name %like% airbus]

When we query with AIRBUS DEFENCE, we got more than 10 results shown in Table 7. This shows the challenging part of doing patent analysis3. Have a look at the one with HAN_ID=62422, you will realize that why we could not find anything when we use 'AIRBUS DEFENCE AND SPACE GMBH' to do query because there is no ‘AND’ but an ‘ANG’. Moreover, we have same firm registered in different countries.

Table 7. The results of the above query
HAN_ID Clean_name Person_ctry_code
53993 AIRBUS DEFENCE & SPACE FR
60513 AIRBUS DEFENCE & SPACE GMBH DE
60514 AIRBUS DEFENCE & SPACE LTD GB
61747 AIRBUS DEFENCE & SPACE GMBH DD
61749 AIRBUS DEFENCE & SPACE SAS FR
62421 AIRBUS DEFENCE & SPACE LTD FI
62422 AIRBUS DEFENCE ANG SPACE GMBH DE
68676 AIRBUS DEFENCE & SPACE GMBH GB
81263 AIRBUS DEFENCE & SPACE NETHERLANDS BV NL
3637004 AIRBUS DEFENCE & SPACE DE
4401227 AIRBUS DEFENCE ANS SPACE GMBH DE
4527012 AIRBUS DEFENCE & AIR SPACE GMBH DE

Since we only want German firms, we will query with the extra row filter by setting Person_ctry_code == 'DE'. This gives us the following results, which means we will use all those five HAN_IDs to extract patent information for Airbus Defence and Space GmbH.

Table 8. The results for querying German firms
HAN_ID Clean_name Person_ctry_code
60513 AIRBUS DEFENCE & SPACE GMBH DE
62422 AIRBUS DEFENCE ANG SPACE GMBH DE
3637004 AIRBUS DEFENCE & SPACE DE
4401227 AIRBUS DEFENCE ANS SPACE GMBH DE
4527012 AIRBUS DEFENCE & AIR SPACE GMBH DE

By the way, thought I like Python a lot, I have to admit that the workflow of doing data analysis with packages like data.table or tidyverse is much better in R community when your dataset is quite organized and cleaned. The pipe function %>% is one of my favorite.

# Code-Block 3. Filter and extract han_ids for Airbus Defence
airbus <- toupper('Airbus Defence')
han_names %>%
    .[Person_ctry_code == 'DE'] %>%
    .[Clean_name %like% airbus] %>%
    .[,HAN_ID] -> airbus_han_ids
airbus_han_ids
#[1] 60513 62422 3637004 4401227 4527012

# Code-Block 4. Count patents for different offices 
han_patents %>%
    .[HAN_ID %in% airbus_han_ids] %>%
    .[, .N, by=Publn_auth] -> foo
    transform(adorn_totals(foo)) %>%
    transpose() %>%
    row_to_names(row_number=1)

With the airbus_han_ids, we will extract patents from HAN_PATENTS dataset. Then we found 1219 patent documents which are applied in different countries. When we use patent counting to measure how innovative of a firm is, we should be aware of the issue of over counting (Webb et al., 2005).

Table 9. Patent distribution for Airbus Defence (DE)
Office EP US WO total
Number 716 415 88 1219

Over counted or under counted? In total, we got 1219 patent documents from different patent offices. However, the same patent could be filed in different offices. In our case, we need to check whether there are duplicates for our entity - AIRBUS DEFENCE AND SPACE GMBH. For instance, among 1219 patents, document US2017113777 refers to the patent application document that the firm filed in USPTO whereas document EP3162699 refers to the same patent application file and granted patent document by EPO. However, both patent documents are recorded in HAN_PATENTS dataset.

Table 10. Some of duplicated patents
HAN_ID HARM_ID Appln_id Publn_auth Patent_number Search_link
60513 60513 470692725 EP EP3162699 Espacenet
60513 60513 477856200 US US2017113777 Espacenet

Table 10 only shows two of those duplicates, there might be more. However, it is not practical to go through all those patent numbers and search them one by one and then check whether there are some duplicates or not. For now, we will only use patents from EPO. This should be a safe choice as EPO normally grant less patents with stricter rules (see group 1 in Table 11).

Table 11. Some of duplicated patents with kind code
HAN_ID Publn_auth Patent_number Kind_code Group
60513 US US9308691 B2 (granted) 1
60513 EP EP2689872 A3 (application+search report) 1
60513 US US10703486 B2 (granted) 2
60513 EP EP3219614 B1 (granted) 2
60513 US US2017165795 A1 (application) 3
Not in the database EP EP3181711 B1 (granted) 3

For our patent analysis, we need to make sure:

If we only use patents from EPO, patents might be under counted overall as it is possible that firms got granted patents from USPTO but did not filed patent application to EPO or applied but rejected. With all those trade-offs, the best decision is to be of under counted rather than over counted.

In Table 11, there is one entry marked as ‘Not in the database’, this shows the bias4 of OECD HAN database because EP3181711 was granted by EPO and also published as US2017165795. However, only one of them is included in the database as shown in Table 11.

Using EPO patents. By using EPO patent numbers, we can avoid the issue of over count across patent offices with the risk of under count. However, we have not solved the issue of over count within EPO. Table 12 shows not all patents from HAN_PATENTS were granted.

Table 12. Selected patents for Airbus defence (DE) from EPO
HAN_ID HARM_ID Appln_id Publn_auth Patent_number Granted
60513 60513 213 EP EP2030891 No
60513 60513 65448 EP EP2025928 No
60513 60513 156990 EP EP1920908 Yes
60513 60513 161551 EP EP1972896 Yes
60513 60513 173385 EP EP2134522 Yes
60513 60513 173386 EP EP2136979 Yes

Summary. At this stage, let’s summarize what have done.

# Code-Block 5. Extract patents for one-company full code
# read dataset 1, 2, and 3 
de_firms <- fread('work/notebooks/patent/data/orbis_de_matched_l.csv')
han_names <- fread('work/notebooks/patent/data/202208_HAN_NAMES.txt')
han_patents <- fread('work/notebooks/patent/data/202208_HAN_PATENTS.txt')

# filter out germany firms from han_names 
# by setting Person_ctry_code == 'DE'
# match names "AIRBUS DEFENCE" and get their HAN_ID
airbus <- toupper('Airbus Defence')
han_names %>%
    .[Person_ctry_code == 'DE'] %>%
    .[Clean_name %like% airbus] %>%
    .[,HAN_ID] -> airbus_han_ids

# calcualte the summary statistics for AIRBUS DEFENCE
han_patents %>%
    .[HAN_ID %in% airbus_han_ids] %>%
    .[, .N, by=Publn_auth] -> foo
    transform(adorn_totals(foo)) %>%
    transpose() %>%
    row_to_names(row_number=1)

# focusing on patents from EPO
# filter with condition Publn_auth == 'EP'
han_patents %>%
    .[HAN_ID %in% airbus_han_ids] %>%
    .[Publn_auth == 'EP'] -> airbus_ep_patents

With all patent documents of AIRBUS DEFENCE AND SPACE GMBH from EPO we ready to figure out how to search for granted patents. The input for the next step should be airbus_ep_patents, which is a table with 716716 rows as shown in Table 12.

Search for granted patents

If you have registered PATSTAT online database (free trial for one month) or bought the database, you can extract more information from it based on Appln_id which is the unique identifier used in PATSTAT. For instance,

SELECT * FROM tls211_pat_publn  -- tls211_pat_publn is the publication table
WHERE appln_id = 156990  -- third row from Table 12

With the above query, I got the following results.

pat_publn_id publn_kind appln_id publn_date publn_first_grant
277148936 A1 156990 2008-05-14 N
437725522 B1 156990 2015-04-08 Y

This tells us whether the patent was granted or not. However, we need to extract this kind of information with thousands of entries (if not millions). Right now, I could not find a way from PATSTAT online to do this (please email to me if you knew how to do it, thanks in advance).

To search for the granted information in a programming environment, we need an API. Luckily, EPO provides this kind of APIs:

Since we use only EPO patents and have patent numbers and application IDs from OECD HAN database, we will use Linked open EP data.

Linked open EP data

According to EPO, “Linked open data offers you new ways of combining patent data and non-patent data in your work”. Linked open EP data can be queried, retrieved and viewed using standardized web technologies like HTTP, URI and SPARQL.

OECD HAN database illustration
Figure 1. OECD HAN database structure: there are four tables, in which HAN_PERSON is the correspondence table which include all cleaned names.

Back to our case. As we have stated from the last section, we want to search for granted patents based on Patent_number. Those patent numbers we have are publication numbers. Therefore, we should use APIs for publications. Please read the following documents from EPO:

The idea behind API is very intuitive. For each URL, you have an endpoint which returns the specific results. For instance, the following link returns all patents with publication numbers as input:

https://data.epo.org/linked-data/data/publication/{st3Code}/{publicationNumber}

# st3Code is two-letter codes of countries 
# publicaionNumber is just publication number

# try this one in your browser 
# https://data.epo.org/linked-data/data/publication/EP/1048543

With publication number (Panten_number in our database), we can just extract all relevant information from linked database of EPO. Code-Block 6 gives the sample code in R to extract all publications for patent EP1972896.

# Code-Block 6. API request
request <- GET('https://data.epo.org/linked-data/data/publication/EP/1972896.json')
response <- content(request, as = "text", encoding = "UTF-8")
json<- fromJSON(response, flatten = TRUE)

# make a table with selected variables
json$result$items %>%
    select(
        `_about`, 
        publicationDate, 
        application.applicationNumber,
        publicationKind.label) %>%
    rename(Link=`_about`, Publication_date=publicationDate, 
        Application_number=application.applicationNumber, 
        Kind_code=publicationKind.label)

Table 13 gives the results of Code-Block 6, which shows the key information we care: whether the patent is granted or not, which is indicated by the kind code. The rule of finding granted patents is to filter out B1 or B2 from kind code.

Table 13. Returned results from EPO's publication API
Link Publication_date Application_number Kind_code
EP/1972896/A2/ Wed, 24 Sep 2008 08004318 A2
EP/1972896/A3/- Wed, 07 Nov 2012 08004318 A3
EP/1972896/B1/- Wed, 06 May 2015 08004318 B1

With the right algorithm as shown in Code-Block 7, we can extract all patent information for AIRBUS DEFENCE AND SPACE GMBH. To extract patent information for 716 patent numbers of Airbus company, it takes around 6 minutes and 38 seconds. It could be speeded up if we set a smaller value of Sys.sleep().

Airbus patents distribution
Figure 4. Summary of granted patents of Airbus Defence; the decline in recent years can be explained with the fact that the application and granting procedures may take several years before a patent is granted, similar to many other fields of technology (also known as patent truncation issue); Remark: Airbus Defence and Space was founded on January 17, 2014.

With 538 granted patents of Airbus Defence and Space company, we will search for their classifications and try to find out the field of strategic orientation of this firm.

International Patent Classification (IPC). The International Patent Classification (IPC) system is a hierarchical patent classification system which is used by more than 100 patent offices on all continents. It breaks down technologies into eight sections with several hierarchical sub-levels. The IPC system has approximately 75,000 subdivisions and is updated on an annual basis. Further information on the IPC system is available at WIPO’s IPC Guide.

Cooperative Patent Classification (CPC). The Cooperative Patent Classification (CPC) system builds on the IPC system and provides a more granular and detailed classification structure. The CPC system has more than 250,000 subdivisions and is updated four times a year. It is used by more than 30 patent offices worldwide. Detailed information about the CPC system is available at USPTO’s CPC Lookup page.

With the program in Code-Block 8, we will calculate how many IPC classes that Airbus Defence (DE) applied for and what are the most frequent ones in terms of level 1, 2 and 3. For IPC, level 1 classification has 8 categories:

Level 2 of IPC gives more details of top level, for instance, B06 is everything about cleaning and H04 is about electronic communication technique. Within each sub-level, you have sub-level too. Let’s go through a case with IPC number H04H 60/18 on copying information:

Where could we get the information of IPC or CPC symbols without looking up one by one? The answer is again Linked Open EP data.

Airbus patents distribution
Figure 5. The IPC distribution of Airbus Defence (DE)'s patents; section B is about performing operations and transport; section H is electricity; section G is about physics. Remark : the total number of granted patents is 538 but the count of total IPC classes is 1763, which means some patents are assigned to multiple classes.

Figure 5 gives the IPC distribution of Airbus Defence (De)’s patents, which shows that the company has many patents in section B, H and G, which shows the key components of aircraft industry in terms of manufacturing process.

Top 5 IPC class Count
B29C70-38 14
B29C70-54 13
G08G5-00 13
B64C39-02 13
G05D1-00 12
Airbus patents distribution
Figure 6. The IPC distribution of Airbus Defence (DE)'s patents: sub-level classes; the table on the left shows the top 5 IPC class of all levels, whereas the figure on the right shows the distribution for the sub class.

Looking at those IPC classification labels, it is difficult for anyone who is not a patent expert to apprehend the meaning of those labels. Even for a patent expert, she/he has to be domain expert to interpret those labels. For instance, the top IPC class is B29c70-38, which means ‘Automated lay-up, e.g. using robots, laying filaments according to predetermined patterns’.

Word Clouds

Measuring patent quality

After knowing how to get patents for a company, we are now interested in measuring the quality of those patents. There are many indicators constructed by scholars to measure how ‘innovative’ or ‘radical’ or ‘disruptive’ (Squicciarini et al., 2013; Funk & Owen-Smith, 2017). In this post, we will learn different measurement for patents in terms of technological and economic value (Squicciarini et al., 2013).

Here is the list of patent quality measurement:

Patent truncation issue. Figure 7 shows the truncation issue of patent, which shows the time lag between application date and granted date. Paper by Lerner (2022) discussed the use and misuse of patents and addressed this issue and discussed potential biases (Lerner & Seru, 2022).

Patent truncation issue
Figure 7. The patent truncation issue for Airbus (DE), which shows that it normally takes four to five years for a patent to be granted; this means she/he has to take the lags into the consideration for using any kind quality measurement. w

Understanding OECD citation database structure

The OECD Citations database provides information on patent and non-patent literature (NPL) citations (or references) found in patent documents. Please read the OECD Citations Database report to learn how they collect the data. There are eleven tables in this database, which covers lots of information related to patent citation. We will only focus on four tables: EPO_CITATIONS, EPO_CIT_COUNTS, EPO_EQUIVALENT, and EPO_NPL_CITATIONS.

In the above sections, we have extracted all granted patents for Airbus Defence (DE) company (see Table 12). With those patent numbers, we could extract their citations from OECD Citations database. Some key figures for our dataset:

Again, we will focus on granted patents and will analyze its backward citations. I am interested in answering the question: backward citation and forward citation in terms of scale, agents, and lags, etc.

Patent truncation issue
Figure 8. Plot of frequency of Airbus' cited patents from different authorities: this shows USA is the source of innovation.
Patent truncation issue
Figure 9. Distribution of total citations made in each patent; the median is around 5/6; the distribution is also right-skewed.
Patent truncation issue
Figure 10. Plot of frequency of citation origins, which shows that firms either does not search for prior arts diligently or avoid citing prior arts strategically or has no manpower to search for prior arts (APP: citation introduced by the applicant, EXA: citation introduced during the examination, FOP: citations introduced by the opponent of the proprietor, SEA: citation introduced during the search).
Patent truncation issue
Figure 11. Distribution of citation lags; this is very interesting to see some of patent citation have more than 100 years lags.
Patent truncation issue
Figure 12. Distribution of total citation received, which shows many of patents have zero forward citations.

How does OECD citation database calculated the total forward citation? According to their report, Total_cits_Recd is calculated by summing up three citations:

It is important to know that _all counts are based on citations made during the search only (citation origin is either SEA or ISR, see Figure 10). The OECD’s way of calculating total forward citations is to calculate all citations received for the whole family. That’s why OECD citation database also provides three tables - EPO_EQUIVALENT, PCT_EQUIVALENT, and US_EQUIVALENT to give a list of patent equivalents to cited patents.

  1. there are more than 100 million patent documents from leading industrialized and developing countries in EPO’s database. 

  2. HAN_PATENTS include both patent application documents (A1 or A2) and granted patents (B1 or B2), therefore we need to figure out how to extract granted patents. 

  3. actually this kind of problem is very common in business analytics or any text analysis related to identifying entities’ names. 

  4. it is unavoidable for having bias when you work with millions of documents; as long as it is not systematic it should be okay. 

  1. De Rassenfosse, G., Dernis, H., & Boedt, G. (2014). An introduction to the Patstat database with example queries. Australian Economic Review, 47(3), 395–408.
  2. Funk, R. J., & Owen-Smith, J. (2017). A dynamic network measure of technological change. Management Science, 63(3), 791–817.
  3. Kang, B., & Tarasconi, G. (2016). PATSTAT revisited: Suggestions for better usage. World Patent Information, 46, 56–63.
  4. Lerner, J., & Seru, A. (2022). The use and misuse of patent data: Issues for finance and beyond. The Review of Financial Studies, 35(6), 2667–2704.
  5. Squicciarini, M., Dernis, H., & Criscuolo, C. (2013). Measuring patent quality: Indicators of technological and economic value.
  6. Webb, C., Dernis, H., Harhoff, D., & Hoisl, K. (2005). Analysing European and international patent citations: A set of EPO patent database building blocks.