Web Scraping: A Summary

After scraped a fair amount of websites, I wrote down what I have learned.

There are many ways to do web scrapping. In this post, I will summarize them into three categories and give methods to scrape those different websites:

Methods for static websites

If the website you want to scrape is a static one, you could just use requests and BeautifulSoup. The starting function could be like the following one.

from bs4 import BeautifulSoup
import requests


def url_to_soup(url, headers = HEADERS):
    try:
        # you may turn verify to False might unblock multiple requests
        page = requests.get(url,  headers=headers, verify=True)
        print(page)
        page.encoding = "utf-8"
        print(page.status_code)
        soup = BeautifulSoup(page.content, "lxml")
        return soup
    except requests.exceptions.ConnectionError:
        print("Connection refused: "+url)

Methods for dynamic websites

If the website you want to scrape is a dynamic one, you need to inspect Network via Developer Tools in your browser and figure out how the data was rendered dynamically. Depending on how the data is rendered, you could try:

I prefer using Splash as it allows you to render javascript in Python environment. For instance,

res = requests.get("http://localhost:8050/render.html",
                            params={"url": url, "wait": 2})
soup = BeautifulSoup(res.content, "html.parser")

Methods for single-page application websites

For the single-page application websites, you need to write a javascript script and add it into Sources of developer tools of your browser as it is shown in the following figure.

A screenshot of developer tools
Figure 1. Screenshot of Developer Tool

In this example, the function getData was rendered from the website we want to scrape. Once it was rendered, then we could call it directly from Console and get the data we want.

After that, you could use the following function to save your dataset.

(function(console){
    // define a function 
    console.save = function(data, filename){

        if(!data) {
            console.error('Console.save: No data')
            return;
        }
        // file name = 'webdata.json' 
        if(!filename) filename = 'webdata.json'

        if(typeof data === "object"){
            data = JSON.stringify(data, undefined, 4)
        }
        // create a blob object to save data 
        var blob = new Blob([data], {type: 'text/json'}),
            e    = document.createEvent('MouseEvents'),
            a    = document.createElement('a')

        a.download = filename
        a.href = window.URL.createObjectURL(blob)
        a.dataset.downloadurl =  ['text/json', 
                                    a.download, a.href].join(':')
        e.initMouseEvent('click', true, false, window, 0, 0, 0, 0, 0, 
                            false, false, false, false, 0, null)
        a.dispatchEvent(e)
    }
})(console)

Every time you run console.save([your_data]), your browser will automatically download the data into a json file called webdata.json.

Javascript framework

If you are seeking for a Javascript framework that provides a high-level API to control Chrome or Chromium over the DevTools Protocol, you can use Puppeteer.

References:

[1] DevTools Snippets
[2] Web APIs Blob
[3] Scraping XHR