Webscraping: Guidelines for an efficient & ethic web scraping with Python

Gabriele Albini
10 min readDec 29, 2020

1. Introduction

The web is a broad source of information easy to access and that increases every day with whatever new content is published.

For our corporate / personal projects, the web can be seen as an actual data source: webscraping techniques allow to get the information we want to analyse from an existing website and then re-use it as our datasource.

In many projects, data gathering is generally a costly, time-consuming and a challenging piece of work: webscraping can enable us to use whatever is public and avoid collecting data, saving a lot of time and resources and allowing to quickly focus on more interesting parts of the project.

On the other hand, webscraping may be also seen as a risky and potentially harmful activity, posing questions around what’s fair or not allowed to do.

In this article, I’ll recap what are the recommended techniques that I’ve researched, verified and implemented to efficiently perform webscraping, without harming the target websites, making sure to avoid doing what’s not allowed, but still taking some precautions and measures.

2. Before starting

First of all, given any content we would like to get from the web, let’s look for all the alternatives, in terms of websites and datasources, that are available.

There may be a lot of websites offering a specific content: some may have APIs, some may require the creation of an account to unblock any/all the content, some may disallow webscraping, some may not disallow it, … It is good to have different alternatives, investigating what’s feasible and then choosing our target website. But what is actually “feasible” ?

To answer this question, the very first thing to clarify is checking what’s allowed/disallowed to do.

For each interesting website we have found, let’s check if the content we would like to get is accessible only via a username / account: if that’s the case, it may be not possible to get the content using some scripts, so let’s prefer other websites.

Another useful thing to check is whether or not the website is offering any API: if the website is providing an API, obviously it is possible to use it to retrieve content. APIs are interfaces that allow users to receive information: the use of APIs should not affect the traffic on the website and is the preferred way to perform webscraping.

If no APIs are available, let’s look for any terms related to the website content, in order to make sure nothing is explicitly preventing us to perform webscraping. Although not all websites include specific terms about the information they publish, all websites should include a file that lists the pages that the website provider would like to “hide” from the web robots or crawlers. In this file, the website provider should list both the website sections that shouldn’t be scraped and the users that are banned from the website: given a website domain, add “/robots.txt” to the URL, open it in the browser and check what are the allowed or disallowed sections.

The robots.txt is a simple text file stored in the domain root folder. This file contains a group of rules: for example, if a website would like to completely block a user named “crawler123”, the robots.txt will include something like this:

User-agent: crawler123

Disallow: /

Normally, if the website doesn’t prevent any access, the robots.txt will include full allowed access to all users:

User-agent: *

Allow: /

In other cases, a website may exclude only some sections from all users:

User-agent: *

Disallow: /section1/

Disallow: /section2/

(I recommend to check https://en.wikipedia.org/robots.txt where there are some funny comments about some crawlers!).

It is strongly recommended to follow the robots.txt, in order to avoid hitting the website with a lot of disallowed requests which may cause some performance issues and which may result in getting banned from it.

3. Customizing requests

After having identified the website we need and that doesn’t disallow our webscraping activities, the very first step is to use python to request the html page. Once the html code of the page is downloaded, we can use some libraries (in this article I’ll be using Beautiful soup) to parse the code and extract the information we need.

Requesting the webpage is the step that involves a call to the web, to retrieve the page html; the rest of the webscraping activities are done “offline”, working with the html code.

There are a lot of precautions we can adopt in order to send these “requests” anonymously, especially if the webscraping activity will involve a lot of pages from the same website.

First of all, a very good practice is to use a VPN : this is not involving any coding or Python. The intention here is to run the webscraping script from a laptop that uses a VPN connection.

“A virtual private network (VPN) gives you online privacy and anonymity by creating a private network from a public internet connection. VPNs mask your internet protocol (IP) address so your online actions are virtually untraceable. Most important, VPN services establish secure and encrypted connections to provide greater privacy.”

Getting a VPN typically involves a subscription to any VPN provider: this allows to hide your IP and encrypt your connection. Additionally, some providers allow you to make the connection look like it is coming from a country of your choice or to use some advanced techniques if you are trying to reach a website inaccessible from your country. The VPN is not a pre requisite when performing web scraping, but it is definitely useful in order to avoid the target website to discover and eventually block your IP.

Another challenge we may face is the presence of some controls that websites use: if a lot of requests are sent from the same IP to the same domain, in a very short amount of time, the website may detect a suspicious activity and redirect us to a captcha-page. One way to avoid the problem is to customise each request, making it look a little bit different: the options we have are to alter headers and/or the IP.

3.1 Customizing Headers

“A request header is an HTTP header that can be used in an HTTP request to provide information about the request context, so that the server can tailor the response.”

In other word, the header is a kind of presentation/introduction of the origin device to the website, detailing where the request is coming from in terms of device, browser, language, …

Within Python, it is possible to customise the request header, here’s how.

First of all, let’s copy a set of 4–5 possible request headers: we can get them by using different devices we have available, or different browsers from the same device, or by using any user-agent switcher extension installed on our usual browser. In any case, once we have the website opened on our browser:

  • Right click on the mouse and select “Inspect”
  • In Chrome, go to the “network” tab
  • Refresh the page and detect the webpage request
  • Right click on it and select “copy all as cURL”
  • Paste the content on https://curl.trillworks.com/, which will give us a usable code in Python
Image by author

By repeating these steps from different devices or browsers, we can get a set of different headers.

Lastly, when looking at the full information included in the headers, we can identify any personal content we may want to change or delete.

Within Python, we can treat each header as a dictionary and create a list of dictionaries which will be used in each request, randomly picking one entry for each request. This is performed in the below code:

Image by author

3.2 Customizing IP

Similarly to headers, the next element that could be customised in each request is the IP. Python allows to specify some proxy IP within the request.

“A proxy server is an intermediary server through which your traffic gets routed. The internet servers you visit see only the IP address of that proxy server and not your IP address.”

There are several options to get a list of usable proxy IPs: normally they can be purchased or are provided as part of a subscription plan.

Another option is to use some websites (which you can google) that keep lists of free Proxy-IPs on their domains. These IPs may be used by many users and some websites may detect them, rejecting the request or re-route the request to a captcha page. We may also find out that some IPs simply aren’t working. However, by trying IP after IP we would be sending a request by using a proxy IP for free.

In this scenario, one of the strategies which can be applied is to develop some webscraping to these websites in order to get a list of 50–100 free proxy IPs. Then, we can randomly pick from this list, test the IP and use it for our requests (or exclude it if it doesn’t seem to be working).

Below, you can find a code that perform these exact steps, using three alternative pages from a website that include some free IPs as well as a subscription plan.

The IPs are contained in an html table which is looped over using the “.children” beautiful Soup method:

Image by author

It is good practice to test the functionality of these free proxy IPs found online. A simple way to do that is checking if we get an “ok” response when requesting any page, such as https://httpbin.org/ip (which just shows your IP address). This “test” is performed by the function below:

Image by author

3.3 Adding delay between requests

A final feature I’d add to our requests is: delay.

When browsing, a human user would navigate among the pages of a website with some delay: spending some time on the landing page, then focusing on some other sections, reading a few lines, looking for buttons, etc. This is not happening with a webscraping script, which could send many requests to the different pages of a domain without any delay. It is worth considering that the infrastructure behind websites is able to handle the first scenario and with limitations in terms of multiple users navigating the websites at the same time. The second kind of traffic, generated by webscraping scripts, may cause unexpected traffic peaks to websites, stressing the infrastructure.

For this reason, and also because the webscraping activities aren’t supposed to be harmful, it is a good practice to introduce some delay in our code, between requests.

Python allows to introduce delay in out script, for instance a random number between 5 to 10 seconds:

time.sleep(random.uniform(5, 10) * delay)

4. Extract the info needed from the html source code

4.1 Request the html

Once the request to the webpage is sent successfully, the html of the page can be parsed with the BeautifulSoup() Python function: thanks to this, we can benefit from different methods that will allow us to look for the sections and the contents we want to extract from the html.

Image by author

Before starting to code, just by looking at the page from the browser we should first of all analyse its html, understanding where is the information we need, what type of tag is wrapping it and if any tag class name or html hierarchy can help us.

4.2 Clean the html code

We should also look at any kind of misleading content: in some websites there may be an html code and a javascript code within the same html file. The two codes may have some overlaps in terms of classnames, content, etc. and this may be misleading for our script (when we’ll be looking for keywords, or tags, we may find double results, especially when looking for information by html “class”). Beautiful soup allows to remove from our downloaded version of the page any content we want, with the .extract() method:

Image by author

4.3 Exploit the html tag classes or id

If the information we are looking for is identifiable by a class or an id, specified within the tag of the html code, then we’ll be able to look for these class elements and get their content (eg. their text).

Suppose that the information we want to get is inside different <div>, each with class = “target”. We could extract their text with this code:

Image by author

4.4 Exploit the html hierarchy

In some other cases, the information we want to get is not simply identified by the class: classes may be missing or the information may be spread across multiple tags.

Many html pages have some kind of hierarchy of tags, such as a main <html> tag wrapper, then a <head> and a <body> section with different containers (div) inside. For instance:

Image by author

If a hierarchy can be identified, we could exploit it to focus our search on a specific section of the page, navigating through some elements on the same level (“siblings” in BeautifulSoup) or focusing on the “children” of an element which is easy to find.

Here’s an example which is extracting the “section 1” relevant info from the above html:

Image by author

Even if the hierarchy is mostly flat, we could still loop over siblings and get the content we are aiming for. In this other example, the code will extract the lines of the first 2 articles:

Image by author

This is the result we’ll obtain:

Image by author

5. Conclusions

I wanted to recap some key takeovers about performing webscraping in an efficient but ethical and safe way:

  1. Do things ethically:
  • Make sure the website is not explicitly disallowing webscraping
  • Include references / sources in your final project

2. Take some precautions not to be harmful:

  • Whenever available, always prefer a provided API
  • Include some delay between the requests of your scripts

3. Don’t neglect your own security, taking some precautions:

  • Get a VPN plan
  • Use a technique to rotate headers
  • Use a technique to rotate proxy IPs

4. Design your script:

  • Analyse the html from your browser
  • Focus on the html structure you can use at your favour: tag classes, html hierarchy

--

--

Gabriele Albini

Constant Learner, passionate about data analytics, ML and data visualization. Interested in work, tech, music & guitar