How to block resources in Selenium and Python?

by scrapecrow Jun 30, 2023

To speed up Selenium web scrapers we can block media and other non-essential background requests.

Unfortunately, Selenium by itself doesn't support request interception and blocking so we must use a proxy to handle the blocking for us, then attach this proxy to our Selenium instance.

For example, a popular proxy for such use case is mitproxy. We can easily configure it to block requests by resource type or by resource name.

First install mitmproxy using pip install mitmproxy or package manager available in your operating system. Then, we can create a simple block.py script that will extend mitmproxy with our custom blocking logic:

# block.py
from mitmproxy import http

# we can block popular 3rd party resources like tracking and advertisements.
BLOCK_RESOURCE_NAMES = [
  'adzerk',
  'analytics',
  'cdn.api.twitter',
  'doubleclick',
  'exelator',
  'facebook',
  'fontawesome',
  'google',
  'google-analytics',
  'googletagmanager',
  # or something abstract like images
  'images'
]
# or block based on resource extension
BLOCK_RESOURCE_EXTENSIONS = [
    '.gif',
    '.jpg',
    '.jpeg',
    '.png',
    '.webp',
]

# this will handle all requests going through proxy:
def request(flow: http.HTTPFlow) -> None:
    url = flow.request.pretty_url
    has_blocked_extension = any(url.endswith(ext) for ext in BLOCK_RESOURCE_EXTENSIONS)
    contains_blocked_key = any(block in url for block in BLOCK_RESOURCE_NAMES)
    if has_blocked_extension or contains_blocked_key:
        print(f"Blocked {url}")
        flow.response = http.Response.make(
            404,  # status code
            b"Blocked",  # content
            {"Content-Type": "text/html"}  # headers
        )

We can run this proxy using mitmproxy -s block.py and it'll start a proxy on localhost:8080 on our machine.

Now, we can attach this proxy to our Selenium instance and it'll block all unwanted requests going through it:

from selenium import webdriver

PROXY = "localhost:8080"  # IP:PORT or HOST:PORT of our mitmproxy

chrome_options = webdriver.ChromeOptions()
# this command enabled proxy for our Selenium browser:
chrome_options.add_argument('--proxy-server=%s' % PROXY)

chrome = webdriver.Chrome(options=chrome_options)
# test it by going to a page with blocked resources:
chrome.get("http://web-scraping.dev/product/1")
chrome.quit()

Using this method to block resources can significantly reduce the bandwidth used by the Selenium scraper - often by 2-10 times! This will also greatly speed up scraping as the browser doesn't need to render unnecessary resources.

🤖 Tip: to use mitmproxy with Selenium and http websites the mitmproxy certificate needs to be installed for that see how to install mitmproxy certificate

Related Articles

Bypass Proxy Detection with Browser Fingerprint Impersonation

Stop proxy blocks with browser fingerprint impersonation using this guide for Playwright, Selenium, curl-impersonate & Scrapfly

PROXIES
SELENIUM
PLAYWRIGHT
PUPPETEER
BLOCKING
Bypass Proxy Detection with Browser Fingerprint Impersonation

Guide to SeleniumBase — A Better & Easier Selenium

SeleniumBase streamlines browser automation with simple syntax, cross-browser support, and robust features, perfect for testing and web scraping.

SELENIUM
HEADLESS-BROWSER
Guide to SeleniumBase — A Better & Easier Selenium

Playwright vs Selenium

Explore the key differences between Playwright vs Selenium in terms of performance, web scraping, and automation testing for modern web applications.

HEADLESS-BROWSER
PLAYWRIGHT
SELENIUM
Playwright vs Selenium

What is a Headless Browser? Top 5 Headless Browser Tools

Quick overview of new emerging tech of browser automation - what exactly are these tools and how are they used in web scraping?

HEADLESS-BROWSER
PLAYWRIGHT
SELENIUM
PUPPETEER
What is a Headless Browser? Top 5 Headless Browser Tools

How to Scrape With Headless Firefox

Discover how to use headless Firefox with Selenium, Playwright, and Puppeteer for web scraping, including practical examples for each library.

HEADLESS-BROWSER
PUPPETEER
SELENIUM
NODEJS
PLAYWRIGHT
PYTHON
How to Scrape With Headless Firefox

Selenium Wire Tutorial: Intercept Background Requests

In this guide, we'll explore web scraping with Selenium Wire. We'll define what it is, how to install it, and how to use it to inspect and manipulate background requests.

PYTHON
HEADLESS-BROWSER
SELENIUM
TOOLS
Selenium Wire Tutorial: Intercept Background Requests