Using requests_cache to Simulate Browser Caching in Python Web Scraping
This article explains how to use the Python requests_cache library to mimic browser caching behavior in web‑scraping tasks, reducing request latency by skipping delays when cached responses are available, and shows various configuration options, back‑ends, and practical code examples.
Typical anti‑scraping measures add random delays between requests, but if a response is cached the delay can be omitted, shortening the crawler's runtime. The requests_cache library can simulate browser‑like caching: it returns a cached response when available, otherwise it waits briefly before proceeding.
Basic usage
Install the library and enable caching: pip install requests-cache Example hook that adds a throttle when a response is not from cache:
import requests_cache
import time
requests_cache.install_cache() # default browser‑style cache
requests_cache.clear()
def make_throttle_hook(timeout=0.1):
def hook(response, *args, **kwargs):
print(response.text)
if not getattr(response, 'from_cache', False):
print(f'Wait {timeout} s!')
time.sleep(timeout)
else:
print(f'exists cache: {response.from_cache}')
return response
return hook
if __name__ == '__main__':
requests_cache.install_cache()
requests_cache.clear()
session = requests_cache.CachedSession() # create cached session
session.hooks = {'response': make_throttle_hook(2)} # attach hook
print('first requests'.center(50, '*'))
session.get('http://httpbin.org/get')
print('second requests'.center(50, '*'))
session.get('http://httpbin.org/get')The article also lists useful testing sites such as httpbin.org, which provides endpoints for inspecting cookies, IP, headers, and authentication.
Why use requests_cache?
It follows browser caching rules, helping to evade simple anti‑scraping detection.
Customizable cache back‑ends improve performance for repeated requests.
Note that only requests made through a Session (e.g., session.get) can be cached; direct calls like requests.get are not cached.
Installation pip install requests-cache Comparing non‑cached and cached code
Non‑cached example:
import requests
import time
start = time.time()
session = requests.Session()
for i in range(10):
session.get('http://httpbin.org/delay/1')
print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)Cached example using requests_cache.CachedSession:
import requests_cache # pip install requests_cache
import time
start = time.time()
session = requests_cache.CachedSession('demo_cache')
for i in range(10):
session.get('http://httpbin.org/delay/1')
print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)To add caching to existing code with minimal changes, simply call: requests_cache.install_cache('demo_cache') and continue using the normal requests API.
Cache management
Clear the cache with: requests_cache.clear() Check whether a response came from cache:
import requests_cache
import requests
requests_cache.install_cache()
url = 'http://httpbin.org/get'
res = requests.get(url)
print(f'cache exists: {res.from_cache}')
res = requests.get(url)
print(f'exists cache: {res.from_cache}')Custom cache back‑ends
The backend parameter lets you store cache data in memory, sqlite, mongoDB, or redis. Example for a filesystem cache:
requests_cache.install_cache('demo_cache', backend='filesystem')
# or with temporary directory
requests_cache.install_cache('demo_cache', backend='filesystem', use_temp=True)Redis example (requires redis-py):
backend = requests_cache.RedisCache(host='localhost', port=6379)
requests_cache.install_cache('demo_cache', backend=backend)Advanced options expire_after: set global cache TTL (default is permanent). allowable_codes and allowable_methods: restrict caching to specific HTTP status codes or methods. cache_control=True: respect the response's Cache‑Control header.
Example of per‑URL expiration:
urls_expire_after = {'*.site1.com': 30, 'site2.com/static': -1}
requests_cache.install_cache('demo_cache2', urls_expire_after=urls_expire_after)When Cache‑Control: no-store is sent in request headers, the cache is bypassed even if caching is enabled:
session.get('http://httpbin.org/delay/1', headers={'Cache-Control': 'no-store'})Overall, requests_cache provides a flexible way to add HTTP response caching to Python scraping scripts, improving speed and reducing server load while offering fine‑grained control over cache behavior.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
