HTTP GET requests with the Python standard library
If you’re doing HTTP in Python, you’re probably using one of three popular libraries: requests, httpx, or urllib3; I’ve used each of them at different times. These libraries are installed with pip, live outside the standard library, and provide more features than the built-in urllib.request module – indeed, the documentation for that module recommends using requests.
Recently I’ve been looking for a new HTTP library, because my previous choice seems abandoned. I was using httpx, but the maintainer has closed issues on the GitHub repo, there’s only been one commit since January, and the last release was over a year ago. The easy choice would be switching to requests or urllib3, but I wondered: can I just use the standard library?
My usage is pretty basic – I have some manually-invoked scripts that make a handful of GET requests to public websites. I don’t have long-running processes; I’m not making thousands of requests at once; I’m not using proxies or authentication. There are plenty of features you can only get from third-party HTTP libraries – from connection pooling to HTTP/2 support – but I don’t need any of them.
I started experimenting, and what I realised is that I don’t miss the features, but I do miss the API.
Here’s how you make a basic GET request with httpx:
import httpx
resp = httpx.get(
"https://example.com",
params={"name": "pentagon", "sides": "5"},
headers={"User-Agent": "Shape-Sorter/1.0"}
)
print(resp.content)Here’s the same request with urllib.request:
import urllib.parse
import urllib.request
url = "https://example.com"
params = {"name": "pentagon", "sides": "5"}
headers = {"User-Agent": "Shape-Sorter/1.0"}
u = urllib.parse.urlsplit(url)
query = urllib.parse.urlencode(params)
url = urllib.parse.urlunsplit(
(u.scheme, u.netloc, u.path, query, u.fragment)
)
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
print(resp.read())Verbose! I’ve wrapped it in a helper function in chives, my personal utility library. Here’s the same request a third time:
from chives.fetch import fetch_url
resp = fetch_url(
"https://example.com",
params={"name": "pentagon", "sides": "5"},
headers={"User-Agent": "Shape-Sorter/1.0"}
)
print(resp)Much cleaner!
The code in chives does have one dependency – certfi, a lightweight package that provides Mozilla’s collection of root certificates.
There are lots of good reasons to use a third-party HTTP library, but I can do everything I need with the standard library and my personal wrapper. Let’s go through how it works.
Building the urllib.request.Request object
The first step is building the Request object. Other HTTP libraries provide helper functions or hide this step for simple requests (notice the basic httpx.get call doesn’t mention an httpx.Request), but for urllib.request we have to do it ourselves. Here’s mine:
import urllib.parse
import urllib.request
QueryParams = dict[str, str] | list[tuple[str, str]]
Headers = dict[str, str]
def build_request(
url: str,
*,
params: QueryParams | None = None,
headers: Headers | None = None
) -> urllib.request.Request:
"""
Build a urllib Request, appending query parameters and attaching headers.
"""
if params is not None:
params_list = list(params.items()) if isinstance(params, dict) else params
u = urllib.parse.urlsplit(url)
query = urllib.parse.parse_qsl(u.query) + params_list
new_query = urllib.parse.urlencode(query)
url = urllib.parse.urlunsplit(
(u.scheme, u.netloc, u.path, new_query, u.fragment)
)
req = urllib.request.Request(url, headers=headers or {})
return reqI can pass params as a dict or as a list of (key, value) tuples; I start by converting it to the list form. This means I can pass the same query parameter multiple times in a URL. That’s admittedly unusual, but I use it on a couple of my websites so I wanted to support it here.
I’m using the urllib.parse module to manipulate the URL and append the query parameters. I parse the initial URL with urlsplit, encode the query parameters, then reassemble the URL with urlunsplit. This preserves any existing query parameters and fragments, and returns a complete URL I can pass to the Request object.
(If, like me, you’d reach for the urlparse function, you’re showing your age – one thing I learnt during this project is that urlparse is now obsolete, and urlsplit is the replacement.)
This function only handles GET requests, which is all I need for my scripts – but it wouldn’t be difficult to extend it to handle POST requests or form data if the need arises.
This is a pure function, so it’s easy to test thoroughly.
Getting a web page or an API endpoint
In most cases, I just care about getting the response body from the remote server, not the headers or URL – for example, if I’m fetching a web page or an API endpoint. If I want something different in a single script, I’ll eschew my wrapper and use urllib.request directly.
Here’s my fetch_url wrapper:
import certifi
import ssl
def fetch_url(
url: str,
*,
params: QueryParams | None = None,
headers: Headers | None = None
) -> bytes:
"""
Fetch the contents of a URL and return the body of the response.
"""
req = build_request(url, params=params, headers=headers)
ssl_context = ssl.create_default_context(cafile=certifi.where())
with urllib.request.urlopen(req, context=ssl_context) as resp:
data: bytes = resp.read()
return dataThe key function is urllib.request.urlopen, which is what actually makes the HTTP request. I’m passing it two parameters: a Request and an SSLContext.
We build the Request using the build_request function.
The SSLContext tells urllib.request which HTTPS certificates it can trust, in this case by pointing to a “cafile” (Certificate Authority file) file provided by the certifi library. This file contains a list of trusted root certificates, and all valid HTTPS certificates should eventually point back to an entry in this list.
The certifi library is a lightweight wrapper around Mozilla’s list of trusted Root Certificates. It’s not in the standard library because it’s important to stay up to date with changes to the list, and you don’t want those changes coupled to Python version releases. Although this exercise is about reducing dependencies, I’m okay with certifi because it’s tiny – you can read the whole thing in less than five minutes. I know what it’s doing.
The urlopen function looks for a 200 OK status code, and throws an HTTPError if it gets an error response from the server. I considered wrapping that in another type, but for now I’m just catching HTTPError.
This function doesn’t set a timeout on HTTP requests. That would be an issue in a lot of contexts, but I’m normally using this from a script I run manually. If something gets stuck, I can stop the script and debug manually.
This function doesn’t support streaming responses; it reads the whole thing into memory at once. That’s fine for web pages or API calls, but I wouldn’t use this to download large files or videos.
There’s a lot of stuff this function doesn’t do, but it works well in all of my scripts, it has a friendly API, and it only has one third-party dependency.
Downloading images with format-based file extensions
As I started using the fetch_url in my projects, I realised the one time I often care about response headers is when I’m downloading images. I want the filename to have the appropriate filename extension – .jpg for JPEGs, .png for PNGs, and so on. Sometimes I can guess the file format from the URL, but sometimes I need to inspect the Content-Type header.
I considered exposing the headers from fetch_url, but since I only need the headers for downloading images and that’s a pretty common operation, I decided to make a download_image helper instead.
First, I wrote a helper function that picks a filename extension based on the Content-Type header:
def choose_filename_extension(content_type: str | None) -> str:
"""
Choose a filename extension for an image downloaded with the given
Content-Type header.
"""
if content_type is None:
raise ValueError(
"no Content-Type header, cannot determine image format"
)
content_type_mapping = {
"image/jpeg": "jpg",
"image/png": "png",
"image/gif": "gif",
"image/webp": "webp",
}
try:
return content_type_mapping[content_type]
except KeyError:
raise ValueError(f"unrecognised Content-Type header: {content_type}")The mapping contains the four image formats I encounter in practice; it’s easy for me to add more if I try to download a newer format someday.
Then I wrote a function that takes an image URL and an “out prefix” (an initial guess at the path), downloads the image and choose a new file extension, and returns the final path:
from pathlib import Path
def download_image(
url: str,
out_prefix: Path,
*,
params: QueryParams | None = None,
headers: Headers | None = None,
) -> Path:
"""
Download an image from the given URL to the target path, and return
the path of the downloaded file.
Add the appropriate file extension, based on the image's Content-Type.
Throws a FileExistsError if you try to overwrite an existing file.
"""
req = build_request(url, params=params, headers=headers)
ssl_context = ssl.create_default_context(cafile=certifi.where())
with urllib.request.urlopen(req, context=ssl_context) as resp:
image_data: bytes = resp.read()
image_format = choose_filename_extension(content_type=resp.headers["content-type"])
out_path = out_prefix.with_suffix("." + image_format)
out_path.parent.mkdir(exist_ok=True, parents=True)
with open(out_path, "xb") as out_file:
out_file.write(image_data)
return out_pathThe first half of this function is the same as fetch_url; the second half constructs the final path and writes the download image to disk. I like this approach because it allows the caller to specify a meaningful directory and filename without worrying about the filename extension (which is important but not meaningful).
The function creates the output directory if it doesn’t exist, for convenience. Nothing grinds my gears like getting a FileNotFoundError when trying to write to a file in a folder that doesn’t exist. My text editor is smart enough to auto-create missing folders; I want my code to do the same.
I open the file in xb mode to avoid overwriting existing files – if I try to write to an image I’ve already saved, I get a FileExistsError. I find that a useful safety check, and I use exclusive creation mode in a lot of my scripts now.
Packaging and testing
A few months ago, I created a personal utility library chives for dealing with tiny archives, and that was a good place to keep this code.
The HTTP code is in chives.fetch, and the accompanying tests are in test_fetch.py. I’m testing it using the vcrpy library, which knows how to record responses from urllib.request.
I now use this code across all my personal scripts, and it’s been rock-solid. There are lots of good reasons to use Python’s more advanced HTTP libraries, but they’re for use cases I don’t have.