Skip to main content

A snippet for downloading files with Python

Back in February, I wrote about the new storage service I’ve been helping to build at Wellcome. Since then, we’ve migrated almost everything from the old system into the new service. Before we can decommission the old system, we need to check that every one of the 42 million files it was storing was either been migrated successfully, or we’re happy to delete it (for example, material that was ingested as part of a system test).

We’ve verified about 99.5% of the files so far, and it’s just the remaining half a percent that need to be checked. The best way to process the last few files is to download them and do some manual inspection, to understand whether a given set of files should be saved or deleted. I don’t want to download the files by hand, so I’ve written some Python scripts to automate the process.

The files are available over HTTP from a server that’s in the office – but of course, we’re not working from the office right now. I have to download files to a laptop at home, using a VPN. The server is quite underpowered; timeouts or dropped connections aren’t uncommon. If I get all the bytes, they’re correct, but the download often fails midway.

This is a common problem I have to solve in a lot of my scripts:

Because I do this so often, I’ve tidied up and extracted the function I use to download files. The next time I have to do this, I won’t have to write it from scratch:

import os
import sys
import uuid

import httpx
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_fixed
import urllib3.exceptions


@retry(
    retry=(
        retry_if_exception_type(httpx.HTTPError) |
        retry_if_exception_type(urllib3.exceptions.HTTPError)
    ),
    stop=stop_after_attempt(10),
    wait=wait_fixed(60),
)
def download_file(*, url, path, client=None):
    """
    Atomically download a file from ``url`` to ``path``.

    If ``path`` already exists, the file will not be downloaded again.
    This means that different URLs should be saved to different paths.

    This function is meant to be used in cases where the contents of ``url``
    is immutable -- calling it more than once should always return the same bytes.

    Returns the download path.

    """
    # If the URL has already been downloaded, we can skip downloading it again.
    if os.path.exists(path):
        return path

    if os.path.dirname(path):
        os.makedirs(os.path.dirname(path), exist_ok=True)

    if client is None:
        client = httpx.Client()

    try:
        with client.stream("GET", url) as resp:
            resp.raise_for_status()

            # Download to a temporary path first.  That way, we only get
            # something at the destination path if the download is successful.
            #
            # We download to a path in the same directory so we can do an
            # atomic ``os.rename()`` later -- atomic renames don't work
            # across filesystem boundaries.
            tmp_path = f"{path}.{uuid.uuid4()}.tmp"

            with open(tmp_path, "wb") as out_file:
                for chunk in resp.iter_raw():
                    out_file.write(chunk)

    # If something goes wrong, it will probably be retried by tenacity.
    # Log the exception in case a programming bug has been introduced in
    # the ``try`` block or there's a persistent error.
    except Exception as exc:
        print(exc, file=sys.stderr)
        raise

    os.rename(tmp_path, path)
    return path

This is more involved than just using urlretrieve in the standard library, but gets me several improvements:

I don’t have any tests for this right now, but I’ve tested in the past, and I’ve variants of this code to download thousands of files successfully.

Next time I need to write a script that needs to download files, I’ll copy-paste this code into the project. It’s not the most complicated thing in the world, but it’s one less thing I’ll have to write from fresh next time. If this would be useful to you, feel free to copy it into your own code as well.