Using zipstream to stream new zip files to object storage with boto3
You can construct a zipstream.ZipFile, add files, then wrap it in a file-like object to upload it with S3.upload_fileobj.
I’m writing some code to build large ZIP files and store them in S3-compatible object storage.
I could build the ZIP as a local file, and then upload it to the cloud – but that means I need to have enough local disk space to store the entire file before I begin uploading. Alternatively, I could store the zip in an in-memory buffer – but now I need lots of memory instead of lots of disk. What if I could generate the ZIP file piece-by-piece, so I don’t need to hold the whole thing at once?
I found Allan Lei’s zipstream module, which does exactly what I want:
Like Python’s ZipFile module, except it works as a generator that provides the file in many small chunks.
It took a bit of work to integrate it with boto3.
(I was using Linode object storage for all of these examples, but I think it should be the same for any S3-compatible storage that works with boto3.)
I’m using boto3 1.38.19 and zipstream 1.1.4.
You need to pass a file-like object
Here’s my first attempt, based on the code in the python-zipstream README. I construct the ZipFile, add files, then pass it directly to upload_fileobj.
import boto3
import zipstream
s3_client = boto3.client("s3")
zf = zipstream.ZipFile(mode="w")
zf.writestr(arcname="greeting.txt", data=b"Hello world!")
zf.writestr(arcname="numbers.json", data=b"[1, 2, 3, 4, 5]")
# ❌ Does not work ❌
s3_client.upload_fileobj(
    Fileobj=zf,
    Bucket="my-example-bucket",
    Key="example.zip"
)
This fails with an exception:
KeyError: 'There is no item named 8388608 in the archive'
This is because upload_fileobj expects to receive a file-like object, whereas zipstream.ZipFile is giving it an iterable of bytes. We can write a wrapper class that transforms the ZipFile to get our first working upload.
class FileLikeObject:
    """
    Wrap an iterable of ``bytes`` and turn it into a file-like object
    that can be passed to ``S3Client.upload_fileobj``.
    """
    def __init__(self, iterable):
        self.iterator = iter(iterable)
        self.buffer = b""
    def read(self, size=-1) -> bytes:
        """
        Read up to ``size`` bytes from the object and return them.
        If ``size`` is unspecified or -1, all bytes until EOF are returned.
        Fewer than ``size`` bytes may be returned if there are less than
        ``size`` bytes left in the iterator.
        """
        size: int = size or -1
        # Fill the buffer with enough bytes to fulfil the request.
        while size < 0 or len(self.buffer) < size:
            try:
                chunk = next(self.iterator)
                self.buffer += chunk
            except StopIteration:
                break
        if size < 0:
            result, self.buffer = self.buffer, b""
        else:
            result, self.buffer = self.buffer[:size], self.buffer[size:]
        return result
s3_client.upload_fileobj(
    Fileobj=FileLikeObject(zf),
    Bucket="example-bucket",
    Key="example.zip",
)
This wrapper class also gives an opportunity to capture other data as the stream is being uploaded – for example, if you wanted to get the size or checksum of the ZIP file as it’s being created.
Set allowZip64=True if you want big ZIP files
By default, ZIP files have a limit of 4 GB and 65,535 files. If you want bigger zip files, you need to use the ZIP64 extension. In zipstream, this is disabled by default.
If you try to upload a ZIP file that exceeds these limits:
zf = zipstream.ZipFile(mode="w")
for i in range(100000):
    s = str(i).zfill(6)
    zf.writestr(arcname=f'numbers/{s[1]}/{s[2]}/{s}.txt', data=s.encode("utf8"))
s3_client.upload_fileobj(
    Fileobj=FileLikeObject(zf),
    Bucket="example-bucket",
    Key="numbers.zip",
)
then the upload fails with an error:
zipfile.LargeZipFile: Files count would require ZIP64 extensions
You need to set allowZip64=True:
zf = zipstream.ZipFile(mode="w", allowZip64=True)
and then the upload succeeds.