Finding the size and cost of each of your S3 buckets

Whenever I look at our AWS bill, one of the biggest costs is always S3 storage. That’s not a surprise – our account holds, among other things, two copies of Wellcome Collection’s entire digital archive, which is nearly 120TB and growing every day. If we ever got a bill and there wasn’t a big number next to S3, that’s a reason to to panic.

We spend about $25,000 on S3 storage every year. That’s not nothing, but it’s also not exorbitant in the context of a large organisation. It’d be nice to find some easy wins, but developer time costs money too – it’s worth an hour to save a few thousand dollars a year, but a complete audit to squeeze out a few extra dollars is out of the question.

If we wanted to reduce this cost, we’d need to know where it’s coming from: which buckets have the biggest files? That would give us an idea of where to start deleting things first – given what we use all our buckets for, are any of them surprisingly large or populous?

I wrote a script to get an overview of our buckets. It creates a spreadsheet that tells me:

Among other things, the first time I ran it I discovered:

It uses CloudWatch Metrics to get an idea of the total number of bytes in each bucket. Those figures are only updated every couple of days, but they’re accurate enough to get an idea of which buckets are worth further investigation.

It took about half an hour to write the initial version, and a few hours more to tidy it up. I’ve posted the script on GitHub so other people can use it to find quick wins in their own AWS accounts. If you have a big S3 bill too, you might want to try it and see if you have any unexpectedly large buckets.