Replicating Wellcome Collection’s digital archive to Azure Blob Storage

Tagged with azure, digital preservation, wellcome collection
Posted 30 September 2020

I wrote this article while I was working at Wellcome Collection. It was originally published on their Stacks blog under a CC BY 4.0 license, and is reposted here in accordance with that license.

Our cloud storage service is designed to ensure the long-term preservation of our digital collections. As an archive, we have an obligation to ourselves and to our depositors to keep our collections safe. We’ve spent millions of pounds digitising our physical objects, and some of our born-digital and audiovisual material is irreplaceable.

One way we do this is by storing multiple copies of every file. If one copy were to be corrupted or deleted, we’d still have other copies that we could use to construct the complete archive.

In the initial iteration of our storage service, we kept two copies of every file in a pair of Amazon S3 buckets. We’ve recently upgraded the storage service to keep a third copy of every file in Azure Blob Storage, and in a different geographic location. In this post, I’m going to explain why this change was important, and how we made it.

A pile of blue dusty powder — Photo by Marco Almbauer, taken from Wikimedia Commons, used under CC BY‑SA 3.0.

Why do we need another storage provider?

Before we started this work, both copies of our digital collections were kept in two S3 buckets. Both buckets are in the same AWS account.

Our AWS account is a single point of failure. Although Amazon invest heavily in reliability and durability — for example, objects in S3 are stored across multiple, distinct facilities (“Availability Zones”) and should survive the destruction of a single data centre — you can imagine scenarios in which we lose everything in our S3 buckets.

Amazon could close our account because of a payment issue. Somebody could accidentally run a script that empties both buckets. There could be a systemic issue that affects all content in S3.

Adding a second provider removes this single point of failure. All of these scenarios could still happen, but another copy of the data in a separate provider means they would be inconvenient rather than disastrous. There are fewer scenarios that would affect multiple providers simultaneously.

Why do we need another geographic location?

All of our data in S3 is kept in a single AWS region. An AWS region is a collection of data centres in the same geographic area. We use the eu-west-1 region, which means our data is somewhere in Ireland.

Having all our data in the same geographic location is another single point of failure. Although our data is spread across multiple data centres, you can imagine manmade or natural disasters in which many data centres are affected at the same time.

Adding a second geographic location removes this single point of failure. There are fewer disasters that would affect multiple locations at once, especially if (as our replicas are) the locations are on different landmasses.

Which location and provider did we choose?

We chose Azure Blob Storage because Wellcome is already using Azure for a number of other services, including Active Directory and SQL Databases — but the account is owned and managed by a different team. That means the account has a different payment setup, different support contacts, and a different authentication system. Even the internal processes for getting access to our AWS and Azure accounts are different.

There’s very little overlap between our AWS and Azure accounts, which minimises the chance of a systemic failure that affects both. That overlap includes people. At time of writing, just three people have access to all three replicas, and eventually we’d like to reduce that number. That means there are very few people who could wipe out every copy of the archive (whether accidentally or maliciously.)

We did look at other cloud storage providers — there are plenty of good choices, all of which would probably have worked similarly well. Azure was the best choice for us, but it may not be for you. If you have critical data in cloud storage, the important thing is to have a backup in another cloud. Which cloud you use as backup is less important.

We’re using Azure’s West Europe (Netherlands) region. We chose it because it’s a good distance from Ireland, and roughly equidistant from the Wellcome Collection building. Again, the exact choice isn’t so important — the important thing is that it’s a different geographic location from our data in S3.

How did we implement replication to Azure?

We reused the design we have for storing files in S3. The unit of storage in our digital archive is a BagIt “bag”, and each bag is passed through a series of applications. Each application does one step in the process of storing a bag — unpacking a compressed archive, verifying the bag’s contents, copying the bag to long-term storage, and so on.

We had to create new versions of our applications that could manage bags in Azure Blob Storage:

The bag replicator copies a bag from working storage (in S3) to permanent storage (in S3 or Azure). Previously our replicator could only write bags to S3; we had to extend it to write bags into Azure Blob Storage.
A BagIt “bag” includes a file manifest, with a checksum for every file in the bag. The bag verifier compares the stored bag to the checksums in the manifest, to confirm the bag was written correctly. We had to build a new verifier that could inspect bags in Azure.
The replica aggregator counts the number of verified copies of a bag, and warns us if a bag doesn’t have enough copies. Previously it only counted two replicas, now it has to count three.

We built prototypes of these new apps, and then we added them to the pipeline for new bags. That meant that any new bags that were stored were being written to S3 and Azure, but all the existing bags were only stored in S3. We left this running for a few weeks to see how our new code worked with real bags.

Once we were confident these new apps were working, we replicated all of our existing bags into Azure. That took several weeks, after which our Azure storage account was byte-for-byte identical to our existing S3 buckets. As we were doing it, we kept tweaking the apps to handle unusual bags and improve reliability.

What was hard?

There were a number of challenges writing bags to Azure.

Getting used to Azure

Until now, the Platform team has worked entirely in AWS. Although S3 and Blob Storage are similar (they’re both distributed object stores), they differ in the details, and it took us time to get used to the ins and outs of Azure Blob Storage.

This is especially important when things go wrong. We have plenty of experience with the edge cases of S3, and the storage service has code to handle and retry unusual errors. We don’t have that experience with Azure, and we had to learn very quickly.

Refactoring our code to work with Azure

We’ve known we wanted to add another storage provider since the earliest designs of the storage service, and we tried to write our code in a sans I/O, provider-agnostic way. For example, the code that parses a BagIt “bag” manifest shouldn’t have to care whether the bag is in S3 or Blob Storage.

In practice, there were still plenty of cases where our code assumed that it would only ever talk to S3. We had to do a lot of refactoring to remove these assumptions before we could start adding Azure support.

Although we have no immediate plans to do so, doing this work once should make it easier to add another provider in the future. The hard part was making our code truly generic enough to handle more than one provider; after that, adding additional providers should be simpler.

The cost of cross-cloud data transfer

Copying your data into a public cloud is free — but if you want to get it out again, you have to pay. Copying a gigabyte of data out of S3 costs as much as storing it for nearly four months.

Our complete data transfer cost was about $10k for our 60TB archive— not eye-wateringly expensive, but enough that we didn’t want to do it more than necessary. (We paid once to replicate everything from S3 to Azure, and once for the verifier running in AWS to read everything from Azure to check we wrote the correct bytes.)

We took several steps to keep our data transfer costs down, including caching the results of the Azure bag verifier, and running our own NAT instances in AWS to avoid the bandwidth cost of Amazon’s managed NAT Gateway.

Reading and writing large files

Some of the files we have are fairly large — the largest we’ve seen so far is 166GiB, which is a preservation copy of a digitised video.

Trying to read or write such a large file in a single request can be risky. You need to hold open the same connection for a long time, and if anything goes wrong (say, a timeout or a dropped connection), you have to start from scratch. It’s generally more reliable to break a large file into smaller pieces, read or write each piece individually, then stitch the pieces back together. Both AWS and Azure encourage this approach and have limits on how much data you can write in a single piece.

We’ve had to write a lot of code so that we can reliably read and write blobs of all sizes — with our initial prototypes, we saw dropped connections and timeouts when working with larger blobs. Some of this was adapting code we’d already written for S3; some of it was entirely new code for Azure.

This is an area where the S3 SDK is more mature than Azure. The S3 Java SDK provides the TransferManager class for uploading large files, which handles the details of breaking the file into smaller pieces. The Azure Java SDK doesn’t provide a comparable class — we had to write and debug our own code for uploading large files. We’re surely not the first people to need this functionality, and it would be nice if it was provided in the SDK.

Where we are now

Every bag in the storage service has three copies: one in Amazon S3, one in Amazon Glacier, one in Azure Blob Storage. When we store new bags, they’re copied to each of these locations.

This means our digital archive is much more resilient and less likely to suffer catastrophic data loss. This gives us more confidence as we continue to build on it and use it for ever-more data.