We reused the design we have for storing files in S3. The unit of storage in our digital archive is a BagIt “bag”, and each bag is passed through a series of applications. Each application does one step in the process of storing a bag — unpacking a compressed archive, verifying the bag’s contents, copying the bag to long-term storage, and so on.
We had to create new versions of our applications that could manage bags in Azure Blob Storage:
We built prototypes of these new apps, and then we added them to the pipeline for new bags. That meant that any new bags that were stored were being written to S3 and Azure, but all the existing bags were only stored in S3. We left this running for a few weeks to see how our new code worked with real bags.
Once we were confident these new apps were working, we replicated all of our existing bags into Azure. That took several weeks, after which our Azure storage account was byte-for-byte identical to our existing S3 buckets. As we were doing it, we kept tweaking the apps to handle unusual bags and improve reliability.
There were a number of challenges writing bags to Azure.
Until now, the Platform team has worked entirely in AWS. Although S3 and Blob Storage are similar (they’re both distributed object stores), they differ in the details, and it took us time to get used to the ins and outs of Azure Blob Storage.
This is especially important when things go wrong. We have plenty of experience with the edge cases of S3, and the storage service has code to handle and retry unusual errors. We don’t have that experience with Azure, and we had to learn very quickly.
We’ve known we wanted to add another storage provider since the earliest designs of the storage service, and we tried to write our code in a sans I/O, provider-agnostic way. For example, the code that parses a BagIt “bag” manifest shouldn’t have to care whether the bag is in S3 or Blob Storage.
In practice, there were still plenty of cases where our code assumed that it would only ever talk to S3. We had to do a lot of refactoring to remove these assumptions before we could start adding Azure support.
Although we have no immediate plans to do so, doing this work once should make it easier to add another provider in the future. The hard part was making our code truly generic enough to handle more than one provider; after that, adding additional providers should be simpler.
Copying your data into a public cloud is free — but if you want to get it out again, you have to pay. Copying a gigabyte of data out of S3 costs as much as storing it for nearly four months.
Our complete data transfer cost was about $10k for our 60TB archive— not eye-wateringly expensive, but enough that we didn’t want to do it more than necessary. (We paid once to replicate everything from S3 to Azure, and once for the verifier running in AWS to read everything from Azure to check we wrote the correct bytes.)
We took several steps to keep our data transfer costs down, including caching the results of the Azure bag verifier, and running our own NAT instances in AWS to avoid the bandwidth cost of Amazon’s managed NAT Gateway.
Some of the files we have are fairly large — the largest we’ve seen so far is 166GiB, which is a preservation copy of a digitised video.
Trying to read or write such a large file in a single request can be risky. You need to hold open the same connection for a long time, and if anything goes wrong (say, a timeout or a dropped connection), you have to start from scratch. It’s generally more reliable to break a large file into smaller pieces, read or write each piece individually, then stitch the pieces back together. Both AWS and Azure encourage this approach and have limits on how much data you can write in a single piece.
We’ve had to write a lot of code so that we can reliably read and write blobs of all sizes — with our initial prototypes, we saw dropped connections and timeouts when working with larger blobs. Some of this was adapting code we’d already written for S3; some of it was entirely new code for Azure.
This is an area where the S3 SDK is more mature than Azure. The S3 Java SDK provides the TransferManager class for uploading large files, which handles the details of breaking the file into smaller pieces. The Azure Java SDK doesn’t provide a comparable class — we had to write and debug our own code for uploading large files. We’re surely not the first people to need this functionality, and it would be nice if it was provided in the SDK.
Every bag in the storage service has three copies: one in Amazon S3, one in Amazon Glacier, one in Azure Blob Storage. When we store new bags, they’re copied to each of these locations.
This means our digital archive is much more resilient and less likely to suffer catastrophic data loss. This gives us more confidence as we continue to build on it and use it for ever-more data.