Wellcome Collection’s approach to digital verification

Tagged with digital preservation, wellcome collection
Posted 18 January 2021

I wrote this article while I was working at Wellcome Collection. It was originally published on their Stacks blog under a CC BY 4.0 license, and is reposted here in accordance with that license.

As an archive, we have a responsibility to preserve and maintain our collections, and this is just as true of digital files as of physical objects. But digital media is fragile: we can’t just save our files on a hard drive, lock it in a cupboard, and hope for the best. Disks fail, data decays, and files get corrupted. Unlike paper, which degrades slowly and might remain readable for centuries, a single error can render a digital file completely unreadable.

To ensure our digital collections are preserved for years to come, we have a set of rigorous checks and verifications that run against every digital file we store in the archive. This includes verifying checksums, validating filenames, and checking we have the right number of files.

In this post, I’m going to explain how we decide which checks to run, and when we run them. Our approach is deliberately quite broad and general, and could apply to any digital preservation repository — regardless of size, tech stack, or storage medium.

A black and white etching of of three men wearing Victorian era clothing looking into a microscope with three tubes. — Every aspect of a file must be examined closely, to rule out the possibility of corruption or error. Used under CC BY, image from Wellcome Collection.

What we verify

Are files packaged in BagIt bags?

We use the BagIt packaging format to organise files in our digital archive. When you want to store new files in the archive, you put those files in a BagIt “bag” and send them to our storage service; files in the same bag get stored and processed together. As well as the files, a BagIt bag includes some metadata and checksums that describe the files it contains.

We use BagIt because it’s an open standard, and we believe it will continue to be readable for a long time.

The longevity of BagIt is only useful if we’re actually storing valid BagIt bags, so we check that every bag we store matches the BagIt specification. This includes checking that we can parse the BagIt metadata files, and that the metadata is an accurate description of the bag. For example, the BagIt Payload-Oxum field describes the number and size of the files in the bag, and we check that it’s correct.

Do files have the correct checksums?

A checksum is a short value that can be used to check that no errors have been introduced when copying data. You use a checksum algorithm (e.g. MD5, SHA-1) to create a checksum for a piece of data (say, a file you want to store), and the same file will always have the same checksum.

If you store a file and it has the same checksum in a year’s time as it does today, you can be reasonably sure it hasn’t changed. But if you store a file and later it has a different checksum, something has changed. This tells you the file has been modified or corrupted, and you need to take action to repair it.

BagIt metadata includes a checksum for every file in a bag. We verify the checksum of every file in the archive, so we know it’s the same file we were originally sent.

We currently use the SHA-256 checksum algorithm to verify our files, although both BagIt and our storage service are flexible enough to support checksums with multiple algorithms for the same file. We can choose to add checksums using other algorithms later, if we decide it would be useful.

Explicit checks are better than implicit assumptions

Once something is in the digital archive, it’s intentionally very hard to modify or remove — and once it’s in, we have to store it, keep it safe, and support it with all our tooling.

Because of this, we’ve designed the storage service to be quite conservative about what it lets in. In certain cases, we’ve made assumptions about the files and bags in the archive (for example, “every bag has an alphanumeric identifier”), and we encode those assumptions as explicit checks. If a bag turns up that breaks our assumptions, the failing check makes it very visible.

These checks aren’t iron-clad, immutable rules; they’re an opening to a discussion. If somebody tries to store a bag or file which breaks one of these assumptions, it means there’s a misunderstanding somewhere. Either they don’t understand how the storage service works, or they have a use case the developers haven’t considered. An explicit check forces this misunderstanding into the open, so we can discuss how best to handle it — rather than the bag slipping in silently, and causing unexpected issues later.

When we verify

We verify every file before copying it to permanent storage. We only copy files to permanent storage after they’ve passed all the checks. If a file fails checking, we reject it and explain to the user why we couldn’t accept it.

If we copied unverified files directly to permanent storage, we might store a broken file. We’d then need something to record which files were broken, or design some rollback mechanism to remove that file from the storage. Both of those increase the risk of bugs or mistakes which might affect the integrity of the archive. Keeping unverified files away from the permanent storage avoids that risk, and keeps the storage service simpler.

Then we re-verify every file after copying it to permanent storage. We know we received valid files, but something could go wrong in the copying process. There could be a bug in the copying code that means it copies the wrong file, or stores an incomplete copy, or stores it in the wrong place. We re-run all our checks every time we copy a file to a new location, so we can detect any such bugs immediately.

Finally, we ensure our files are continually verified at rest. It’s not enough to verify files once, and leave them. Because digital storage media degrades over time, digital files need to be continually verified and repaired if you want them to remain usable.

All of the files in our digital archive are stored in the cloud, specifically Amazon S3 and Azure Blob Storage, and this continual verification process is part of the service they provide. Both services store multiple copies of each file, and they use checksums to ensure file integrity. If they ever detect corruption in one copy, they repair it using one of the other copies.

Both services are certified for compliance with a variety of international, independent standards for information security (AWS, Azure). The level of data integrity and safety they provide goes far beyond anything we could build in-house. We trust their verification process, and don’t do any additional checking of data at rest.

How we verify

Every check I’ve described happens automatically. These checks form part of the automated storage service that stores new files in our digital archive. This means they always happen reliably and consistently, and don’t rely on manual intervention or somebody remembering to do the checks. It also gives staff more time to work on other tasks that can’t be automated, like writing catalogue descriptions or appraising a new collection.

In fact, humans aren’t even allowed to write directly to the underlying storage — they have to go through the storage service. This gives us a very high degree of confidence in the integrity of the archive, because we know everything has been through a rigorous set of checks and verifications.

When we write new checks, we make sure to write a good user-facing error message. If the storage service does reject some files, we want it to explain why they were rejected and how they can be fixed. Ideally, these messages should make sense to somebody who isn’t a developer, or who doesn’t understand the inner workings of the storage service.

A digital archive doesn’t just run on computers, it runs on trust. People should trust that it’s a safe and secure store for our digital collections.

Having a robust verification process helps build that trust. It shows that the archive will ensure the integrity of the files it contains, and can act as a suitable long-term store for our files.