We verify every file before copying it to permanent storage. We only copy files to permanent storage after they’ve passed all the checks. If a file fails checking, we reject it and explain to the user why we couldn’t accept it.
If we copied unverified files directly to permanent storage, we might store a broken file. We’d then need something to record which files were broken, or design some rollback mechanism to remove that file from the storage. Both of those increase the risk of bugs or mistakes which might affect the integrity of the archive. Keeping unverified files away from the permanent storage avoids that risk, and keeps the storage service simpler.
Then we re-verify every file after copying it to permanent storage. We know we received valid files, but something could go wrong in the copying process. There could be a bug in the copying code that means it copies the wrong file, or stores an incomplete copy, or stores it in the wrong place. We re-run all our checks every time we copy a file to a new location, so we can detect any such bugs immediately.
Finally, we ensure our files are continually verified at rest. It’s not enough to verify files once, and leave them. Because digital storage media degrades over time, digital files need to be continually verified and repaired if you want them to remain usable.
All of the files in our digital archive are stored in the cloud, specifically Amazon S3 and Azure Blob Storage, and this continual verification process is part of the service they provide. Both services store multiple copies of each file, and they use checksums to ensure file integrity. If they ever detect corruption in one copy, they repair it using one of the other copies.
Both services are certified for compliance with a variety of international, independent standards for information security (AWS, Azure). The level of data integrity and safety they provide goes far beyond anything we could build in-house. We trust their verification process, and don’t do any additional checking of data at rest.
Every check I’ve described happens automatically. These checks form part of the automated storage service that stores new files in our digital archive. This means they always happen reliably and consistently, and don’t rely on manual intervention or somebody remembering to do the checks. It also gives staff more time to work on other tasks that can’t be automated, like writing catalogue descriptions or appraising a new collection.
In fact, humans aren’t even allowed to write directly to the underlying storage — they have to go through the storage service. This gives us a very high degree of confidence in the integrity of the archive, because we know everything has been through a rigorous set of checks and verifications.
When we write new checks, we make sure to write a good user-facing error message. If the storage service does reject some files, we want it to explain why they were rejected and how they can be fixed. Ideally, these messages should make sense to somebody who isn’t a developer, or who doesn’t understand the inner workings of the storage service.
A digital archive doesn’t just run on computers, it runs on trust. People should trust that it’s a safe and secure store for our digital collections.
Having a robust verification process helps build that trust. It shows that the archive will ensure the integrity of the files it contains, and can act as a suitable long-term store for our files.