Skip to main content

Unexpected errors in the BagIt area

Last week, James Truitt asked a question on Mastodon:

Do any #digipres folks happen to have a handy repo of small invalid bags for testing purposes?

I'm trying to automate our ingest process, and want to make sure I'm accounting for as many broken expectations as possible.

31 Jan 2025 at 19:49

The “bags” he’s referring to are BagIt bags. BagIt is an open format developed by the Library of Congress for packaging digital files. Bags include manifests and checksums that describe their contents, and they’re often used by libraries and archives to organise files before transfering them to permanent storage.

Although I don’t use BagIt any more, I spent a lot of time working with it when I was a software developer at Wellcome Collection. We used BagIt as the packaging format for files saved to our cloud storage service, and we built a microservice very similar to what James is describing.

The “bag verifier” would look for broken bags, and reject them before they were copied to long-term storage. I wrote a lot of bag verifier test cases to confirm that it would spot invalid or broken bags, and that it would give a useful error message when it did.

All of the code for Wellcome’s storage service is shared on GitHub under an MIT license, including the bag verifier tests. They’re wrapped in a Scala test framework that might not be the easiest thing to read, so I’m going to describe the test cases in a more human-friendly way.

Before diving into specific examples, it’s worth remembering: context is king.

BagIt is described by RFC 8493, and you could create invalid bags by doing a line-by-line reading and deliberately ignoring every “MUST” or “SHOULD” but I wouldn’t recommend this aproach. You’d get a long list of test cases, but you’d be overwhelmed by examples, and you might miss specific requirements for your system.

The BagIt RFC is written for the most general case, but if you’re actually building a storage service, you’ll have more concrete requirements and context. It’s helpful to look at that context, and how it affects the data you want to store. Who’s creating the bags? How will they name files? Where are you going to store bags? How do bags fit into your wider systems? And so on.

Understanding your context will allow you to skip verification steps that you don’t need, and to add verification steps that are important to you. I doubt any two systems implement the exact same set of checks, because every system has different context.

Here are examples of potential validation issues drawn from the BagIt specification and my real-world experience. You won’t need to check for everything on this list, and this list isn’t exhaustive – but it should help you think about bag validation in your own context.


The Bag Declaration bagit.txt

This file declares that this is a BagIt bag, and the version of BagIt you’re using (RFC 8493 §2.1.1). It looks the same in almost every bag, for example:

BagIt-Version: 1.0
Tag-File-Character-Encoding: UTF-8

This tightly prescribed format means it can only be invalid in a few ways:

The Payload Files and Payload Manifest

The payload files are the actual content you want to save and preserve. They get saved in the payload directory data/ (RFC 8493 §2.1.2), and there’s a payload manifest manifest-algorithm.txt that lists them, along with their checksums (RFC 8493 §2.1.3).

Here’s an example of a payload manifest with MD5 checksums:

37d0b74d5300cf839f706f70590194c3  data/waterfall.jpg

This tells us that the bag contains a single file data/waterfall.jpg, and it has the MD5 checksum 37d0…. These checksums can be used to verify that the files have transferred correctly, and haven’t been corrupted in the process.

There are lots of ways a payload manifest could be invalid:

A bag can contain multiple payload manifests – for example, it might contain both MD5 and SHA1 checksums. Every payload manifest must be valid for the overall bag to be valid.

Payload filenames

There are lots of gotchas around filenames and paths. It’s a complicated problem, and I definitely don’t understand all of it.

It’s worth understanding the filename rules of any filesystem where you will be storing bags. For example, Azure Blob Storage has a number of rules around how you can name files, and Amazon S3 has different rules. We stored files in both at Wellcome Collection, and so the storage service had to enforce the superset of these rules.

I’ve listed some edge cases of filenames you might want to consider, but it’s not a comlpete list. There are lots of ways that unexpected filenames could cause you issues, but whether you care depends on the source of your bags. If you control the bags and you know you’re not going to include any weird filenames, you can probably skip most of these.

We only checked for one of these conditions at Wellcome Collection, because we had a pre-ingest step that normalised filenames. It converted filenames to ASCII, and saved a mapping between original and normalised filename in the bag. However, the normalisation was only designed for one filesystem, and produced filenames with trailing dots that were still disallowed in Azure Blob.

The Tag Manifest tagmanifest-algorithm.txt

Similar to the payload manifest, the tag manifest lists the tag files and their checksums. A “tag file” is the BagIt term for any metadata file that isn’t part of the payload (RFC 8493 §2.2.1).

Unlike the payload manifest, the tag manifest is optional. A bag without a tag manifest can still be a valid bag.

If the tag manifest is present, then many of the ways that a payload manifest can invalidate a bag – malformed contents, unreferenced files, or incorrect checksums – can also apply to tag manifests.

There are some additional things to consider:

Although the tag manifest is optional in the BagIt spec, at Wellcome Collection we made it a required file. Every bag had to have at least one tag manifest file, or our storage service would refuse to ingest it.

The Bag Metadata bag-info.txt

This is an optional metadata file that describes the bag and its contents (RFC 8493 §2.2.2). It’s a list of metadata elements, as simple label-value pairs, one per line.

Here’s an example of a bag metadata file:

Source-Organization: Lumon Industries
Organization-Address: 100 Main Street, Kier, PE, 07043
Contact-Name: Harmony Cobel

Unlike the manifest files, this is primarily intended for human readers. You can put arbitrary metadata in here, so you can add fields specific to your organisation.

Although this file is more flexible, there are still ways it can be invalid:

Although the bag metadata file is optional in a general BagIt bag, you may want to add your own rules based on how you use it.

For example, at Wellcome Collection, we required all bags to have an External-Identifier value, that matched a specific schema. This allowed us to link bags to records in other databases, and our bag verifier would reject bags that didn’t include it.

The Fetch File fetch.txt

This is an optional element that allows you to reference files stored elsewhere (RFC 8493 §2.2.3).

It tells the person reading the bag that a file hasn’t been included in this copy of the bag; they have to go and fetch it from somewhere else. The file is still recorded in the payload manifest (with a checksum you can verify), but you don’t have a complete bag until you’ve downloaded all the files.

Here’s an example of a fetch.txt:

https://topekastar.com/~daria/article.txt  1841  data/article.txt

This tells us that data/article.txt isn’t included in this copy of the bag, but we we can download it from https://topekastar.com/~daria/article.txt. (The number 1841 is the size of the file in bytes. It’s optional.)

Using fetch.txt allows you to send a bag with “holes”, which saves disk space and network bandwidth, but at a cost – we’re now relying on the remote location to remain available. From a preservation standpoint, this is scary! If topekastar.com goes away, this bag will be broken. I know some people don’t use fetch.txt for precisely this reason.

If you do use fetch.txt, here are some things to consider:

We used fetch files at Wellcome Collection to implement versioning, and we added extra rules about what remote URLs were allowed. In particular, we didn’t allow fetching a file from just anywhere – you could fetch from our S3 buckets, but not the general Internet. The bag verifier would reject a fetch file entry that pointed elsewhere.


These examples illustrate just how many ways a BagIt bag can be invalid, from simple structural issues to complex edge cases.

Remember: the key is to understand your specific needs and requirements. By considering your context – who creates your bags, where they’ll be stored, and how they fit into your wider systems – you can build a validation process to catch the issues that matter to you, while avoiding unnecessary complexity.

I can give you my ideas, but only you can build your system.