What does \d match in a regex?
Earlier tonight, I was playing with Hypothesis’s from_regex strategy. This strategy generates strings that match a given regex, and I thought it might be a good way to debug especially gnarly regexes – by seeing examples of matching strings, maybe I could understand a complex regex better.
I was trying to find strings that matched the regex \d+\(\d+\)
(a random example from Google), and it exposed a bug in my understanding.
I know that \d
matches any digits, and I’d always treated it as synonymous with [0-9]
– but it turns out that’s not always the case. There are all sorts of Unicode characters which are digits and which match \d
, but which aren’t Arabic numerals. Here are just a couple:
- ꘠ / Vai digit zero
- ೯ / Kannada digit nine
- ᧔ / New Tai Lue digit four
- ᠗ / Mongolian digit seven
Different languages handle this differently – for example, using \d
in Python 3 will match all these numeric characters that aren’t Arabic numerals, but in Python 2 it won’t. No doubt it varies among other programming languages as well.
How much this distinction matters depends on your use case, and in most cases I imagine the answer is “very little”. I present it more as an intellectual curiosity than something which needs serious consideration when writing regexes.
This isn’t the first time Hypothesis has exposed one of my faulty assumptions – language is incredibly complicated, and I still have a lot to learn.