Here’s a programming assumption I used to make, that until today I’d never really thought about: changing the case of a string won’t change its length.
Now, thanks to Hypothesis, I know better:
>>> x = u'İ' >>> len(x) 1 >>> len(x.lower()) 2
I’m not going to pretend I understand enough about Unicode or Python’s string handling to say what’s going on here.
I discovered this while testing a moderately fiddly normalisation routine – this routine would normalise the string to lowercase, unexpectedly tripping a check that it was the right length. If you’d like to see this for yourself, here’s a minimal example:
from hypothesis import given, strategies as st @given(st.text()) def test_changing_case_preserves_length(xs): assert len(xs) == len(xs.lower())
Follow-up, 2 December:
David MacIver asked whether this affects Python 2, 3, or both, which I forgot to mention. The behaviour is different: Python 2 lowercases
İ to an ASCII
i, whereas Python 3 adds a double dot:
i̇. This means that only Python 3 has the bug where the length changes under case folding (whereas Python 2 commits a different sin of throwing away information).
Cory Benfield pointed out that the Unicode standard has explicit character mappings that add or remove characters when changing case, and highlights a nice example in the other direction: when you uppercase the German esszett (ß), you replace it with a double-S.
Finally, Rob Wells wrote a follow-on post that explains this problem in more detail. He also points out the potential confusion of
len(): should it count visible characters, or Unicode code points? The Swift String API does a rather good job here: if you haven’t used it, check out Apple’s introductory blog post.