Text and Unicode
Storing text in computers gets tricky, especially when you want to represent more than the basic Latin alphabet A–Z. The Unicode standard is the most popular character encoding, which can handle 172 different scripts.
These entries are about interesting things I’ve learnt while working with text and Unicode.
3 articles
Operations on strings don’t always commute
Is uppercasing then reversing a string the same as reversing and then uppercasing? Of course not.
Using fuzzy string matching to find duplicate tags
Another example of why strings are terrible
Pop quiz: if I lowercase a string, does it still have the same length as the original string?
4 notes
When fixing mojibake, use
ftfy.fix_and_explain()to understand how it’s fixing a piece of textEditing a filename in Finder will convert it to NFD
Even if the filename looks the same, it may be invisibly converted to a different sequence of bytes.
How to create flag emojis for countries in Python
Use Unicode property escapes to detect emoji in JavaScript