This is my third time at Monki Gras, and my second time speaking – I first went in 2018, and I gave a talk about the curb cut effect in 2019. I bought a ticket as soon as they went on sale – I enjoyed myself so much at previous events, going again was a no-brainer. (My ticket was reimbursed because I was speaking, but I’d have happily paid to go anyway.)
Monki Gras is a rare event that manages to have both good talks and a good hallway track. The first day of talks had a lot of interesting ideas and were well-presented, and I had some thoughtful and friendly conversations in between. I know I’ll be thinking about the event for weeks to come, and there’s still another day to go!
I’m pleased with the talk I wrote, and people seemed to enjoy it. The talk wasn’t recorded, but I’ve put my slides and notes below. (I wanted to get these up quickly, so there may be silly typos or mistakes. Please let me know if you see any!)
This is the key message: being a good user of AI is about both technical skills and managing your trust in the tool. You need to know the mechanics of prompt engineering, what text you type in the box, yes. But you also need to know how much you trust the tool, and whether you can rely on its results – if you don’t, the output is useless.
There are some personal reasons why Monki Gras feels a bit special.
I went to the last Monki Gras in 2019. It was cancelled in 2020 thanks to COVID, and this is the first year it’s been back – five years later! A lot of stuff has changed in that time.
In 2019, I was starting to explore what being genderfluid might mean for me, how that might affect my professional career, and I had several meaningful conversations with now-friends at Monki Gras. In 2024, I have a better understanding of what my gender looks like, I’m much happier, and I’m more comfortable presenting as my full self.
At both events, my gender and my appearance have been a complete non-issue. People accept that I am who I say I am; there were no awkward stares or questions; I was never misgendered. I got to relax, and focus on the event rather than worrying if somebody was about to be weird.
It’s nice.
This should be the norm at professional events, but it isn’t, and it’s a bit sad that I can’t take it for granted. But it’s nice when it happens.
There are lots of books about the history of swing dancing and jazz music. I read Swing Dance by Scott Cupit as I was writing the talk because it’s what I had to hand, but there are plenty of others. It’s a fun subject!
If you’re getting started, I’d particularly recommend looking for information about Frankie Manning and Norma Miller, two of the early pioneers of this style of dancing.
Most of the photos in the talk come from the Flickr Commons, a collection of historical photographs from over 100 international cultural heritage organisations.
You can learn more about the Commons, browse the photos, and see who’s involved using the Commons Explorer https://commons.flickr.org/. (Which I helped to build!)
Introductory slide.
It’s lovely to be back at Monki Gras.
My name is Alex Chan; my pronouns are they/she; I’m a software developer.
I work at the Flickr Foundation, where we’re trying to keep Flickr’s pictures visible for 100 years – including many of the photos in the talk. I do a bunch of fun stuff around digital preservation, cultural heritage, museums and libraries, all that jazz. If you want to learn more about my work and all the other stuff I do, you can read more at alexwlchan.net.
But I’m not talking about my work today. Instead, I want to talk about dancing. A few years ago I started learning to dance; last year I started learning to use AI, and I’ve spotted a lot of parallels between the two. I want to tell you what learning to dance taught me about learning to use AI.
This all started about five years ago. I was in a theatre at Bristol, watching a musical, and as musicals are wont to do, they had some big song and dance numbers.
I was sitting in the audience, and as I watched the actors dancing, I thought “that looks fun, I want do to that”. So in the interval I got out my phone, I started reading about dancing, and what they were doing on stage.
The particular style of dancing they were doing was called swing dancing.
Swing dancing is an umbrella term for a wide variety of energetic and rhythmic dance styles, usually danced to jazz music. Probably the most well-known is lindy hop, but it also includes balboa, jitterbug, charleston, collegiate shag…
Swing dance came out of the African American jazz scene in the early 20th century, and a lot of the moves practiced today have their roots in traditional African folk dances.
Swing dancing is very popular today, and there’s a thriving swing dance scene in London, so I was able to find a beginner class just five minutes from where I was working at the time. It was closer than the nearest Starbucks!
This was a proper beginner class, no experience needed – which was good, because I didn’t have any! You could walk in having never danced before and they’d teach you from scratch.
This is great for students, but it poses a tricky challenge for the teachers. Many people are quite nervous in a dance class, unsure if they can do it, especially if it’s their first time. The teachers have to get you on board quickly, and I noticed a pattern – after a gentle warmup, they’d always start with a really simple step.
At this point I demonstrated with a simple step on stage. I don’t have any video of this, so you’ll have to make do with crude MS Paint drawings.
This is a kick, kick, step.
Stand on my left leg, and raise my right leg in the air. Bent at the knee, foot pointing back.
Swing the right leg forward into a kick.
Swing the right leg back up, completing the kick.
Swing the right leg forward again for a second kick.
Step down onto the right leg, and lift my left leg into the air.
Kick, kick, step.
This was a smart way to start the class. If you can balance on one foot, you can do this move. You get that sense of achievement, and there’s something satisfying about a whole room of people stepping in unison.
From there, the class would gradually build to more complicated things – moves, turns, routines. But we’d always build something new in small steps, not doing too much at once.
Starting small is a great way to learn a new skill, and this approach can apply in a lot of areas.
If you’re trying to learn a skill or embed a new habit, make the bar for success extremely low. This is a one-two punch:
The latter is a common mistake when we learn new skills as adults. We’re used to being good at things; we have high standards for ourselves. Then we try to learn something new, and we set our goal far too high, then we fall short, we bounce off.
How many people have done this? Picked up a new skill, done it once, you weren’t instantly good at it, so you never did it again. [A lot of guilty faces in the audience for this one.]
We don’t like to be bad at things – but we can only learn if we push through the period of not being very good. When we’re a beginner and we don’t know very much, we need to set small goals and step forward gradually.
This is where we come back to AI, to prompt engineering – which have been new skills for many of us in the last year or two.
When I first tried these new generative AI tools, I was lured in by the big and flashy stuff. That’s what grabbed headlines; that’s what grabbed my attention. I was reading Twitter, I thought “that looks fun, I want do to that”.
I set a very high bar for what I wanted to achieve, and I wanted to replicate those cool results – but I didn’t know what I was doing, so I couldn’t do any of that cool stuff. So I bounced off for months. I ignored these tools, because of the bitter taste of those initial failures.
I only started doing useful stuff with these tools when I lowered my expectations. I wanted a small first step. What’s a small step for AI? It’s a simple prompt; a single question; a single sentence.
I went back and found the first ChatGPT sessions where I got something useful out of it. I was building a URL checker, I have a bunch of websites, I want to check they’re up. I wanted to write a script to help me. So I asked ChatGPT “how do I fetch a single URL”.
This is a simple task, something I could easily Google, and many of you probably already know how to do this. But that’s not the point – the point is that it gave me that initial feeling of success. It was the first time I had that sense of “ooh, this could be useful”.
I was able to build on that, and by asking more small questions I eventually got a non-trivial URL checker. A more experienced AI user could probably have written the entire program in a single prompt, and I couldn’t, but that’s okay – I still got something useful by asking a series of small, simple questions.
So start small! This is a really useful idea, and applies to so many things, not just dance or AI.
But I forget it so often, because it’s easy to be lured in by the hype, the impressive, the shiny. I have to keep reminding myself: start small, don’t overreach yourself.
But what happens when you want to move past the basics?
In dancing, the next step is finding a partner. A lot of swing dances are partnered dances, there’s a leader and a follower. The leader initiates the moves, the follower follows, together they make awesome dance magic.
If you go back to the same classes and events, you dance with the same people. You get to know them, what they can do, what they like and what they don’t. You develop a sense of rapport, and this is so important when dancing. You have to find the right level of trust, what you both feel comfortable doing. Maybe you’re happy to dance a move with one person, but not with another.
Let’s look at a few examples.
In the photo above, the couple are dancing some sort of hand-to-hand jitterbug routine. This is the sort of thing that you can teach a beginner in a couple of hours; nothing weird is going on here; it’s the sort of move most dancers would happily dance with a complete stranger.
Next up, we have tandem Charleston.
In this dance, the follower stands in front of the leader, they both face forward, and they’re connected by their hands at their sides. It’s hard for them to talk to each other, and the follower can’t see their leader.
This is a fun dance that’s easy to teach to beginners, but you can see it’s a bit more of an awkward position. A lot of followers (especially women) aren’t super comfortable dancing this with strangers, and would walk away if you tried it on a social dance floor.
And finally we have aerials. This is when your feet leave the ground, you’re lifted up in the air. In this case, the leader has lifted their follow and flipped her entirely upside down – and hopefully a few seconds later, she returned to the ground safely!
This is obviously a much riskier move, and requires a lot of trust between the two partners. You would never do this with a stranger, and I know a lot of experienced dancers who don’t go near moves like this.
The point is that trust isn’t binary. You have to find the right level of trust with your partner.
What do you feel comfortable doing?
And that question applies to generative AI as much as it does to dancing. If you’re using it for fun stuff, to make images or videos, that’s one thing. If you’re using them for knowledge work, if you’re relying on their output, that’s quite another. If you want to use these tools, you have to know how much you trust them.
The right level of trust isn’t absolute faith or complete scepticism; it’s somewhere in the middle. Maybe you trust it for one thing, but not another.
I’ve seen a lot of discussions of prompt engineering that focus on the mechanical skills, without thinking about trust. “Type in this text to get these results.” That’s important, but it’s no good having those skills if you can’t trust and use the results you get.
How do we learn to trust AI? I think this will be a key question as we use more and more of these tools. How do we build mental models of what we can trust? How do we help everyone find the right level of trust for them? How do we work out when we do and don’t trust them? The same techniques we’ve already discussed can help – start small, and work your way up.
When you dance with a stranger, you don’t jump straight to the most complex move in your repetoire – you start with simpler moves, and you get a sense of each other’s comfort. Are you both dancing energetically? Confidentally? Does it feel safe to do something more complicated? Or is your partner nervous? Wary? Perhaps at the edge of their comfort zone?
We can do something similar with AI. One thing I’ve found useful when testing new tools is to ask it something simple, something I already know how to do. If I see it doing a good job, I can start to trust it for similar tasks. If it completely messes up, I know I can’t trust it for this.
So what do I trust AI for? I want to give you a few practical examples.
I think it’s important to work in areas where you already have a decent understanding. We know these tools can hallucinate. We know they can make stuff up. We know they can go off the rails.
The safeguard is us, the human, and we need to be able to spot when they’ve gone off the rails and need guiding back to the straight and narrow.
We all have different areas of competence and expertise – the areas where we trust AI are going to be different for each us. So I might trust an AI to tell me about digital preservation or dance styles, but I wouldn’t ask it questions about farming or firefighting or frogs.
Let’s look at a few examples.
One thing I use AI tools for is to generate a whole bunch of ideas, a whole bunch of questions, a sort of brainstorming tool. I give it a discussion topic, and rather than asking it for answers, I ask it to tell me what sort of things I should be considering.
I used ChatGPT to help me write this talk. I described the broad premise of the talk, and I asked it to tell me what aspects I should consider. What should I discuss? What could I say? What might my audience want to hear? I didn’t use any of its output directly, but it gave me some stuff to think about. Some of it was bad, some of it I already had, some of its suggestions were useful additions to the talk.
This is an inversion of the prompter/promptee relationship: I’m not giving the computer a topic to think about it; it’s giving me topics to think about.
What if I want more than ideas, what if I want some facts?
AI tools are unreliable sources of facts, and you always have to be careful. They can make up nonsense, and repeat them as if they’re facts with complete confidence and authority. It’s mansplaining as a service.
But it’s not like they don’t know any facts, and sometimes they do get them right. The right level of trust isn’t absolute faith or complete scepticism, it’s somewhere in between. How do we know when to trust them?
I’ve settled on the idea that AI is like a friend who’s read a Wikipedia article – maybe after having a few beers. There’s definitely something behind what they’re remembering, but it may not always be right. I wouldn’t rely on it for anything important, but if often contains a clue to something which is true – a name, an idea, some terminology that leads me towards more trustworthy reference material.
And finally, I use AI for writing code.
But again, I stick to areas where I already have some expertise. I use these tools to write code in languages I use, frameworks I’m familiar with, problems that I can understand.
When I’m working with a team of human developers, I sometimes have to pass on doing a code review because I don’t know enough to do a proper review. That’s my threshold for using AI tools – if I wouldn’t be comfortable reviewing the code, I don’t trust the two of us to write it together. There’s too much of a risk that I’ll miss a subtle mistake or major bug.
But that still leaves a lot of use cases!
I use it for a lot of boilerplate code. It’s a good way to get certain repetitive utility functions. And it’s particularly useful when there are tools that have a complicated interface, and I have to get the list of fiddly options correct (ffmpeg springs to mind). It’s quite tricky to get the right set of incantations, but once I’ve got them it’s easy to see if they’re behaving correctly.
So those are a few of the things I trust generative AI for. These are more evolutionary than revolutionary – these AI tools have become another thing in my toolbox, but they haven’t fundamentally changed the way I work. (Yet.)
Of course, you’ll trust them for different things. Don’t take this as a prescriptive list; take it as some ideas for how you might use AI.
So like a slow jazz number at the end of the evening, let’s wrap things up.
Being a great dancer: yes, it requires the technical skills. You need to know the footwork, the moves, the rhythm. But it’s also about trust, knowing your partner, working out what they’re comfortable with.
The same thing is true of using AI. You need to know how to write prompts, how to get information, how to get the results you want. But you also need to know if you trust those results, when you can rely on the output.
We need both of those skills to be great users of AI.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I had two accounts as a way to keep two separate watch histories. I was watching videos about gender and trans stuff before I came out, and I didn’t want them appearing in my main account – say, when I was listening to music at work. That’s less of a concern now than it was five or six years ago, and the lines between them have become blurry. I don’t need two accounts any more.
Because I only use YouTube for watching videos, and not posting, there were only three lists I really wanted to keep: my subscriptions, my Watch Later queue, and my Likes. My subs and watch later were both small enough to copy by hand; the likes were the hard bit – I had about 1500 or so.
There’s no built-in way to move Likes between YouTube accounts, so it was time to break out the YouTube API.
The first step was getting some API credentials. This uses the Google Cloud console, which I’m not super familiar with, but YouTube has a lot of quickstart guides and code samples which made the process much easier.
I used the Python quickstart guide, and went through the following steps:
At some point during this process, I had to create an OAuth consent screen. If I was publishing this app for the world to use, you’d see this as signing into the app, and it would have to be reviewed by Google. Because I was only writing scripts for me, I was able to mostly skip this step – I left the app with a “testing” status, and just listed my two YouTube accounts as “test users”:
After this, I tried to run the sample Python script from Google’s documentation.
It didn’t work – it was written for an older version of the Python libraries.
In particular, it used flow.run_console()
, which uses an authentication method which has been deprecated for over a year.
A Stack Overflow answer suggested I use flow.run_local_server()
, and that was more successful.
Here’s the first script I got working, which is a modified version of the sample code:
import googleapiclient.discovery # pip install google-api-python-client==1.7.2
import google_auth_oauthlib.flow # pip install google-auth-oauthlib==0.4.1
def create_youtube_client(client_secrets_file):
"""
Given the path to a JSON file with OAuth credentials from the
Google Cloud console, create an authenticated client.
"""
api_service_name = "youtube"
api_version = "v3"
scopes = ["https://www.googleapis.com/auth/youtube.readonly"]
flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(
client_secrets_file, scopes
)
credentials = flow.run_local_server()
youtube = googleapiclient.discovery.build(
api_service_name, api_version, credentials=credentials
)
return youtube
if __name__ == "__main__":
youtube = create_youtube_client(
client_secrets_file="client_secret_12345.apps.googleusercontent.com.json"
)
request = youtube.channels().list(
part="snippet,contentDetails,statistics",
mine=True
)
response = request.execute()
from pprint import pprint; pprint(response)
When I run this, the script kicks me out into a web browser, where I have to go through the usual Google login screen, and confirm I want to use this app. After I clicked through a few confirmation screens, my browser eventually got to a page that said:
The authentication flow has completed. You may close this window.
and back in my terminal window, the script was running and printing a list of my playlists.
Already this was further than I’d got in the past – I had an authenticated API client, and it was retrieving real data from my YouTube account. Good progress!
The authentication code above works, but it has two major issues:
It’s reading my OAuth client config from a JSON file on disk. Credentials should never be stored in plain text, so I want to put that somewhere more secure.
It doesn’t remember the credentials from flow.run_local_server()
– every time I run the script, I have to go through the in-browser authentication flow.
I was running the script many times as I gradually built up the code, and this quickly got annoying.
Both of these issues can be solved using the keyring module, which provides a platform-agnostic interface to the system password store (in my case, the login keychain on macOS).
I changed the function to fetch the OAuth client config from the keychain, and to store retrieved credentials in the keychain. When I run it repeatedly, it retrieves the stored credentials rather than sending me back through the in-browser flow.
After running these scripts for a while, I discovered that Google’s OAuth credentials expire after about a week. I wrote some rudimentary code to handle credential expiry – it deletes the stored credentials, and sends me back through the in-browser flow. There are almost certainly better ways to do this, but my simplistic approach worked well enough for my one-off script.
Here’s my updated function:
import datetime
import json
import google.oauth2.credentials
import googleapiclient.discovery # pip install google-api-python-client==1.7.2
import google_auth_oauthlib.flow # pip install google-auth-oauthlib==0.4.1
import keyring
def create_youtube_client(label: str):
"""
Get an authenticated OAuth client for YouTube.
It gets the OAuth config from the system keychain, and caches
per-user credentials in the keychain under ("youtube", label).
"""
api_service_name = "youtube"
api_version = "v3"
scopes = ["https://www.googleapis.com/auth/youtube.readonly"]
# Try to retrieve a stored OAuth access token from the keychain.
#
# This saves me going through the in-browser authentication flow
# if I've already run the script.
stored_credentials = keyring.get_password("youtube", label)
if stored_credentials is not None:
json_credentials = json.loads(stored_credentials)
if "expiry" in json_credentials:
expiry = datetime.datetime.fromisoformat(json_credentials["expiry"])
expiry = expiry.replace(tzinfo=None)
json_credentials["expiry"] = expir
credentials = google.oauth2.credentials.Credentials(**json_credentials)
# If there are no stored credentials, fetch new ones.
else:
# Retrieve the OAuth client credentials from the keychain.
#
# This contains the contents of the JSON file that I downloaded
# from the Google Cloud console, but now those credentials aren't
# just saved as a plaintext file on disk.
stored_client_secrets = keyring.get_password("youtube", "client_secrets")
if stored_client_secrets is None:
raise ValueError("Could not find OAuth client secrets in keychain!")
flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_config(
client_config=json.loads(stored_client_secrets), scopes=scopes
)
credentials = flow.run_local_server()
# Save these credentials in the system keychain, so they can be
# retrieved later.
keyring.set_password("youtube", label, credentials.to_json())
youtube = googleapiclient.discovery.build(
api_service_name, api_version, credentials=credentials
)
# The OAuth credentials don't last forever -- they seem to expire after
# a week. This is a slightly ropey attempt to work around that.
#
# If we call the API and the saved token is expired, just delete
# it and get new creds -- sending me back through the in-browser flow.
#
# Notes:
#
# - There are ways to refresh OAuth tokens that don't involve
# sending me back through the in-browser flow, but I didn't
# look at them as part of this project.
# - Catching all exceptions is a bit broad. This code should really
# retry only if it gets a "credentials expired" exception, and
# throw any other exceptions immediately.
#
try:
request = youtube.channels().list(part="snippet", mine=True)
request.execute()
except Exception as e:
keyring.delete_password("youtube", label)
return create_youtube_client(label)
else:
return youtube
This function is more complicated than Google’s sample code, and there are more ways that it could be improved. Authentication is hard!
With an authenticated client, it was relatively straightforward to write code that interacts with YouTube’s APIs. I’ve lost the links, but I found snippets of sample code in Google’s documentation that I was able to adapt.
I started by wrapping the create_youtube_client
in a class, and writing a function to list all the videos I’d liked:
class YouTubeClient:
def __init__(self, label: str):
self.youtube = self.create_youtube_client(label)
def create_youtube_client(self, label: str):
…
def get_liked_videos(self):
"""
Generate a list of videos that this YouTube account has liked.
"""
kwargs = {"part": "snippet", "playlistId": "LL", "maxResults": "50"}
while True:
request = self.youtube.playlistItems().list(**kwargs)
response = request.execute()
yield from response["items"]
try:
kwargs["pageToken"] = response["nextPageToken"]
except KeyError:
break
[Edit, 15 February 2024: the original version of this code called the videos()
endpoint and filtered for my likes, but that was only able to see the first 1000 likes. That was fine for this project, where I was gradually deleting the list, but not in general. I’ve changed it to use the playlistItems()
API, which seems to return the full set.]
This generates videos in reverse order of liking them – the most recently liked video comes first. The items are large dicts which include various metadata fields about each video, of which the most interesting one to me is the ID:
{'id': 'J-u2aW7T2bw', …}
{'id': 'XPaKAh2zxgk', …}
{'id': '-q7ZVXOU3kM', …}
Then I wrote a couple of methods which like/unlike a video.
Because these are modifying data in YouTube, I had to change the scopes
to https://www.googleapis.com/auth/youtube
, replacing the youtube.readonly
scope I’d been using previously.
class YouTubeClient:
…
def like_video(self, *, video_id):
"""
Mark a video as "liked" on YouTube.
"""
request = self.youtube.videos().rate(id=video_id, rating="like")
response = request.execute()
def unlike_video(self, *, video_id):
"""
Remove the "liked" rating from a video on YouTube.
"""
request = self.youtube.videos().rate(id=video_id, rating="none")
response = request.execute()
Putting these functions together, I was then able to write a short script which moved my likes from one account to the other:
old_youtube = YouTubeClient(label="old_account")
new_youtube = YouTubeClient(label="new_account")
for video in old_youtube.get_liked_videos():
video_id = video["id"]
print("https://www.youtube.com/watch?v={video_id}")
new_youtube.like_video(video_id=video_id)
old_youtube.unlike_video(video_id=video_id)
Removing the likes from the old account wasn’t strictly necessary – I was planning to close the account when I was done – but it was an easy way to track the progress, and turned out to be helpful towards the end of the process (more on that below).
Incidentally, around the time I wrote this code, David published a post about writing good programming abstractions, and I think this is a nice example of one. Wrapping these API calls in a couple of named functions doesn’t do anything to help de-duplication, but it does make the intent of the final script much clearer.
By and large this code worked extremely well. Almost all of the videos moved across seamlessly, and I could watch it in two side-by-side browser windows – likes appeared in one account as they disappeared from the other. It was substantially quicker and easier than if I’d tried to do it by hand.
I did run into a couple of non-obvious issues:
The YouTube API has a quota, and I burnt through it pretty quickly. You get 10,000 units per day, and rating a video (aka like/unlike) costs 50 units. I had to make two calls to move each video (one like, one unlike), so I could only move about 100 videos a day.
The quota resets at midnight Pacific Time, or about 8am in London. I got into the habit of running the script once a day, every day, until I’d moved my entire list of Liked videos. It took a while, but still less than doing it by hand!
You can apply for a quota increase, but I didn’t bother – I knew I’d only run into the quota a handful of times, and it was easier to spread my runs over multiple days than fill in an application for more quota. The docs say it can take a week or so to approve quota increases, by which I’d probably be done.
Sometimes I’d get a 403 error with the message “The owner of the video that you are trying to rate has disabled ratings for that video”.
I’m not sure what this means – if I opened the video in my web browser, I could still use the like/unlike buttons. This only affected a handful of videos in my entire list, so I just used my web browser to move them across.
The API couldn’t see the last dozen or so videos.
On the last day of running the script, the get_liked_videos()
function returned an empty list, but I could still see some liked videos in the old account in my web browser.
I’m not sure why they were invisible to the API.
Again, because it was only a handful of videos, I moved them across by hand.
[Edit, 15 February 2024: I think this was caused by my use of the videos()
API instead of playlistItems()
; see above.]
These were relatively minor issues, and easy to work around. And once I’d finished running this script, I was able to close the old account and throw away this code – but maybe I’ll come back to these notes if I have another interesting idea for using the YouTube API.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>Maximum size of a PDF, version 7: 381 km × 381 km.
https://commons.m.wikimedia.org/wiki/File:Seit…
Some version of this has been floating around the Internet since 2007, probably earlier. This tweet is pretty emblematic of posts about this claim: it’s stated as pure fact, with no supporting evidence or explanation. We’re meant to just accept that a single PDF can only cover about half the area of Germany, and we’re not given any reason why 381 kilometres is the magic limit.
I started wondering: has anybody made a PDF this big? How hard would it be? Can you make a PDF that’s even bigger?
A few years ago I did some silly noodling into PostScript, the precursor to PDF, and it was a lot of fun. I’ve never actually dived into the internals of PDF, and this seems like a good opportunity.
Let’s dig in.
These posts are often accompanied by a “well, actually” where people in the replies explain this is a limitation of a particular PDF reader app, not a limitation of PDF itself. They usually link to something like the Wikipedia article for PDF, which explains:
Page dimensions are not limited by the format itself. However, Adobe Acrobat imposes a limit of 15 million by 15 million inches, or 225 trillion in2 (145,161 km2).[2]
If you follow the reference link, you find the specification for PDF 1.7, where an appendix item explains in more detail (emphasis mine):
In PDF versions earlier than PDF 1.6, the size of the default user space unit is fixed at 1/72 inch. In Acrobat viewers earlier than version 4.0, the minimum allowed page size is 72 by 72 units in default user space (1 by 1 inch); the maximum is 3240 by 3240 units (45 by 45 inches). In Acrobat versions 5.0 and later, the minimum allowed page size is 3 by 3 units (approximately 0.04 by 0.04 inch); the maximum is 14,400 by 14,400 units (200 by 200 inches).
Beginning with PDF 1.6, the size of the default user space unit may be set with the UserUnit entry of the page dictionary. Acrobat 7.0 supports a maximum UserUnit value of 75,000, which gives a maximum page dimension of 15,000,000 inches (14,400 * 75,000 * 1 ⁄ 72). The minimum UserUnit value is 1.0 (the default).
15 million inches is exactly 381 kilometres, matching the number in the original tweet. And although this limit first appeared in PDF 1.6, it’s “version 7” of Adobe Acrobat. This is probably where the original claim comes from.
What if we make a PDF that exceeds these “maximum” values?
I’ve never dived into the internals of a PDF document – I’ve occasionally glimpsed some bits in a hex editor, but I’ve never really understood how they work. If I’m going to be futzing around for fun, this is a good opportunity to learn how to edit the PDF directly, rather than going through a library.
I found a good article which explains the internal structure of a PDF, and combined with asking ChatGPT a few questions, I was able to get enough to write some simple files by hand.
I know that PDFs support a huge number of features, so this is probably a gross oversimplification, but this is the mental picture I created:
The start and end of a PDF file are always the same: a version number (%PDF-1.6
) and an end-of-file marker (%%EOF
).
After the version number comes a long list of objects. There are lots of types of objects, for all the various things you can find in a PDF, including the pages, the text, and the graphics.
After that list comes the xref
or cross-reference table, which is a lookup table for the objects.
It points to all the objects in the file: it tells you that object 1 is 10 bytes after the start, object 2 is after 20 bytes, object 3 is after 30 bytes, and so on.
By looking at this table, a PDF reading app knows how many objects there are in the file, and where to find them.
The trailer
contains some metadata about the overall document, like the number of pages and whether it’s encrypted.
Finally, the startxref
value is a pointer to the start of the xref
table.
This is where a PDF reading app starts: it works from the end of the file until it finds the startxref
value, then it can go and read the xref
table and learn about all the objects.
With this knowledge, I was able to write my first PDF by hand.
If you save this code into a file named myexample.pdf
, it should open and show a page with a red square in a PDF reading app:
%PDF-1.6
% The first object. The start of every object is marked by:
%
% <object number> <generation number> obj
%
% (The generation number is used for versioning, and is usually 0.)
%
% This is object 1, so it starts as `1 0 obj`. The second object will
% start with `2 0 obj`, then `3 0 obj`, and so on. The end of each object
% is marked by `endobj`.
%
% This is a "stream" object that draws a shape. First I specify the
% length of the stream (54 bytes). Then I select a colour as an
% RGB value (`1 0 0 RG` = red), then I set a line width (`5 w`) and
% finally I give it a series of coordinates for drawing the square:
%
% (100, 100) ----> (200, 100)
% |
% [s = start] |
% ^ |
% | |
% | v
% (100, 200) <---- (200, 200)
%
1 0 obj
<<
/Length 54
>>
stream
1 0 0 RG
5 w
100 100 m
200 100 l
200 200 l
100 200 l
s
endstream
endobj
% The second object.
%
% This is a "Page" object that defines a single page. It contains a
% single object: object 1, the red square. This is the line `1 0 R`.
%
% The "R" means "Reference", and `1 0 R` is saying "look at object number 1
% with generation number 0" -- and object 1 is the red square.
%
% It also points to a "Pages" object that contains the information about
% all the pages in the PDF -- this is the reference `3 0 R`.
2 0 obj
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0 0 300 300]
/Contents 1 0 R
>>
endobj
% The third object.
%
% This is a "Pages" object that contains information about the different
% pages. The `2 0 R` is reference to the "Page" object, defined above.
3 0 obj
<<
/Type /Pages
/Kids [2 0 R ]
/Count 1
>>
endobj
% The fourth object.
%
% This is a "Catalog" object that provides the main structure of the PDF.
% It points to a "Pages" object that contains information about the
% different pages -- this is the reference `3 0 R`.
4 0 obj
<<
/Type /Catalog
/Pages 3 0 R
>>
endobj
% The xref table. This is a lookup table for all the objects.
%
% I'm not entirely sure what the first entry is for, but it seems to be
% important. The remaining entries correspond to the objects I created.
xref
0 4
0000000000 65535 f
0000000851 00000 n
0000001396 00000 n
0000001655 00000 n
0000001934 00000 n
% The trailer. This contains some metadata about the PDF. Here there
% are two entries, which tell us that:
%
% - There are 4 entries in the `xref` table.
% - The root of the document is object 4 (the "Catalog" object)
%
trailer
<<
/Size 4
/Root 4 0 R
>>
% The startxref marker tells us that we can find the xref table 2196 bytes
% after the start of the file.
startxref
2196
% The end-of-file marker.
%%EOF
I played with this file for a while, just doing simple things like adding extra shapes, changing how the shapes appeared, and putting different shapes on different pages. I tried for a while to get text working, but that was a bit beyond me.
It quickly became apparent why nobody writes PDFs by hand – it got very fiddly to redo all the lookup tables! But I’m glad I did it; manipulating all the PDF objects and their references really helped me feel like I understand the basic model of PDFs. I opened some “real” PDFs created by other apps, and they have many more objects and types of object – but now I could at least follow some of what’s going on.
With this newfound ability to edit PDFs by hand, how can I create monstrously big ones?
Within a PDF, the size of each page is set on the individual “Page” objects – this allows different pages to be different sizes. We’ve already seen this once:
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0 0 300 300]
/Contents 1 0 R
>>
Here, the MediaBox
is setting the width and height of the page – in this case, a square of 300 × 300 units.
The default unit size is 1/72 inch, so the page is 300 × 72 = 4.17 inches.
And indeed, if I open this PDF in Adobe Acrobat, that’s what it reports:
By changing the MediaBox
value, we can make the page bigger.
For example, if we change the value to 600 600
, Acrobat says it’s now 8.33 x 8.33 in
.
Nice!
We can increase it all the way to 14400 14400
, the max allowed by Acrobat, and then it says the page is now 200.00 x 200.00in
.
(You get a warning if you try to push past that limit.)
But 200 inches is far short of 381 kilometres – and that’s because we’re using the default unit of 1/72 inch.
We can increase the unit size by adding a /UserUnit
value.
For exaple, setting the value to 2 will double the page in both dimensions:
<<
/Type /Page
/Parent 3 0 R
/MediaBox [0 0 14400 14400]
/UserUnit 2
/Contents 1 0 R
>>
And now Acrobat reports the size of the page as 400.00 x 400.00 in
.
If we crank it all the way up to the maximum of UserUnit 75000
, Acrobat now reports the size of our page as 15,000,000.00 x 15,000,000.00 in
– 381 km along both sides, matching the original claim.
If you’re curious, you can download the PDF.
If you try to create a page with a larger size, either by increasing the MediaBox
or UserUnit
values, Acrobat just ignores it.
It keeps saying that the size of a page is 15 million inches, even if the page metadata says it’s higher.
(And if you increase the UserUnit
past 75000
, this happens silently – there’s no warning or error to suggest the size of the page is being capped.)
[Edit, 1 February 2024: some extra zeroes slipped into the original version of this post – it’s a million inches, not a billion. Thanks to mrb on Hacker News for spotting the mistake!]
This probably isn’t an issue – I don’t think the UserUnit
value is widely used in practice.
I found one Stack Overflow answer saying as such, and I couldn’t find any examples of it online.
The builtin macOS Preview.app doesn’t even support it – it completely ignores the value, and treats all PDFs as if the unit size is 1/72 inch.
But unlike Acrobat, the Preview app doesn’t have an upper limit on what we can put in MediaBox
.
It’s perfectly happy for me to write a width which is a 1 followed by twelve 0s:
If you’re curious, that width is approximately the distance between the Earth and the Moon. I’d have to get my ruler to check, but I’m pretty sure that’s larger than Germany.
I could keep going. And I did. Eventually I ended up with a PDF that Preview claimed is larger than the entire universe – approximately 37 trillion light years square. Admittedly it’s mostly empty space, but so is the universe. If you’d like to play with that PDF, you can get it here.
Please don’t try to print it.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>Accompanying me was Flemingo, a hand-puppet flamingo who’s become a constant companion when I go to the show. (Why a flamingo? Because of one verse in a single song that has turned the humble flamingo into an icon in the fandom.)
Both of us were wearing bow ties to cosplay as Ian Fleming, who’s both a character in the musical and the co-author of a memo that laid the seeds for the real-life operation. And Ian Fleming was a part-time novelist, part-time spy, so it made sense for us to be carrying a stack of his books.
I did look in a couple of charity shops for old Ian Fleming novels, but I couldn’t find any. Lacking real books, I decided that the only alternative was to make my own.
This led to a series of props that are, if anything, over-arsed: the Collected Works of Ian Flemingo. These are a collection of postcards with covers based on the real James Bond books… sort of:
I got the idea over Christmas, and annoyed my family by coming up with increasingly tenuous ideas for James Bond Bird puns.
This was lots of fun, and I came up with so many ideas – for every one you see, there are two more I didn’t use.
The design of the covers are loosely based on a real set of hardback editions, without the fancy art. For the images, I turned to PhyloPic, a site I bookmarked years ago and I’ve been waiting for a good reason to use. It provides free silhouette images of animals and plants, and I was able to pick a set of bird images with CC0 licenses. (Flamingo, gull, goldfinch, dodo, loon, stork, crow, magpie)
I’d like to say the rainbow spread was intentional, but really I just kept picking colours I hadn’t used yet – I didn’t arrange them in this order until I had them printed.
On the other side of the cards, I wrote pun-filled blurbs that are based loosely on the original books. I got the original blurbs from Amazon listings, then stuffed them with puns and references to the show. At one point I did consider including testimonials from characters, but I didn’t have space.
The books are published by “El Otro Editora”, a reference to “El Otro Periodico”, a fan-produced newspaper, which is in turn named after the line “el otro telefono”. The price is “6d”, which is the one mistake that really irks me – I meant the cost to be a ha’penny, but instead it’s half a shilling. Oops. (Why a ha’penny? Because of a line in the show.)
My favourite touch is the barcodes in the corner. This is what makes it feel like a “real” book to me – and every book has a unique barcode, and they can be scanned!
Initially I picked random numbers as the ISBNs, but I got worried about where they might be pointing – maybe I’d accidentally pick a real ISBN that pointed at an offensive book?
Then I looked for ISBNs that could be used safely. For example, I know there are telephone numbers that are reserved for fictional purposes, so a phone number that appears in a TV show won’t accidentally lead to a deluge of calls for the person on the other end. But although lots of ISBNs are used for fiction books, there aren’t any ranges reserved for fictional books. (And yes, I did look up the ISBN specifications to see if there were any ranges reserved for this purpose. This project is nothing if not ridiculous.)
Finally, I did the obvious thing – I just dropped in the ISBNs for the original Ian Fleming books. So if you scan the barcodes, you’ll find a James Bond book.
I printed the cards themselves using InstantPrint, who did a great job – the cards feel nice in the hand, and they were delivered quickly on a tight deadline.
Most of my work is purely digital, so I always enjoy when I get to actually hold something I’ve made. In the week leading up to the event, I found myself taking them out of my bag and just turning them over in my hands as I admired the shiny, shiny thing. I don’t expect to print more of these particular books, but I can see myself doing more custom postcards in future – it’s nice to have physical souvenirs in an increasingly online world.
I gave out a bunch of the postcards as OHT souvenirs on Monday, and people liked them! I also got to gave some sets to the cast at stage door today, which got rid of the rest of the prints. I had fun making them, and I’m so glad that people are enjoying them.
Small flashes of joy, indeed.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>However, there is one point of contention: how should websites ask for your age?
I’ve done some thinking, and I’ve come up with a proposal. We all know the best way to tell somebody’s age is to count the candles on their birthday cake, so I’ve built a cake-based interface.
(If these animations are distracting, you can toggle them off/on.)
Let me answer some FAQs, and then I’ll explain how it works:
Why can’t we just use <input type="number">
?
As many people are fond of saying, age is a state of mind, not a number.
Will this input UI work on all devices? This is definitely on the wide side, but I tried it on the 52′ DiamondVision Ultra Mega Display where I do all my web development, and it fits just fine on there. I’m sure that’s all the testing we need, and nobody would ever have a smaller display where this design doesn’t work.
Can I license this UI to use in my apps? You certainly can! Just send your mail-order form to me at the Institute of Good Ideas, Potassium Plaza, Rainy England.
What’s on your roadmap for V2? Adding a tinny MP3 of “Happy birthday” that autoplays at maximum volume whenever this UI is on screen.
I think we can all agree that this is a brilliant idea, and I’m sure all the major browsers will implement it within weeks. I look forward to getting my cheques in the post.
The cake is drawn entirely using SVG animations, which I haven’t used before.
I’m quite pleased with how well it works, and how close I was able to get to my original idea.
I know there are quite a few ways to do animation on the web; I wanted to experiment with the SVG <animate>
tag.
The basic idea of the <animate>
tag is that you can tell it different values that an attribute of an element can take over time.
For example, here I’m animating a rectangle by increasing the width
it from 0 to 100, and then decreasing it back to 0 again:
<rect width="10" height="10" fill="black">
<animate
attributeName="width"
values="0;100;0"
dur="20s"
repeatCount="5"
/>
</rect>
which looks like:
It’s pretty flexible – you can animate multiple properties on the same element; you can change non-numeric attributes like fill
or stroke
; you have a lot of control over how the animation behaves.
I did some brief experiments with simple shapes, enough to get a sense of how I could use it.
Now I knew how to animate attributes on SVG, I made a small icon of a static birthday cake.
There are plenty of existing icons like this on the web, but I made my own so I could keep the shapes simple – most icon sets are just a giant <path>
exported from a drawing app, which I’d have to unpick.
Animating that would be harder than just creating my own icon.
I started with a little pencil-drawn sketch to work out the rough geometry, then I wrote the SVG by hand. I still find it vaguely relaxing to create pictures from code. This is what I came up with:
Most of this is fairly vanilla SVG, using stuff I’ve written about before. The candle flames and the curving line are both using SVG masks, and the curves are drawn as a collection of circular arcs.
The one interesting bit is the rounded corners on the two layers of cake, where only the top two corners are rounded.
You can set the corner radius of an SVG <rect>
with the rx
attribute, and you get the same curve on all four corners – unlike the CSS border-radius
property, which allows you to pick different radii for each corner.
To get curves on just two corners, I overlapped two rectangles – one with and without rounded corners.
Because I’m only doing a solid fill, I’m rendering the two rectangles directly in the image – but if I wanted a more complex fill, I could use this to create as a mask that I applied to another shape.
Once I had my basic icon, I created an extended version that has several hundred candles on it. This is what the cake looks like when it’s fully complete:
This has something like 200 candles on it; in hindsight I was way off my estimate of how old the oldest humans are. According to Wikipedia, the oldest humans are closer to 120 years old.
I then sprinkled <animate>
elements everywhere to make different parts of the cake appear at different times.
For the plate and the two cake layers, I’m animating the width
attributes, so they gradually get bigger.
For the candles, I’m applying a mask which has an animated width
attribute, so it gradually allows more and more of the candles to be seen.
That animation uses calcMode="discrete"
, which causes it to do a distinct step at each tick, rather than a smooth animation between the two.
This means that you only ever see whole candles, rather than half-candles in the middle of the animation.
Finally, I added an animation to the viewBox
attribute of the overall SVG – this means the width of the SVG increases as more candles become visible.
This allows me to get the current state of the animation in JavaScript:
getComputedStyle(document.querySelector('svg'))['width']
// 158px
I know how far apart the candles are spaced, so I can use this to work out how many are visible at any given time. There are other ways to inspect the state of an in-progress SVG animation; tying it to the geometry was the easiest in this case.
If you’d like to learn more, I encourage you to read the SVG file. It’s a bit repetitive in parts, but overall I think it’s fairly readable.
I learn a lot from doing mini-projects like this, and more than I would by just reading the documentation. I didn’t plan to work on this, but this particular idea – “animate a birthday cake” – sunk its teeth into my hyperfocus a few days ago, and I’ve been thinking about it ever since. Posting this article will let me call the project “done” and move on to other things.
Animation is one of those topics that’s always been just beyond what I can do – I knew that SVG animation is a thing, but I’d never actually tried it. Now I have!
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>For giggles, I decided to use the animal detection feature to tell me what sort of dog Ziva is. We’re not actually sure of her breed because she’s a rescue dog of unknown parentage, so it’s always fun to try the animal detection and see what it suggests.
When I tried, I was shown a button I hadn’t seen before: Look up Mammal.
And with the best will in the world, I couldn’t have guessed what it would suggest:
After we’d all had a good laugh at this suggestion, I opened the Wikipedia page and went down a rabbit hole. When you see some photos, it starts to make more sense – several of the photos featured really do show a sizable snout. Given the distortion in my photo, you can see why the algorithm thought there could be a match (and where the “hammer-headed” name comes from):
And it turns out it’s a pretty cool creature – it’s a “megabat” (great name) and its wingspan can be close to a metre wide. I think of bats as small and cuddly, but if something flew past that was that size, I’d be pretty nervous. Imagine if something that looked like this flew towards you with outstretched wings on a dark night:
And for extra scary points, it makes a pretty loud honking sound, so loud that it’s often considered a pest. This noise is how males attract females, and it’s so important that their internal organs are actually shaped around their ability to honk:
The most noticeable anatomical features of the male involve sound production. The larynx is one-half the length of the vertebral column and fills most of the thoracic cavity, pushing the heart, lungs and alimentary canal backward and sideways.
— Hypsignathus monstrosus, by Paul Langevin and Robert M. R. Barclay, Mammalian Species Issue 357, 26 April 1990.
That same paper also features a delightful description of the male nose, which sounds like somebody getting revenge for the way male authors describe women in novels:
Males have a large, square, truncated head (Tate, 1942) with enormous pendulous lips, ruffles around a warty snout and a hairless, split chin (Lang and Chapin, 1917).
I find myself down this sort of rabbit hole surprisingly often, and I’ve begun to think of it as “serendipitous search results”. Whatever Ziva is, she’s definitely not an African species of large bat, but it appeared in my search results anyway, and that was the start of some fun reading. At other times, I’ll look for a specific book at my local library, and they don’t have it, but I’ll end up reading half a dozen books with similar titles because that’s what the search could find.
If you want more pictures of cool bats, there are a bunch of them on iNaturalist. If not, I’ll see you next time I stumble upon something fun and unexpected while searching.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I did return to a couple of favourite authors, who all released new books that I enjoyed – Alexandria Bellefleur, Toshikazu Kawaguchi, Maureen Johnson – but that was more familiar feelings than new. I got more of my novelty from theatre than books this year.
About half of what I read came from my local library, including four of the books on the list below. The library was particularly helpful for the “good, but not great” books that I know I won’t read again. I was glad to return the books, and not have them taking up space in my house. I expect to continue leaning on my library next year.
I continue to write short reviews of books I’ve read at https://books.alexwlchan.net, and keep notes in Obsidian. The latter have been useful for jogging my memory, including as I write this post.
Below are the best books I read in 2023, in the order I read them. Note that it’s quite a heavy selection, several dealing with death and oppression and other unpleasant things. I do recommend them all, but you should have a cheery book to read afterwards.
by Beth Lewis (2021)
This is a fascinating take on the “what if” concept.
Iris has run away to the forest to escape her abusive wife, and there she meets an alternate version of herself who made different choices – an Iris who married someone different, and now carries an unwanted pregnancy. They both have regrets about how their life turned out, and envy some aspects of the other’s life – but they also learn that the grass isn’t always greener.
The book switches back and forth between “Before” (the events leading up to her running away) and “After” (what she does in the woods). You know that something bad has happened to Iris, but you only find out the details slowly. I enjoy this sort of non-linear story.
It gave me a lot of thoughts and feelings, and I felt sympathy for Iris. Both Irises have flawed and imperfect lives, and there’s no implication that one is obviously better or more correct. I haven’t read it again, but I want to, because I suspect there’s a lot I’d pick up on a second time round.
Content warnings for abuse, sexual assault, rape, and suicide.
(Yes, I technically read this in 2022, but it missed the cutoff for last year’s post.)
by Tanya Talaga (2017)
This is an account of abuse, neglect, and deaths of Indigenous students in Canada’s school system. It’s not an easy book to read, but it’s well-written and worth the time.
It focuses on seven deaths in Thunder Bay, and includes the history of the residential schools which form the backdrop for those events. As I don’t know much about Canadian history or education, I found the context useful.
The titular “Seven Fallen Feathers” are seven children who died in unexplained circumstances. The book devotes a chapter to each of them, describing their history, the days leading up to their death, and the hole it left in the lives of their friends and family. It’s difficult to read but important, and the author has obviously done a lot of research and interviews.
There are certain themes that keep coming up – racism towards Indigenous people, disinterest from the Thunder Bay police, the effect of moving kids to big cities – but the author is subtle about them. She doesn’t need to tell you what patterns you should be looking for, because it’s so obvious from the stories. It’s the embodiment of “show, don’t tell”.
Content warnings for death, suicide, racism and colonialism.
by Darcie Little Badger (2020)
A fun story in an America where fantasy and magic are real and commonplace, including vampires, ghosts, and evil scarecrows.
Elatsoe (“Ellie”) is a member of the Lipan Apache Tribe, and women in her family have the ability to speak to ghosts – usually animal ghosts, because human ghosts are more violent and angry. She’s investigating the death of her cousin Trevor, which she believes to be a murder.
The plot is a little slow, but the worldbuilding is gorgeous – it feels rich and real, without being heavy-handed. We get little glimpses of the magic, but it’s never treated as spectacular or obsessed over. I didn’t need to have every detail to enjoy it. I’d happily read more stories in this setting, but it also works nicely as a standalone.
This was Darcie Little Badger’s debut novel, and I read her second novel, A Snake Falls to Earth, in August. It was another good fantasy, and I’ll definitely be reading more of her work in future.
by Laura Imai Messina (2020)
This is a gorgeous story about grief and heartbreak, and two people learning to find love again after great tragedy.
The phone box itself is mundane, with no special magic or power. When I first saw the title, I wondered if this was a fantasy or sci-fi book – I was getting TARDIS vibes – but it’s not, and that’s a good thing. It’s just an ordinary phone box in a garden in Bell Gardia, a few hours from Tokyo.
People go to the phone box to have conversations with their loved ones – often they’re talking to people who have died, but not always. Some talk to estranged family, others talk to friends who are alive but mentally incapacitated or traumatised. These conversations are largely private, and we don’t get to hear many of them.
Instead, the book focuses on a handful of characters – Sui, Takeshi, and Hana – and how they interact with the phone box, and its other visitors. They’ve all suffered losses, and their visits to the phone box are what help them to start reconnecting with people.
It reminded me a lot of Toshikazu Kawaguchi’s books, which I adore – grief and sadness but with an ultimately positive message.
by Fern Brady (2023)
This is a great memoir about autism and sexuality.
I first saw Fern Brady when she appeared on Taskmaster, where she quickly became one of my favourite contestants. I loved how unapologetic she was about being herself, and how much fun she seemed to be having. When I learnt she was writing a memoir, I knew I had to read it.
It’s the story of her growing up, being autistic, and the ways her behaviours have affected her life. There’s also a lot of discussion of how being a woman meant her autism was overlooked or ignored for a long time. It felt genuine and raw, and it wasn’t unrealistically hopeful or optimistic – it was a statement about what being an autistic woman is like.
I definitely saw parallels with my own life, and it’s given me plenty to think about.
I was engrossed and read it in a single day; I was enjoying it so much I actually missed my train stop on the way home.
by Jihyun Park and Seh-Lynn Chai (2022)
This is a gripping and horrifying story of growing up in North Korea, then escaping as an adult.
North Korea is a country I was only vaguely aware of, and most of my knowledge comes from pop culture stereotypes, so I learnt a lot from this book. It’s primarily a story about Jihyun’s experience rather than North Korean politics, but even so it covers a lot of Korean history that was new to me.
It starts when Jihyun was a child, in the early 1970s, and North Korea was less uncomfortable, if not exactly prosperous. She grows up to be a school teacher as the economy declines, and then escapes through China when things get much worse – saving herself, but leaving her entire family behind. One of the most moving chapters is a farewell letter to her father.
The writing is clear and simple, with plenty of small details and individual stories. As with Seven Fallen Feathers, this is a good example of “show, don’t tell”.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>$ python clean_up_text.py /path/to/text/file.md
This is mostly fine, but finding that path is a bit annoying when I want to run them on a note I have open in Obsidian. It’s not hard, it just takes a few steps – open the “More options” menu, click “Reveal in Finder”, drag the file from Finder into terminal. I wanted a way to make it a bit quicker.
I’ve written a little script which gives me a path to the note I currently have open in Obsidian, so now I can run something more like:
$ python clean_up_text.py $(path_to_frontmost_obsidian_note)
I’ve managed to do this with a bit of AppleScript and Python, even though Obsidian doesn’t have any AppleScript support.
The inspiration for this script was another script I have for getting the frontmost URL from my web browser. The crux of that script is a single line of AppleScript that controls Safari:
tell application "Safari" to get URL of document 1
Unfortunately this isn’t quite as simple in Obsidian – it doesn’t have any AppleScript support, so you can’t do anything with tell application "Obsidian"
.
(The lack of AppleScript is annoying, but understandable. It’s a niche technology on a marginal platform, and Apple seems to have completely forgotten it exists. Much as I find AppleScript useful, it’s hard to justify the time/effort to add support for it in a new app today.)
But even if Obsidian doesn’t have its own AppleScript dictionary, it is still visible from the AppleScript universe – as a process in System Events. We can’t see much, but we can see its windows, for example:
tell application "System Events"
tell process "Obsidian" to get title of front window
end tell
The window title has three parts, separated by hyphens: the name of the note, the name of the vault, and the Obsidian version:
Short story ideas - textfiles - Obsidian v1.4.16
This is the same title that shows up in the “Window” menu – it’s a bit of Obsidian poking into macOS where AppleScript can see it.
Because Obsidian always uses the title as the filename (e.g. this file is called Short story ideas.md
), we can use this to find the path to the Markdown file.
To find the Markdown file, you match the vault name to a folder on disk, then you search for files that match the note title. There are a bunch of ways you could do this; I picked Python because that’s what I’m familiar with, but you could use another language just as easily.
This is the script I wrote, which I named obnote
.
Hopefully the comments are enough to explain what’s going on:
#!/usr/bin/env python3
"""
Print the path to the Markdown file which is currently open
in Obsidian (if any).
This relies on knowing the on-disk locations of my Obsidian vaults,
so you won't be able to use this without changing it for your own setup.
Note: this will print the *first* file with the same name as your
open note, which may cause issues if you have multiple notes with
the same title.
"""
import os
import subprocess
def get_file_paths_under(root=".", *, suffix=""):
"""
Generates the absolute paths to every matching file under ``root``.
See https://alexwlchan.net/2023/snake-walker/
"""
if not os.path.isdir(root):
raise ValueError(f"Cannot find files under non-existent directory: {root!r}")
for dirpath, _, filenames in os.walk(root):
for f in filenames:
p = os.path.join(dirpath, f)
if os.path.isfile(p) and f.lower().endswith(suffix):
yield p
def get_applescript_output(script):
"""
Run an AppleScript command and return the output.
"""
cmd = ["osascript", "-e", script]
return subprocess.check_output(cmd).strip().decode("utf8")
if __name__ == "__main__":
window_title = get_applescript_output("""
tell application "System Events"
tell process "Obsidian" to get title of front window
end tell
""")
# The window title will be something of the form:
#
# Short story ideas - textfiles - Obsidian v1.4.16
#
note_title, vault_name, _ = window_title.rsplit(" - ", 2)
# Match the vault name to a path on disk.
#
# This is very specific to my setup, so if you want to use it on
# your computer, you'll need to customise this bit.
if vault_name == "textfiles":
vault_root = os.path.join(os.environ["HOME"], "textfiles")
else:
raise ValueError(f"Unrecognised vault name: {vault_name}")
# Find Markdown files that match the name of this note.
for path in get_file_paths_under(vault_root, suffix=".md"):
if os.path.basename(path) == f"{note_title}.md":
print(path, end="")
break
else: # no break
raise RuntimeError(f"Could not find note with title {note_title}")
This does assume that notes have unique titles – that I won’t, for example, have two notes in different folders both called Short story ideas.md
.
That’s true in my vault, but you might want to be careful using it if you reuse note titles.
Now I can invoke my text cleanup scripts like so:
$ python clean_up_text.py $(obnote)
This is especially useful when I want to run the same cleanup script on multiple notes in quick succession.
I can run this command once, switch to Obsidian and select a new note, then return to my terminal and press up-arrow
and enter
to run the cleanup on my new note.
There are lots of other ways you could solve this problem – for example, I realised as I wrote this post that you could look at the .obsidian/workspace.json
file – but this works for me, and I had a bit of fun while writing it.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>(I don’t remember much of my old approach, but I know it was messy. At one point I used Homebrew and virtual environments, but I got burnt by Homebrew unexpectedly breaking Python, so I scrapped it and started installing everything in my global Python installation. Don’t do that.)
In August I read Glyph’s post Get Your Mac Python From Python.org and it all seemed like sensible advice, so I decided to use that as my starting point. I downloaded Python on my new work laptop from Python.org, and I started using virtual environments for everything.
This worked well enough, but there were some rough edges in my new workflow. I’ve been tweaking my Fish shell config to make it a bit smoother.
One recommendation in Glyph’s post is that you always use virtual environments, and they suggest a way to enforce that:
Once you have installed Python from python.org, never
pip install
anything globally into that Python, even using the--user
flag. Always, always use a virtual environment of some kind. In fact, I recommend configuring it so that it is not even possible to do so, by putting this in your~/.pip/pip.conf
:[global] require-virtualenv = true
I like the idea of always using virtualenvs, but I’m not a fan of putting config files in my home directory. I struggle to keep them up-to-date, and after a while I lose track of what’s what – is this config still in use, or is it cruft from a tool I no longer use? Plus, each config file becomes one more thing to remember when I set up a new computer.
Fortunately, this config file isn’t the only way to ensure you always use a virtual environment.
You can also set PIP_REQUIRE_VIRTUALENV
, so I have the following lines in my fish shell config:
# This prevents me from installing packages with pip without being
# in a virtualenv first.
#
# This allows me to keep my system Python clean, and install all my
# packages inside virtualenvs.
#
# See https://docs.python-guide.org/dev/pip-virtualenv/#requiring-an-active-virtual-environment-for-pip
# See https://blog.glyph.im/2023/08/get-your-mac-python-from-python-dot-org.html#and-always-use-virtual-environments
#
set -g -x PIP_REQUIRE_VIRTUALENV true
Because I keep my shell config in Git, it’s easier to see when I added this variable, and when I get a new computer I’ll get this right behaviour “for free”.
The process of creating new virtual environments is ostensibly simple – just two commands.
$ python3 -m venv .venv
$ source .venv/bin/activate.fish
In practice I only ever remembered to run the first – I’d create my new virtual environment, go to pip install
something, and then it would complain I hadn’t enabled a virtual environment.
I’d mutter and grumble, activate the virtualenv, and try again.
If I’m creating a virtual environment, I want to use it immediately, so I wrapped this process in a Fish function called venv
:
function venv --description "Create and activate a new virtual environment"
echo "Creating virtual environment in "(pwd)"/.venv"
python3 -m venv .venv --upgrade-deps
source .venv/bin/activate.fish
# Append .venv to the Git exclude file, but only if it's not
# already there.
if test -e .git
set line_to_append ".venv"
set target_file ".git/info/exclude"
if not grep --quiet --fixed-strings --line-regexp "$line_to_append" "$target_file" 2>/dev/null
echo "$line_to_append" >> "$target_file"
end
end
# Tell Time Machine that it doesn't need to both backing up the
# virtualenv directory. (macOS-only)
# See https://ss64.com/mac/tmutil.html
tmutil addexclusion .venv
end
I typically run this in the root of a project directory, usually a Git repo.
When I run it, it creates a new virtual environment with an up-to-date version of pip
(thanks to --upgrade-deps
), then it activates it immediately.
This means my next command can be a pip install
, and it’ll run inside the new virtualenv.
It also adds the .venv
directory to .git/info/exclude
, which is a local-only gitignore file.
This means that Git will ignore my virtual environment, and not try to save it.
The grep
command is checking that I haven’t already gitignore-d .venv
, so I don’t add repeated ignore rules.
It also tells Time Machine not to bother backing up the virtual environment directory. I’d never restore a virtualenv from a backup; I’d just create a new one fresh, so backing it up is a waste of space and CPU cycles.
I often combine this with another function I have for creating temporary directories:
function tmpdir --description "Create and switch into a temporary directory"
cd (mktemp -d)
end
like so:
$ tmpdir; venv
And with two short commands, I’m in an empty directory with a fresh virtual environment. This is great for quick prototyping, experiments, and one-off projects.
Once I’ve created my virtual environments, I need to remember to activate them.
I could do this manually, or I could have the computer look for virtualenvs and (de)activate them automatically for me. There are various plugins for doing this (I used virtualfish a few years ago), but this time round I realised my needs were simple enough that I could just write my own function.
My venv
function ensures a standard approach to virtualenv naming: I always call them .venv
, and I put them in the root of my project directories, which are always Git repos.
This means I can find if there’s a virtualenv I want to auto-activate by looking to see if I’m in a Git repo, then looking for a folder called .venv
.
This is the function:
function auto_activate_venv --on-variable PWD --description "Auto activate/deactivate virtualenv when I change directories"
# Get the top-level directory of the current Git repo (if any)
set REPO_ROOT (git rev-parse --show-toplevel 2>/dev/null)
# Case #1: cd'd from a Git repo to a non-Git folder
#
# There's no virtualenv to activate, and we want to deactivate any
# virtualenv which is already active.
if test -z "$REPO_ROOT"; and test -n "$VIRTUAL_ENV"
deactivate
end
# Case #2: cd'd folders within the same Git repo
#
# The virtualenv for this Git repo is already activated, so there's
# nothing more to do.
if [ "$VIRTUAL_ENV" = "$REPO_ROOT/.venv" ]
return
end
# Case #3: cd'd from a non-Git folder into a Git repo
#
# If there's a virtualenv in the root of this repo, we should
# activate it now.
if [ -d "$REPO_ROOT/.venv" ]
source "$REPO_ROOT/.venv/bin/activate.fish" &>/dev/null
end
end
This function runs as an event handler in Fish – it runs whenever the PWD
variable changes.
That variable is the current working directory, so in practice this runs whenever I change directories.
I find the top-level directory of the current Git repo by running git rev-parse --show-toplevel
, which is a super handy command I use in lots of scripts.
If I’m not in a Git repo, it returns an empty string.
Then I compare that to the path of the currently-enabled virtualenv in VIRTUAL_ENV
, and decide whether I need to activate or deactivate a virtualenv.
If you want the complete code, my Fish shell config is in a public repo, although the virtualenv stuff is a bit spread out.
This was the first project where I used ChatGPT to help write the code. I was initially quite sceptical of LLMs, but watching what Simon Willison has been doing persuaded me to try it. This felt like a safe project to try it – it’s a minimal project with clearly defined “is the code working” criteria, and limited impact if I do something daft.
Overall I was quite impressed.
All the code seemed to work, and it was helpful for the bits of shell syntax I only half-remember – things like test -z
and combining multiple conditions in a boolean.
I didn’t use any of its output directly, but it was a good starting point that I could adapt into my actual code.
I’m sure this won’t be my last project where ChatGPT lends a helping hand.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>If I tried a simple example:
import mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.open("https://www.example.net/").read()
it would fail with an error:
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate
verify failed: unable to get local issuer certificate (_ssl.c:1000)
My first instinct was to check Google and GitHub; I couldn’t find any other instances of people finding and fixing this issue. The most I could find was a quickstart example that starts with an HTTP example, then suggests disabling SSL verification to access HTTPS sites. I found a few instances of people following this suggestion on GitHub – but I wasn’t keen on that. SSL verification exists for a reason; I don’t want to get rid of it!
A bit later I found a page about changing the certificates used by your mechanize browser with browser.set_ca_data()
.
I knew from my work on HTTP libraries that certifi is a bundle of SSL certificates often used in Python libraries, so I decided to try pointing mechanize at certifi:
import mechanize
import certifi
browser = mechanize.Browser()
browser.set_handle_robots(False)
+browser.set_ca_data(cafile=certifi.where())
browser.open("https://www.example.net/").read()
That seemed to work, and my mechanize browser was once again able to browse the HTTPS web.
If you use popular HTTP libraries like httpx or requests, they install and load SSL certificates from certifi automatically. I don’t know why mechanize doesn’t do the same, but it was just a one-line change to get it working correctly.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>One of the cool things it does is autosuggestions from my shell history. As I’m typing, it suggests (in light grey) a command I’ve run before. I can press the right arrow to accept the suggestion, or I can keep typing to ignore it.
This feature is pretty smart and very useful, and it’s probably saved me thousands of keystrokes. But a few times recently it’s gone wrong, and suggested me something I didn’t want – so I’ve been tweaking my fish config to make it easier for fish to “forget” certain commands.
There are a couple of categories of command that I don’t want fish to remember or suggest:
Typos and mistakes. If I ever mistype a command and run it, that mistake will be used for future autosuggestions. Getting a command wrong once is annoying; having my mistake be continually suggested is worse.
A while back I mistyped hdiutil attach
as hdiutil atach
, and I kept getting the wrong version as an autosuggestion.
(That annoyance is what led to the code in this post!)
Sensitive information. Sometimes I end up with sensitive information in my shell commands – either by accident, or because it’s the fastest way to fix a problem. I know I’m not meant to do that, but nobody’s perfect.
My fish shell history lives unencrypted in a file on disk, where anything could find it (~/.local/share/fish/fish_history
).
I’d rather not have passwords, keys, and other credentials living there forever.
Potentially dangerous commands. My example of this is doing a Git force push, which can delete data if I’m not careful. It’s the right thing to do sometimes, but I never want to start typing a regular Git push and get a force push as an autosuggestion.
I never want to do a force push accidentally, and I’m willing to give up the benefits of autosuggestion for a bit of extra safety.
To help me avoid autosuggestion in these three cases, I’ve added two functions to my shell config.
This function removes the last-typed command from my history, which prevents it from being suggested again. I run this manually, whenever I mistype a command or some other one-off thing I don’t want to remember.
function forget_last_command
set last_typed_command (history --max 1)
history delete --exact --case-sensitive "$last_typed_command"
true
end
You can test this with the following steps:
echo "my password is hunter2"
echo
into your shell, and see that the previous command is suggestedforget_last_command
echo
again, and notice that the first command is no longer suggestedecho
, and check that the first command isn’t suggestedThe heavy lifting is done by fish’s history
command – first it looks up the last command I typed, then it removes that from the history, and finally it persists that change to disk.
(I’m not entirely sure the history save
should be necessary, but with fish 3.6.1 – the latest version – it is required for this to work.
I think there’s something slightly funky about history delete --exact
, which made this maddening to debug until I started following my fish_history
file.)
Forgetting a command on a one-off basis is good for typos and accidental passwords, but what about commands I use on a semi-regular basis?
It’d be annoying if I had to type forget_last_command
every time I ran git push --force
.
This function looks at my last command, and if it’s dangerous, it removes it from my history. Crucially, this runs as part of my shell prompt, so it runs as soon as a command completes – I don’t need to remember to forget:
function forget_dangerous_history_commands
set last_typed_command (history --max 1)
if [ "$last_typed_command" = "git push origin (gcb) --force" ]
history delete --exact --case-sensitive "$last_typed_command"
history save
end
end
You can test this with the following steps:
git push origin (gcb) --force
(here gcb
is an alias for git rev-parse --abbrev-ref HEAD
, which prints the name of the currently checked-out branch)git push
again, and notice that the force push isn’t suggestedA force push is the only example of a dangerous command that I use regularly, but there could be others – anything involving rm -rf
, for example.
If I ever find myself doing something dangerous that I never want suggested, it should be pretty easy to extend this function.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>Writing for Stories was one of my “bucket list” items while working at Wellcome, and I actually submitted the pitch on the same night I decided to start looking for new jobs. It’s among the more personal things I’ve written, but I’m really pleased with the result. I’m incredibly grateful to Alice White (my editor) and Steven Pocock (who took the photographs) who helped turn my rough idea into something great, and a nice capstone for my time at Wellcome.
There are photos of several of my finished or in-progress pieces in the article, which come from a variety of artists:
You can read the story on the Wellcome Collection website.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>What I’d rather do is move some big items out of my library, and get some space back. I’ve got a pretty good workflow for reviewing new photos, but what about ones from before I had my reviewing tool?
I wrote a short Swift script which prints a list of all the largest files in my Photos Library. The key part is two methods in PhotoKit: PHAsset.fetchAssets to enumerate all the files, and PHAssetResource.assetResources to retrieve the original filename and file size. The rest of the script takes the data and does some sorting and pretty-printing.
#!/usr/bin/env swift
import Photos
struct AssetData: Codable {
var localIdentifier: String
var originalFilename: String
var fileSize: Int64
}
/// Returns a list of assets in the Photos Library.
///
/// The list is sorted by file size, from largest to smallest.
func getAssetsBySize() -> [AssetData] {
var allAssets: [AssetData] = []
let options: PHFetchOptions? = nil
PHAsset.fetchAssets(with: options)
.enumerateObjects({ (asset, _, _) in
let resource = PHAssetResource.assetResources(for: asset)[0]
let data = AssetData(
localIdentifier: asset.localIdentifier,
originalFilename: resource.originalFilename,
fileSize: resource.value(forKey: "fileSize") as! Int64
)
allAssets.append(data)
})
allAssets.sort { $0.fileSize > $1.fileSize }
return allAssets
}
/// Quick extension to allow left-padding a string in Swift
///
/// By user2878850 on Stack Overflow:
/// https://stackoverflow.com/a/69859859/1558022
extension String {
func leftPadding(toLength: Int, withPad: String) -> String {
String(
String(reversed())
.padding(toLength: toLength, withPad: withPad, startingAt: 0)
.reversed()
)
}
}
let bcf = ByteCountFormatter()
for photo in getAssetsBySize() {
let size =
bcf
.string(fromByteCount: photo.fileSize)
.leftPadding(toLength: 8, withPad: " ")
print("\(size) \(photo.originalFilename)")
}
When I run the script, I combine it with head
to get a list of the top N files:
$ swift get_photo_sizes.swift | head -n 5
578 MB IMG_3607.MOV
518.5 MB IMG_0794.MOV
494.1 MB IMG_9858.MOV
373.6 MB IMG_1933.MOV
372.5 MB IMG_3751.MOV
In my library of 26k items, the script takes about about a minute or so to run.
I went through the first 50 or so items, one-by-one. I moved about 30 videos out of my photos library and on to an external disk, and I deleted a few more – in total I recovered about 7GB of space. It’s not a lot, but it gives me some more breathing room.
Pretty much all these files were video messages I’d made for friends and family, and sent as soon as they were recorded. Honestly, I think it’s unlikely I’ll ever watch these again – I’m keeping them just-in-case, but I definitely don’t need them in my synced-everywhere photo library.
I don’t know if I’ll use this exact script again, but it was a good opportunity to practice using Swift and PhotoKit. I’m gradually building a little collection of scripts and tools I can use to do stuff with photos, and this is another pebble on that pile.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>There’s a lot of spam in the catalogue search. Somebody types in a search query which can’t possibly return any results – instead it’s a message (often not in English) promoting sketchy-sounding services and domains. Call me sceptical, but I don’t think somebody who types in:
escort girls in your area play free casino games ✔️ with chatgpt ⏩ whatsapp scamalot.xyz
is actually looking for catalogue results in a library/museum website.
I don’t know why people set up bots to do this – but whatever the reason, dealing with this sort of spam is an inevitable part of running a website on the public Internet.
Before I started this work, we were sending all these spam queries to our back-end search API and Elasticsearch cluster. Over time, the load from the spam was starting to add up, and starting to crowd out real queries on our cluster.
We wanted to find a way to identify the spam, so we could return a “no results page” ASAP, without actually sending the query to our Elasticsearch cluster. It was usually “obvious” if you read the queries as a human, but how could we teach the computer to make the same distinction?
I started by using the code from my last post to get all the CloudFront logs for our catalogue search:
import datetime
import json
class DatetimeEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
return obj.isoformat()
log_entries = get_cloudfront_logs_from_s3(
sess,
Bucket="wellcomecollection-experience-cloudfront-logs",
Prefix="wellcomecollection.org/prod/"
)
with open("search_log_entries.json", "w") as out_file:
for entry in log_entries:
if entry["cs-uri-stem"].startswith("/search"):
out_file.write(json.dumps(entry, cls=DatetimeEncoder) + "\n")
That gave me about 7 million log entries that I could analyse. Then I started developing my spam heuristic, which had a single Python function as its interface:
def is_spam(log_entry) -> bool:
return False
To develop the heuristic, I wrote a bunch of versions of this function, trying various techniques to look at different fields in the log entry and decide if a particular request was spam. To help me evaluate the different versions, I wrote a test script I could run repeatedly as I tweaked the function:
import random
import time
import humanize
import termcolor
spam = []
legit = []
# Get all the log entries
#
# This will load them into a single list, which uses a lot of memory.
# It is possible to do this in a more memory-efficient way using a
# generator, but I had memory to spare and I didn't want
log_entries = [json.loads(line) for line in open("log_entries.json")]
# Go through all the log entries and classify them as spam/not spam
time_start = time.time()
for log_entry in log_entries:
if is_spam(log_entry):
spam.append(log_entry)
else:
legit.append(log_entry)
time_end = time.time()
# Print a brief summary of the results
print(f"Rejected = {humanize.intcomma(len(spam))}")
print(f"Allowed = {humanize.intcomma(len(legit))}")
print(f" = {len(spam) / (len(spam) + len(legit)) * 100:.2f}% marked as spam")
elapsed = time_end - time_start
print(f"Per req = ~{elapsed / len(queries)}s")
# Print a sample of the log entries marked as spam/not spam, to give
# me something to evaluate.
print("---")
for log_entry in random.sample(spam, k=min(100, len(spam))):
print(termcolor.colored(log_entry["cs-uri-query"], "red"))
print("---")
for log_entry in random.sample(legit, k=100):
print(termcolor.colored(log_entry["cs-uri-query"], "green"))
The output gives me a summary with a few statistics:
Rejected = 5,136,696
Allowed = 1,805,073
= 74.00% marked as spam
Per req = ~1.4436019812105793e-05s
The proportion of rejected traffic is so I can see whether my proposed heuristic is actually making a difference to the volume of requests. The per request time is for measuring performance; I didn’t want to introduce noticeable latency for legitimate users.
It also prints a random sample of the queries marked as legitimate and spam. This gave me a spot check on the heuristic – I could see if legitimate queries were being rejected, or if I wanted to add another rule for matching spam.
Repeatedly re-running this test harness gave me a workflow for developing my spam detection heuristic: I’d tweak my function, re-run, and see how it affected the results. I kept iterating until I was catching a decent proportion of spam, without penalising real users.
Most of my analysis focused on the search query, and there were several patterns I spotted which seemed to be strong indicators of spam:
Certain keywords like chatgpt
, casino
and crypto
.
This was my first idea, because it was pretty obvious in the queries I was reading, but it was dropped from the final heuristic for two reasons. It only dropped a fairly small amount of traffic (~2%) and it was hard to agree on a list of words that were definitely spam.
Emoji, of which ⏩, ✔️, ㊙️ were particularly common examples. There’s no emoji in our catalogue data so it’s unlikely a real person would search for it (and they won’t find anything if they do!).
Long, all-Chinese queries. There is some Chinese in the catalogue, but it’s a tiny proportion – the vast majority of our data is in English.
Mangled character encodings, aka Mojibake.
Not all of the non-English text was encoded properly, and there were a lot of queries like â\x8f©â\x9c\x94ï\x8fã\x8a
.
I don’t think a real person would ever type this in, but a poorly coded spam bot quite likely.
As I was going, I did tally some other fields in the logs I’d marked as spam, to see if I could spot any other patterns I could use for spam detection – for example, I counted IP addresses to see if all the spam was coming from a single IP address that we could just block.
import collections
spam_ips = collections.Counter(log_entry["c-ip"] for log_entry in spam)
Unfortunately I didn’t find any good patterns this way, so I stuck to the query-based analysis.
For the first version of our spam detection, I settled on this heuristic:
Reject queries with more than 25 characters from character ranges which are rarely used in our catalogue data (Chinese, Korean, Mojibake, emoji)
Later we reduced that threshold to 20, and so far it seems to have worked well. You can see the implementation and the associated tests on GitHub.
If your query is marked as spam, we now show the “no results” page immediately in our frontend web app, rather than sending your search to our backend Elasticsearch cluster. As part of this change, we tweaked the copy on our “no results” page, asking people to email us if their search unexpectedly returned no results:
This was a hedge against mistakes in the spam heuristic – if it somehow got a false positive and binned a query from a real user who should have seen results, we’d hear about it. In practice, I don’t think there’s been a single one. We’ve managed to cut the load on our Elasticsearch cluster, without impacting real users.
I’m almost certain our spam is automated bots rather than targeted spam, which is why I can safely publish this analysis. Nobody is going to read this and adapt their spam attack to counter it, but maybe it’ll be useful if you have to analyse a spam problem of your own.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I find location data quite useful on my photos, and I was wondering if I could add it after the fact. Although my camera doesn’t know where I was, I had a walking workout running on my Apple Watch, and that was tracking my location – could I combine the photos from my camera and the location data from my watch?
The first step was to get all the data from my walking workout. I was able to export the data from the Health app on my iPhone, following the instructions in an Apple Support document:
You can export all of your health and fitness data from Health in XML format, which is a common format for sharing data between apps.
Tap your picture or initials at the top right.
If you don’t see your picture or initials, tap Summary or Browse at the bottom of the screen, then scroll to the top of the screen.
Tap Export All Health Data, then choose a method for sharing your data.
When I tried this, my iPhone said it would take “a few moments”. It took much longer than that, and the lack of progress bar made me wonder if it was broken.
But it did eventually finish, and fifteen minutes later, I had a 174MB ZIP file full of my health data. When I unzipped it, this is what it looked like inside:
apple_health_export/
├─ export.xml
├─ export_cda.xml
├─ electrocardiograms/
│ ├─ ecg_2020-12-27.csv
│ └─ ...10 other files
└─ workout-routes/
├─ route_2020-12-26_1.47pm.gpx
├─ route_2020-12-27_1.04pm.gpx
└─ ...1556 other files
The GPX files are the interesting thing here – GPX is a standard format for passing around GPS data. If I preview one of those files in Quick Look, I can see my walking route shown as a thick green line on a map:
GPX files are XML, and the format of the Apple Health workout routes isn’t especially complicated. Here’s the first few lines of a file:
<?xml version="1.0" encoding="UTF-8"?>
<gpx
version="1.1"
creator="Apple Health Export"
xmlns="http://www.topografix.com/GPX/1/1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"
>
<metadata>
<time>2023-09-19T21:04:33Z</time>
</metadata>
<trk>
<name>Route 2023-09-17 2:38pm</name>
<trkseg>
<trkpt lon="13.887391" lat="46.277433">
<ele>532.367857</ele>
<time>2023-09-17T07:05:52Z</time>
<extensions>
<speed>1.400002</speed>
<course>287.656214</course>
<hAcc>2.032849</hAcc>
<vAcc>1.793892</vAcc>
</extensions>
</trkpt>
<trkpt lon="13.887373" lat="46.277437">
<ele>532.469812</ele>
<time>2023-09-17T07:05:53Z</time>
<extensions>
<speed>1.398853</speed>
<course>283.005353</course>
<hAcc>1.821742</hAcc>
<vAcc>1.615372</vAcc>
</extensions>
</trkpt>
…
The file is a series of trkpt
(“track points”), each of which has a longitude, a latitude, an elevation and a timestamp.
The timestamps are in UTC – the first timestamp is just after 7am, but I didn’t arrive at Bohinj until just after 9am.
Like the rest of Slovenia, Bohinj is currently on UTC+2.
There are also a couple of data points which I think are something related to direction and speed? I’m not looking into those, but it was interesting to see they’re in there. I don’t think I’ve worked with GPS data before, and there’s a bit more than I expected – I thought I’d just be getting longitude and latitude coordinates, but these extra values make sense, particular when I’m walking.
I used lxml to write a Python function which extracts all these track points from the file. There are dedicated libraries for dealing with GPX files, but I already know how to use lxml and it was simple enough to write something for this one-off task.
import datetime
from lxml import etree
import pytz
utc = pytz.timezone("UTC")
def get_track_points(tree: etree._ElementTree):
"""
Generate a series of track points from an Apple Health workout route.
"""
namespaces = {"gpx": "http://www.topografix.com/GPX/1/1"}
for trkpt in tree.xpath("//gpx:trkpt", namespaces=namespaces):
# e.g. 2023-09-17T07:05:52Z
time_str = trkpt.xpath(".//gpx:time", namespaces=namespaces)[0].text
time = datetime.datetime.strptime(time_str, "%Y-%m-%dT%H:%M:%SZ").astimezone(utc)
elevation = float(trkpt.xpath(".//gpx:ele", namespaces=namespaces)[0].text)
latitude = float(trkpt.attrib["lat"])
longitude = float(trkpt.attrib["lon"])
yield {
"time": time,
"elevation": elevation,
"latitude": latitude,
"longitude": longitude
}
with open("route_2023-09-17_2.38pm.gpx") as infile:
tree = etree.parse(infile)
for track_point in get_track_points(tree):
print(track_point)
I pulled all these track points into a single Python dictionary, mapping time to location:
with open("route_2023-09-17_2.38pm.gpx") as infile:
tree = etree.parse(infile)
locations = {
track_point["time"]: track_point
for track_point in get_track_points(tree)
}
I discovered that there are some duplicate timestamps in the GPX file – although there’s second-level precision, occasionally it would record two locations for the same time. The two locations were pretty close, maybe a metre or so apart. For this sort of casual photo analysis that’s fine, but it might cause issues if you need more precision.
Pulling them all into a dictionary means picking the last location that appeared in the file. That’s somewhat arbitrary, but I didn’t want to spend too much time on this so I called it good. Because they’re so close together, either is fine for my purposes.
To tie this all together, I wrote a bit more Python which would find all the JPEG files from my camera, get the timestamp of that photo, and use exiftool
to add location metadata if my workout had recorded a location at that precise timestamp:
import subprocess
def get_created_time(jpeg_path, *, camera_timezone):
"""
Returns the created time of a photo, according to ``exiftool``.
"""
created_time_str = subprocess.check_output([
"exiftool", "-s3", "-DateTimeOriginal", jpeg_path
]).decode("ascii").strip()
# e.g. 2023:09:17 10:40:49
created_time = datetime.datetime.strptime(created_time_str, "%Y:%m:%d %H:%M:%S")
# Assume the camera was set to match the timezone where the photo
# was taken; convert the timestamp to UTC first.
return timezone.localize(created_time).astimezone(utc)
def set_location(jpeg_path, *, location_info):
"""
Set the location information on a file using ``exiftool``.
"""
# The Apple Watch locations record latitude/longitude/elevation
# as a single value, whereas exiftool wants an absolute value
# and a direction.
#
# e.g. the Apple Watch might record a position as (37.3346, -122.0090),
# which exiftool wants to see as (37.3346, N, 122.0090, W).
subprocess.check_call([
"exiftool",
f"-GPSLatitude={abs(location_info['latitude'])}",
f"-GPSLatitudeRef={"N" if location_info['latitude'] > 0 else 'S'}",
f"-GPSLongitude={abs(location_info['longitude'])}",
f"-GPSLongitudeRef={"E" if location_info['longitude'] > 0 else 'W'}",
f"-GPSAltitude={abs(location_info['elevation'])}",
f"-GPSAltitudeRef={"0" if location_info['elevation'] > 0 else '1'}",
jpeg_path
])
# See https://alexwlchan.net/2023/snake-walker/ for get_file_paths_under()
for jpeg_path in get_file_paths_under("100_OLYMP", suffix=".jpg"):
slovenia = pytz.timezone("Europe/Ljubljana")
created_time = get_created_time(jpeg_path, camera_timezone=slovenia)
try:
location_info = locations[created_time]
except KeyError:
pass
else:
set_location(jpeg_path, location_info=location_info)
This code has a big assumption at its core: that my Watch will have recorded a location at the precise second I took each photo. In practice, that seems to work well enough – I don’t know if my Watch is doing second-by-second location, but I’d stand still to take my photos, and it would record at least one data point in that time. All my photos from Bohinj got tagged.
If this was an issue, you could write a looser heuristic to matching photos to location data in the workout – for example, using any location that was recorded within a few seconds of the photo being taken. But “same second” worked fine for me, so that’s all I’ve done.
After I ran this code, I did some spot-checking of individual photos – it took a few tries to get the timezone handling correct. I’d taken a photo of the “Welcome to Bohinj” sign right after I got off the bus, and that turned out to be super helpful – I knew exactly where it was, and I could keep tweaking my code until that photo got the right location.
I was once given a tip: when travelling between time zones, take a photo of a clock that’s correctly set to the local time. That way, you can easily correct the time offset later if your camera was configured incorrectly. If I plan to reuse this location tagging code, I’d use the same trick, but with a photo of something in a known location.
Once this was done, I imported all the files into my Photos Library, and voila: I could see all my photos plotted on a map, even though I’d taken them on a camera without GPS support.
I’m pretty happy with this project – for half an hour’s work, I have a nicely-tagged set of photos and a better understanding of the location data recorded by my Apple Watch.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I’d already been thinking about getting a new job, so when I saw an ad for software developers at Wellcome a week later, it felt like the universe was telling me something. I applied, I was hired, and I started at the beginning of 2017.
After nearly seven years, it’s time for something new. I’m going to be leaving Wellcome at the end of September, and joining the Flickr Foundation as their new Tech Lead.
This was a bittersweet decision. I’ve loved my time at Wellcome. I got to work with a fantastic group of people, who’ve helped me grow and learn and fix my mistakes. I’m a much better person now than I was in 2017, and much of that comes from the people I met at Wellcome. (Of whom there are far, far too many to name.)
I’ve been able to work on a number of cool projects while I’ve been at Wellcome, including:
The unified collections search, which combines records from across Wellcome’s different catalogues into a single search box. This was my first exposure to AWS, Elasticsearch and large-scale data pipelines, and I learnt a lot doing this project.
The storage service, which is the permanent storage for Wellcome Collection’s digital collections. I’m proud of how robust and reliable it’s become, and working on this will be a highlight of my entire software engineering career.
Helping to write Wellcome’s first trans inclusion policy. This was a major milestone for the LGBTQ+ Staff Network, and helped me feel comfortable coming out at work in 2019.
Getting my first experience of line management. I’m immensely grateful that the team gave me feedback and patience in equal measure as I warmed up to this change – I made lots of mistakes in the first few months!
As I’ve stepped into a more managerial role, one of the most gratifying things has been to watch people take something that I worked on, and find ways to make it better. The architecture of the catalogue pipeline has been streamlined and improved, making it more reliable and efficient. The trans inclusion policy has been rewritten with more detail and clarity. People are finding new ways to use the data in the storage service.
I’m leaving behind a remarkable team of smart, thoughtful, and capable people. I hope Wellcome knows just how lucky they are to have them.
At the start of October, I’ll be joining the Flickr Foundation as their Tech Lead. The mission – keeping Flickr’s pictures visible for 100 years – is a daunting one, but I’m excited about the challenge. I start in just under a fortnight, and I’m really looking forward to it.
This is a role I couldn’t have done back in 2017 – I had no experience designing large systems, or managing a team, or doing software in a group that wasn’t just software engineers. Those are all things I learnt while working at Wellcome.
I’m immensely grateful for my time at Wellcome Collection, for all the people I got to meet there, and for everything I learnt that’s enabling me take this next step.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I’ve grown to really like it, and I expect to keep using it for a while to come. Its approach to tagging and linked notes fits my mental model, and there’s a lot of flexibility in the plugin architecture. I can make it look nice, add a few basic features I want, and it syncs nicely across my Mac and my iPhone.
Inspired by Steph Ango’s example, I thought it might be useful to write a little about how I structure my Obsidian vaults. My setup is somewhat fluid, so consider this as more of a point-in-time snapshot than a definitive approach – I keep tweaking it as I find better ways to organise my notes.
I have two vaults:
The two vaults have the same structure, but different contents. Usually the distinction is pretty clear-cut, but occasionally there’s some overlap. For example, I learn about AWS in the course of my work. If I learn something which is generically useful and not specific to my employer’s setup, I’ll put the notes in my personal vault rather than the work vault.
I’m using the Minimal theme, and I use the Minimal Settings plugin to give each vault a distinct appearance. I’m a very visual person, and making the two vaults look different helps reinforce the distinction in my mind. I use a similar set of colour-based themes to help me distinguish between Slack workspaces.
I’m a big fan of keyword tagging, and I use it in all my notes. Every note has at least one tag; often multiple.
I tag liberally, adding all the keywords that I think I might use to search for something later – I think of my tags as a “search engine in reverse”. If I think I might look for a note in three different ways, I give it three different tags.
I create lot of different tags – my primary vault has at least 800. The distribution is very skewed, with maybe 50 tags that I use a lot, and then a long tail of tags that are only used a handful of times. This might seem messy to some people, but it works for me – even if a tag is only used once or twice, it’s still useful for searching.
I use prefixes as a way to namespace some of tags, like aws/amazon-s3
and python/pip
.
This helps keep my list of tags somewhat organised, but otherwise it’s a bit of an inconsistent mess.
e.g. I don’t have any rules about singular vs plural
I have different tags in each of my vaults, but I try to use the same tag in both places if it means the same thing.
I have a handful of top-level folders, and I put most notes in folders. Both of my vaults have the same set of top-level folders.
I try not to keep too many notes in my root – it’s mostly brand new notes, stuff I’m actively working on, or notes I refer to frequently. When I’m finished working on a note, I move it into a folder.
The folders I use:
Attachments for images, audio, PDFs, and so on. Anything that isn’t a text file. I use my image gallery plugin to browse the contents of this folder.
Ideas for anything I think of that I might like to do in the future, but don’t want to do right now. This includes ideas for projects, books I might like to read, half-finished blog posts, and more. I like being able to capture my ideas and then get them out of the way, without committing to finishing them.
Some of these entries are very long-lived, and I’ve built them up over multiple years. I’ll capture the initial spark of something, then go back and add more details as I think of them. This accumulation of thoughts can be useful if I ever go back and actually do the thing.
Journal is for all of my journal entries, or anything I’ve done that’s bound to a particular time (DIY, craft projects, holiday plans, and so on).
I have per-year folders to keep it manageable, but there’s not a lot of consistency. I have my journal entries going back as far as 2009, and I’ve had quite a few different approaches to journaling in that time!
People is for per-person notes. These files are pretty small, and usually exist for easy linking rather than for in-depth notes. For example, it’s much easier to search for all journal entries linked to “Jane Smith” than it is to search for all instances of the word “Jane”.
Occasionally I do put bits of info in somebody’s note that isn’t bound to a specific journal entry – food allergies, the names of their kids, gift preferences, and so on.
Reference is for detailed notes on anything outside my vault – books I’ve read, videos I’ve watched, podcasts I’ve listened to. I have subfolders for the different types of media.
Snippets is for little bits of information I want to save. A cool tweet, an interesting word, some trivia fact.
At least to me, it’s always obvious which of these folders a note belongs in. This has been a constant feature of all my folder setups – I want to be able to file notes immediately, without thinking. I don’t want to be wondering where a particular note should be stored on a day-to-day basis.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I found a couple of useful, new-to-me AWS APIs for doing this.
You can find the account ID using the GetAccessKeyInfo API, for example:
$ aws sts get-access-key-info --access-key-id AKIA3B6K4VLAVGRVTXJA
{
"Account": "760097843905"
}
This should work when you authenticate as any IAM entity that has the sts:GetAccessKeyInfo
permission, even if it’s in a different account to the key.
This is useful because the AWS estate at work is split over a dozen accounts, and some of the accounts have overlapping use cases. Even if you know roughly what a key is used for, it may not be obvious which account it’s defined in.
Once you know the account, you can find the username with the GetAccessKeyLastUsed API.
You’ll need to authenticate as an IAM entity with the iam:GetAccessKeyLastUsed
permission in that particular account:
For example:
$ aws iam get-access-key-last-used --access-key-id "AKIA3B6K4VLAVGRVTXJA"
{
"UserName": "example-user-2023-08-26",
"AccessKeyLastUsed": {
"LastUsedDate": "2023-08-24T15:58:00Z",
"ServiceName": "s3",
"Region": "eu-west-1"
}
}
Note that this works even if the access key has never actually been used, for example:
$ aws iam get-access-key-last-used --access-key-id "AKIA3B6K4VLAVGRVTXJA"
{
"UserName": "example-user-2023-08-26",
"AccessKeyLastUsed": {
"ServiceName": "N/A",
"Region": "N/A"
}
}
I took these APIs and wrapped them in a Python script that takes an access key as input, and prints a bunch of information about the key and the associated user. This is what it looks like:
$ python3 describe_iam_access_key.py AKIA3B6K4VLAVGRVTXJA
access key: AKIA3B6K4VLAVGRVTXJA
account: platform (760097843905)
username: example-user-2023-08-26
key created: 26 August 2023
status: Active
IAM permissions: example-user-2023-08-26.iam_permissions.txt
console: https://us-east-1.console.aws.amazon.com/iamv2/home#/users/details/example-user-2023-08-26
terraform: https://github.com/wellcomecollection/platform-infrastructure/tree/main/terraform/users
This script won’t work for everyone – in particular, going from an AWS account ID to an authenticated IAM session is probably going to look different for every organisation, but a lot of the bigger pieces are reusable.
Because the IAM permissions can be quite long and verbose, it saves them to a separate text file. It also includes links to the IAM console and the Terraform configuration (and it can find the latter because we tag the user with that link).
This script only works with long-term credentials created for an IAM user. It doesn’t work for temporary credentials using AWS STS – if you want to find out who owns the latter, you have to review your CloudTrail logs – but for my purposes, that’s not an issue.
When writing this script, one of the things I was pleasantly surprised by was the presence of AWS APIs that feel tailor-made for this use case. I was expecting I’d have to loop through every account, every user, every access key, and look for one that matched, which could have been pretty slow. Using these APIs was much simpler and quicker!
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>The show is based on the real-life wartime deception operation of the same name. British spies took a corpse, dressed him as a pilot with a suitcase full of fake documents, and let him wash up on the coast of Spain – feeding the Nazis false information about the planned invasion of Sicily.
Making a musical about Second World War espionage is a delicate balancing act. You want to write jokes so the audience have fun, but you don’t want to trivialise the horror and suffering of war. I think the show handles this pretty well – there’s lots of laughter, but every so often the theatre falls silent as we’re reminded that “these men’s lives are not a joke”. (This line was added in the West End transfer, and it’s one of my favourites.)
The emotional high point of the show is Dear Bill, a solo in which a quiet secretary (Jak Malone, Christian Andrews) writes a fake love letter for the fictional Allied pilot. All the jokes fade away, and we realise the letter isn’t as fake as we thought. It’s a moving and powerful number, and I still cry every time I hear it.
The secretary is Hester Leggatt, a real woman who really did work on Operation Mincemeat. She was the head of the MI5 secretaries, and she wrote the letter that went in the pilot’s briefcase. Several of the lines in the song are taken directly from her words. (“Darling, why did we go and meet in the middle of a war, such a silly thing for anybody to do.”)
Beautiful as it is, I don’t think Dear Bill can stand on its own. It shows us the pain in Hester’s heart; the grief of her lost lover; a tragedy she has kept from her coworkers – but that can’t be her whole arc. Nothing can ever bring Tom back, but we want to see her begin to heal.
That healing starts to come in the second act song Useful. Hester is talking to Jean Leslie, another MI5 secretary (Claire-Marie Hall, Holly Sumpton), about how they can both be useful even if their contributions aren’t recognised by the men in charge. The song calls back to Dear Bill, echoing its lyrics as Jean tells Hester how much she’s grown under Hester’s supervision. It’s a beautiful moment of realisation, as Hester sees how much Jean looks up to her.
The song has a sad ring of truth – Hester wasn’t considered important by history, and SpitLip (the writers of the show) struggled to find much information about her. Unlike the men whose histories were extensively documented and recorded, Hester died almost without trace.
But after falling in love with her, the show’s “Mincefluencer” fandom have been righting this wrong – there’s been an extensive investigation to learn more about her, and to uncover her life story. I haven’t been part of this, but I’ve been watching the #find-hester channel in the Discord, and the amount they’ve found is quite astounding.
One of the perennial topics is the idea of getting Hester a blue plaque to commemorate her life. It’s not clear whether she would qualify, and so this hasn’t happened. (Yet.)
But the other night I was sitting in the theatre, and I heard Jak sing the line “perhaps just a small plaque, something tasteful and small”. I thought of all the fan discussions, and I decided to take matters into my own hands.
I had some spare blue cross-stitch fabric from my Saturn V blueprint, and I have plenty of white thread. That’s basically a plaque, right? I started sketching out a design – blue plaques usually have the name of the person, their dates, and a line or two about why they’re worth remembering.
I already had her name, which is “Leggatt” with an “a” – at some point a transcription error had corrupted this to “Leggett” in the popular record, making her harder to track down. This error was found by the fandom research. That research also gave me her dates (1905–1995), and then I used one of Jean’s lines as her description (“a timeless inspiration”).
For the letters themselves, I used two fonts. Her name is written in Needlework Gazette’s Fancy Alphabet (three strands of cross-stitch), and the rest of the text is in StitchPoint’s Monaco (two strands of back-stitch). I planned out the rough shape in a spreadsheet, made a few adjustments to the spacing between letters, and then I stitched it up.
I mounted the piece in a 6″ hoop which I’d painted white, and I gave it to Jak on stage door a few weeks ago. I futzed a bit of the glue, but otherwise I’m pretty pleased with the result. I’m told that it now hangs backstage at the Fortune, among other pieces of fan art.
It’s not a garden or a grand royal park, but I hope Hester would like it all the same.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>Creating the resources with infrastructure-as-code isn’t too bad; the tricky part is updating them later. If you have a large or thorny codebase, it may not be obvious where a particular resource is defined – when you want to make changes, which file should you update?
If you’re in a hurry, it’s tempting to make a manual change now, and tell yourself you’ll come back to update the code “later” – when you have more time to find the file – but “later” rarely comes.
To make this easier, I recommend tagging all your resources with a link to the file where the resource is defined.
At work, we’re managing AWS resources defined in Terraform. The Terraform AWS Provider supports setting default tags – you write them once, and then they get applied to every resource that can be tagged. This is what that looks like for us:
provider "aws" {
default_tags {
tags = {
TerraformConfigurationURL = "https://github.com/wellcomecollection/aws-account-infrastructure/tree/main/accounts/storage"
}
}
}
The TerraformConfigurationURL
points to a specific subfolder of a GitHub repository, which is where this particular set of Terraform configuration files are stored.
If we’re looking at a resource in the AWS console, we can look for the TerraformConfigurationURL
tag.
If it’s there, we can follow the URL to find the Terraform where the resource is defined.
This is particularly simple with Terraform and AWS, because of the support for default tags. It might be more cumbersome if you’re using a different tool or managing different types of resources, but I still think it’s worth the benefits.
I originally created these tags to solve the “where is this thing defined” problem. I’ve found something in the AWS console, I want to make a change to it, and I want to find the Terraform definition so I can manage the change using infrastructure-as-code. It has been useful for that, but it’s also been helpful in other, unexpected ways.
On one occasion, they highlighted some resources that were defined in multiple places.
We could see two Terraform configurations fighting over the value of the TerraformConfigurationURL
tag – one would set it to A, the other would set it to B, the first would set it back to A, and so on.
This conflict helped us find and delete the duplicate definition.
It’s also been a good way to find resources that aren’t managed with infrastructure-as-code. Because this tag should be applied to everything that’s managed with Terraform, anything without this tag was probably created some other way.
Some of our AWS infrastructure predates our use of Terraform, and we’ve been trying to bring it into Terraform – looking for resources that don’t have this tag is one way to do that. I also check for this tag as part of our security audits, looking for untagged IAM users that might have been created quietly for malicious purposes.
As with any tagging strategy, it’s not perfect. Not every resource supports tagging, and we haven’t always remembered to create these tags – but it’s good enough. We have enough resources using this tag for it to be useful, and it’s been handy on plenty of occasions.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>The first is a parsing function, which gets the individual log entries from a single log file. This takes a file-like object in binary mode, so works the same whether I’m reading the file from a local disk or directly from S3. This is what it looks like:
import datetime
import urllib.parse
def parse_cloudfront_logs(log_file):
"""
Parse the individual log entries in a CloudFront access log file.
Here ``logfile`` should be a file-like object opened in binary mode.
The format of these log files is described in the CloudFront docs:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat
"""
# The first line is a version header, e.g.
#
# b'#Version: 1.0\n'
#
next(log_file)
# The second line tells us what the fields are, e.g.
#
# b'#Fields: date time x-edge-location …\n'
#
header = next(log_file)
field_names = [
name.decode("utf8")
for name in header.replace(b"#Fields:", b"").split()
]
# For each of the remaining lines in the file, the values will be
# space-separated, e.g.
#
# b'2023-06-26 00:05:49 DUB2-C1 618 1.2.3.4 GET …'
#
# Split the line into individual values, then combine with the field
# names to generate a series of dict objects, one per log entry.
#
# For an explanation of individual fields, see the CloudFront docs:
# https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat
numeric_fields = {
"cs-bytes": int,
"sc-bytes": int,
"sc-content-len": int,
"sc-status": int,
"time-taken": float,
"time-to-first-byte": float,
}
url_encoded_fields = {
"cs-uri-stem",
"cs-uri-query",
}
nullable_fields = {
"cs(Cookie)",
"cs(Referer)",
"cs-uri-query",
"fle-encrypted-fields",
"fle-status",
"sc-range-end",
"sc-range-start",
"sc-status",
"ssl-cipher",
"ssl-protocol",
"x-forwarded-for",
}
for line in log_file:
values = line.decode("utf8").strip().split("\t")
log_data = dict(zip(field_names, values))
# Undo any URL-encoding in a couple of fields
for name in url_encoded_fields:
log_data[name] = urllib.parse.unquote(log_data[name])
# Empty values in certain fields (e.g. ``sc-range-start``) are
# represented by a dash; replace them with a proper empty type.
for name, value in log_data.items():
if name in nullable_fields and value == "-":
log_data[name] = None
# Convert a couple of numeric fields into proper numeric types,
# rather than strings.
for name, converter_function in numeric_fields.items():
try:
log_data[name] = converter_function(log_data[name])
except ValueError:
pass
# Convert the date/time from strings to a proper datetime value.
log_data["date"] = datetime.datetime.strptime(
log_data.pop("date") + log_data.pop("time"),
"%Y-%m-%d%H:%M:%S"
)
yield log_data
It generates a dictionary, one per log line.
The named values make it easy for me to inspect and use the log entries in my analysis code.
A couple of the values are converted to more meaningful types than strings – for example, the cs-bytes
field is counting the bytes, so it makes sense for it to be an int
rather than a str
.
This is how it gets used:
for log_entry in parse_cloudfront_logs(log_file):
print(log_entry)
# {'c-ip': '1.2.3.4', 'c-port': '9962', 'cs-cookie': None, ...}
And then I can use my regular Python tools for analysing iterable data. For example, if I wanted to count the most commonly-requested URIs in a log file:
import collections
tally = collections.Counter(
log_entry["cs-uri-stem"]
for log_entry in parse_cloudfront_logs(log_file)
)
from pprint import pprint
pprint(tally.most_common(10))
CloudFront writes new log files a couple of times an hour. Sometimes I want to look at a single log file if I’m debugging an event which occurred at a particular time, but other times I want to look at multiple files. For that, I have a couple of additional functions which handle combining log entries from different files.
If I’m going to be working offline or I know I’m going to be running lots of different bits of analysis on the same set of log files, sometimes I download the log fields directly to my local disk. Then I use my function for walking a file tree to get a single iterator for all the entries in a folder full of log files:
import gzip
def get_cloudfront_logs_from_dir(root):
"""
Given a folder that contains CloudFront access logs, generate all
the CloudFront log entries from all the log files.
"""
for path in get_file_paths_under(root, suffix='.gz'):
with gzip.open(path) as log_file:
yield from parse_cloudfront_logs(log_file)
for log_entry in get_cloudfront_logs_from_dir("cf"):
print(log_entry)
CloudFront logs are stored in S3, so if I’m running inside AWS, it can be faster and easier to read log files directly out of S3. For this I have a function that lists all the S3 keys within a given prefix, then opens the individual objects and parses their log entries. This gives me a single iterator for all the log entries in a given S3 prefix:
import boto3
import gzip
def list_s3_objects(sess, **kwargs):
"""
Given an S3 prefix, generate all the objects it contains.
"""
s3 = sess.client("s3")
for page in s3.get_paginator("list_objects_v2").paginate(**kwargs):
yield from page.get("Contents", [])
def get_cloudfront_logs_from_s3(sess, *, Bucket, **kwargs):
"""
Given an S3 prefix that contains CloudFront access logs, generate
all the CloudFront log entries from all the log files.
"""
s3 = sess.client("s3")
for s3_obj in list_s3_objects(sess, Bucket=Bucket, **kwargs):
Key = s3_obj["Key"]
body = s3.get_object(Bucket=Bucket, Key=Key)["Body"]
with gzip.open(body) as log_file:
yield from parse_cloudfront_logs(log_file)
sess = boto3.Session()
for log_entry in get_cloudfront_logs_from_s3(
sess,
Bucket="wellcomecollection-api-cloudfront-logs",
Prefix="api.wellcomecollection.org/",
):
print(log_entry)
A couple of years ago I watched Ned Batchelder’s talk Loop Like A Native, which is an amazing talk that I’d recommend to Python programmers of any skill level.
One of the key ideas I took from that is the idea of creating abstractions around iteration: rather than creating heavily nested for
loops, use functions to work at higher levels of abstraction.
That’s what I’m trying to do with these functions (and the one in my previous post) – to abstract away the exact mechanics of finding and parsing the log files, and just get a stream of log events I can use like any other Python iterator.
I think the benefits of this abstraction will become apparent in another post I’m hoping to write soon, where I’ll go through some of the analysis I’m actually doing with these logs.
The post will jump straight into a for
loop of CloudFront log events, and it won’t have to worry about exactly where those events come from.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>os.walk
in the standard library, but it’s not quite what I want, so I have a wrapper function I use instead:
import os
def get_file_paths_under(root=".", *, suffix=""):
"""
Generates the absolute paths to every matching file under ``root``.
"""
if not os.path.isdir(root):
raise ValueError(f"Cannot find files under non-existent directory: {root!r}")
for dirpath, _, filenames in os.walk(root):
for f in filenames:
p = os.path.join(dirpath, f)
if os.path.isfile(p) and f.lower().endswith(suffix):
yield p
for path in get_file_paths_under():
This function gives me a couple of things over just using os.walk
: it gives me a single iterator I can loop over, and it constructs the absolute path for me.
The ability to filter by suffix is useful too; it gives me a quick way to filter my search.
I use this when I’m working in a folder tree with lots of different file types.
for path in get_file_paths_under("notes"):
...
for txt_path in get_file_paths_under("notes", suffix=".txt"):
...
The body of the function isn’t especially complicated; the only vaguely interesting bit is the ValueError
.
it’s to help catch silly mistakes when I accidentally pass the name of a file as the input – if you try to os.walk
over a file, you get an empty list of results, which can be a bit confusing.
(I’m sure there’s a good reason, even if I don’t know what it is.)
An experienced Python programmer could probably write this from scratch in a few minutes, but I use it so often that I like to have it saved.
TextExpander inserts this snippet whenever I type py!pth
, including both the function and the for
loop.
I save a few minutes, and I get a version of the function that I know doesn’t have any weird edge cases or silly mistakes.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>For readers: I want images to load quickly and look good. That means looking sharp on high-resolution displays, but without forcing everyone to download massive images.
For me: I want images to be easy to manage. It should be easy for me to add images to a post, and to customise them if I want to do something special.
One way to achieve this is with vector images – SVGs. Those are great for simple diagrams and drawings, and I use them plenty, but they don’t work for photographs and screenshots.
For bitmap images, I wrote a custom Jekyll plugin.
Usually my original image is a JPEG or a PNG.
I save it in _images
, and then I include my custom {% picture %}
tag in the Markdown source:
{%
picture
filename="IMG_9016.jpg"
width="750"
class="photo"
alt="A collection of hot pink flowers, nestled among some dark green leaves in a greenhouse."
%}
This expands into a larger chunk of HTML, which refers to several different variants of the image:
<picture>
<source
srcset="/images/2023/IMG_9016_1x.avif 750w,
/images/2023/IMG_9016_2x.avif 1500w,
/images/2023/IMG_9016_3x.avif 2250w"
sizes="(max-width: 750px) 100vw, 750px"
type="image/avif"
>
<source
srcset="/images/2023/IMG_9016_1x.webp 750w,
/images/2023/IMG_9016_2x.webp 1500w,
/images/2023/IMG_9016_3x.webp 2250w"
sizes="(max-width: 750px) 100vw, 750px"
type="image/webp"
>
<source
srcset="/images/2023/IMG_9016_1x.jpg 750w,
/images/2023/IMG_9016_2x.jpg 1500w,
/images/2023/IMG_9016_3x.jpg 2250w"
sizes="(max-width: 750px) 100vw, 750px"
type="image/jpeg"
>
<img
src="/images/2023/IMG_9016_1x.jpg"
width="750"
style="aspect-ratio: 3 / 4;"
class="photo"
alt="A collection of hot pink flowers, nestled among some dark green leaves in a greenhouse."
>
</picture>
Let’s unpack what’s going on.
My _images
directory is organised into per-year folders:
.
└─ _images/
├─ 2022/
│ ├─ acme_corporation.jpg
│ ├─ alarm_console.png
│ ├─ alfred_search.png
│ └─ ...164 other files
└─ 2023/
├─ amazon-cheetah-listing.jpg
├─ avif_image_broken.png
├─ bedroom_layout.png
└─ ...53 other files
Organising files per-year matches the URL structure of individual posts (/:year/:slug
), and helps keep the folder just a bit more manageable.
I have ~1300 images, and throwing them all in a single folder would get unwieldy.
In this example, the original file is _images/2023/IMG_9016.jpg
.
How does the plugin find an image in this directory structure?
I pass a filename
attribute to the {% picture %}
tag, which tells you the name of the image file, but notice that I don’t pass a year anywhere.
That’s because my plugin can work it out automatically – when Jekyll renders a custom liquid tag on a page, it passes the page as a context variable. That means each instance of my picture tag knows which article it’s in, and it can get the article’s publication date. Then it can construct the path to the original image.
module Jekyll
class PictureTag < Liquid::Tag
def render(context)
article = context.registers[:page]
date = article['date']
year = date.year
path = "_images/#{year}/#{filename}"
…
I use this technique in a couple of plugins – it allows me to organise my files without too much hassle when using them.
I pass a width
attribute to my {% picture %}
tag – this tells the plugin how wide the image will appear on the page.
This mimics the HTML attribute of the same name.
I get the dimensions of the original image using the rszr gem:
require 'rszr'
image = Rszr::Image.load(source_path)
puts image.width
Then I use ImageMagick to create multiple derivative images, at different widths for different screen pixel densities – 1x, 2x, or 3x. I don’t create derivatives that are wider than the original image; that would be wasteful.
widths_to_create =
(1..3)
.map { |pixel_density| pixel_density * visible_width }
.filter { |w| w <= image.width }
For example, if the original file is 250px wide, and I want to show the image at 100px wide, then the plugin would create a 1x image (100px) and a 2x image (200px) but not a 3x image (because 300px is wider than the original image).
This resizing happens as part of the Jekyll build process. An alternative would be to use a proper image CDN and create these derivative images at request time (e.g. imgix or Netlify Large Media), but I’m already doing custom steps in my Jekyll build and it was easier to extend that mechanism than add a new service. It also makes it easier to work with images in a local Jekyll server.
To tell the browser about these different sizes, I use the HTML picture
and source
tags, the latter with an srcset
attribute:
<picture>
…
<source
srcset="/images/2023/IMG_9016_1x.jpg 750w,
/images/2023/IMG_9016_2x.jpg 1500w,
/images/2023/IMG_9016_3x.jpg 2250w"
sizes="(max-width: 750px) 100vw, 750px"
type="image/jpeg"
>
<img
src="/images/2023/IMG_9016_1x.jpg"
width="750"
…
>
</picture>
In this example, the srcset
attribute tells the browser that there are three different widths of image available, and where to find them.
The sizes
attribute tells it which size to use at different screen widths.
If the screen is less than 750px wide, then the image fills the entire screen (100vw
), otherwise the image is 750px wide.
That’s not always exactly right – sometimes margins mean it’s slightly wrong – but it’s close enough.
This is enough information for the browser to decide the best size to load. It knows your screen pixel density and the width of the window, so it can choose an image which (1) will look sharp and crisp on your display and (2) doesn’t include lots of unnecessary pixels.
If your browser doesn’t support <picture>
and <source>
, I include the 1x size in the <img>
tag.
I figure that if your browser is that old, it’s unlikely you’re using a high pixel density display.
JPEG and PNG are fine, but they’re a bit long in the tooth – there are newer image formats that look the same but with smaller files. WebP and AVIF are modern image formats that are much smaller, which means faster loading images for you and a cheaper bandwidth bill for me.
Alongside the different sizes of image, I’m using ImageMagick to create variants in WebP and AVIF.
These get presented as alternative <source>
entries in the <picture>
tag, for example:
<picture>
<source
srcset="/images/2023/IMG_9016_1x.avif 750w,
/images/2023/IMG_9016_2x.avif 1500w,
/images/2023/IMG_9016_3x.avif 2250w"
sizes="(max-width: 750px) 100vw, 750px"
type="image/avif"
>
<source
…
type="image/webp"
>
<source
…
type="image/jpeg"
>
…
</picture>
Not every browser supports WebP and AVIF, which is why I’m providing all three variants. Your browser knows which formats it supports, and will choose appropriately.
The compression is pretty remarkable: the WebP images are about half the size of the originals, but the AVIF images are one sixth! When I first enabled AVIF support, I thought something was broken – the files were so small, it looked wrong to me.
(It turns out something was broken, but it was nothing to do with file sizes.)
Because I have the image dimensions from rszr, I can calculate the aspect ratio of the image and insert it as a property on the <img>
tag:
<img
src="/images/2023/IMG_9016_1x.jpg"
width="750"
style="aspect-ratio: 3 / 4;"
…
>
Combined with the width
, this allows a browser to completely calculate the area an image will take up the page – before it loads the image.
This means it can lay out the page immediately, leave the right amount of space for the image, and it won’t have to rearrange the page later.
The fancy term for this is “Cumulative Layout Shift”, and too much of it can be distracting – setting these two attributes reduces it to zero.
Aside from the filename
attribute, all the attributes on the {% picture %}
get passed directly to the underlying <img>
tag.
I use this for includes things like alt text, CSS classes and inline styles.
It looks exactly like the HTML might look.
This gives me a bunch of flexibility for tweaking the behaviour of images on a per-post basis. I get the benefits of the different sizes and image formats, and it all looks like familiar HTML.
The plugin is doing a bit of work to parse the attributes, and combine them with any attributes that it’s adding (for example, appending the aspect-ratio
property to any inline styles), but this is largely invisible when I’m just writing a post.
One of the attributes I use most often is loading="lazy"
, which gets me browser-native lazy loading of images.
This improves performance on pages with lots of images, and it’s easy for browsers to work out which images to load – they know exactly where each image will go thanks to the width
and aspect-ratio
properties.
When the web was young, images were much simpler. You’d upload your JPEG file to your web server, add an <IMG>
tag to your HTML page, and you were done.
That still works (including the uppercase HTML tags), but there’s a lot more we can do now.
Building this plugin has been one of the more complex bits of front-end web development I’ve done for this site.
Creating the various images with ImageMagick was fairly straightforward, but setting up the srcset
and sizes
attributes so browsers would pick the right image was much harder.
I think it behaves correctly now, and adding images to new posts is pretty seamless – but it took a while to get there.
This was a great way for me to learn how images work in the modern web, but it’s hard to recommend my “write it from scratch” approach. There are lots of existing libraries and tools that make it easy for you to use images on your website, without all the work I had to do first.
I’m the only person who works on this website, and I’m doing it for fun. I can make very different choices than if I was working on a commercial site managed by a large team. I enjoyed writing this plugin, and I’m pleased with my snazzy new images, and for me that’s all that matters.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>About three weeks ago, a new play premiered at Riverside Studios in London: Spy for Spy. I enjoyed it so much that I saw it four times – it was funny, clever, and made my heart ache. It’s a romantic comedy with a clever narrative twist – it’s being told in the wrong order.
Unfortunately it only had a limited run, and it’s already closed. I don’t know if or when it will be staged again, but I wanted to capture a bit of why I enjoyed it so much.
💞 Love 💞
Thank you for experiencing Molly & Sarah’s journey with us.
Our limited run at @RiversideLondon ends tomorrow and we've had a blast.
#SpyForSpyPlay
The play is a two-hander, and it stars two women – a bit of a rarity in theatre – and it’s a romance story. I’m not sure I’ve ever seen a sapphic romance on stage, and that itself was quite refreshing.
The women are Sarah, an anxious and uptight lawyer (Amy Lennox), and Molly, a free-thinking actor (Olive Gray). They’re quite different people, and not the most obvious match, but I was absolutely sold. The dialogue feels genuine and warm, and both of them did a great job of capturing their characters. There was great chemistry, and I wanted to see them be happy together.
The story hits the typical romcom beats: the meet-cute, a first date of sorts, some conflict and reconciliation. But unlike most romcoms, they don’t always come in the same order: the play is split into six scenes, and every night the audience were invited to pick a random order. There were six heart-shaped balloons, you’d pick a balloon and the attached block would tell you the scene. Nobody knew how the play would go until half an hour before curtain up. One night it started at a wedding in a yurt; another night it opened with a breakup in their living room.
There are other plays that experiment with non-linear storytelling – Nick Payne’s Constellations springs to mind – but this is far more ambitious.
Kieron Barry has written a masterpiece of a script – it’s such a dexterous piece of writing that can be told in different orders, and still make sense. There are so many through lines, subtle callbacks, and meta self-references that fit together. I took a few notes, and a week later I’m still realising clever elements of the script. (The watch! The wines! The expressions of gratitude! Email me if you want to read my notes and see what you missed.)
And the script is good even if you ignore the random order. The dialogue feels genuine, like two people would actually talk, and it’s laugh-a-minute funny – although balanced with sombre moments. (I don’t want to give too much away, but “You’re the only person I know who can lie down bolt upright” was both hilarious and felt like a bit of a personal attack.)
One of my favourite scenes is in Carmel, when Molly and Sarah talk about having sex. So often women’s sexuality is treated as shameful or titillating, and this scene is neither of those. It’s a very matter-of-fact scene in which two women in a loving relationship talk about their sexual desires and needs. It felt totally normal, which is how it should be.
The script is backed up by a great production. Amy and Sarah did a great job of bringing their characters to life, and I was hooked within seconds. One thing they did particularly well was switching demeanours – one scene they’re bickering partners, another they’re complete strangers. Subtle behaviours like the way they sit, stand, or look at each other, all add up. Trying to act in the wrong order seems like quite a challenge, but I think they did a good job.
There are other ways the show marks the different scenes. Small changes of costume, lighting, props. Several times, I had a sense of the new scene before a word was said. (Is this a sad scene? A happy one? Are they strangers or lovers?)
If I have a criticism, it’s that some of the transitions felt slow and unpolished. The scenes themselves are carefully directed and coordinated, but it dropped in the moments in-between. The lighting and sound gave the transitions a distinct look and feel, and it’s a shame that wasn’t matched by the on-stage movements.
But overall I really enjoyed it, and I’m glad I could see it as many times as I did. I’ve seen several plays that benefit from multiple viewings – once to experience it fresh, once to see how it builds to the ending. Spy for Spy definitely benefits from multiple viewings, and it can feel entirely different when the order changes. One show I went to felt quite light-hearted and happy; another ended on a dark scene that changed the whole tone of the play.
This isn’t a fluffy romance with perfect people. Sarah and Molly are messy characters with insecurities and flaws, and that feels like a key through line of the play.
Some aspects of their personality are constant, regardless of the order of events or the challenges they face. Maybe that’s what the play means – love is about finding somebody who sees those imperfections, who can see past your facade, and will make the time for you anyway. They both talk about wanting to change for one another, but maybe love is about finding somebody who doesn’t need you to change.
I went to the show based on a single line of description – “a romantic comedy told in the wrong order” – and it delivered on that promise. It’s a clever piece of writing that demands you pay attention, and if you do, that attention is richly rewarded.
On a more personal note, I got to meet several of the people involved with the production over my various trips, and they were all incredibly nice. Lucy, Kieron, Amy, Olive, Nell, Tim, and others – everyone was so willing to chat after the show. The play fell in the middle of several stressful weeks at work, and it was nice to have something to offset that.
I’m so glad I got a chance to see this play. I don’t know if or when it will run again, but if it does, I’d really recommend it.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>I’d often thought about turning them off overnight, to save a bit of money, but I never quite got around to it. I always imagined it would involve a bunch of moving pieces, possibly some Lambda functions we’d have to deploy and manage, and it all felt a bit too much effort. Our bill isn’t in a precarious place, and premature cost optimisation takes away from better ways to use our time.
Then I read an article by Victor Ronin about using Terraform to create schedules in EventBridge, which is much simpler than what I was expecting. I tried rolling that pattern out to our ECS services, and it worked very well.
The core logic sits in a pair of EventBridge Schedules, created with the aws_scheduler_schedule
resource.
One schedule turns a service off in the evening; another turns it back on the next morning.
resource "aws_scheduler_schedule" "turn_off_in_the_evening" {
name = "${var.service_name}-turn_off_in_the_evening"
# This cron expression will run at 7pm UTC on weekdays.
schedule_expression = "cron(0 19 ? * MON,TUE,WED,THUR,FRI *)"
target {
arn = "arn:aws:scheduler:::aws-sdk:ecs:updateService"
role_arn = aws_iam_role.scheduler.arn
input = jsonencode({
Cluster = var.cluster
Service = var.service_name
DesiredCount = 0
})
}
flexible_time_window {
mode = "OFF"
}
}
resource "aws_scheduler_schedule" "turn_on_in_the_morning" {
name = "${var.service_name}-turn_on_in_the_morning"
# This cron expression will run at 7am UTC on weekdays.
schedule_expression = "cron(0 7 ? * MON,TUE,WED,THUR,FRI *)"
target {
arn = "arn:aws:scheduler:::aws-sdk:ecs:updateService"
role_arn = aws_iam_role.scheduler.arn
input = jsonencode({
Cluster = var.cluster
Service = var.service_name
DesiredCount = var.desired_task_count
})
}
flexible_time_window {
mode = "OFF"
}
}
variable "cluster" { type = string }
variable "service_name" { type = string }
variable "desired_task_count" { type = number }
They’re triggered on a schedule, according to the cron expression. UK office hours are roughly 9 to 5, and the schedules are picked to include these hours plus a bit of “slop”. This is to account for people who work slightly earlier, slightly later, or when the UK timezone doesn’t match UTC.
I do a lot of this sort of “slop” in scheduling code. I’ll accept a bit of inefficiency or redundancy if it means I can get simpler code. I could tighten these schedules so they follow UK office hours more closely, but it would add a lot of complexity for marginal gains. It’s not worth it.
The most interesting bit to me is how the schedule updates the ECS service – it calls the UpdateService API with a payload that I provide. In this case I’m just changing the DesiredCount value, but it seems like this could be used to call other AWS APIs. That feels like it has a lot of potential elsewhere.
We’ve already got a variant of these schedules that turns an EC2 instance off/on outside our working hours, and I imagine this won’t be the last time I play with EventBridge Schedules.
Alongside the two schedules, you need an IAM role that allows EventBridge to modify your ECS services when it runs. This is how our IAM role is defined:
resource "aws_iam_role" "scheduler" {
name = "${var.service_name}-office-hours-scaling"
assume_role_policy = data.aws_iam_policy_document.assume_role.json
}
data "aws_iam_policy_document" "assume_role" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["scheduler.amazonaws.com"]
}
}
}
data "aws_iam_policy_document" "allow_update_service" {
statement {
actions = ["ecs:UpdateService"]
resources = [var.service_arn]
}
}
resource "aws_iam_role_policy" "allow_update_service" {
role = aws_iam_role.scheduler.name
policy = data.aws_iam_policy_document.allow_update_service.json
}
variable "service_arn" { type = string }
variable "service_name" { type = string }
This is pretty standard IAM – create the role, and allow the EventBridge Scheduler service to assume it. Then we create an IAM policy document that allows calling the UpdateService API for the service we’re turning off/on, and we attach that policy document to the role.
This isn’t a lot of Terraform, but it would be annoying to copy/paste this for every service we have. To save ourselves the hassle, it’s included it in our standard ECS service module, and services can opt-in to this behaviour with a single flag:
module "service" {
source = "git::github.com/wellcomecollection/terraform-aws-ecs-service.git//modules/service?ref=v3.15.3"
name = "staging-site"
…
turn_off_outside_office_hours = true
}
Partly this is for readability, but mostly it’s to make this behaviour quick and easy to enable – which means we’re more likely to actually do it.
We’ve already rolled this out to a dozen existing services, and there’s a nice dent in last month’s EC2 bill. As we build out new services, I expect this behaviour to spread ever further.
[If the formatting of this post looks odd in your feed reader, visit the original article]
]]>