Four ways to underline text in LaTeX

Because I’m old-fashioned, I still write printed documents in LaTeX, and I still think hyperlinks should be underlined. In general, I’m glad that underlines as a form of emphasis have gone away (boldface or italics are much nicer) — but I have yet to be convinced to drop underlines on hyperlinks.

Sometimes I have to write printed documents that contain hyperlinks, which begs the question: how do you write underlines in LaTeX? Finding an underline I like has proven surprisingly hard — in this post, I’ll show you the different ways I’ve tried to underline text.

Read more →

Using pip-tools to manage my Python dependencies

At last year’s PyCon UK, one of my favourite talks was Aaron Bassett’s session on Python dependency management. He showed us a package called pip-tools, and I’ve been using it ever since.

pip-tools is used to manage your pip dependencies. It allows you to write a top-level summary of the packages you need, for example:

$ cat
pytest >= 1.4

Here I want a version of pytest that’s at least 1.4, and any version of requests.

Then I run pip-compile, which turns that into a full requirements.txt:

$ pip-compile
$ cat requirements.txt
certifi==2017.7.27.1      # via requests
chardet==3.0.4            # via requests
idna==2.6                 # via requests
py==1.4.34                # via pytest
urllib3==1.22             # via requests

I can install these dependencies with pip install -r requirements.txt.

The generated file is pinned: every package has a fixed version. This means that I get the same versions whenever I run pip install, no matter what the new version is. If you don’t pin your dependencies, your package manager may silently install a new version when it’s released – and that’s an easy way for bugs to sneak in.

Instead, check in both files into version control, so you can see exactly when a dependency version was changed. This makes it easier to see if a version bump introduced a bump.

There are also comments to explain why you need a particular package: for example, I’m installing certifi because it’s required by requests.

I’ve been using pip-tools since Aaron’s recommendation, and it’s been really nice. It’s not had an earth-shattering impact on my workflow, but it shaves off a bunch of rough edges. If you do any work with Python, I recommend giving it a look.

For more about pip-tools itself, I recommend Better Package Management by Vincent Driessen, one of the pip-tools authors. This human-readable/pinned-package distinction is coming to vanilla pip in the form of Pipfile, but that was in its infancy last September. pip-tools has been stable for over two years.

Recently, I’ve been trying to push more of my tools inside Docker. Every tool I run in Docker is one less tool I have to install locally, so I can get up-and-running that much faster. Handily, there’s already a Docker image for running pip-tools.

You run it as follows:

$ docker run --volume /path/to/repo:/src --rm micktwomey/pip-tools

It looks for a in /src, so we mount the repo in that directory — this gives the container the ability to read the file, and write a requirements.txt back into a file on the host system. I also add the --rm flag, which cleans up the countainer after it’s finished running.

If you already have Docker, this is a nice way to use pip-tools without installing it locally.

Alongside Docker, I’ve been defining more of my build processes in Makefiles. Having Docker commands is useful, but I don’t want to have to remember all the flags every time I use them. Writing a Makefile gives me shortcuts for common tasks.

This is the Make task I have for updating a requirements.txt:

    docker run --volume $(CURDIR):/src --rm micktwomey/pip-tools
    touch requirements.txt

To use it, run make requirements.txt.

The first line specifies the Make target (requirements.txt), and tells Make that it depends on So when the Make task is invoked, it checks that the .in file exists, and then whether the .in file was updated more recently than .txt. If yes — the .txt file needs rebuilding. If no — we’re up-to-date, there’s nothing to do.

The second line runs the Docker command explained above, using the Make variable $(CURDIR) to get the current directory.

Finally, touch ensures that the last modified time of requirements.txt is always updated. pip-tools will only change the modification time if there are changes to the dependency pins — I change it manually so that make knows the task has run, and the “should I run this task” logic explained above doesn’t spin endlessly.

Once I have this Make task, I can invoke it from other tasks — for example, build tasks that install from requirements.txt — and so it gets run when required, but without an explicit action from me. It’s just another step that happens transparently when I run make build.

If you’d like to see an example of this in use, check out the Makefile changes in the same patch as this post.

NPR: Graduation Readers At MIT Go The Extra Mile To Pronounce Names Correctly

I really enjoyed this NPR story about how MIT try to get accurate pronunciations of names at commencement. They’re really trying to make sure they get every name right:

When we feel that we just aren’t comfortable with the pronunciation of the name, we will send an email to 60, 40, 75 students, sort of depends on the year, saying I would like to pronounce your name correctly at commencement. Could you please call my voicemail and leave your name slowly twice? And, you know, thank you very much.

The piece is only short — a little under five minutes — and worth listening to in full. The transcript struggles to convey the nuances of pronunciation, it’s hard in written text!

My name is on the simpler side, so it’s rare for my name to be mangled — but I still notice and appreciate when somebody is trying to get the names right.

On the flipside, something I find particularly tiresome is people who don’t just mispronounce names, but try to joke about doing so. “Oh dear, isn’t Welsh so hard to pronounce, woe is me.” An inability to even try to pronounce certain names isn’t funny or clever, and laughing about it just draws attention to your laziness. Growing up near Wales, I heard a lot of jokes from English people about the unpronouncability of Welsh names – this from a country that has names like Gloucester and Worcestershire.

If you’re introducing somebody at an event, it’s polite to ask how their name is pronounced. You don’t have to be perfect, but you should at least try.

What happens when you overengineer a static site?

When I started this site, I was using Jekyll (well, Octopress). Frequent issues with maintaining a working Ruby installation caused me to look elsewhere for a Python solution. For a while, I used Pelican, but licensing issues and a sense that the project had been abandoned by maintainers led me to write my own static site generator (SSG). Recently, I’ve come full circle and returned to Jekyll.

Writing my own SSG was a fun exercise, but a bit of a time sink. So I started thinking about switching back to something I didn’t have to maintain myself – and the two popular choices seem to be Jekyll or Hugo.

Today, I’m very happy using Docker, which can handle the problems of keeping a working Ruby install. Jekyll has an edge on longevity, and the plug-in architecture offers lots of room for customisation. As far as I know, Hugo doesn’t have plug-in support – I’ve built up some pretty esoteric features on this site, so customisation is a must-have.

With the newest rewrite, I wanted to treat this like a proper software project. Builds in Docker, continuous testing in CI, and automated deployments. It took more work, but it means I don’t just have a pile of hacked-together scripts.

If you’re interested, I’ve put my entire Jekyll setup on GitHub. It has all of my Jekyll config, the Docker containers I use to build the site, and a bunch of interesting plug-ins – check out the README for more details. Over time, I might write up some of the interesting bits as standalone blog posts.

Hopefully this is the last rewrite I’ll be doing in a while – so I wanted to do this one properly.

Some useful Git commands for CI

I spend a lot of time writing build scripts for interacting with Git repos. But Git’s documentation is notoriously opaque, and a lot of my Git knowledge comes from word-of-mouth rather than reading the docs. In that spirit of sharing, these are few of the Git commands I find useful when writing build scripts.

Read more →

Ode to docopt

Every week, we have an hour of developer learning at work – maybe a talk, a workshop, or some other session about of topic of interest to developers. Last week, we did a round of lightning talks. We’re quite a mixed group of developers – in background, technical stacks, and what we actually work on – so coming up with a topic that’s useful to all can be tricky.

For my slot, I decided to wax lyrical about the docopt library. Once upon a time, I was sceptical, but it’s become my go-to library for any sort of command-line interface. Rather than fiddling with argparse and the like, I just write a docopt help string, and the hard work is done for me. I’ve used it in multiple languages, and thought it might be handy for other devs at work. Ergo, this talk.

You can download my slides as a PDF, or read the notes below.

Read more →

A Python module for lazy reading of file objects

At work, we often pass data around via large files kept in Amazon S3 – XML exports from legacy applications, large log files, JSON dumps of Elasticsearch indexes – that sort of thing. The services that deal with these files run in Docker containers on AWS, and they have limited memory and local storage.

Downloading large files into memory is slow, expensive, and often unnecessary. Many of these files contain a list of records, which we want to process one-at-a-time. We only need to hold a single record in memory at a time, not the whole file.

Python can do efficient line-by-line processing of local files. The following code only reads a line at a time:

with open('very-long-file.txt') as f:
    for line in f:

This is more efficient, and usually results in faster code – but you can only do this for local files, and the only delimiter is a newline. You need a different wrapper if you want to do this for files in S3, or use a different delimiter – and that’s what this module does. It goes like this:

import boto3
from lazyreader import lazyread

s3 = boto3.client('s3')
s3_object = client.get_object(Bucket='example-bucket', Key='records.txt')
body = s3_object['Body']

for doc in lazyread(body, delimiter=b';'):

The code isn’t especially complicated, just a little fiddly, but I think it’s a useful standalone component.

I was mildly surprised that something like this doesn’t already exist, or if it does, I couldn’t find the right search terms! If you know an existing module that does this, please let me know.

You can install lazyreader from PyPI (pip install lazyreader), or see the README for more details.

Backing up full-page archives from Pinboard

Several years ago, I blogged about a Python script I’d written to back up my Pinboard bookmarks. I’ve been using Pinboard since then – a brief homebrew solution aside – and recently, I wanted to turn my attention to my archival account.

As well as storing a list of pages you’ve saved, you can pay a small annual fee and Pinboard will save a complete copy of the pages you bookmark. This helps keep your bookmarks useful in the face of link rot – when pages change, or even disappear completely. Pages break surprisingly quickly – Maciej did some informal research on link rot a few years back, and even that post itself has now broken. Around 15% of my bookmarks no longer go anywhere useful.

Once a page is archived, it appears with a small grey checkmark next to the link – so you easily view the archived page if the original goes away.

I’ve paid for the extra archiving for years, and having these complete backups of my bookmarks is great – but they only live on Pinboard. What if Pinboard gets acquired, or Maciej wipes the servers and runs off to Mexico? I really want a local backup of all these full-page copies.

Read more →

Backing up content from SoundCloud

In the last week or so, SoundCloud have been looking pretty fragile. They closed two of their offices (firing about 40% of their staff), and given they’ve been in financial difficulties for several years, you might wonder if SoundCloud is long for this world.

If you’re a SoundCloud user, you might want to back up anything you’ve already uploaded.

I don’t have anything on SoundCloud myself, but there’s quite a lot of content from Wellcome Collection. When this news was shared in our internal Slack, I decided to pre-emptively download everything on the Collection’s account – back up now, ask questions later. (I was glad to hear that we already had local copies of all the important data. Still, better safe than sorry.)

It sounds like the Internet Archive are going after SoundCloud, but sucking down 2.5PB of data is a tall order, even for them. I thought I’d write up what I did for the Wellcome account, in case anybody else wants an extra copy of their files.

I started by installing youtube-dl, a command-line tool for backing up video and audio from a whole bunch of sites – including SoundCloud. It’s an incredibly handy program to keep around. You can install it in many ways, but I prefer to use pip:

$ pip install youtube-dl

Then, downloading the Collection’s SoundCloud account was a single command:

$ youtube-dl --write-thumbnail --write-info-json ""

youtube-dl is smart enough to recognise this as a user page, and downloads all the tracks uploaded by Wellcome. Replace wellcomecollection with your username to get your own content. For each track, it downloads an audio file, a thumbnail, and a blob of metadata.

For good measure, I then copied everything to an Amazon S3 bucket.

This probably won’t work as well for content that’s private or copyright-restricted, but if all you have is a public account, you might want to consider doing this sooner, not later.

Listing keys in an S3 bucket with Python

A lot of my recent work has involved batch processing on files stored in Amazon S3. It’s been very useful to have a list of files (or rather, keys) in the S3 bucket – for example, to get an idea of how many files there are to process, or whether they follow a particular naming scheme.

The AWS APIs (via boto3) do provide a way to get this information, but API calls are paginated and don’t expose key names directly. It’s a bit fiddly, and I don’t generally care about the details of the AWS APIs when using this list – so I wrote a wrapper function to do it for me. All the messiness of dealing with the S3 API is hidden in general use.

Since this function has been useful in lots of places, I thought it would be worth writing it up properly.

Read more →