Some useful Git commands for CI

I spend a lot of time writing build scripts for interacting with Git repos. But Git’s documentation is notoriously opaque, and a lot of my Git knowledge comes from word-of-mouth rather than reading the docs. In that spirit of sharing, these are few of the Git commands I find useful when writing build scripts.

Read more →


Ode to docopt

Every week, we have an hour of developer learning at work – maybe a talk, a workshop, or some other session about of topic of interest to developers. Last week, we did a round of lightning talks. We’re quite a mixed group of developers – in background, technical stacks, and what we actually work on – so coming up with a topic that’s useful to all can be tricky.

For my slot, I decided to wax lyrical about the docopt library. Once upon a time, I was sceptical, but it’s become my go-to library for any sort of command-line interface. Rather than fiddling with argparse and the like, I just write a docopt help string, and the hard work is done for me. I’ve used it in multiple languages, and thought it might be handy for other devs at work. Ergo, this talk.

You can download my slides as a PDF, or read the notes below.

Read more →


A Python module for lazy reading of file objects

At work, we often pass data around via large files kept in Amazon S3 – XML exports from legacy applications, large log files, JSON dumps of Elasticsearch indexes – that sort of thing. The services that deal with these files run in Docker containers on AWS, and they have limited memory and local storage.

Downloading large files into memory is slow, expensive, and often unnecessary. Many of these files contain a list of records, which we want to process one-at-a-time. We only need to hold a single record in memory at a time, not the whole file.

Python can do efficient line-by-line processing of local files. The following code only reads a line at a time:

with open('very-long-file.txt') as f:
    for line in f:
        do_stuff_with(line)

This is more efficient, and usually results in faster code – but you can only do this for local files, and the only delimiter is a newline. You need a different wrapper if you want to do this for files in S3, or use a different delimiter – and that’s what this module does. It goes like this:

import boto3
from lazyreader import lazyread

s3 = boto3.client('s3')
s3_object = client.get_object(Bucket='example-bucket', Key='records.txt')
body = s3_object['Body']

for doc in lazyread(body, delimiter=b';'):
    print(doc)

The code isn’t especially complicated, just a little fiddly, but I think it’s a useful standalone component.

I was mildly surprised that something like this doesn’t already exist, or if it does, I couldn’t find the right search terms! If you know an existing module that does this, please let me know.

You can install lazyreader from PyPI (pip install lazyreader), or see the README for more details.


Backing up full-page archives from Pinboard

Several years ago, I blogged about a Python script I’d written to back up my Pinboard bookmarks. I’ve been using Pinboard since then – a brief homebrew solution aside – and recently, I wanted to turn my attention to my archival account.

As well as storing a list of pages you’ve saved, you can pay a small annual fee and Pinboard will save a complete copy of the pages you bookmark. This helps keep your bookmarks useful in the face of link rot – when pages change, or even disappear completely. Pages break surprisingly quickly – Maciej did some informal research on link rot a few years back, and even that post itself has now broken. Around 15% of my bookmarks no longer go anywhere useful.

Once a page is archived, it appears with a small grey checkmark next to the link – so you easily view the archived page if the original goes away.

I’ve paid for the extra archiving for years, and having these complete backups of my bookmarks is great – but they only live on Pinboard. What if Pinboard gets acquired, or Maciej wipes the servers and runs off to Mexico? I really want a local backup of all these full-page copies.

Read more →


Backing up content from SoundCloud

In the last week or so, SoundCloud have been looking pretty fragile. They closed two of their offices (firing about 40% of their staff), and given they’ve been in financial difficulties for several years, you might wonder if SoundCloud is long for this world.

If you’re a SoundCloud user, you might want to back up anything you’ve already uploaded.

I don’t have anything on SoundCloud myself, but there’s quite a lot of content from Wellcome Collection. When this news was shared in our internal Slack, I decided to pre-emptively download everything on the Collection’s account – back up now, ask questions later. (I was glad to hear that we already had local copies of all the important data. Still, better safe than sorry.)

It sounds like the Internet Archive are going after SoundCloud, but sucking down 2.5PB of data is a tall order, even for them. I thought I’d write up what I did for the Wellcome account, in case anybody else wants an extra copy of their files.

I started by installing youtube-dl, a command-line tool for backing up video and audio from a whole bunch of sites – including SoundCloud. It’s an incredibly handy program to keep around. You can install it in many ways, but I prefer to use pip:

$ pip install youtube-dl

Then, downloading the Collection’s SoundCloud account was a single command:

$ youtube-dl --write-thumbnail --write-info-json "https://soundcloud.com/wellcomecollection"

youtube-dl is smart enough to recognise this as a user page, and downloads all the tracks uploaded by Wellcome. Replace wellcomecollection with your username to get your own content. For each track, it downloads an audio file, a thumbnail, and a blob of metadata.

For good measure, I then copied everything to an Amazon S3 bucket.

This probably won’t work as well for content that’s private or copyright-restricted, but if all you have is a public account, you might want to consider doing this sooner, not later.


Listing keys in an S3 bucket with Python

A lot of my recent work has involved batch processing on files stored in Amazon S3. It’s been very useful to have a list of files (or rather, keys) in the S3 bucket – for example, to get an idea of how many files there are to process, or whether they follow a particular naming scheme.

The AWS APIs (via boto3) do provide a way to get this information, but API calls are paginated and don’t expose key names directly. It’s a bit fiddly, and I don’t generally care about the details of the AWS APIs when using this list – so I wrote a wrapper function to do it for me. All the messiness of dealing with the S3 API is hidden in general use.

Since this function has been useful in lots of places, I thought it would be worth writing it up properly.

Read more →


A visit to the Crossness pumping station

In the early nineteenth century, the River Thames was heavily polluted. It was treated as an open sewer, with human excrement and industrial waste dumped directly into the river and left to rot. The uncleaned river led to multiple outbreaks of cholera, and made central London thoroughly unpleasant. In summer 1858, the hot weather made the smell so bad that it was dubbed “the Great Stink”. At this time, many people believed that bad smells (called miasma) were responsible for the spread of disease, so the state of the river was seen as a public health hazard.

After 1858, Parliament decided to commission a new, modern sewerage system that would carry the smell away from the centre of the city. The Metropolitan Board of Works – led by engineer Joseph Bazalgette – were tasked with building the new sewers. I first came across the story in a BBC docudrama series, which has quite a nice overview.

The design of this new system was rather elegant: a series of six main tunnels (three either side of the river) would carry the sewage east, away from the city. Smaller sewers would carry sewage from individual properties into the main tunnels. The whole system was built on a gradient, so everything is carried entirely by gravity. When it’s sufficiently far east, the sewage is pumped back up to ground level, dumped in the Thames and washed out to sea.

A map of London’s sewers, drawn in 1882. The main interceptor tunnels are highlighted in red. Image from Wikipedia.

The endpoint of the southern tunnel was at Crossness. There was a pumping station with four steam-driven pumps that pulled the waste up to ground level, and dumped it into the river on the outgoing tide. Both Crossness and the wider sewerage system were seen as major feats of Victorian engineering, and the opening of Crossness itself was a particularly prestiguous event.

An invitation to the opening of Crossness in 1865. Image from the Science Museum, Wellcome Images.

Today we’re (slightly) more enlightened, and don’t just dump raw sewage into the sea. Instead, sewage is sent to treatment plants for processing, and disposed of elsewhere – which led to these old pumping stations being decommissioned. By the end of 1950s, these stations were all but abandoned.

Since then, the other southern pumping station (Deptford) has essentially vanished, and the northern station (Abbey Mills) is a shell of its former self. But Crossness survived fairly well: the large chimney in the invitation above was demolished, but otherwise the site was left in reasonable shape. In 1985, the Crossness Engines Trust was established to preserve the site, and restore the engines to a working state. Today, the pumping station is open to the public.

Last weekend, Crossness were running an open day - the pumping station was open to the public, and they were running the restored engine. Given my interest, I decided to head down, have a look round, and take a few photos.

Read more →


Accessibility at AlterConf

On Saturday, I was at AlterConf London, a conference about diversity in the tech and gaming industries. If you follow me on Twitter, you’ll have seen that I was tweeting pretty effusively about it throughout the day. It was one of the friendliest, nicest conferences I’ve ever been to, with a cracking set of speakers to boot.

I was really impressed by how much the AlterConf organisers had done to make the conference accessible and inclusive. Most tech conferences are dominated by cis, white men – this was very different. Both the speaker lineup and the audience were remarkably diverse.

In this post, I want to talk about a few of the things that really stood out to me, which helped to make the conference feel more inclusive. Many of these are ideas that could be replicated elsewhere, and I’d love to see them spread. I’ll write about the talks in a separate post.

A disclaimer: I’m a cis white male, so I don’t tend to have problems at other tech conferences. Take my praise with a pinch of salt, because I’m not really the person this is aimed at helping.

Read more →


A few examples of extensions in Python-Markdown

I write a lot of content in Markdown (including all the posts on this site), and I use Python-Markdown to render it as HTML. One of Python-Markdown’s features is an Extensions API. The package provides some extensions for common tasks – abbreviations, footnotes, tables and so on – but you can also write your own extensions if you need something more specialised.

After years of just using the builtin extensions, I’ve finally started to dip my toe into custom extensions. In this post, I’m going to show you a few of my tweaked or custom extensions.

Read more →


A script for backing up your Instapaper bookmarks

About three days ago, there was an extended outage at Instapaper. Luckily, it seems like there wasn’t any permanent data loss – everybody’s bookmarks are still safe – but this sort of incident can make you worry.

I have a Python script that backs up my Instapaper bookmarks on a regular basis, so I was never worried about data loss. At worst, I’d have lost an hour or so of changes – fairly minor, in the grand scheme of things. I’ve been meaning to tidy it up and share it for a while, and this outage prompted me to get on and finish that. You can find the script and the installation instructions on GitHub.