Backing up full-page archives from Pinboard

Several years ago, I blogged about a Python script I’d written to back up my Pinboard bookmarks. I’ve been using Pinboard since then – a brief homebrew solution aside – and recently, I wanted to turn my attention to my archival account.

As well as storing a list of pages you’ve saved, you can pay a small annual fee and Pinboard will save a complete copy of the pages you bookmark. This helps keep your bookmarks useful in the face of link rot – when pages change, or even disappear completely. Pages break surprisingly quickly – Maciej did some informal research on link rot a few years back, and even that post itself has now broken. Around 15% of my bookmarks no longer go anywhere useful.

Once a page is archived, it appears with a small grey checkmark next to the link – so you easily view the archived page if the original goes away.

I’ve paid for the extra archiving for years, and having these complete backups of my bookmarks is great – but they only live on Pinboard. What if Pinboard gets acquired, or Maciej wipes the servers and runs off to Mexico? I really want a local backup of all these full-page copies.

Read more →


Listing keys in an S3 bucket with Python

A lot of my recent work has involved batch processing on files stored in Amazon S3. It’s been very useful to have a list of files (or rather, keys) in the S3 bucket – for example, to get an idea of how many files there are to process, or whether they follow a particular naming scheme.

The AWS APIs (via boto3) do provide a way to get this information, but API calls are paginated and don’t expose key names directly. It’s a bit fiddly, and I don’t generally care about the details of the AWS APIs when using this list – so I wrote a wrapper function to do it for me. All the messiness of dealing with the S3 API is hidden in general use.

Since this function has been useful in lots of places, I thought it would be worth writing it up properly.

Read more →


Backing up content from SoundCloud

In the last week or so, SoundCloud have been looking pretty fragile. They closed two of their offices (firing about 40% of their staff), and given they’ve been in financial difficulties for several years, you might wonder if SoundCloud is long for this world.

If you’re a SoundCloud user, you might want to back up anything you’ve already uploaded.

I don’t have anything on SoundCloud myself, but there’s quite a lot of content from Wellcome Collection. When this news was shared in our internal Slack, I decided to pre-emptively download everything on the Collection’s account – back up now, ask questions later. (I was glad to hear that we already had local copies of all the important data. Still, better safe than sorry.)

It sounds like the Internet Archive are going after SoundCloud, but sucking down 2.5PB of data is a tall order, even for them. I thought I’d write up what I did for the Wellcome account, in case anybody else wants an extra copy of their files.

I started by installing youtube-dl, a command-line tool for backing up video and audio from a whole bunch of sites – including SoundCloud. It’s an incredibly handy program to keep around. You can install it in many ways, but I prefer to use pip:

$ pip install youtube-dl

Then, downloading the Collection’s SoundCloud account was a single command:

$ youtube-dl --write-thumbnail --write-info-json "https://soundcloud.com/wellcomecollection"

youtube-dl is smart enough to recognise this as a user page, and downloads all the tracks uploaded by Wellcome. Replace wellcomecollection with your username to get your own content. For each track, it downloads an audio file, a thumbnail, and a blob of metadata.

For good measure, I then copied everything to an Amazon S3 bucket.

This probably won’t work as well for content that’s private or copyright-restricted, but if all you have is a public account, you might want to consider doing this sooner, not later.


A visit to the Crossness pumping station

In the early nineteenth century, the River Thames was heavily polluted. It was treated as an open sewer, with human excrement and industrial waste dumped directly into the river and left to rot. The uncleaned river led to multiple outbreaks of cholera, and made central London thoroughly unpleasant. In summer 1858, the hot weather made the smell so bad that it was dubbed “the Great Stink”. At this time, many people believed that bad smells (called miasma) were responsible for the spread of disease, so the state of the river was seen as a public health hazard.

After 1858, Parliament decided to commission a new, modern sewerage system that would carry the smell away from the centre of the city. The Metropolitan Board of Works – led by engineer Joseph Bazalgette – were tasked with building the new sewers. I first came across the story in a BBC docudrama series, which has quite a nice overview.

The design of this new system was rather elegant: a series of six main tunnels (three either side of the river) would carry the sewage east, away from the city. Smaller sewers would carry sewage from individual properties into the main tunnels. The whole system was built on a gradient, so everything is carried entirely by gravity. When it’s sufficiently far east, the sewage is pumped back up to ground level, dumped in the Thames and washed out to sea.

A map of London’s sewers, drawn in 1882. The main interceptor tunnels are highlighted in red. Image from Wikipedia.

The endpoint of the southern tunnel was at Crossness. There was a pumping station with four steam-driven pumps that pulled the waste up to ground level, and dumped it into the river on the outgoing tide. Both Crossness and the wider sewerage system were seen as major feats of Victorian engineering, and the opening of Crossness itself was a particularly prestiguous event.

An invitation to the opening of Crossness in 1865. Image from the Science Museum, Wellcome Images.

Today we’re (slightly) more enlightened, and don’t just dump raw sewage into the sea. Instead, sewage is sent to treatment plants for processing, and disposed of elsewhere – which led to these old pumping stations being decommissioned. By the end of 1950s, these stations were all but abandoned.

Since then, the other southern pumping station (Deptford) has essentially vanished, and the northern station (Abbey Mills) is a shell of its former self. But Crossness survived fairly well: the large chimney in the invitation above was demolished, but otherwise the site was left in reasonable shape. In 1985, the Crossness Engines Trust was established to preserve the site, and restore the engines to a working state. Today, the pumping station is open to the public.

Last weekend, Crossness were running an open day - the pumping station was open to the public, and they were running the restored engine. Given my interest, I decided to head down, have a look round, and take a few photos.

Read more →


Accessibility at AlterConf

On Saturday, I was at AlterConf London, a conference about diversity in the tech and gaming industries. If you follow me on Twitter, you’ll have seen that I was tweeting pretty effusively about it throughout the day. It was one of the friendliest, nicest conferences I’ve ever been to, with a cracking set of speakers to boot.

I was really impressed by how much the AlterConf organisers had done to make the conference accessible and inclusive. Most tech conferences are dominated by cis, white men – this was very different. Both the speaker lineup and the audience were remarkably diverse.

In this post, I want to talk about a few of the things that really stood out to me, which helped to make the conference feel more inclusive. Many of these are ideas that could be replicated elsewhere, and I’d love to see them spread. I’ll write about the talks in a separate post.

A disclaimer: I’m a cis white male, so I don’t tend to have problems at other tech conferences. Take my praise with a pinch of salt, because I’m not really the person this is aimed at helping.

Read more →


A few examples of extensions in Python-Markdown

I write a lot of content in Markdown (including all the posts on this site), and I use Python-Markdown to render it as HTML. One of Python-Markdown’s features is an Extensions API. The package provides some extensions for common tasks – abbreviations, footnotes, tables and so on – but you can also write your own extensions if you need something more specialised.

After years of just using the builtin extensions, I’ve finally started to dip my toe into custom extensions. In this post, I’m going to show you a few of my tweaked or custom extensions.

Read more →


A script for backing up your Instapaper bookmarks

About three days ago, there was an extended outage at Instapaper. Luckily, it seems like there wasn’t any permanent data loss – everybody’s bookmarks are still safe – but this sort of incident can make you worry.

I have a Python script that backs up my Instapaper bookmarks on a regular basis, so I was never worried about data loss. At worst, I’d have lost an hour or so of changes – fairly minor, in the grand scheme of things. I’ve been meaning to tidy it up and share it for a while, and this outage prompted me to get on and finish that. You can find the script and the installation instructions on GitHub.


A script for backing up your Goodreads reviews

Last year, I started using Goodreads to track my reading. (I’m alexwlchan if you want to follow me.) In the past, I’ve had a couple of hand-rolled systems for recording my books, but maintaining them often became a distraction from actually reading!

Using Goodreads is quite a bit simpler, but it means my book data is stored on somebody else’s servers. What if Goodreads goes away? I don’t want to lose that data, particularly because I’m trying to be better about writing some notes after I finish a book.

There is an export function on Goodreads, but it has to be invoked by hand. I prefer to have backup tools that can be run automatically: I can set them to run on a schedule, and I know my data is safe. This tends to be a script or a cron job.

That’s exactly what I’ve done for Goodreads: I’ve written a Python script that uses the Goodreads API to grab the same information as provided by the builtin export. I have this configured to run once a day, and now I have daily backups of my Goodreads data. You can find the script and installation instructions on GitHub.

This was a fun opportunity to play with the ElementTree module (normally I work with JSON), and also a reminder that the lack of yield from has become my most disliked feature in Python 2.


A Python interface to AO3

In my last post, I talked about some work I’d been doing to scrape data from AO3 using Python. I haven’t made any more progress, but I’ve tidied up what I had and posted it to GitHub.

Currently this gives you a way to get metadata about works (word count, title, author, that sort of thing), along with your complete reading history. This latter is particularly interesting because it allows you to get a complete list of works where you’ve left kudos.

Instructions are in the README, and you can install it from PyPI (pip install ao3).

I’m not actively working on this (I have what I need for now), but this code might be useful for somebody else. Enjoy!


Experiments with AO3 and Python

Recently, I’ve been writing some scripts that need to get data from AO31. Unfortunately, AO3 doesn’t have an API (although it’s apparently on the roadmap), so you have to do everything by scraping pages and parsing HTML. A bit yucky, but it can be made to work.

You can get to a lot of pages without having an AO3 account – which includes most of the fic. If you want to get data from those pages, you can use any HTTP client to download the HTML, then parse or munge it as much as you like. For example, in Python:

import requests

req = requests.get('http://archiveofourown.org/works/9079264')
print(req.text)  # Prints the page's HTML

I have a script that takes this HTML, and which can extract metadata like word count and pairings. (I use that to auto-tag my bookmarks on Pinboard, because I’m lazy that way.)

But there are some pages that require you to be logged in to an account. For example, AO3 can track your reading history across the site. If you try to access a private page with the approach above, you’ll just get an error message:

Sorry, you don't have permission to access the page you were trying to reach. Please log in.

Wouldn’t it be nice if you could access those pages in a script as well?

I’ve struggled with this for a while, and I had some hacky workarounds, but nothing very good. Tonight, I found quite a neat solution that seems much more reliable.

For this to work, you need an HTTP client that doesn’t just do one-shot requests. You really want to make two requests: one to log you in, another for the page you actually want. You need to persist some login state from the first request to the second, so that AO3 remembers us on the second request. Normally, this state is managed by your browser: in Python, we can do the same thing with sessions.

After a bit of poking at the AO3 login form, I’ve got the following code that seems to work:

import requests

sess = requests.Session()

# Log in to AO3
sess.post('http://archiveofourown.org/user_sessions', params={
    'user_session[login]': USERNAME,
    'user_session[password]': PASSWORD,
})

# Fetch my private reading history
req = sess.get('https://archiveofourown.org/users/%s/readings' % USERNAME)
print(req.text)

Where previously this would return an error page, now I get my reading history. There’s more work to parse this into usable data, but we’re past my previous stumbling block.

I think this is a useful milestone, and could form the basis for a Python-based AO3 API. I’ve thought about writing such a library in the past, but it’s a bit limited if you can’t log in. With that restriction lifted, there’s a lot more you can potentially do.

I have a few ideas about what to do next, but I don’t have much free time coming up. I’m not promising anything – but you might want to watch this space.


  1. Non-fannish types: AO3 is the Archive of Our Own, a popular website for sharing fanfiction. ↩︎