Is a given URL from a Tumblr post?

I’ve been writing some code recently that takes a URL, and performs some special actions if that URL is a Tumblr post. The problem is working out whether a given URL points to Tumblr.

Most Tumblrs use a consistent naming scheme:, so I can detect them with a regular expression. But some Tumblrs use custom URLs, and mask their underlying platform: for example, or Unfortunately, I encounter enough of these that I can’t just hard-code them, and I really should handle them properly.

So how can I know if an arbitrary URL belongs to Tumblr?

I’ve had to do this a couple of times now, so I thought it was worth writing up what to do – partly for my future reference, partly in case anybody else finds it useful.

In the HTTP headers on a Tumblr page, there are a couple of “X-Tumblr” headers. These are custom headers, defined by Tumblr – they aren’t part of the official HTTP spec. They aren’t documented anywhere, but it’s clear who’s sending them, and I’d be quite surprised to see another site send them. For my purposes, this is a sufficiently reliable indicator.

So this is the function I use to detect Tumblr URLs:

    from urllib.parse import urlparse
except ImportError:  # Python 2
    from urlparse import urlparse

import requests

def is_tumblr_url(url):
    if urlparse(url).netloc.endswith(''):
        return True
        req = requests.head(url)
        return any(h.startswith('X-Tumblr') for h in req.headers)

It’s by no means perfect, but it’s a step-up from basic string matching, and accurate and fast enough that I can usually get by.

Two Python scripts for cleaning up directories

Last month, I wrote about some tools I’d been using to clear disk space on my Mac. I’ve been continuing to clean up my mess of files and folders as I try to simplify my hard drive, and there are two new scripts I’ve been using to help me. Neither is particularly complicated, but I thought they were worth writing up properly.

Depending on how messy your disk is, these may or may not be useful to you – but they’ve saved a lot of time for me.

Of course, you should always be very careful of code that deletes or rearranges files on your behalf, and make sure you have good backups before you start.

Continue reading →

Chasing redirects and URL shorteners

Quick post today. A few years back, there was a proliferation of link shorteners on Twitter: tinyurl,,,, and so on. When characters are precious, you don’t want to waste them with a long URL. This is frustrating for several reasons:

  • It becomes harder to see where a particular link goes.
  • If the link shortener goes away, all the links break, even if the pages behind the links are still up.
  • Often the same link would be wrapped multiple times: a link would redirect to, then, before finally getting to the destination.

Twitter have tried to address this with their link shortener. All links in Twitter get wrapped with, so long URLs no longer penalise your character count, and they show a short preview of the destination URL. But this is still fragile – Twitter may not last forever – and people still wrap links in multiple shorteners.

When I’m storing data with shortened links, I like to record where the link is supposed to go. I keep the shortened and the resolved link, which tends to be pretty future-proof.

To find out where a shortened URL goes, I could just open it in a web browser. But that’s slow and manual, and doesn’t work if I want to save the URL as part of a scripted pipeline. So I have a couple of utility functions to help me out.

All the good link shorteners use HTTP 3XX redirects to send you to the next URL in the chain. A lot of HTTP libraries will just follow those if you make a GET request, so it’s enough to make a GET request and see where you end up. Here’s what that looks like with python-requests:

import requests

def resolve_url(base_url):
    r = requests.get(base_url)
    return r.url

if __name__ == '__main__':
    import sys

When run from the command-line, this just prints the final URL.

Sometimes I also want to see the intermediate links involved in the resolution: for example, if a site “helpfully” redirects any broken pages to a generic 404. In that case, I make individual HEAD requests and follow the redirects manually:

import requests

def chase_redirects(url):
    while True:
        yield url
        r = requests.head(url)
        if 300 < r.status_code < 400:
            url = r.headers['location']

if __name__ == '__main__':
    import sys
    for url in chase_redirects(sys.argv[1]):

This prints each URL involved in the chain. It’s useful for debugging a particular URL, or working out where a redirect chain falls over. I don’t use it as much, but it’s useful to have around.

There are definitely weird setups where these functions fall over (for example, a pair of pages which redirect to each other), but in the vast majority of cases they’re completely fine.

I have these two scripts saved as resolve_url and chase_url. I can invoke them from a shell prompt, or incorporate them in scripts. They’re handy little programs: incredibly simple, quick, and perform one task very well.

Clearing disk space on OS X

Over the weekend, I’ve been trying to clear some disk space on my Mac. I’ve been steadily accumulating lots of old and out-of-date files, and I just wanted a bit of a spring clean. Partly to get back that the disk space, partly so I didn’t have to worry about important information files that might be getting lost in the noise.

Over the course of a few hours, I was able to clean up over half a terabyte of old files. This wasn’t just trawling through the Finder by hand – I had a couple of tools and apps to help me do this – and I thought it would be worth writing down what I used.

Backups: Time Machine, SuperDuper! and CrashPlan

Embarking on an exercise like this without good backups would be foolhardy: what if you get cold feet, or accidentally delete a critical file? Luckily, I already have three layers of backup:

  • Time Machine backups to a local hard drive
  • Overnight SuperDuper! clones to a local hard drive
  • Backups to the CrashPlan cloud

The Time Machine backups go back to last October, the CrashPlan copies further still. I haven’t looked at the vast majority of what I deleted in months (some stuff years), so I don’t think I’ll miss it – but if I change my mind, I’ve got a way out.

For finding the big folders: DaisyDisk

DaisyDisk can analyse a drive or directory, and it presents you with a pie chart like diagram showing which folders are taking up the most space. You can drill down into the pie segments to see the contents of each folder. For example, this shows the contents of my home directory:

This is really helpful for making big space savings – it’s easy to see which folders have become bloated, and target my cleaning accordingly. If you want quick gains, this is a great app.

It’s also fast: scanning my entire 1TB boot drive took less than ten seconds.

For finding duplicate files: Gemini

Once I’ve found a large directory, I need to decide what (if anything) I want to delete. Sometimes I can look for big files that I know I don’t want any more, and move them straight to the Trash. But the biggest waste of space on my computer is multiple copies of the same file. Whenever I reorganise my hard drive, files get copied around, and I don’t always clean them up.

Gemini is a tool that can find duplicate files or folders within a given set of directories. For example, running it over a selection of my virtualenvs:

Once it’s found the duplicates, you can send files straight to the Trash from within the app. It has some handy filters for choosing which dupes to drop – oldest, newest, from within a specific directory – so doing so is pretty quick.

This is another fast way to reclaim space: deleting dupes saves space, but doesn’t lose any information.

Gemini isn’t perfect: it gets slow when scanning large directories (100k+ files), and sometimes it would miss duplicates. I often had to run it several times before it had found out all of the dupes in a directory. Note that I’m only running v1: these problems may be fixed in the new version.

File-by-file comparisons: Python’s filecmp module

Sometimes I wanted to compare a couple of individual files, not an entire directory. For this, I turned to Python’s filecmp module. This module contains a number of functions for comparing files and directories. This let me write a shell function for doing the comparisons on the command-line (this is the fish shell):

function filecmp
    python -c "import filecmp; print(filecmp.cmp('''$argv[1]''', '''$argv[2]'''))"

Fish drops in the two arguments to the function as $argv[1] and $argv[2]. The -c flag tells Python to run a command passed in as a string, and then it’s printing the result of calling filecmp.cmp() with the two files – True if they match, False if they don’t.

I’m using triple-quoted strings in the Python, so that filenames containing quote characters don’t prematurely terminate the string. I could still be bitten by a filename that contains a triple quote, but that would be very unusual. And unlike Python, where quote characters are interchangeable, it’s important that I use double-quotes for the string in the shell: shells only expand variables inside double-quoted strings, not single-quoted strings.

Usage is as follows:

$ filecmp hello.txt hello.txt

$ filecmp hello.txt foo.txt

I have this in my fish config file, so it’s available in any Terminal window. If you drag a file from the Finder to the Terminal, it auto-inserts the full path to that file, so it’s really easy to do comparisons – I type filecmp, and then drag in the two files I want to compare.

This is great if I only want to compare a few files at a time. I didn’t use it much on the big clean, but I’m sure it’ll be useful in the future.

Photos library: Duplicate Photos Cleaner

Part of this exercise was trying to consolidate my photo library. I’ve tried a lot of tools for organising my photos – iPhoto, Aperture, Lightroom, a folder hierarchy – and so photos are scattered across my disk. I’ve settled on using iCloud Photo Library for now, but I still had directories with photos that I hadn’t imported.

When I found a directory with new pictures, I just loaded everything into Photos. It was faster than cherry-picking the photos I already didn’t have, and ensures I didn’t miss anything – but of course, it also ensures that I import any duplicates.

Once I’d finished importing photos from the far corners of my disk, I was able to use this app to find duplicates in my photo library, and throw them away. It scans your entire Photo Library (it can do iPhoto and Photo Booth as well), and moves any duplicates to a dedicated album, for you to review/delete at will.

I chose the app by searching the Mac App Store; there are plenty of similar apps, and I don’t know how this one compares. I don’t have anything to particularly recommend it compared to other options, but it found legitimate duplicates, so it’s fine for my purposes.

Honourable mentions: find, du and df

There were a couple of other command-line utilities that I find useful.

If I wanted to find out which directories contain the most files – not necessarily the most space – I could use find. This isn’t about saving disk space, it’s about reducing the sheer number of unsorted files I keep. There were two commands I kept using:

  • Count all the files below the current directory: both files in this directory, and all of its subdirectories.
    $ find . | wc -l
  • Find out which of the subdirectories of the current directory contain the most files.
    $ for l in (ls); if [ -d $l ]; echo (find $l | wc -l)"  $l"; end; end
             627  _output
             262  content
              31  screenshots
               3  talks
              33  theme
              11  util

These two commands let me focus on processing directories that had a lot of files. It’s nice to clear away a large chunk of these unsorted files, so that I don’t have to worry about what they might contain.

And when I’m using Linux, I can mimic the functions of DaisyDisk with df and du. The df (display free space) command lets you see how much space is free on each of my disk partitions:

$ df -h
Filesystem      Size   Used  Avail Capacity   iused     ifree %iused  Mounted on
/dev/disk2     1.0Ti  295Gi  741Gi    29%  77290842 194150688   28%   /
devfs          206Ki  206Ki    0Bi   100%       714         0  100%   /dev
map -hosts       0Bi    0Bi    0Bi   100%         0         0  100%   /net
map auto_home    0Bi    0Bi    0Bi   100%         0         0  100%   /home
/dev/disk3s4   2.7Ti  955Gi  1.8Ti    35% 125215847 241015785   34%

And du (display disk usage) lets me see what’s using up space in a single directory:

$ du -hs *
 24K    experiments
 32K    favicon-a.acorn
 48K    favicon.acorn
 24K    style
 56K    templates
 40K    touch-icon-a.acorn

I far prefer DaisyDisk when I’m on the Mac, but it’s nice to have these tools in my back pocket.

Closing thought

These days, disk space is cheap (and even large SSDs are fairly affordable). So I don’t need to do this: I wasn’t running out of space, and it would be easy to get more if I was. But it’s useful for clearing the noise, and finding old files that have been lost in the bowels of my hard drive.

I do a really big cleanup about once a year, and having these tools always makes me much faster. If you ever need to clear large amounts of disk space, I’d recommend any of them.

A two-pronged iOS release cycle

One noticeable aspect of this year’s WWDC keynote was a lack of any new features focused on the iPad. Federico Viticci has written about this this on Mac Stories, in which he said:

I wouldn’t be surprised to see Apple move from a monolithic iOS release cycle to two major iOS releases in the span of six months – one focused on foundational changes, interface refinements, performance, and iPhone; the other primarily aimed at iPad users in the Spring.

I think this is a very plausible scenario, and between iOS 9.3 and WWDC, it seems like it might be coming true. Why? Education.

Apple doesn’t release breakdowns, but a big chunk of iPad sales seems to come from the education market. Education runs to a fixed schedule: the academic year starts in the autumn, continues over winter and spring, with a long break in the summer. A lot of work happens in the summer break, which includes lesson plans for the coming year, and deploying new tech.

The traditional iOS release cycle – preview the next big release at WWDC, release in the autumn – isn’t great for education. By the time the release ships, the school year is already underway. That can make it difficult for schools to adopt new features, often forcing them to wait for the next academic year.

If you look at the features introduced in iOS 9.3 – things like Shared iPad, Apple School Manager, or Managed Apple ID – these aren’t things that can be rolled out mid-year. They’re changes at the deployment stage. Once students have the devices, it’s too late. Even smaller things, like changes to iTunes U, can’t be used immediately, because they weren’t available when lesson plans were made over the summer. (And almost no teachers are going to run developer previews.)

This means there’s less urgency to get education-specific iPad features into the autumn release, because it’s often a year before they can be deployed. In a lot of cases, deferring that work for a later release (like with iOS 9.3) doesn’t make much of a difference for schools. And if you do that, it’s not a stretch to defer the rest of the iPad-specific work, and bundle it all into one big release that focuses on the iPad. Still an annual cycle, but six months offset from WWDC.

Moving to this cycle would have other benefits. Splitting the releases gives Apple more flexibility: they can spread the engineering work across the whole year, rather than focusing on one massive release for WWDC. It’s easier to slip an incomplete feature if the next major release is six months away, not twelve. And it’s a big PR item for the spring, a time that’s usually quiet for official Apple announcements.

I don’t know if this is Apple’s strategy; I’m totally guessing. But it seems plausible, and I’ll be interested to see if it pans out into 2017.

A subscription for my toothbrush 

Last weekend, I had my annual dental check-up. Thankfully, everything was fine, and my teeth are in (reasonably) good health.

While I was at the dentist, I had the regular reminder to change my toothbrush on a regular basis. You’re supposed to replace your toothbrush every three months or so: it keeps the bristles fresh and the cleaning effective. If left to my own devices, I would probably forget to do this.

To help me fix this, I buy my toothbrushes through Amazon’s “Subscribe & Save” program. I have a subscription to my toothbrushes: I tell Amazon what I want to buy, and how often I want it, then they remember when to send me a package. So every six months, I get a new pack of brushes.

Amazon always email you a few days before they place they send you the next subscription, so there’s a chance to cancel it if it’s no longer relevant. It doesn’t come completely out of the blue. And there’s a place for managing your subscriptions on your account page.

I’m sure I could remember to buy toothbrushes myself if I put my mind to it, but it’s quite nice that Amazon will do it for me. It’s just one less thing for me to think about.

Reading web pages on my Kindle

Of everything I’ve tried, my favourite device for reading is still my e-ink Kindle. Long reading sessions are much more comfortable on a display that isn’t backlit.

It’s easy to get ebooks on my Kindle – I can buy them straight from Amazon. But what about content I read on the web? If I’ve got a long article on my Mac that I’d like to read on my Kindle instead, how do I push it from one to the other?

There’s a Send to Kindle Mac app available, but that can only send documents on disk. I tried it a few times – save web pages to a PDF or HTML file, then send them to my Kindle through the app – but it was awkward, and the quality of the finished files wasn’t always great. A lot of web pages have complex layouts, which didn’t look good on the Kindle screen.

But I knew the folks at Instapaper had recently opened an API allowing you to use the same parser as they use in Instapaper itself. You give it a web page, and it passes back a nice, cleaned-up version that only includes the article text. Perhaps I could do something useful with that?

I decided to write a Python script that would let me send articles to a Kindle – from any device.

Continue reading →

Introduction to property-based testing 

On Tuesday night, I was talking about testing techniques at the Cambridge Python User Group (CamPUG for short). I was talking primarily about property-based testing and the Python library Hypothesis, but I included an introduction to the ideas of stateful testing, and fuzz testing with american fuzzy lop (afl).

I was expecting the traditional property-based testing would be the main attraction, and the stateful and fuzz testing would just be nice extras. In fact, I think interest was pretty evenly divided between the three topics.

I’ve posted the slides and my rough notes. Thanks to everybody who came on Tuesday – I had a really good evening.

Finding 404s and broken pages in my Apache logs

Sometime earlier this year, I broke the Piwik server-side analytics that I’d been using to count hits to the site. It sat this way for about two months before anybody noticed, which I took as a sign that I didn’t actually need them. I look at them for vanity, nothing more.

Since then, I’ve been using Python to parse my Apache logs, an idea borrowed from Dr. Drang. All I want is a rough view count, and if I work on the raw logs, then I can filter out a lot of noise from things like bots and referrer spam. High-level tools like Piwik and Google Analytics make it much harder to prune your results.

My Apache logs include a list of all the 404 errors: any time that somebody (or something) has found a missing page. This is useful information, because it tells me if I’ve broken something (not unlikely, see above). Although I try to have a helpful 404 page, that’s no substitute for fixing broken pages. So I wrote a script that looks for 404 errors in my Apache logs, and prints the most commonly hit pages – then I can decide whether to fix or ignore them.

The full script is on GitHub, along with some instructions. Below I’ll walk through the part that actually does the hard work.

page_tally = collections.Counter()

for line in sys.stdin:

    # Any line that isn't a 404 request is uninteresting.
    if '404' not in line:

    # Parse the line, and check it really is a 404 request; otherwise,
    # discard it.  Then get the page the user was trying to reach.
    hit = PATTERN.match(line).groupdict()
    if hit['status'] != '404':
    page = hit['request'].split()[1]

    # If it's a 404 that I know I'm not going to fix, discard it.
    if page in WONTFIX_404S:

    # If I fixed the page after this 404 came in, I'm not interested
    # in hearing about it again.
    if page in FIXED_404S:
        time, _ = hit["time"].split()
        date = datetime.strptime(time, "%d/%b/%Y:%H:%M:%S").date()
        if date <= FIXED_404S[page]:

        # But I definitely want to know about links I thought I'd
        # fixed but which are still broken.
        print('!! ' + page)

    # This is a 404 request that we're interested in; go ahead and
    # add it to the counter.
    page_tally[page] += 1

for page, count in page_tally.most_common(25):
    print('%5d\t%s' % (count, page))

I’m passing the Apache log in to stdin, and looping over the lines. Each line corresponds to a single hit.

On lines 6–7, I’m throwing away all the lines that don’t contain the string “404”. This might let through a few lines that aren’t 404 results – I’m not too fussed. This is just a cheap heuristic to avoid (relatively) slow parsing of lots of lines that I don’t care about.

On lines 11–14, I actually parse the line. My PATTERN regex for parsing the Apache log format comes from Dr. Drang’s post. Now I actually can properly filter for 404 results only, and discard the rest. The request parameter is usually something like GET /about/ HTTP/1.1 – a method, a page and an HTTP version. I only care about the page, so throw away the rest.

Like any public-facing computer, my server is crawled by bots looking for unpatched versions of WordPress and PHP. They’re looking for login pages where they can brute force credentials or exploit known vulnerabilities. I don’t have PHP or WordPress installed, so they show up as 404 errors in my logs.

Once I’m happy that I’m not vulnerable to whatever they’re trying to exploit, I add those pages to WONTFIX_404S. On lines 17–18, I ignore any errors from those pages.

The point of writing this script is to find, and fix, broken pages. Once I’ve fixed the page, the hits are still in the historical logs, but they’re less interesting. I’d like to know if the page is still broken in future, but I already know that it was broken in the past.

When I fix a page, I add it to FIXED_404S, a dictionary in which the keys are pages, and the values are the date on which I think I fixed it. On lines 22–32, I throw away any broken pages that I’ve acknowledged and fixed, if they came in before the fix. But then I highlight anything that’s still broken, because it means my fix didn’t work.

Any hit that hasn’t been skipped by now is “interesting”. It’s a 404’d page that I don’t want to ignore, and that I haven’t fixed in the past. I add 1 to the tally of broken pages, and carry on.

I’ve been using the Counter class from the Python standard library to store my tally. I could use a regular dictionary, but Counter helps clean up a little boilerplate. In particular, I don’t have to initialise a new key in the tally – it starts at a default of 0 – and at the end of the script, I can use the most_common() method to see the 404’d pages that are hit most often. That helps me prioritise what pages I want to fix.

Here’s a snippet from the output when I first ran the script:

23656   /atom.xml
14161   /robots.txt
 3199   /favicon.ico
 3075   /apple-touch-icon.png
  412   /wp-login.php
  401   /blog/2013/03/pinboard-backups/

Most of the actually broken or missing pages were easy to fix. In ten minutes, I fixed ~90% of the 404 problems that had occurred since I turned on Apache last August.

I don’t know how often I’ll actually run this script. I’ve fixed the most common errors; it’ll be a while before I have enough logs to make it worth doing another round of fixes. But it’s useful to have in my back pocket for a rainy day.

A Python smtplib wrapper for FastMail

Sometimes I want to send email from a Python script on my Mac. Up to now, my approach has been to shell out to osascript, and use AppleScript to invoke to compose and send the message. This is sub-optimal on several levels:

  • It relies on having up-to-date email config;
  • The compose window of briefly pops into view, stealing focus from my main task;
  • Having a Python script shell out to run AppleScript is an ugly hack.

Plus it was a bit buggy and unreliable. Not a great solution.

My needs are fairly basic: I just want to be able to send a message from my email address, with a bit of body text and a subject, and optionally an attachment or two. And I’m only sending messages from one email provider, FastMail.

Since the Python standard library includes smtplib, I decided to give that a try.

After a bit of mucking around, I came up with this wrapper:

from email import encoders
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib

class FastMailSMTP(smtplib.SMTP_SSL):
    """A wrapper for handling SMTP connections to FastMail."""

    def __init__(self, username, password):
        super().__init__('', port=465)
        self.login(username, password)

    def send_message(self, *,
        msg_root = MIMEMultipart()
        msg_root['Subject'] = subject
        msg_root['From'] = from_addr
        msg_root['To'] = ', '.join(to_addrs)

        msg_alternative = MIMEMultipart('alternative')

        if attachments:
            for attachment in attachments:
                prt = MIMEBase('application', "octet-stream")
                prt.set_payload(open(attachment, "rb").read())
                    'Content-Disposition', 'attachment; filename="%s"'
                    % attachment.replace('"', ''))

        self.sendmail(from_addr, to_addrs, msg_root.as_string())

Lines 7–12 create a subclass of smtplib.SMTP_SSL, and uses the supplied credentials to log into FastMail. Annoyingly, this subclassing is broken on Python 2, because SMTP_SSL is an old-style class, and so super() doesn’t work. I only use Python 3 these days, so that’s okay for me, but you’ll need to change that if you want a backport.

For getting my username/password into the script, I use the keyring module. It gets them from the system keychain, which feels pretty secure. My email credentials are important – I don’t just want to store them in an environment variable or a hard-coded string.

Lines 14–19 defines a convenience wrapper for sending a message. The * in the arguments list denotes the end of positional arguments – all the remaining arguments have to be called as keyword arguments. This is a new feature in Python 3, and I really like it, especially for functions with lots of arguments. It helps enforce clarity in the calling code.

In lines 20–23, I’m setting up a MIME message with my email headers. I deliberately use a multi-part MIME message so that I can add attachments later, if I want.

Then I add the body text. With MIME, you can send multiple versions of the body: a plain text and an HTML version, and the recipient’s client can choose which to display. In practice, I almost always use plaintext email, so that’s all I’ve implemented. If you want HTML, see Stack Overflow.

Then lines 29–37 add the attachments – if there are any. Note that I use None as the default value for the attachments argument, not an empty list – this is to avoid any gotchas around mutable default arguments.

Finally, on line 39, I call the sendmail method from the SMTP class, which actually dispatches the message into the aether.

The nice thing about subclassing the standard SMTP class is that I can use my wrapper class as a drop-in replacement. Like so:

with FastMailSMTP(user, pw) as server:
                        to_addrs=['', ''],
                        msg='Hello world from Python!',
                        subject='Sent from smtplib',

I think this is a cleaner interface to email. Mucking about with MIME messages and SMTP is a necessary evil, but I don’t always care about those details. If I’m writing a script where email support is an orthogonal feature, it’s nice to have them abstracted away.

← Older Posts