Tagged with “python”


Listing keys in an S3 bucket with Python

A lot of my recent work has involved batch processing on files stored in Amazon S3. It’s been very useful to have a list of files (or rather, keys) in the S3 bucket – for example, to get an idea of how many files there are to process, or whether they follow a particular naming scheme.

The AWS APIs (via boto3) do provide a way to get this information, but API calls are paginated and don’t expose key names directly. It’s a bit fiddly, and I don’t generally care about the details of the AWS APIs when using this list – so I wrote a wrapper function to do it for me. All the messiness of dealing with the S3 API is hidden in general use.

Since this function has been useful in lots of places, I thought it would be worth writing it up properly.

Read more →


A few examples of extensions in Python-Markdown

I write a lot of content in Markdown (including all the posts on this site), and I use Python-Markdown to render it as HTML. One of Python-Markdown’s features is an Extensions API. The package provides some extensions for common tasks – abbreviations, footnotes, tables and so on – but you can also write your own extensions if you need something more specialised.

After years of just using the builtin extensions, I’ve finally started to dip my toe into custom extensions. In this post, I’m going to show you a few of my tweaked or custom extensions.

Read more →


A script for backing up your Instapaper bookmarks

About three days ago, there was an extended outage at Instapaper. Luckily, it seems like there wasn’t any permanent data loss – everybody’s bookmarks are still safe – but this sort of incident can make you worry.

I have a Python script that backs up my Instapaper bookmarks on a regular basis, so I was never worried about data loss. At worst, I’d have lost an hour or so of changes – fairly minor, in the grand scheme of things. I’ve been meaning to tidy it up and share it for a while, and this outage prompted me to get on and finish that. You can find the script and the installation instructions on GitHub.


A script for backing up your Goodreads reviews

Last year, I started using Goodreads to track my reading. (I’m alexwlchan if you want to follow me.) In the past, I’ve had a couple of hand-rolled systems for recording my books, but maintaining them often became a distraction from actually reading!

Using Goodreads is quite a bit simpler, but it means my book data is stored on somebody else’s servers. What if Goodreads goes away? I don’t want to lose that data, particularly because I’m trying to be better about writing some notes after I finish a book.

There is an export function on Goodreads, but it has to be invoked by hand. I prefer to have backup tools that can be run automatically: I can set them to run on a schedule, and I know my data is safe. This tends to be a script or a cron job.

That’s exactly what I’ve done for Goodreads: I’ve written a Python script that uses the Goodreads API to grab the same information as provided by the builtin export. I have this configured to run once a day, and now I have daily backups of my Goodreads data. You can find the script and installation instructions on GitHub.

This was a fun opportunity to play with the ElementTree module (normally I work with JSON), and also a reminder that the lack of yield from has become my most disliked feature in Python 2.


A Python interface to AO3

In my last post, I talked about some work I’d been doing to scrape data from AO3 using Python. I haven’t made any more progress, but I’ve tidied up what I had and posted it to GitHub.

Currently this gives you a way to get metadata about works (word count, title, author, that sort of thing), along with your complete reading history. This latter is particularly interesting because it allows you to get a complete list of works where you’ve left kudos.

Instructions are in the README, and you can install it from PyPI (pip install ao3).

I’m not actively working on this (I have what I need for now), but this code might be useful for somebody else. Enjoy!


Experiments with AO3 and Python

Recently, I’ve been writing some scripts that need to get data from AO31. Unfortunately, AO3 doesn’t have an API (although it’s apparently on the roadmap), so you have to do everything by scraping pages and parsing HTML. A bit yucky, but it can be made to work.

You can get to a lot of pages without having an AO3 account – which includes most of the fic. If you want to get data from those pages, you can use any HTTP client to download the HTML, then parse or munge it as much as you like. For example, in Python:

import requests

req = requests.get('http://archiveofourown.org/works/9079264')
print(req.text)  # Prints the page's HTML

I have a script that takes this HTML, and which can extract metadata like word count and pairings. (I use that to auto-tag my bookmarks on Pinboard, because I’m lazy that way.)

But there are some pages that require you to be logged in to an account. For example, AO3 can track your reading history across the site. If you try to access a private page with the approach above, you’ll just get an error message:

Sorry, you don't have permission to access the page you were trying to reach. Please log in.

Wouldn’t it be nice if you could access those pages in a script as well?

I’ve struggled with this for a while, and I had some hacky workarounds, but nothing very good. Tonight, I found quite a neat solution that seems much more reliable.

For this to work, you need an HTTP client that doesn’t just do one-shot requests. You really want to make two requests: one to log you in, another for the page you actually want. You need to persist some login state from the first request to the second, so that AO3 remembers us on the second request. Normally, this state is managed by your browser: in Python, we can do the same thing with sessions.

After a bit of poking at the AO3 login form, I’ve got the following code that seems to work:

import requests

sess = requests.Session()

# Log in to AO3
sess.post('http://archiveofourown.org/user_sessions', params={
    'user_session[login]': USERNAME,
    'user_session[password]': PASSWORD,
})

# Fetch my private reading history
req = sess.get('https://archiveofourown.org/users/%s/readings' % USERNAME)
print(req.text)

Where previously this would return an error page, now I get my reading history. There’s more work to parse this into usable data, but we’re past my previous stumbling block.

I think this is a useful milestone, and could form the basis for a Python-based AO3 API. I’ve thought about writing such a library in the past, but it’s a bit limited if you can’t log in. With that restriction lifted, there’s a lot more you can potentially do.

I have a few ideas about what to do next, but I don’t have much free time coming up. I’m not promising anything – but you might want to watch this space.


  1. Non-fannish types: AO3 is the Archive of Our Own, a popular website for sharing fanfiction. ↩︎


A tool for backing up your message history from Slack

I’ve just pushed a small tool to PyPI for backing up message history from Slack. It downloads your message history as a collection of JSON files, including public/private channels and DM threads.

This is mainly scratching my own itch: I don’t like having my data tied up in somebody’s proprietary system. Luckily, Slack provides an API that lets you get this data out into a plaintext form. This allows me to correct what I see as two deficiencies in the data exports provided by Slack:

Installation is pip install slack_history, then run slack_history --help for usage instructions.

Enjoy!


Another example of why strings are terrible

Here’s a programming assumption I used to make, that until today I’d never really thought about: changing the case of a string won’t change its length.

Now, thanks to Hypothesis, I know better:

>>> x = u'İ'
>>> len(x)
1
>>> len(x.lower())
2

I’m not going to pretend I understand enough about Unicode or Python’s string handling to say what’s going on here.

I discovered this while testing a moderately fiddly normalisation routine – this routine would normalise the string to lowercase, unexpectedly tripping a check that it was the right length. If you’d like to see this for yourself, here’s a minimal example:

from hypothesis import given, strategies as st

@given(st.text())
def test_changing_case_preserves_length(xs):
    assert len(xs) == len(xs.lower())

Update, 2 December 2016: David MacIver asked whether this affects Python 2, 3, or both, which I forgot to mention. The behaviour is different: Python 2 lowercases İ to an ASCII i, whereas Python 3 adds a double dot: .

This means that only Python 3 has the bug where the length changes under case folding (whereas Python 2 commits a different sin of throwing away information).

Cory Benfield pointed out that the Unicode standard has explicit character mappings that add or remove characters when changing case, and highlights a nice example in the other direction: when you uppercase the German esszett (ß), you replace it with a double-S.

Finally, Rob Wells wrote a follow-on post that explains this problem in more detail. He also points out the potential confusion of len(): should it count visible characters, or Unicode code points? The Swift String API does a rather good job here: if you haven’t used it, check out Apple’s introductory blog post.


Use keyring to store your credentials

I write a lot of Python scripts that interact with online services, which usually means requires my passwords and API keys. But how to store them?

The simplest approach would be to save my variable in my unencrypted source code:

PASSWORD = 'password!'

This is a terrible idea. Don’t do this.

This password is now trivially accessible to anybody who has access to the source code. If I ever want to share my code (and I often do), I have to remember to carefully scrub it of sensitive information. If I use a version control system like Git, the password is permanently baked into the history of the repository.1

So what’s the alternative? If I don’t want to put secrets directly in the source code, how can I make them available at runtime? I use the keyring module.

Read more →


Creating low contrast wallpapers with Pillow

In my last post, I explained how I’d been using Pillow to draw regular tilings of the plane. What I was actually trying to do was get some new desktop wallpapers, and getting to use a new Python library was just a nice bonus.

A while back, the Code Review Stack Exchange got a fresh design that featured, among other things, a low-contrast background of coloured squares:

I was quite keen on the effect, and wanted to use it as my desktop wallpaper, albeit in different colours. I like using low contrast wallpapers, and this was a nice pattern to try to mimic. My usual work is entirely text-based; this was a fun way to dip my toe into the world of graphics. And a few hours of Python later, I could generate these patterns in arbitrary colours:

In this post, I’ll explain how I went from having a tiling of the plane, to generating these wallpapers in arbitrary colours.

Read more →