Tagged with “python”


Downloading logs from Amazon CloudWatch

At work, we use Amazon CloudWatch for logging in our applications. All our logs are sent to CloudWatch, and you can browse them in the AWS Console. The web console is fine for one-off use, but if I want to do in-depth analysis of the log, nothing beats a massive log file. I’m very used to tools like grep, awk and tr, and I’m more productive using those than trying to wrangle a web interface.

So I set out to write a Python script to download all of my CloudWatch logs into a single file. The AWS SDKs give you access to CloudWatch logs, so this seems like it should be possible. There are other tools for doing this (for example, I found awslogs after I was done) — but sometimes it can be instructive to reinvent something from scratch.

In this post, I’ll explain how I wrote this script, starting from nothing and showing how I build it up. It’s also a nice chance to illustrate several libraries I use a lot (boto3, docopt and maya). If you just want the code, skip to the end of the post.

Read more →


Using hooks for custom behaviour in requests

Recently I’ve been writing a lot of scripts with python-requests to interact with a new API. It starts off with a simple GET request:

resp = requests.get('http://example.com/api/v1/assets', params={...})

I want to make sure that the request succeeded before I carry on, so I throw an exception if I got an error responses:

resp = requests.get('http://example.com/api/v1/assets', params={...})
resp.raise_for_status()

If I get an error, the server response may contain useful debugging information, so let’s log that as well (and actually, logging it might be generally useful):

resp = requests.get('http://example.com/api/v1/assets', params={...})

try:
    resp.raise_for_status()
except requests.HTTPError:
    logger.error('Received error %s', resp.text)
    raise
else:
    logger.debug('Received response %s', resp.text)

And depending on the API, I may want even more checks or logging. For example, APIs that always return an HTTP 200 OK, but embedded the real response code in a JSON response. Or maybe I want to log the URL I requested.

If I’m making lots of calls to the same API, repeating this code gets quite tedious. Previously I would have wrapped requests.get in a helper function, but that relies on me remembering to use the wrapper.

It turns out there’s a better way — today I learnt that requests has a hook mechanism that allows you to provide functions that are called after every response. In this post, I’ll show you some simple examples of hooks that I’m already using to clean up my code.

Read more →


Using pip-tools to manage my Python dependencies

At last year’s PyCon UK, one of my favourite talks was Aaron Bassett’s session on Python dependency management. He showed us a package called pip-tools, and I’ve been using it ever since.

pip-tools is used to manage your pip dependencies. It allows you to write a top-level summary of the packages you need, for example:

$ cat requirements.in
pytest >= 1.4
requests

Here I want a version of pytest that’s at least 1.4, and any version of requests.

Then I run pip-compile, which turns that into a full requirements.txt:

$ pip-compile
$ cat requirements.txt
certifi==2017.7.27.1      # via requests
chardet==3.0.4            # via requests
idna==2.6                 # via requests
py==1.4.34                # via pytest
pytest==3.2.2
requests==2.18.4
urllib3==1.22             # via requests

I can install these dependencies with pip install -r requirements.txt.

The generated file is pinned: every package has a fixed version. This means that I get the same versions whenever I run pip install, no matter what the new version is. If you don’t pin your dependencies, your package manager may silently install a new version when it’s released – and that’s an easy way for bugs to sneak in.

Instead, check in both files into version control, so you can see exactly when a dependency version was changed. This makes it easier to see if a version bump introduced a bump.

There are also comments to explain why you need a particular package: for example, I’m installing certifi because it’s required by requests.

I’ve been using pip-tools since Aaron’s recommendation, and it’s been really nice. It’s not had an earth-shattering impact on my workflow, but it shaves off a bunch of rough edges. If you do any work with Python, I recommend giving it a look.

For more about pip-tools itself, I recommend Better Package Management by Vincent Driessen, one of the pip-tools authors. This human-readable/pinned-package distinction is coming to vanilla pip in the form of Pipfile, but that was in its infancy last September. pip-tools has been stable for over two years.


Recently, I’ve been trying to push more of my tools inside Docker. Every tool I run in Docker is one less tool I have to install locally, so I can get up-and-running that much faster. Handily, there’s already a Docker image for running pip-tools.

You run it as follows:

$ docker run --volume /path/to/repo:/src --rm micktwomey/pip-tools

It looks for a requirements.in in /src, so we mount the repo in that directory — this gives the container the ability to read the file, and write a requirements.txt back into a file on the host system. I also add the --rm flag, which cleans up the countainer after it’s finished running.

If you already have Docker, this is a nice way to use pip-tools without installing it locally.


Alongside Docker, I’ve been defining more of my build processes in Makefiles. Having Docker commands is useful, but I don’t want to have to remember all the flags every time I use them. Writing a Makefile gives me shortcuts for common tasks.

This is the Make task I have for updating a requirements.txt:

requirements.txt: requirements.in
    docker run --volume $(CURDIR):/src --rm micktwomey/pip-tools
    touch requirements.txt

To use it, run make requirements.txt.

The first line specifies the Make target (requirements.txt), and tells Make that it depends on requirements.in. So when the Make task is invoked, it checks that the .in file exists, and then whether the .in file was updated more recently than .txt. If yes — the .txt file needs rebuilding. If no — we’re up-to-date, there’s nothing to do.

The second line runs the Docker command explained above, using the Make variable $(CURDIR) to get the current directory.

Finally, touch ensures that the last modified time of requirements.txt is always updated. pip-tools will only change the modification time if there are changes to the dependency pins — I change it manually so that make knows the task has run, and the “should I run this task” logic explained above doesn’t spin endlessly.

Once I have this Make task, I can invoke it from other tasks — for example, build tasks that install from requirements.txt — and so it gets run when required, but without an explicit action from me. It’s just another step that happens transparently when I run make build.

If you’d like to see an example of this in use, check out the Makefile changes in the same patch as this post.


Ode to docopt

Every week, we have an hour of developer learning at work – maybe a talk, a workshop, or some other session about of topic of interest to developers. Last week, we did a round of lightning talks. We’re quite a mixed group of developers – in background, technical stacks, and what we actually work on – so coming up with a topic that’s useful to all can be tricky.

For my slot, I decided to wax lyrical about the docopt library. Once upon a time, I was sceptical, but it’s become my go-to library for any sort of command-line interface. Rather than fiddling with argparse and the like, I just write a docopt help string, and the hard work is done for me. I’ve used it in multiple languages, and thought it might be handy for other devs at work. Ergo, this talk.

You can download my slides as a PDF, or read the notes below.

Read more →


A Python module for lazy reading of file objects

At work, we often pass data around via large files kept in Amazon S3 – XML exports from legacy applications, large log files, JSON dumps of Elasticsearch indexes – that sort of thing. The services that deal with these files run in Docker containers on AWS, and they have limited memory and local storage.

Downloading large files into memory is slow, expensive, and often unnecessary. Many of these files contain a list of records, which we want to process one-at-a-time. We only need to hold a single record in memory at a time, not the whole file.

Python can do efficient line-by-line processing of local files. The following code only reads a line at a time:

with open('very-long-file.txt') as f:
    for line in f:
        do_stuff_with(line)

This is more efficient, and usually results in faster code – but you can only do this for local files, and the only delimiter is a newline. You need a different wrapper if you want to do this for files in S3, or use a different delimiter – and that’s what this module does. It goes like this:

import boto3
from lazyreader import lazyread

s3 = boto3.client('s3')
s3_object = client.get_object(Bucket='example-bucket', Key='records.txt')
body = s3_object['Body']

for doc in lazyread(body, delimiter=b';'):
    print(doc)

The code isn’t especially complicated, just a little fiddly, but I think it’s a useful standalone component.

I was mildly surprised that something like this doesn’t already exist, or if it does, I couldn’t find the right search terms! If you know an existing module that does this, please let me know.

You can install lazyreader from PyPI (pip install lazyreader), or see the README for more details.


Listing keys in an S3 bucket with Python

A lot of my recent work has involved batch processing on files stored in Amazon S3. It’s been very useful to have a list of files (or rather, keys) in the S3 bucket – for example, to get an idea of how many files there are to process, or whether they follow a particular naming scheme.

The AWS APIs (via boto3) do provide a way to get this information, but API calls are paginated and don’t expose key names directly. It’s a bit fiddly, and I don’t generally care about the details of the AWS APIs when using this list – so I wrote a wrapper function to do it for me. All the messiness of dealing with the S3 API is hidden in general use.

Since this function has been useful in lots of places, I thought it would be worth writing it up properly.

Read more →


A few examples of extensions in Python-Markdown

I write a lot of content in Markdown (including all the posts on this site), and I use Python-Markdown to render it as HTML. One of Python-Markdown’s features is an Extensions API. The package provides some extensions for common tasks – abbreviations, footnotes, tables and so on – but you can also write your own extensions if you need something more specialised.

After years of just using the builtin extensions, I’ve finally started to dip my toe into custom extensions. In this post, I’m going to show you a few of my tweaked or custom extensions.

Read more →


A script for backing up your Instapaper bookmarks

About three days ago, there was an extended outage at Instapaper. Luckily, it seems like there wasn’t any permanent data loss – everybody’s bookmarks are still safe – but this sort of incident can make you worry.

I have a Python script that backs up my Instapaper bookmarks on a regular basis, so I was never worried about data loss. At worst, I’d have lost an hour or so of changes – fairly minor, in the grand scheme of things. I’ve been meaning to tidy it up and share it for a while, and this outage prompted me to get on and finish that. You can find the script and the installation instructions on GitHub.


A script for backing up your Goodreads reviews

Last year, I started using Goodreads to track my reading. (I’m alexwlchan if you want to follow me.) In the past, I’ve had a couple of hand-rolled systems for recording my books, but maintaining them often became a distraction from actually reading!

Using Goodreads is quite a bit simpler, but it means my book data is stored on somebody else’s servers. What if Goodreads goes away? I don’t want to lose that data, particularly because I’m trying to be better about writing some notes after I finish a book.

There is an export function on Goodreads, but it has to be invoked by hand. I prefer to have backup tools that can be run automatically: I can set them to run on a schedule, and I know my data is safe. This tends to be a script or a cron job.

That’s exactly what I’ve done for Goodreads: I’ve written a Python script that uses the Goodreads API to grab the same information as provided by the builtin export. I have this configured to run once a day, and now I have daily backups of my Goodreads data. You can find the script and installation instructions on GitHub.

This was a fun opportunity to play with the ElementTree module (normally I work with JSON), and also a reminder that the lack of yield from has become my most disliked feature in Python 2.


A Python interface to AO3

In my last post, I talked about some work I’d been doing to scrape data from AO3 using Python. I haven’t made any more progress, but I’ve tidied up what I had and posted it to GitHub.

Currently this gives you a way to get metadata about works (word count, title, author, that sort of thing), along with your complete reading history. This latter is particularly interesting because it allows you to get a complete list of works where you’ve left kudos.

Instructions are in the README, and you can install it from PyPI (pip install ao3).

I’m not actively working on this (I have what I need for now), but this code might be useful for somebody else. Enjoy!