Tagged with “python”
At last year’s PyCon UK, one of my favourite talks was Aaron Bassett’s session on Python dependency management. He showed us a package called pip-tools, and I’ve been using it ever since.
pip-tools is used to manage your pip dependencies. It allows you to write a top-level summary of the packages you need, for example:
$ cat requirements.in
pytest >= 1.4
Here I want a version of pytest that’s at least 1.4, and any version of requests.
Then I run
pip-compile, which turns that into a full
$ cat requirements.txt
certifi==2017.7.27.1 # via requests
chardet==3.0.4 # via requests
idna==2.6 # via requests
py==1.4.34 # via pytest
urllib3==1.22 # via requests
I can install these dependencies with
pip install -r requirements.txt.
The generated file is pinned: every package has a fixed version. This means that I get the same versions whenever I run
pip install, no matter what the new version is. If you don’t pin your dependencies, your package manager may silently install a new version when it’s released – and that’s an easy way for bugs to sneak in.
Instead, check in both files into version control, so you can see exactly when a dependency version was changed. This makes it easier to see if a version bump introduced a bump.
There are also comments to explain why you need a particular package: for example, I’m installing certifi because it’s required by requests.
I’ve been using pip-tools since Aaron’s recommendation, and it’s been really nice. It’s not had an earth-shattering impact on my workflow, but it shaves off a bunch of rough edges. If you do any work with Python, I recommend giving it a look.
For more about pip-tools itself, I recommend Better Package Management by Vincent Driessen, one of the pip-tools authors. This human-readable/pinned-package distinction is coming to vanilla pip in the form of Pipfile, but that was in its infancy last September. pip-tools has been stable for over two years.
Recently, I’ve been trying to push more of my tools inside Docker. Every tool I run in Docker is one less tool I have to install locally, so I can get up-and-running that much faster. Handily, there’s already a Docker image for running pip-tools.
You run it as follows:
$ docker run --volume /path/to/repo:/src --rm micktwomey/pip-tools
It looks for a
/src, so we mount the repo in that directory — this gives the container the ability to read the file, and write a
requirements.txt back into a file on the host system. I also add the
--rm flag, which cleans up the countainer after it’s finished running.
If you already have Docker, this is a nice way to use pip-tools without installing it locally.
Alongside Docker, I’ve been defining more of my build processes in Makefiles. Having Docker commands is useful, but I don’t want to have to remember all the flags every time I use them. Writing a Makefile gives me shortcuts for common tasks.
This is the Make task I have for updating a
docker run --volume $(CURDIR):/src --rm micktwomey/pip-tools
To use it, run
The first line specifies the Make target (
requirements.txt), and tells Make that it depends on
requirements.in. So when the Make task is invoked, it checks that the
.in file exists, and then whether the
.in file was updated more recently than
.txt. If yes — the
.txt file needs rebuilding. If no — we’re up-to-date, there’s nothing to do.
The second line runs the Docker command explained above, using the Make variable
$(CURDIR) to get the current directory.
touch ensures that the last modified time of
requirements.txt is always updated. pip-tools will only change the modification time if there are changes to the dependency pins — I change it manually so that make knows the task has run, and the “should I run this task” logic explained above doesn’t spin endlessly.
Once I have this Make task, I can invoke it from other tasks — for example, build tasks that install from
requirements.txt — and so it gets run when required, but without an explicit action from me. It’s just another step that happens transparently when I run
If you’d like to see an example of this in use, check out the Makefile changes in the same patch as this post.
Every week, we have an hour of developer learning at work – maybe a talk, a workshop, or some other session about of topic of interest to developers. Last week, we did a round of lightning talks. We’re quite a mixed group of developers – in background, technical stacks, and what we actually work on – so coming up with a topic that’s useful to all can be tricky.
For my slot, I decided to wax lyrical about the docopt library. Once upon a time, I was sceptical, but it’s become my go-to library for any sort of command-line interface. Rather than fiddling with argparse and the like, I just write a docopt help string, and the hard work is done for me. I’ve used it in multiple languages, and thought it might be handy for other devs at work. Ergo, this talk.
You can download my slides as a PDF, or read the notes below.
Read more →
At work, we often pass data around via large files kept in Amazon S3 – XML exports from legacy applications, large log files, JSON dumps of Elasticsearch indexes – that sort of thing. The services that deal with these files run in Docker containers on AWS, and they have limited memory and local storage.
Downloading large files into memory is slow, expensive, and often unnecessary. Many of these files contain a list of records, which we want to process one-at-a-time. We only need to hold a single record in memory at a time, not the whole file.
Python can do efficient line-by-line processing of local files. The following code only reads a line at a time:
with open('very-long-file.txt') as f:
for line in f:
This is more efficient, and usually results in faster code – but you can only do this for local files, and the only delimiter is a newline. You need a different wrapper if you want to do this for files in S3, or use a different delimiter – and that’s what this module does. It goes like this:
from lazyreader import lazyread
s3 = boto3.client('s3')
s3_object = client.get_object(Bucket='example-bucket', Key='records.txt')
body = s3_object['Body']
for doc in lazyread(body, delimiter=b';'):
The code isn’t especially complicated, just a little fiddly, but I think it’s a useful standalone component.
I was mildly surprised that something like this doesn’t already exist, or if it does, I couldn’t find the right search terms! If you know an existing module that does this, please let me know.
You can install lazyreader from PyPI (
pip install lazyreader), or see the README for more details.
A lot of my recent work has involved batch processing on files stored in Amazon S3. It’s been very useful to have a list of files (or rather, keys) in the S3 bucket – for example, to get an idea of how many files there are to process, or whether they follow a particular naming scheme.
The AWS APIs (via boto3) do provide a way to get this information, but API calls are paginated and don’t expose key names directly. It’s a bit fiddly, and I don’t generally care about the details of the AWS APIs when using this list – so I wrote a wrapper function to do it for me. All the messiness of dealing with the S3 API is hidden in general use.
Since this function has been useful in lots of places, I thought it would be worth writing it up properly.
Read more →
I write a lot of content in Markdown (including all the posts on this site), and I use Python-Markdown to render it as HTML. One of Python-Markdown’s features is an Extensions API. The package provides some extensions for common tasks – abbreviations, footnotes, tables and so on – but you can also write your own extensions if you need something more specialised.
After years of just using the builtin extensions, I’ve finally started to dip my toe into custom extensions. In this post, I’m going to show you a few of my tweaked or custom extensions.
Read more →
About three days ago, there was an extended outage at Instapaper. Luckily, it seems like there wasn’t any permanent data loss – everybody’s bookmarks are still safe – but this sort of incident can make you worry.
I have a Python script that backs up my Instapaper bookmarks on a regular basis, so I was never worried about data loss. At worst, I’d have lost an hour or so of changes – fairly minor, in the grand scheme of things. I’ve been meaning to tidy it up and share it for a while, and this outage prompted me to get on and finish that. You can find the script and the installation instructions on GitHub.
Last year, I started using Goodreads to track my reading. (I’m alexwlchan if you want to follow me.) In the past, I’ve had a couple of hand-rolled systems for recording my books, but maintaining them often became a distraction from actually reading!
Using Goodreads is quite a bit simpler, but it means my book data is stored on somebody else’s servers. What if Goodreads goes away? I don’t want to lose that data, particularly because I’m trying to be better about writing some notes after I finish a book.
There is an export function on Goodreads, but it has to be invoked by hand. I prefer to have backup tools that can be run automatically: I can set them to run on a schedule, and I know my data is safe. This tends to be a script or a cron job.
That’s exactly what I’ve done for Goodreads: I’ve written a Python script that uses the Goodreads API to grab the same information as provided by the builtin export. I have this configured to run once a day, and now I have daily backups of my Goodreads data. You can find the script and installation instructions on GitHub.
This was a fun opportunity to play with the ElementTree module (normally I work with JSON), and also a reminder that the lack of
yield from has become my most disliked feature in Python 2.
In my last post, I talked about some work I’d been doing to scrape data from AO3 using Python. I haven’t made any more progress, but I’ve tidied up what I had and posted it to GitHub.
Currently this gives you a way to get metadata about works (word count, title, author, that sort of thing), along with your complete reading history. This latter is particularly interesting because it allows you to get a complete list of works where you’ve left kudos.
Instructions are in the README, and you can install it from PyPI (
pip install ao3).
I’m not actively working on this (I have what I need for now), but this code might be useful for somebody else. Enjoy!
Recently, I’ve been writing some scripts that need to get data from AO31. Unfortunately, AO3 doesn’t have an API (although it’s apparently on the roadmap), so you have to do everything by scraping pages and parsing HTML. A bit yucky, but it can be made to work.
You can get to a lot of pages without having an AO3 account – which includes most of the fic. If you want to get data from those pages, you can use any HTTP client to download the HTML, then parse or munge it as much as you like. For example, in Python:
req = requests.get('http://archiveofourown.org/works/9079264')
print(req.text) # Prints the page's HTML
I have a script that takes this HTML, and which can extract metadata like word count and pairings. (I use that to auto-tag my bookmarks on Pinboard, because I’m lazy that way.)
But there are some pages that require you to be logged in to an account. For example, AO3 can track your reading history across the site. If you try to access a private page with the approach above, you’ll just get an error message:
Sorry, you don’t have permission to access the page you were trying to reach. Please log in.
Wouldn’t it be nice if you could access those pages in a script as well?
I’ve struggled with this for a while, and I had some hacky workarounds, but nothing very good. Tonight, I found quite a neat solution that seems much more reliable.
For this to work, you need an HTTP client that doesn’t just do one-shot requests. You really want to make two requests: one to log you in, another for the page you actually want. You need to persist some login state from the first request to the second, so that AO3 remembers us on the second request. Normally, this state is managed by your browser: in Python, we can do the same thing with sessions.
After a bit of poking at the AO3 login form, I’ve got the following code that seems to work:
sess = requests.Session()
# Log in to AO3
# Fetch my private reading history
req = sess.get('https://archiveofourown.org/users/%s/readings' % USERNAME)
Where previously this would return an error page, now I get my reading history. There’s more work to parse this into usable data, but we’re past my previous stumbling block.
I think this is a useful milestone, and could form the basis for a Python-based AO3 API. I’ve thought about writing such a library in the past, but it’s a bit limited if you can’t log in. With that restriction lifted, there’s a lot more you can potentially do.
I have a few ideas about what to do next, but I don’t have much free time coming up. I’m not promising anything – but you might want to watch this space.
I’ve just pushed a small tool to PyPI for backing up message history from Slack. It downloads your message history as a collection of JSON files, including public/private channels and DM threads.
This is mainly scratching my own itch: I don’t like having my data tied up in somebody’s proprietary system. Luckily, Slack provides an API that lets you get this data out into a plaintext form. This allows me to correct what I see as two deficiencies in the data exports provided by Slack:
- They only back up public channels, not private channels or direct messages.
- They’re only available to team admins, not individual users.
pip install slack_history, then run
slack_history --help for usage instructions.