In my last post, I talked about some work I’d been doing to scrape data from AO3 using Python. I haven’t made any more progress, but I’ve tidied up what I had and posted it to GitHub.
Currently this gives you a way to get metadata about works (word count, title, author, that sort of thing), along with your complete reading history. This latter is particularly interesting because it allows you to get a complete list of works where you’ve left kudos.
Instructions are in the README, and you can install it from PyPI (
pip install ao3).
I’m not actively working on this (I have what I need for now), but this code might be useful for somebody else. Enjoy!
Recently, I’ve been writing some scripts that need to get data from AO31. Unfortunately, AO3 doesn’t have an API (although it’s apparently on the roadmap), so you have to do everything by scraping pages and parsing HTML. A bit yucky, but it can be made to work.
You can get to a lot of pages without having an AO3 account – which includes most of the fic. If you want to get data from those pages, you can use any HTTP client to download the HTML, then parse or munge it as much as you like. For example, in Python:
req = requests.get('http://archiveofourown.org/works/9079264')
print(req.text) # Prints the page's HTML
I have a script that takes this HTML, and which can extract metadata like word count and pairings. (I use that to auto-tag my bookmarks on Pinboard, because I’m lazy that way.)
But there are some pages that require you to be logged in to an account. For example, AO3 can track your reading history across the site. If you try to access a private page with the approach above, you’ll just get an error message:
Sorry, you don’t have permission to access the page you were trying to reach. Please log in.
Wouldn’t it be nice if you could access those pages in a script as well?
I’ve struggled with this for a while, and I had some hacky workarounds, but nothing very good. Tonight, I found quite a neat solution that seems much more reliable.
For this to work, you need an HTTP client that doesn’t just do one-shot requests. You really want to make two requests: one to log you in, another for the page you actually want. You need to persist some login state from the first request to the second, so that AO3 remembers us on the second request. Normally, this state is managed by your browser: in Python, we can do the same thing with sessions.
After a bit of poking at the AO3 login form, I’ve got the following code that seems to work:
sess = requests.Session()
# Log in to AO3
# Fetch my private reading history
req = sess.get('https://archiveofourown.org/users/%s/readings' % USERNAME)
Where previously this would return an error page, now I get my reading history. There’s more work to parse this into usable data, but we’re past my previous stumbling block.
I think this is a useful milestone, and could form the basis for a Python-based AO3 API. I’ve thought about writing such a library in the past, but it’s a bit limited if you can’t log in. With that restriction lifted, there’s a lot more you can potentially do.
I have a few ideas about what to do next, but I don’t have much free time coming up. I’m not promising anything – but you might want to watch this space.