Skip to main content

Use shlex.split() to parse log files quickly

I wanted to parse some nginx log files, and I know I’ve done something like this in the past (possibly with Apache logs) – but I remember that involved quite a complicated regex for extracting all the components.

I stumbled upon an article by ksndeveloper with a much simpler trick: use shlex.split().

Today I wanted to get a list of URLs from a set of log messages, which went as so:

>>> import shlex
>>> line = '127.0.0.1 - - [01/Dec/2023:12:08:23 +0000] "GET /page/with/error HTTP/1.0" 500 0 "-" "-"'

>>> shlex.split(line)
['127.0.0.1', '-', '-', '[01/Dec/2023:12:08:23', '+0000]', 'GET /page/with/error HTTP/1.0', '500', '0', '-', '-']

>>> shlex.split(line)[5]
'GET /page/with/error HTTP/1.0'

>>> shlex.split(line)[5].split()
['GET', '/page/with/error', 'HTTP/1.0']

>>> shlex.split(line)[5].split()[1]
'/page/with/error'

I haven’t tested this extensively, so I don’t know if it’s robust against larger, more complex log files, but it worked well enough for some quick analysis.