Fast and random sampling in SQLite
I was building a small feature for the Flickr Commons Explorer today: show a random selection of photos from the entire collection. I wanted a fast and varied set of photos.
This meant getting a random sample of rows from a SQLite table (because the Explorer stores all its data in SQLite). I’m happy with the code I settled on, but it took several attempts to get right.
Approach #1: ORDER BY RANDOM()
My first attempt was pretty naïve – I used an ORDER BY RANDOM()
clause to sort the table, then limit the results:
SELECT *
FROM photos
ORDER BY random()
LIMIT 10
This query works, but it was slow – about half a second to sample a table with 2 million photos (which is very small by SQLite standards). This query would run on every request for the homepage, so that latency is unacceptable.
It’s slow because it forces SQLite to generate a value for every row, then sort all the rows, and only then does it apply the limit. SQLite is fast, but there’s only so fast you can sort millions of values.
I found a suggestion from Stack Overflow user Ali to do a random sort on the id
column first, pick my IDs from that, and only fetch the whole row for the photos I’m selecting:
SELECT *
FROM photos
WHERE id IN (
SELECT id
FROM photos
ORDER BY RANDOM()
LIMIT 10
)
This means SQLite only has to load the rows it’s returning, not every row in the database. This query was over three times faster – about 0.15s – but that’s still slower than I wanted.
Approach #2: WHERE rowid > (…)
Scrolling down the Stack Overflow page, I found an answer by Max Shenfield with a different approach:
SELECT * FROM photos
WHERE rowid > (
ABS(RANDOM()) % (SELECT max(rowid) FROM photos)
)
LIMIT 10
The rowid
is a unique identifier that’s used as a primary key in most SQLite tables, and it can be looked up very quickly. SQLite automatically assigns a unique rowid
unless you explicitly tell it not to, or create your own integer primary key.
This query works by picking a point between the biggest and smallest rowid
values used in the table, then getting the rows with rowids which are higher than that point. If you want to know more, Max’s answer has a more detailed explanation.
This query is much faster – around 0.0008s – but I didn’t go this route.
The result is more like a random slice than a random sample. In my testing, it always returned contiguous rows – 101
, 102
, 103
, …
– which isn’t what I want. The photos in the Commons Explorer database were inserted in upload order, so photos with adjacent row IDs were uploaded at around the same time and are probably quite similar. I’d get one photo of an old plane, then nine more photos of other planes. I want more variety!
(This behaviour isn’t guaranteed – if you don’t add an ORDER BY
clause to a SELECT
query, then the order of results is undefined. SQLite is returning rows in rowid
order in my table, and a quick Google suggests that’s pretty common, but that may not be true in all cases. It doesn’t affect whether I want to use this approach, but I mention it here because I was confused about the ordering when I read this code.)
Approach #3: Select random rowid values outside SQLite
Max’s answer was the first time I’d heard of rowid
, and it gave me an idea – what if I chose random rowid
values outside SQLite? This is a less “pure” approach because I’m not doing everything in the database, but I’m happy with that if it gets the result I want.
Here’s the procedure I came up with:
-
Create an empty list to store our sample.
-
Find the highest
rowid
that’s currently in use:sqlite> SELECT MAX(rowid) FROM photos; 1913389
-
Use a random number generator to pick a
rowid
between 1 and the highestrowid
:>>> import random >>> random.randint(1, max_rowid) 196476
If we’ve already got this
rowid
, discard it and generate a new one.(The
rowid
is a signed, 64-bit integer, so the minimum possible value is always 1.) -
Look for a row with that
rowid
:SELECT * FROM photos WHERE rowid = 196476
If such a row exists, add it to our sample. If we have enough items in our sample, we’re done. Otherwise, return to step 3 and generate another
rowid
.If such a row doesn’t exist, return to step 3 and generate another
rowid
.
This requires a bit more code, but it returns a diverse sample of photos, which is what I really care about. It’s a bit slower, but still plenty fast enough (about 0.001s).
This approach is best for tables where the rowid
values are mostly contiguous – it would be slower if there are lots of rowid
s between 1 and the max that don’t exist. If there are large gaps in rowid
values, you might try multiple missing entries before finding a valid row, slowing down the query. You might want to try something different, like tracking valid rowid
values separately.
This is a good fit for my use case, because photos don’t get removed from Flickr Commons very often. Once a row is written, it sticks around, and over 97% of the possible rowid
values do exist.
Summary
Here are the four approaches I tried:
Approach | Performance (for 2M rows) | Notes |
---|---|---|
ORDER BY RANDOM() | ~0.5s | Slowest, easiest to read |
WHERE id IN (SELECT id …) | ~0.15s | Faster, still fairly easy to understand |
WHERE rowid > ... | ~0.0008s | Returns clustered results |
Random rowid in Python | ~0.001s | Fast and returns varied results, requires code outside SQL, may be slower with sparsely populated rowid |
I’m using the random rowid
in Python in the Commons Explorer, trading code complexity for speed. I’m using this random sample to render a web page, so it’s important that it returns quickly – when I was testing ORDER BY RANDOM()
, I could feel myself waiting for the page to load.
But I’ve used ORDER BY RANDOM()
in the past, especially for asynchronous data pipelines where I don’t care about absolute performance. It’s simpler to read and easier to see what’s going on.
Now it’s your turn – visit the Commons Explorer and see what random gems you can find. Let me know if you spot anything cool!