Fast and random sampling in SQLite

Tagged with sqlite
Posted 13 March 2025

I was building a small feature for the Flickr Commons Explorer today: show a random selection of photos from the entire collection. I wanted a fast and varied set of photos.

This meant getting a random sample of rows from a SQLite table (because the Explorer stores all its data in SQLite). I’m happy with the code I settled on, but it took several attempts to get right.

Approach #1: `ORDER BY RANDOM()`

My first attempt was pretty naïve – I used an ORDER BY RANDOM() clause to sort the table, then limit the results:

SELECT *
FROM photos
ORDER BY random()
LIMIT 10

This query works, but it was slow – about half a second to sample a table with 2 million photos (which is very small by SQLite standards). This query would run on every request for the homepage, so that latency is unacceptable.

It’s slow because it forces SQLite to generate a value for every row, then sort all the rows, and only then does it apply the limit. SQLite is fast, but there’s only so fast you can sort millions of values.

I found a suggestion from Stack Overflow user Ali to do a random sort on the id column first, pick my IDs from that, and only fetch the whole row for the photos I’m selecting:

SELECT *
FROM photos
WHERE id IN (
    SELECT id
    FROM photos
    ORDER BY RANDOM()
    LIMIT 10
)

This means SQLite only has to load the rows it’s returning, not every row in the database. This query was over three times faster – about 0.15s – but that’s still slower than I wanted.

Approach #2: `WHERE rowid > (…)`

Scrolling down the Stack Overflow page, I found an answer by Max Shenfield with a different approach:

SELECT * FROM photos
WHERE rowid > (
  ABS(RANDOM()) % (SELECT max(rowid) FROM photos)
)
LIMIT 10

The rowid is a unique identifier that’s used as a primary key in most SQLite tables, and it can be looked up very quickly. SQLite automatically assigns a unique rowid unless you explicitly tell it not to, or create your own integer primary key.

This query works by picking a point between the biggest and smallest rowid values used in the table, then getting the rows with rowids which are higher than that point. If you want to know more, Max’s answer has a more detailed explanation.

This query is much faster – around 0.0008s – but I didn’t go this route.

The result is more like a random slice than a random sample. In my testing, it always returned contiguous rows – 101, 102, 103, … – which isn’t what I want. The photos in the Commons Explorer database were inserted in upload order, so photos with adjacent row IDs were uploaded at around the same time and are probably quite similar. I’d get one photo of an old plane, then nine more photos of other planes. I want more variety!

(This behaviour isn’t guaranteed – if you don’t add an ORDER BY clause to a SELECT query, then the order of results is undefined. SQLite is returning rows in rowid order in my table, and a quick Google suggests that’s pretty common, but that may not be true in all cases. It doesn’t affect whether I want to use this approach, but I mention it here because I was confused about the ordering when I read this code.)

Approach #3: Select random rowid values outside SQLite

Max’s answer was the first time I’d heard of rowid, and it gave me an idea – what if I chose random rowid values outside SQLite? This is a less “pure” approach because I’m not doing everything in the database, but I’m happy with that if it gets the result I want.

Here’s the procedure I came up with:

Create an empty list to store our sample.

Find the highest rowid that’s currently in use:

sqlite> SELECT MAX(rowid) FROM photos;
1913389

Use a random number generator to pick a rowid between 1 and the highest rowid:
```
>>> import random
>>> random.randint(1, max_rowid)
196476
```
If we’ve already got this rowid, discard it and generate a new one.
(The rowid is a signed, 64-bit integer, so the minimum possible value is always 1.)
Look for a row with that rowid:
```
SELECT *
FROM photos
WHERE rowid = 196476
```
If such a row exists, add it to our sample. If we have enough items in our sample, we’re done. Otherwise, return to step 3 and generate another rowid.
If such a row doesn’t exist, return to step 3 and generate another rowid.

This requires a bit more code, but it returns a diverse sample of photos, which is what I really care about. It’s a bit slower, but still plenty fast enough (about 0.001s).

This approach is best for tables where the rowid values are mostly contiguous – it would be slower if there are lots of rowids between 1 and the max that don’t exist. If there are large gaps in rowid values, you might try multiple missing entries before finding a valid row, slowing down the query. You might want to try something different, like tracking valid rowid values separately.

This is a good fit for my use case, because photos don’t get removed from Flickr Commons very often. Once a row is written, it sticks around, and over 97% of the possible rowid values do exist.

Summary

Here are the four approaches I tried:

Approach	Performance (for 2M rows)	Notes
`ORDER BY RANDOM()`	~0.5s	Slowest, easiest to read
`WHERE id IN (SELECT id …)`	~0.15s	Faster, still fairly easy to understand
`WHERE rowid > ...`	~0.0008s	Returns clustered results
Random `rowid` in Python	~0.001s	Fast and returns varied results, requires code outside SQL, may be slower with sparsely populated `rowid`

I’m using the random rowid in Python in the Commons Explorer, trading code complexity for speed. I’m using this random sample to render a web page, so it’s important that it returns quickly – when I was testing ORDER BY RANDOM(), I could feel myself waiting for the page to load.

But I’ve used ORDER BY RANDOM() in the past, especially for asynchronous data pipelines where I don’t care about absolute performance. It’s simpler to read and easier to see what’s going on.

Now it’s your turn – visit the Commons Explorer and see what random gems you can find. Let me know if you spot anything cool!

Fast and random sampling in SQLite

Approach #1: ORDER BY RANDOM()

Approach #2: WHERE rowid > (…)

Approach #3: Select random rowid values outside SQLite

Summary

Approach #1: `ORDER BY RANDOM()`

Approach #2: `WHERE rowid > (…)`