Skip to main content

Fast and random sampling in SQLite

I was building a small feature for the Flickr Commons Explorer today: show a random selection of photos from the entire collection. I wanted a fast and varied set of photos.

This meant getting a random sample of rows from a SQLite table (because the Explorer stores all its data in SQLite). I’m happy with the code I settled on, but it took several attempts to get right.

Approach #1: ORDER BY RANDOM()

My first attempt was pretty naïve – I used an ORDER BY RANDOM() clause to sort the table, then limit the results:

SELECT *
FROM photos
ORDER BY random()
LIMIT 10

This query works, but it was slow – about half a second to sample a table with 2 million photos (which is very small by SQLite standards). This query would run on every request for the homepage, so that latency is unacceptable.

It’s slow because it forces SQLite to generate a value for every row, then sort all the rows, and only then does it apply the limit. SQLite is fast, but there’s only so fast you can sort millions of values.

I found a suggestion from Stack Overflow user Ali to do a random sort on the id column first, pick my IDs from that, and only fetch the whole row for the photos I’m selecting:

SELECT *
FROM photos
WHERE id IN (
    SELECT id
    FROM photos
    ORDER BY RANDOM()
    LIMIT 10
)

This means SQLite only has to load the rows it’s returning, not every row in the database. This query was over three times faster – about 0.15s – but that’s still slower than I wanted.

Approach #2: WHERE rowid > (…)

Scrolling down the Stack Overflow page, I found an answer by Max Shenfield with a different approach:

SELECT * FROM photos
WHERE rowid > (
  ABS(RANDOM()) % (SELECT max(rowid) FROM photos)
)
LIMIT 10

The rowid is a unique identifier that’s used as a primary key in most SQLite tables, and it can be looked up very quickly. SQLite automatically assigns a unique rowid unless you explicitly tell it not to, or create your own integer primary key.

This query works by picking a point between the biggest and smallest rowid values used in the table, then getting the rows with rowids which are higher than that point. If you want to know more, Max’s answer has a more detailed explanation.

This query is much faster – around 0.0008s – but I didn’t go this route.

The result is more like a random slice than a random sample. In my testing, it always returned contiguous rows – 101, 102, 103, – which isn’t what I want. The photos in the Commons Explorer database were inserted in upload order, so photos with adjacent row IDs were uploaded at around the same time and are probably quite similar. I’d get one photo of an old plane, then nine more photos of other planes. I want more variety!

(This behaviour isn’t guaranteed – if you don’t add an ORDER BY clause to a SELECT query, then the order of results is undefined. SQLite is returning rows in rowid order in my table, and a quick Google suggests that’s pretty common, but that may not be true in all cases. It doesn’t affect whether I want to use this approach, but I mention it here because I was confused about the ordering when I read this code.)

Approach #3: Select random rowid values outside SQLite

Max’s answer was the first time I’d heard of rowid, and it gave me an idea – what if I chose random rowid values outside SQLite? This is a less “pure” approach because I’m not doing everything in the database, but I’m happy with that if it gets the result I want.

Here’s the procedure I came up with:

  1. Create an empty list to store our sample.

  2. Find the highest rowid that’s currently in use:

    sqlite> SELECT MAX(rowid) FROM photos;
    1913389
    
  3. Use a random number generator to pick a rowid between 1 and the highest rowid:

    >>> import random
    >>> random.randint(1, max_rowid)
    196476
    

    If we’ve already got this rowid, discard it and generate a new one.

    (The rowid is a signed, 64-bit integer, so the minimum possible value is always 1.)

  4. Look for a row with that rowid:

    SELECT *
    FROM photos
    WHERE rowid = 196476
    

    If such a row exists, add it to our sample. If we have enough items in our sample, we’re done. Otherwise, return to step 3 and generate another rowid.

    If such a row doesn’t exist, return to step 3 and generate another rowid.

This requires a bit more code, but it returns a diverse sample of photos, which is what I really care about. It’s a bit slower, but still plenty fast enough (about 0.001s).

This approach is best for tables where the rowid values are mostly contiguous – it would be slower if there are lots of rowids between 1 and the max that don’t exist. If there are large gaps in rowid values, you might try multiple missing entries before finding a valid row, slowing down the query. You might want to try something different, like tracking valid rowid values separately.

This is a good fit for my use case, because photos don’t get removed from Flickr Commons very often. Once a row is written, it sticks around, and over 97% of the possible rowid values do exist.

Summary

Here are the four approaches I tried:

Approach Performance (for 2M rows) Notes
ORDER BY RANDOM() ~0.5s Slowest, easiest to read
WHERE id IN (SELECT id …) ~0.15s Faster, still fairly easy to understand
WHERE rowid > ... ~0.0008s Returns clustered results
Random rowid in Python ~0.001s Fast and returns varied results, requires code outside SQL, may be slower with sparsely populated rowid

I’m using the random rowid in Python in the Commons Explorer, trading code complexity for speed. I’m using this random sample to render a web page, so it’s important that it returns quickly – when I was testing ORDER BY RANDOM(), I could feel myself waiting for the page to load.

But I’ve used ORDER BY RANDOM() in the past, especially for asynchronous data pipelines where I don’t care about absolute performance. It’s simpler to read and easier to see what’s going on.

Now it’s your turn – visit the Commons Explorer and see what random gems you can find. Let me know if you spot anything cool!