Mysql – Simple Random Samples from a Sql database

mysqlpostgresqlrandomsql

How do I take an efficient simple random sample in SQL? The database in question is running MySQL; my table is at least 200,000 rows, and I want a simple random sample of about 10,000.

The "obvious" answer is to:

SELECT * FROM table ORDER BY RAND() LIMIT 10000

For large tables, that's too slow: it calls RAND() for every row (which already puts it at O(n)), and sorts them, making it O(n lg n) at best. Is there a way to do this faster than O(n)?

Note: As Andrew Mao points out in the comments, If you're using this approach on SQL Server, you should use the T-SQL function NEWID(), because RAND() may return the same value for all rows.

EDIT: 5 YEARS LATER

I ran into this problem again with a bigger table, and ended up using a version of @ignorant's solution, with two tweaks:

  • Sample the rows to 2-5x my desired sample size, to cheaply ORDER BY RAND()
  • Save the result of RAND() to an indexed column on every insert/update. (If your data set isn't very update-heavy, you may need to find another way to keep this column fresh.)

To take a 1000-item sample of a table, I count the rows and sample the result down to, on average, 10,000 rows with the the frozen_rand column:

SELECT COUNT(*) FROM table; -- Use this to determine rand_low and rand_high

  SELECT *
    FROM table
   WHERE frozen_rand BETWEEN %(rand_low)s AND %(rand_high)s
ORDER BY RAND() LIMIT 1000

(My actual implementation involves more work to make sure I don't undersample, and to manually wrap rand_high around, but the basic idea is "randomly cut your N down to a few thousand.")

While this makes some sacrifices, it allows me to sample the database down using an index scan, until it's small enough to ORDER BY RAND() again.

Best Solution

I think the fastest solution is

select * from table where rand() <= .3

Here is why I think this should do the job.

  • It will create a random number for each row. The number is between 0 and 1
  • It evaluates whether to display that row if the number generated is between 0 and .3 (30%).

This assumes that rand() is generating numbers in a uniform distribution. It is the quickest way to do this.

I saw that someone had recommended that solution and they got shot down without proof.. here is what I would say to that -

  • This is O(n) but no sorting is required so it is faster than the O(n lg n)
  • mysql is very capable of generating random numbers for each row. Try this -

    select rand() from INFORMATION_SCHEMA.TABLES limit 10;

Since the database in question is mySQL, this is the right solution.