Friday, November 19, 2010

random subsets of data

I recently had a request to return a random subset of contacts from a database, but to be sure that there were some contacts from each group represented. Let us say that you have a database with employees and you want to get a list of some random people from each department. Here is how I went about doing it:

declare @s1 varchar(max)

select @s1 = coalesce(@s1 + ' union all ', '') +
e from (

select distinct
'select * from (select top 11 percent * from
Employees where Department = '''
+ Department + ''' order by newid())
as [t' + Department + ']' as e
from Employees) T1
exec (@s1)

What we are doing here is generating the text for a query that will select the top N contacts from the Employee list for each department. The coalesce function will put all of these queries together into one query that will union the results into a single table. The exec function will execute the query.

Note that this is a TSQL example to run on SQL Server. You may have to translate this, depending on your database system.

If you want a fixed number then what I would do is tweak the percent to get a number than is slightly more than the number desired then just select the top N from that. This runs the risk of not getting somebody from every department. Another method would be to select a number slightly less than your target, then select some random number of records to add to the result. This is more complex, but may be what you need.