Janitor Programmer: random subsets of data

I recently had a request to return a random subset of contacts from a database, but to be sure that there were some contacts from each group represented. Let us say that you have a database with employees and you want to get a list of some random people from each department. Here is how I went about doing it:



declare @s1 varchar(max)

select @s1 = coalesce(@s1 + ' union all ', '') +
e from (

select distinct
'select * from (select top 11 percent * from 
Employees where Department = '''
+ Department + ''' order by newid()) 
as [t' + Department + ']'  as e
from Employees) T1
exec (@s1)

What we are doing here is generating the text for a query that will select the top N contacts from the Employee list for each department. The coalesce function will put all of these queries together into one query that will union the results into a single table. The exec function will execute the query.

Note that this is a TSQL example to run on SQL Server. You may have to translate this, depending on your database system.

If you want a fixed number then what I would do is tweak the percent to get a number than is slightly more than the number desired then just select the top N from that. This runs the risk of not getting somebody from every department. Another method would be to select a number slightly less than your target, then select some random number of records to add to the result. This is more complex, but may be what you need.

Janitor Programmer

Friday, November 19, 2010

random subsets of data

No comments:

Pages

Followers

Blog Archive