Thursday, December 2, 2010

random subsets of data ii

In doing more work to get random subsets, the previous solution fails miserably if you are going to be using a view. You can run the EXEC statement in a stored procedure but cannot use it in a view. And furthermore, calling a stored procedure from a view is somewhere between extremely problematic and impossible, depending on who you talk to. Using the RANK function and partitioning the data you can get a similar result.

Let us use the same idea as the previous example and assume we want at least one employee from every department. The only requirement for the following query is that you are asking for more employees than you have departments.


select top 100 T1.* from

(select

RANK() over
(PARTITION by Department order by newid()) as r,

*
from Employees
where Active = 1

) as T1
order by t1.r, t1.Department


Note, as before, that this is a sql server query. Modifications may need to be made for different flavors of SQL. What this is doing is for every department the employees are getting randomly ranked thanks to order by NEWID(). By selecting the top 100 and ordering by rank and then department, you'll get all of the 1's from every department, then all of the 2's and so on until you get 100.

The order by department is strictly unnecessary. I did it to make viewing and verifying the results easier.

No comments: