Perform operation on n random distinct elements from Collection using Streams API

Question

I'm attempting to retrieve n unique random elements for further processing from a Collection using the Streams API in Java 8, however, without much or any luck. More precisely I'd want something like this: I want to do it as efficiently as possible. Can this be done? edit: My second attempt -- although not exactly what I was aiming for:

Accepted Answer

The shuffling approach works reasonably well, as suggested by fge in a comment and by ZouZou in another answer. Here’s a generified version of the shuffling approach:static List shuffleSelectN(Collection coll, int n) { assert n <= coll.size(); List list = new ArrayList<>(coll); Collections.shuffle(list); return list.subList(0, n);}I’ll note that using subList is preferable to getting a stream and then calling limit(n), as shown in some other answers, because the resulting stream has a known size and can be split more efficiently.The shuffling approach has a couple disadvantages. It needs to copy out all the elements, and then it needs to shuffle all the elements. This can be quite expensive if the total number of elements is large and the number of elements to be chosen is small.An approach suggested by the OP and by a couple other answers is to choose elements at random, while rejecting duplicates, until the desired number of unique elements has been chosen. This works well if the number of elements to choose is small relative to the total, but as the number to choose rises, this slows down quite a bit because of the likelihood of choosing duplicates rises as well.Wouldn’t it be nice if there were a way to make a single pass over the space of input elements and choose exactly the number wanted, with the choices made uniformly at random? It turns out that there is, and as usual, the answer can be found in Knuth. See TAOCP Vol 2, sec 3.4.2, Random Sampling and Shuffling, Algorithm S.Briefly, the algorithm is to visit each element and decide whether to choose it based on the number of elements visited and the number of elements chosen. In Knuth’s notation, suppose you have N elements and you want to choose n of them at random. The next element should be chosen with probability (n – m) / (N – t)where t is the number of elements visited so far, and m is the number of elements chosen so far.It’s not at all obvious that this will give a uniform distribution of chosen elements, but apparently it does. The proof is left as an exercise to the reader; see Exercise 3 of this section.Given this algorithm, it’s pretty straightforward to implement it in “conventional” Java by looping over the collection and adding to the result list based on the random test. The OP asked about using streams, so here’s a shot at that.Algorithm S doesn’t lend itself obviously to Java stream operations. It’s described entirely sequentially, and the decision about whether to select the current element depends on a random decision plus state derived from all previous decisions. That might make it seem inherently sequential, but I’ve been wrong about that before. I’ll just say that it’s not immediately obvious how to make this algorithm run in parallel.There is a way to adapt this algorithm to streams, though. What we need is a stateful predicate. This predicate will return a random result based on a probability determined by the current state, and the state will be updated — yes, mutated — based on this random result. This seems hard to run in parallel, but at least it’s easy to make thread-safe in case it’s run from a parallel stream: just make it synchronized. It’ll degrade to running sequentially if the stream is parallel, though.The implementation is pretty straightforward. Knuth’s description uses random numbers between 0 and 1, but the Java Random class lets us choose a random integer within a half-open interval. Thus all we need to do is keep counters of how many elements are left to visit and how many are left to choose, et voila:/** * A stateful predicate that, given a total number * of items and the number to choose, will return 'true' * the chosen number of times distributed randomly * across the total number of calls to its test() method. */static class Selector implements Predicate

Advertisement

Answer