in [Python]

Prev: Sockets and xml problem
Next: TypeError: _new_() takes exactly 3 arguments (2 given) - what doesit mean?
From: Mark Dickinson on 28 May 2010 16:44 For a lazy Friday evening, here's a Python algorithm that seemed so cute that I just had to share it with everyone. I'm sure it's well known to many here, but it was new to me. Skip directly to the 'sample2' function to see the algorithm and avoid the commentary... Suppose that you want to select a number of elements, k, say, from a population, without replacement. E.g., selecting 3 elements from range(30) might give you: [13, 3, 27] Order matters, so the above is considered distinct from [3, 13, 27]. And you want to be sure that each possible selection has equal probability of occurring (to within the limits of the underlying PRNG). One solution is to select elements from the population one-by-one, keep track of the indices of already-selected elements in a set, and if you end up selecting something that's already in your set, simply try again. Something like this (code stolen and adapted from Random.sample in Python's standard library 'random' module): from random import randrange def sample1(population, k): n = len(population) result = [None] * k selected = set() for i in range(k): j = randrange(n) # retry until we get something that's not already selected while j in selected: j = randrange(n) selected.add(j) result[i] = population[j] return result N.B. The above is Python 3 code; for Python 2, replace range with xrange. All that's required of 'population' here is that it implements __len__ and __getitem__. The method works well for k significantly smaller than n, but as k approaches n the number of reselections required increases. So for larger k, Random.sample uses a different method: roughly, make a copy of 'population', do a partial in-place shuffle of that copy that randomizes the first k elements, and return those. This second method isn't so great when k is small and n is huge, since it ends up being O(n) from the list copy, but it works out that the two methods complement each other nicely. Looking at the above code, I was idly wondering whether there was a way to alter 'sample1' to avoid the need for resampling, thus giving a single algorithm that works reasonably efficiently regardless of the population size and requested sample size. And it turns out that there is. The code below is similar to 'sample1' above, except that instead of using a set to keep track of indices of already-selected members of the population, it uses a dict; for an index i (corresponding to a member of the population), d[i] gives the position that population[i] will occupy in the resulting sample. from random import randrange def sample2(population, k): n = len(population) d = {} for i in reversed(range(k)): j = randrange(i, n) if j in d: d[i] = d[j] d[j] = i result = [None] * k for j, i in d.items(): result[i] = population[j] return result Note that no resampling is required, and that there's no copying of the population list. The really clever bit is the 'if j in d: ...' block. If you stare at the algorithm for long enough (and it does take some staring), you can convince yourself that after the first 'for' loop, d can be any of the n*(n-1)*...*(n-k+1) mappings-with-no-repeated-elements from some set of k elements of range(n) to range(k), and that each one of these mappings is equally likely to occur. In a sense, this d is the inverse of the desired sample, which would be a map with no repetitions from range(k) to range(n). So inverting d, and replacing d's keys by the corresponding population elements, gives the random sample. N.B. I don't claim any originality for the algorithm; only for the implementation: the algorithm is based on an algorithm attributed to Robert Floyd, and appearing in Jon Bentley's 'Programming Pearls' book (though that algorithm produces a set, so doesn't worry about the ordering of the sample). But I was struck by its beauty and simplicity, and thought it deserved to be better known. Happy Friday! -- Mark |