Number of elements of numpy arrays inside specific bins
I have an ensemble of sorted (one-dimensional) arrays of unequal lengths (say
M2). I want to find how many elements of each of these arrays is inside specific number ranges (where the number ranges are specified by neighboring elements from another sorted array, say
zbin). I want to know what is the fastest way to achieve this.
Here, I am giving a small example of the task that I want to do (and also the method that I am following presently to achieve the desired functionality):
""" Function to do search query """ def search(numrange, lst): arr = np.zeros(len(lst)) for i in range(len(lst)): probe = lst[i] count = 0 for j in range(len(probe)): if (probe[j]>numrange): break if (probe[j]>=numrange) and (probe[j]<=numrange): count = count + 1 arr[i] = count return arr """ Some example of sorted one-dimensional arrays of unequal lengths """ M0 = np.array([5.1, 5.4, 6.4, 6.8, 7.9]) M1 = np.array([5.2, 5.7, 8.8, 8.9, 9.1, 9.2]) M2 = np.array([6.1, 6.2, 6.5, 7.2]) """ Implementation and output """ lst = [M0, M1, M2] zbin = np.array([5.0, 5.5, 6.0, 6.5]) zarr = np.zeros( (len(zbin)-1, len(lst)) ) for i in range(len(zbin)-1): numrange = [zbin[i], zbin[i+1]] zarr[i,:] = search(numrange, lst) print zarr
[[ 2. 1. 0.] [ 0. 1. 0.] [ 1. 0. 3.]]
Here, the final output
zarr gives me the number of elements of each of the arrays (
M2) inside each of the bins possible from
[5.5, 6.0] and
[6.0, 6.5].) For example consider the bin
[5.0, 5.5]. The array
M0 has 2 elements inside that bin (
M1 has 1 element (
M2 has 0 elements in that bin. This gives the first row of
[2,1,0]. One can get the other rows of
zarr in a similar manner.
In my actual task, I will be dealing with
zbin of lengths much larger than what I have given in this example, and also bigger and many more arrays like
Ms and the array
zbin would be sorted always. I am wondering if the function that I have designed (
search()), and the method that I am following are the most optimum and the fastest ways to achieve the desired functionality. I will really appreciate any help.
I would guess it would be difficult to beat the performance of simply looping over each array and calling numpy.histogram. I’m guessing you haven’t tried this or you’d have mentioned it!
It’s certainly possible that you could exploit the sorted nature to come up with a faster solution, but I’d start by comparing the timing of that.
We could make use of the sorted nature and hence use
np.searchsortedfor this task, like so -
out = np.empty((len(zbin)-1, len(lst)),dtype=int) for i,l in enumerate(lst): left_idx = np.searchsorted(l, zbin[:-1], 'left') right_idx = np.searchsorted(l, zbin[1:], 'right') out[:,i] = right_idx - left_idx