Number of elements of numpy arrays inside specific bins

I have an ensemble of sorted (one-dimensional) arrays of unequal lengths (say M0, M1 and M2). I want to find how many elements of each of these arrays is inside specific number ranges (where the number ranges are specified by neighboring elements from another sorted array, say zbin). I want to know what is the fastest way to achieve this.

Here, I am giving a small example of the task that I want to do (and also the method that I am following presently to achieve the desired functionality):

""" Function to do search query """
def search(numrange, lst):
    arr = np.zeros(len(lst))        
    for i in range(len(lst)):
        probe = lst[i]
        count = 0
        for j in range(len(probe)):
            if (probe[j]>numrange[1]): break
            if (probe[j]>=numrange[0]) and (probe[j]<=numrange[1]): count = count + 1   

        arr[i] = count
    return arr

""" Some example of sorted one-dimensional arrays of unequal lengths """
M0 = np.array([5.1, 5.4, 6.4, 6.8, 7.9])
M1 = np.array([5.2, 5.7, 8.8, 8.9, 9.1, 9.2])
M2 = np.array([6.1, 6.2, 6.5, 7.2])

""" Implementation and output """
lst = [M0, M1, M2]
zbin = np.array([5.0, 5.5, 6.0, 6.5])
zarr = np.zeros( (len(zbin)-1, len(lst)) )
for i in range(len(zbin)-1):
    numrange = [zbin[i], zbin[i+1]]
    zarr[i,:] = search(numrange, lst)

print zarr


[[ 2.  1.  0.]
 [ 0.  1.  0.]
 [ 1.  0.  3.]] 

Here, the final output zarr gives me the number of elements of each of the arrays (M0, M1 and M2) inside each of the bins possible from zbin (viz. [5.0, 5.5], [5.5, 6.0] and [6.0, 6.5].) For example consider the bin [5.0, 5.5]. The array M0 has 2 elements inside that bin (5.1 and 5.4), M1 has 1 element (5.2) and M2 has 0 elements in that bin. This gives the first row of zarr i.e. [2,1,0]. One can get the other rows of zarr in a similar manner.

In my actual task, I will be dealing with zbin of lengths much larger than what I have given in this example, and also bigger and many more arrays like M0, M1, ... Mn. All Ms and the array zbin would be sorted always. I am wondering if the function that I have designed (search()), and the method that I am following are the most optimum and the fastest ways to achieve the desired functionality. I will really appreciate any help.

2 answers

  • answered 2018-05-16 04:36 Ethan Coon

    I would guess it would be difficult to beat the performance of simply looping over each array and calling numpy.histogram. I’m guessing you haven’t tried this or you’d have mentioned it!

    It’s certainly possible that you could exploit the sorted nature to come up with a faster solution, but I’d start by comparing the timing of that.

  • answered 2018-05-16 04:53 Divakar

    We could make use of the sorted nature and hence use np.searchsorted for this task, like so -

    out = np.empty((len(zbin)-1, len(lst)),dtype=int)
    for i,l in enumerate(lst):
        left_idx = np.searchsorted(l, zbin[:-1], 'left')
        right_idx = np.searchsorted(l, zbin[1:], 'right')
        out[:,i] = right_idx - left_idx