How to filter a numpy array of points by another array

how can I filter a numpy array a, by the elements of a numpy array b so that I get all the points in a that are not in b.

import numpy as np

a = np.array([[1,2],[1,3],[1,4]])
b = np.array([[1,2],[1,3]])
c = np.array([ d for d in a if d not in b])
print(c)

# acutall outcome
# []
# desired outcome
# np.array([[1,4]])```

3 answers

  • answered 2022-05-04 10:47 oda

    This probably will not be the most efficient (though it turns out to be faster than the other approaches presented here for this input -- see below), but one thing you can do is convert a and b to Python lists and then take their set difference:

    # Method 1
    tmp_1 = [tuple(i) for i in a]    # -> [(1, 2), (1, 3), (1, 4)]
    tmp_2 = [tuple(i) for i in b]    # -> [(1, 2), (1, 3)]
    
    c = np.array(list(set(tmp_1).difference(tmp_2)))
    

    As noted by @EmiOB, this post offers some insights into why [ d for d in a if d not in b ] in your question does not work. Drawing from that post, you can use

    # Method 2
    c = np.array([d for d in a if all(any(d != i) for i in b)])
    

    Remarks


    The implementation of array_contains(PyArrayObject *self, PyObject *el) (in C) says that calling array_contains(self, el) (in C) is equivalent to

    (self == el).any()
    

    in Python, where self is a pointer to an array and el is a pointer to a Python object.

    In other words:

    1. if arr is a numpy array and obj is some arbitrary Python object, then
    obj in arr
    

    is the same as

    (arr == obj).any()
    
    1. if arr is a typical Python container such as a list, tuple, dictionary, and so on, then
    obj in arr
    

    is the same as

    any(obj is _ or obj == _ for _ in arr)
    

    (see membership test operations).

    All of which is to say, the meaning of obj in arr is different depending on the type of arr.

    This explains why the logical comprehension that you proposed [d for d in a if d not in b] does not have the desired effect.

    This can be confusing because it is tempting to reason that since a numpy array is a sequence (though not a standard Python one), test membership semantics should be the same. This is not the case.

    Example:

    a = np.array([[1,2],[1,3],[1,4]])
    print((a == [1,2]).any())          # same as [1, 2] in a
    # outputs True
    

    Timings


    For your input, I found my approach to be the fastest, followed by Method 2 obtained from the post @EmiOB suggested, followed by @DanielF's approach. I would not be surprised if changing the input size changes the ordering of the timings so take them with a grain of salt.

    # Method 1
    5.96 µs ± 8.92 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    # Method 2
    6.45 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    # @DanielF's answer
    16.5 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
    

  • answered 2022-05-04 10:59 Abhyuday Vaish

    Use This:

    c = np.array([a_elem for a_elem in a if all(any(a_elem != b_elem) for b_elem in b)])
    

    Output:

    array([[1, 4]])
    

    Explanation:

    We loop for a sublist a_elem from a and check for all sublists from b. any(a_elem != b_elem) returns True if any value from a_elem is not equal to b_elem. all(any(a_elem != b_elem) for b_elem in b) returns True if all sublists are unequal.

    Eg:

    We take [1,2] from a check if any of its elements are unequal to [1,2], [1,3] from b one by one. So, it'll be False for [1,2] and True for [1,3]. This creates a list [False, True]

    Next, we take [1,3] from a. It'll return True for [1,2] and False for [1,3]. This creates another list [True, False].

    Lastly, we take [1,4] from a. It'll return True for both [1,2] and [1,3]. This creates a list [True, True]

    Now, when we run all() it returns True when both values are True in the above lists. Hence, we add [1,4] to our array.

  • answered 2022-05-04 11:28 Daniel F

    When comparing row-wise like this I tend to use @Jaime's recipe for converting to a void view here :

    vview = lambda a:np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
    
    a[~np.isin(vview(a), vview(b)).squeeze()]
    Out[]: array([[1, 4]])
    

    This avoids the slow for loops of the other answers and doesn't create any intermediate data structures.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum