How to filter a numpy array of points by another array
how can I filter a numpy array a, by the elements of a numpy array b so that I get all the points in a that are not in b.
import numpy as np
a = np.array([[1,2],[1,3],[1,4]])
b = np.array([[1,2],[1,3]])
c = np.array([ d for d in a if d not in b])
print(c)
# acutall outcome
# []
# desired outcome
# np.array([[1,4]])```
3 answers

This probably will not be the most efficient (though it turns out to be faster than the other approaches presented here for this input  see below), but one thing you can do is convert
a
andb
to Python lists and then take their set difference:# Method 1 tmp_1 = [tuple(i) for i in a] # > [(1, 2), (1, 3), (1, 4)] tmp_2 = [tuple(i) for i in b] # > [(1, 2), (1, 3)] c = np.array(list(set(tmp_1).difference(tmp_2)))
As noted by @EmiOB, this post offers some insights into why
[ d for d in a if d not in b ]
in your question does not work. Drawing from that post, you can use# Method 2 c = np.array([d for d in a if all(any(d != i) for i in b)])
Remarks
The implementation of
array_contains(PyArrayObject *self, PyObject *el)
(in C) says that callingarray_contains(self, el)
(in C) is equivalent to(self == el).any()
in Python, where
self
is a pointer to an array andel
is a pointer to a Python object.In other words:
 if
arr
is a numpy array andobj
is some arbitrary Python object, then
obj in arr
is the same as
(arr == obj).any()
 if
arr
is a typical Python container such as a list, tuple, dictionary, and so on, then
obj in arr
is the same as
any(obj is _ or obj == _ for _ in arr)
(see membership test operations).
All of which is to say, the meaning of
obj in arr
is different depending on the type ofarr
.This explains why the logical comprehension that you proposed
[d for d in a if d not in b]
does not have the desired effect.This can be confusing because it is tempting to reason that since a numpy array is a sequence (though not a standard Python one), test membership semantics should be the same. This is not the case.
Example:
a = np.array([[1,2],[1,3],[1,4]]) print((a == [1,2]).any()) # same as [1, 2] in a # outputs True
Timings
For your input, I found my approach to be the fastest, followed by Method 2 obtained from the post @EmiOB suggested, followed by @DanielF's approach. I would not be surprised if changing the input size changes the ordering of the timings so take them with a grain of salt.
# Method 1 5.96 µs ± 8.92 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) # Method 2 6.45 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) # @DanielF's answer 16.5 µs ± 276 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
 if

Use This:
c = np.array([a_elem for a_elem in a if all(any(a_elem != b_elem) for b_elem in b)])
Output:
array([[1, 4]])
Explanation:
We loop for a sublist
a_elem
froma
and check for all sublists fromb
.any(a_elem != b_elem)
returnsTrue
if any value froma_elem
is not equal tob_elem
.all(any(a_elem != b_elem) for b_elem in b)
returns True if all sublists are unequal.Eg:
We take
[1,2]
froma
check if any of its elements are unequal to[1,2]
,[1,3]
fromb
one by one. So, it'll beFalse
for[1,2]
andTrue
for[1,3]
. This creates a list[False, True]
Next, we take
[1,3]
froma
. It'll returnTrue
for[1,2]
andFalse
for[1,3]
. This creates another list[True, False]
.Lastly, we take
[1,4]
froma
. It'll returnTrue
for both[1,2]
and[1,3]
. This creates a list[True, True]
Now, when we run
all()
it returnsTrue
when both values areTrue
in the above lists. Hence, we add[1,4]
to our array. 
When comparing rowwise like this I tend to use @Jaime's recipe for converting to a void view here :
vview = lambda a:np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))) a[~np.isin(vview(a), vview(b)).squeeze()] Out[]: array([[1, 4]])
This avoids the slow
for
loops of the other answers and doesn't create any intermediate data structures.
do you know?
how many words do you know