how to compare two data frames of different lengths without getting identically-labelled series error
I have two dataframes corresponding to my train and test data respectively. They both have a column called 'Location'. I want to see which values in the 'location' column are the same in both the train and test dataframes and which values are not. So for example given two df:
df_train:
i loc
0 10
1 11
2 12
df_test:
i loc
0 10
1 12
2 13
3 17
I would need it to return that 10 and 12 are in both dataframes, and that 11, 13 and 17 are only in df_test.Below is what I have tried:
df_t["match_location"] = np.where(df_tst["location_remapped"] == df_t["location_remapped"], "True", "False")
However I run into this error as both df are different lengths:
ValueError Traceback (most recent call last)
<ipython-input-49-51941d90b84e> in <module>()
----> 1 df_t["match_location"] = np.where(df_tst["location_remapped"] == df_t["location_remapped"], "True", "False")
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/ops/common.py in new_method(self, other)
67 other = item_from_zerodim(other)
68
---> 69 return method(self, other)
70
71 return new_method
/usr/local/lib/python3.7/dist-packages/pandas/core/arraylike.py in __eq__(self, other)
30 @unpack_zerodim_and_defer("__eq__")
31 def __eq__(self, other):
---> 32 return self._cmp_method(other, operator.eq)
33
34 @unpack_zerodim_and_defer("__ne__")
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _cmp_method(self, other, op)
5494
5495 if isinstance(other, Series) and not self._indexed_same(other):
-> 5496 raise ValueError("Can only compare identically-labeled Series objects")
5497
5498 lvalues = self._values
ValueError: Can only compare identically-labeled Series objects
Does anyone have a way around this?
1 answer
-
answered 2022-05-04 10:07
jezrael
If no duplicates in
loc
columns useDataFrame.merge
with outer join and parameterindicator
:df = df_train.merge(df_test, on='loc', indicator='match_location', how='outer') print (df) loc match_location 0 10 both 1 11 left_only 2 12 both 3 13 right_only 4 17 right_only
For boolean column compare by
both
:df['match_location'] = df['match_location'].eq('both') print (df) loc match_location 0 10 True 1 11 False 2 12 True 3 13 False 4 17 False
If possible duplicates first remove them:
df = (df_train.drop_duplicates('loc') .merge(df_test.drop_duplicates('loc'), on='loc', indicator='match_location', how='outer'))
How many English words
do you know?
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
how many words do you know
Powered by Examplum