how to compare two data frames of different lengths without getting identically-labelled series error

I have two dataframes corresponding to my train and test data respectively. They both have a column called 'Location'. I want to see which values in the 'location' column are the same in both the train and test dataframes and which values are not. So for example given two df:

df_train:
i  loc
0  10  
1  11  
2  12  

df_test:
i  loc
0  10  
1  12  
2  13
3  17 

I would need it to return that 10 and 12 are in both dataframes, and that 11, 13 and 17 are only in df_test.Below is what I have tried:

df_t["match_location"] = np.where(df_tst["location_remapped"] == df_t["location_remapped"], "True", "False")

However I run into this error as both df are different lengths:

ValueError                                Traceback (most recent call last)
<ipython-input-49-51941d90b84e> in <module>()
----> 1 df_t["match_location"] = np.where(df_tst["location_remapped"] == df_t["location_remapped"], "True", "False")

2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/ops/common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

/usr/local/lib/python3.7/dist-packages/pandas/core/arraylike.py in __eq__(self, other)
     30     @unpack_zerodim_and_defer("__eq__")
     31     def __eq__(self, other):
---> 32         return self._cmp_method(other, operator.eq)
     33 
     34     @unpack_zerodim_and_defer("__ne__")

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _cmp_method(self, other, op)
   5494 
   5495         if isinstance(other, Series) and not self._indexed_same(other):
-> 5496             raise ValueError("Can only compare identically-labeled Series objects")
   5497 
   5498         lvalues = self._values

ValueError: Can only compare identically-labeled Series objects

Does anyone have a way around this?

1 answer

  • answered 2022-05-04 10:07 jezrael

    If no duplicates in loc columns use DataFrame.merge with outer join and parameter indicator:

    df = df_train.merge(df_test, on='loc', indicator='match_location', how='outer')
    print (df)
       loc match_location
    0   10           both
    1   11      left_only
    2   12           both
    3   13     right_only
    4   17     right_only
    

    For boolean column compare by both:

    df['match_location'] = df['match_location'].eq('both')
    print (df)
       loc  match_location
    0   10            True
    1   11           False
    2   12            True
    3   13           False
    4   17           False
    

    If possible duplicates first remove them:

    df = (df_train.drop_duplicates('loc')
                 .merge(df_test.drop_duplicates('loc'), 
                        on='loc', 
                        indicator='match_location',
                        how='outer'))
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum