Problem with finding difference between left and outer join in Pandas

I am merging two dataframes based on two strings of and my presumption is that these strings will have many instances that logically should match but do not because of different formatting (country names)

df['citizenship'].dtypes
Out[46]: dtype('O')

df2['Country_Name'].dtypes
Out[47]: dtype('O')

I read up on this fantastic ressource (Pandas Merging 101) and decided to do an Anti-join to find the country names of the first (main) dataframe that did not merge (and then look for df2 in the excel):

nonmerge = pd.merge(df, df2, left_on='citizenship', right_on='Country_Name', \
            how='left', indicator=True).query('_merge != "left_only"').drop('_merge', 1)
nonmerge['citizenship'].unique()

which gives me a list of countries that should be only in the left data frame:

array(['India, Republic of', 'Russian Federation', 'Ukraine', 'Japan',
       'Malaysia', 'Burkina Faso', 'Yemen', 'Bosnia and Herzegovina',
       'United States of America', 'Canada', 'Ireland',
       'Iran, Islamic Republic of', 'United Arab Emirates',
       "China, People's Republic of", 'New Zealand', 'Western Sahara',
       'Moldova, Republic of', 'Mongolia', 'Romania', 'Georgia',
       'Tanzania, United Republic of', 'Dominican Republic',
       'Turkmenistan', 'Syrian Arab Republic', 'Jamaica', 'Taiwan',
       'Korea, Republic of', 'Reunion', 'Saint Lucia',
       'Antigua and Barbuda', 'Saint Pierre and Miquelon',
       'American Samoa', 'Aruba', 'Barbados', 'Greenland',
       'Solomon Islands', 'Anguilla', 'Belize', 'Norfolk Island',
       'Saint Helena', 'Grenada', 'Guadeloupe', 'Cayman Islands',
       'Central African Republic', 'Bermuda', 'Turks and Caicos Islands',
       'Christmas Island', 'New Caledonia', 'Mayotte',
       'Wallis and Futuna', 'Niue', 'Tuvalu', 'Brunei Darussalam',
       'Cook Islands', 'Tokelau', 'Martinique', 'Montserrat',
       'Cocos (Keeling) Islands', 'Guam'], dtype=object)

What really surprised me however is checking one example of this list and finding out that this value is actually present in both original dataframes

'India, Republic of' in df['citizenship'].values
Out[49]: True

'India, Republic of' in df2['Country_Name'].values
Out[50]: True

Anybody knows what I am missing here? PS: Also keen for tips on how to do this more efficiently. Coming from Stata, this way of getting to work merges with messy data seems very tedious... Thanks!

1 answer

  • answered 2020-11-20 11:40 oskros

    If I understand correctly, you wish to know which values differ between the column 'citizenship' in df and the column 'Country_Name' in df2. That can be done with something simple like the following

    set(df['citizenship']) - set(df2['Country_Name'])