Fuzzywuzzy merge on multiple columns - pandas

I've 2 dataframes:

Dataframe 1:

    path                      hierarchy
0  path3              path1/path2/path3
1  path2                    path1/path2
2  path6  path1/path2/path4/path5/path6

DataFrame 2:

    path                       hierarcy  unique_id
0  path2    path1/<random_string>/path2          1
1  Path3   <random_string>/Path_2/path3          2
2  path2                    path1/Path2          3
3  pAth6        path1/path2/path4/path6          4
4  path6  path1/path2/path4/path5/path6          5

Expected output:

    path                      hierarchy  unique_id
0  path3              path1/path2/path3          2
1  path2                    path1/path2          3
2  path6  path1/path2/path4/path5/path6          5

Now, I wanna fill in the unique_id's from the 2nd dataframe to the 1st one. But, there are some issues -

  1. I can't directly do merge(left) the 2 dataframes based on the path column as the correct path depends on the hierarchy column as well.
  2. The strings in the hierarchy column are not exactly the same. There are some differences for example Path_2 and path2. The same is also valid for the path column.

I tried fuzzy matching but it's the hierarchy column that's creating the issue. I want to start the matching from the right side of the hierarchy column and then move towards the left applying fuzzy matching on each level. I'm not sure if that's a good approach or not.

Basically, the requirement is to identify the most perfect match based on both path/hierarchy columns both.

Dictionaries to create the dataframe:

df1 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
 'hierarchy': {0: 'path1/path2/path3',
  1: 'path1/path2',
  2: 'path1/path2/path4/path5/path6'}})


df2 = pd.DataFrame({'path': {0: 'path2', 1: 'Path3', 2: 'path2', 3: 'path6', 4: 'path6'},
 'hierarcy': {0: 'path1/<random_string>/path2',
  1: '<random_string>/Path_2/path3',
  2: 'path1/Path2',
  3: 'path1/path2/path4/path6',
  4: 'path1/path2/path4/path5/path6'},
 'unique_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}})


df3 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
 'hierarchy': {0: 'path1/path2/path3',
  1: 'path1/path2',
  2: 'path1/path2/path4/path5/path6'},
 'unique_id': {0: 2, 1: 3, 2: 5}})

Any help is appreciated.

2 answers

  • answered 2021-07-27 16:31 Piotr Żak

    Transform text in df1 and df2 to lowercase:

    df1['path'] = df1['path'].str.lower()
    df2['path'] = df2['path'].str.lower()
    

    Merge:

    result = pd.merge(df1, df2, on=['path'])
    

    Result:

    enter image description here

  • answered 2021-07-28 07:55 RJ Adriaansen

    I tried fuzzy matching, but didn't get satisfying results, because the order of the hierarchy is crucial here. Let me suggest an alternative approach that treats the hierarchies as vectors and then retrieves the idx of the vector in df2 with the closest spatial proximity. In the example below the paths are cleaned, split by / and are assigned a unique value. Subsequently the spatial distance is calculated of every hierarchy in df1 against the vectors with the same length in df2:

    import pandas as pd
    import numpy as np
    from scipy import spatial
    
    df1 = pd.DataFrame({'path': {0: 'path3', 1: 'path2', 2: 'path6'},
     'hierarchy': {0: 'path1/path2/path3',
      1: 'path1/path2',
      2: 'path1/path2/path4/path5/path6'}})
    
    
    df2 = pd.DataFrame({'path': {0: 'path2', 1: 'Path3', 2: 'path2', 3: 'path6', 4: 'path6'},
     'hierarchy': {0: 'path1/<random_string>/path2',
      1: '<random_string>/Path_2/path3',
      2: 'path1/Path2',
      3: 'path1/path2/path4/path6',
      4: 'path1/path2/path4/path5/path6'},
     'unique_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}})
    
    #clean up paths
    df1['hierarchy'] = df1['hierarchy'].str.lower().str.replace('_','').str.split('/')
    df2['hierarchy'] = df2['hierarchy'].str.lower().str.replace('_','').str.split('/')
    
    #create unique integer ids per path and map them to the dataframes
    unique_dirs = set(list(np.concatenate(df1['hierarchy'].to_list() + df2['hierarchy'].to_list())))
    dir_map = dict(zip(unique_dirs, range(len(unique_dirs))))
    
    df1 = df1.assign(mapped=[list(map(dir_map.get, x)) for x in df1.hierarchy])
    df2 = df2.assign(mapped=[list(map(dir_map.get, x)) for x in df2.hierarchy])
    
    #calculate spatial distance to vectors from df2 with the same length, then return the matched hierarchy and idx
    def check(row):
      value = row['mapped']
      arr = np.array(df2[df2['mapped'].str.len() == len(value)]['mapped'].to_list())
      tree = spatial.KDTree(arr)
      distances, indices = tree.query(value)
      matches = df2[df2['mapped'].astype(str) == str(tree.data[indices].tolist())]
      return matches['hierarchy'].tolist()[0], matches['unique_id'].tolist()[0]
    
    df1[['match', 'idx']] = df1.apply(check, axis=1, result_type="expand")
    

    Result df1:

    path hierarchy mapped match idx
    0 path3 ['path1', 'path2', 'path3'] [0, 1, 3] ['<randomstring>', 'path2', 'path3'] 2
    1 path2 ['path1', 'path2'] [0, 1] ['path1', 'path2'] 3
    2 path6 ['path1', 'path2', 'path4', 'path5', 'path6'] [0, 1, 6, 5, 4] ['path1', 'path2', 'path4', 'path5', 'path6'] 5

    The code could likely be computationally optimized and will not work if multiple similar hierarchies with random paths at the same place exist. In that case I wouldn't assign a random integer, but retrieve embeddings for the paths using a language model prior to calculating the distance.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum