Fuzzy Lookup In Python

I have two CSV files. One that contains Vendor data and one that contains Employee data. Similar to what "Fuzzy Lookup" in excel does, I'm looking to do two types of matches and output all columns from both csv files, including a new column as the similarity ratio for each row. In excel, I would use a 0.80 threshold. The below is sample data and my actual data has 2 million rows in one of the files which is going to be a nightmare if done in excel.

Match 1: From Vendor file, fuzzy match "Vendor Name" with "Full Name" from Employee file Match 2: From Vendor file, fuzzy match "SSN" with "SSN" from Employee file

Dataframe 1: Vendor Data

Company Vendor ID Vendor Name Invoice Number Transaction Amt Vendor Type SSN
15 58421 CLIFFORD BROWN 854 500 Misc 668419628
150 9675 GREEN 7412 70 One Time 774801971
200 15789 SMITH, JOHN 80 40 Employee 965214872
200 69997 HAROON, SIMAN 964 100 Misc 741-98-7821

Dataframe 2: Employee Data

Full Name Employee ID Manager SSN
BROWN, CLIFFORD 1 Manager 1 668-419-628
BLUE, CITY 2 Manager 2 874126487
SMITH, JOHN 3 Manager 3 965-21-4872
HAROON, SIMON 4 Manager 4 741-98-7820

Expected output 1 - Match Name

Full Name Employee ID Manager SSN Company Vendor ID Vendor Name Invoice Number Transaction Amt Vendor Type SSN Similarity Ratio
BROWN, CLIFFORD 1 Manager 1 668-419-628 150 58421 CLIFFORD BROWN 854 500 Misc 668419628 1.00
SMITH, JOHN 3 Manager 3 965-21-4872 200 15789 SMITH, JOHN 80 40 Employee 965214872 1.00
HAROON, SIMON 4 Manager 4 741-98-7820 200 69997 HAROON, SIMAN 964 100 Misc 741-98-7821 0.96
BLUE, CITY 2 Manager 2 874126487 0.00

Expected output 2 - Match SSN

Full Name Employee ID Manager SSN Company Vendor ID Vendor Name Invoice Number Transaction Amt Vendor Type SSN Similarity Ratio
BROWN, CLIFFORD 1 Manager 1 668-419-628 150 58421 CLIFFORD, BROWN 854 500 Misc 668419628 0.97
SMITH, JOHN 3 Manager 3 965-21-4872 200 15789 SMITH, JOHN 80 40 Employee 965214872 0.97
BLUE, CITY 2 Manager 2 874126487 0.00
HAROON, SIMON 4 Manager 4 741-98-7820 0.00

I've tried the below code:

import pandas as pd
from fuzzywuzzy import fuzz

df1 = pd.read_excel(r'Directory\Sample Vendor Data.xlsx')
df2 = pd.read_excel(r'Directory\Sample Employee Data.xlsx')

matched_names = []

for row1 in df1.index:
    name1 = df1._get_value(row1, 'Vendor Name')  
    for row2 in df2.index:
        name2 = df2._get_value(row2, 'Full Name')  
        match = fuzz.ratio(name1, name2)
        if match > 80:  # This is the threshold
            match.append([name1, name2, match])

df_ratio = pd.DataFrame(columns=['Vendor Name', 'Full Name','match'], data=matched_names)
df_ratio.to_csv(r'directory\MatchingResults.csv',  encoding='utf-8')

I'm just not getting the results I want and am ready to reinvent the whole script. Any suggestions would help to improve my script. Please note, I'm fairly new to Python so be gentle. I am totally open to a new approach on this example.

1 answer

  • answered 2021-09-22 02:02 ResidentSleeper

    Try this following function that uses process.extract to match list of strings.

    from fuzzywuzzy import fuzz, process  
    
    def match_string(string, choices, scorer, threshold=80):
        results = process.extract(string, choices.keys(), scorer=scorer, limit=None)
        results = [v for v in results if v[1] >= threshold]
        
        if not results:
            return (None, 0)
        
        key = max(results, key=lambda x: x[1])
        if key:
            return (choices[key[0]], key[1])
        else:
            return (None, 0)
    

    You can also change the scorer function to tune matching. (fuzz.token_set_ratio)

    # emp >> Your employee dataframe
    # vendor >> Your vendor dataframe
    emp_map = dict(zip(emp['Full Name'], emp['Employee ID']))  # For mapping
    ssn_map = dict(zip(emp['SSN'], emp['Employee ID']))
    
    # Matching
    vendor['Employee ID'],  vendor['Similarity Ratio'] = zip(*vendor['Vendor Name'].apply(lambda x: match_string(x, emp_map, fuzz.token_set_ratio)))
    

    Join with employee to obtain your desire output

    emp.merge(vendor, on='Employee ID', how='left').fillna('')
    
             Full Name  Employee ID    Manager        SSN_x Company Vendor ID     Vendor Name Invoice Number Transaction Amt Vendor Type        SSN_y Similarity Ratio
    0  BROWN, CLIFFORD            1  Manager 1  668-419-628    15.0   58421.0  CLIFFORD BROWN          854.0           500.0        Misc    668419628            100.0
    1       BLUE, CITY            2  Manager 2    874126487                                                                                                           
    2      SMITH, JOHN            3  Manager 3  965-21-4872   200.0   15789.0     SMITH, JOHN           80.0            40.0    Employee    965214872            100.0
    3    HAROON, SIMON            4  Manager 4  741-98-7820   200.0   69997.0   HAROON, SIMAN          964.0           100.0        Misc  741-98-7821             92.0
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum