Using Jaccard Index to find the best match between skill required and teachers

I have a set of Students with a list of skills they want to learn and set of teachers with a list of skills they are ready to teach.

Based on this information I have the below given tables. One for the Students and one for the Teachers. '1' represents a skill a student is willing to learn and the teacher is willing to teach. '0' means the opposite.

|  Students  |  Skill 1  |  Skill 2  |  Skill 3 |  Skill 4 |  Skill 5  |
|------------|-----------|---- ------|----------|----------|-----------|
|      A     |      1    |      0    |     0    |     1    |     0     |
|      B     |      1    |      1    |     0    |     0    |     1     |
|      C     |      0    |      0    |     1    |     1    |     0     |
|      D     |      1    |      1    |     0    |     1    |     1     |
|      E     |      0    |      1    |     1    |     0    |     1     |


|  Teachers  |  Skill 1  |  Skill 2  |  Skill 3 |  Skill 4 |  Skill 5  |
|------------|-----------|---- ------|----------|----------|-----------|
|      F     |      1    |      1    |     1    |     1    |     1     |
|      G     |      0    |      1    |     0    |     0    |     0     |
|      H     |      0    |      0    |     1    |     1    |     1     |
|      I     |      1    |      1    |     0    |     0    |     0     |
|      J     |      0    |      0    |     1    |     0    |     1     |

I am trying to match the Teachers with the appropriate Students and one suggestion I can see is to use the Jaccard Index. However, I am not sure if the Jaccard index works correctly on the Binary data.

I tried to use it on a small dataset as per below but I am not getting the correct results.

import numpy as np

a = [0, 1, 1, 0, 1, 0, 0]
b = [0, 1, 1, 0, 1, 0, 0]

#define Jaccard Similarity function

def jaccard(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(list1) + len(list2)) - intersection
    return float(intersection) / union

#find Jaccard Similarity between the two sets 

jaccard(a, b)

0.16666 is the output even though the binary lists are exactly the same.

Any suggestions on how to correctly use the Jaccard Index in this case or any other way to match the teachers to the students? Thanks!

1 answer

  • answered 2022-05-04 11:38 mozway

    If I understand correctly, you want to compute the maximum skill overlap using the Jaccard index and assign the "best" teacher to each student.

    The first step is to compute a matrix of Jaccard indices:

    S = (df1.melt(id_vars='Students')
            .query('value==1')
            .groupby('Students')['variable']
            .agg(frozenset)
         )
    T = (df2.melt(id_vars='Teachers')
            .query('value==1')
            .groupby('Teachers')['variable']
            .agg(frozenset)
         )
    
    def jaccard(s1, s2):
        return len(s1&s2)/len(s1|s2)
    
    from itertools import product
    
    df = (pd
       .Series({(s,t): jaccard(S[s], T[t]) for s,t in product(S.index, T.index)})
       .unstack()
       .rename_axis(index='student', columns='teacher')
    )
    
    # df
    teacher    A         B         C         D         E
    student                                             
    A        0.4  0.000000  0.250000  0.333333  0.000000
    B        0.6  0.333333  0.200000  0.666667  0.250000
    C        0.4  0.000000  0.666667  0.000000  0.333333
    D        0.8  0.250000  0.400000  0.500000  0.200000
    E        0.6  0.333333  0.500000  0.250000  0.666667
    

    Then, we can solve the assignment problem using scipy.optimize.linear_sum_assignment:

    from scipy.optimize import linear_sum_assignment
    
    x, y = linear_sum_assignment(df, maximize=True)
    
    out = pd.DataFrame({'student': df.columns[y], 'teacher': df.index[x]})
    
    # out
      student teacher
    0       B       A
    1       D       B
    2       C       C
    3       A       D
    4       E       E
    

    Alternatively, if you just want the best teacher for each student, even if this means potentially having teachers without students and others with many students, use idxmax:

    df.idxmax(axis=1)
    
    student
    A    A
    B    D
    C    C
    D    A
    E    E
    dtype: object
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum