Memory error when computing cosine similarity on large text dataset

I'm trying to compute the cosine similarity between a large number of sentences (around 400k), but get a memory error with the current method (see below).

test_df = pd.read_csv(path)
k = test_df['text']

# Vectorise the data
vec = TfidfVectorizer()
X = vec.fit_transform(k)

# Calculate the pairwise cosine similarities
S = cosine_similarity(X)

EDIT:

I have now also tried the following, which also gives me a memory error:

from sklearn.metrics.pairwise import cosine_similarity
X=cosine_similarity(X, X)

Ideally, I would like to end up with a dataframe that shows the cosine similarity from one sentence, for this I use pandas. There is just one issue where the dataframe gets really large even with a small dataset.

df = pd.DataFrame.from_records(T)
print(len(T))
print(df.head())

Any solutions are appreciated.

1 answer

  • answered 2020-09-24 15:31 büşra çelik

    I think you can normalize the matrices such as

    import numpy as np
    
    normalized_df = normalized_df.astype(np.float32)
    cosine_sim = cosine_similarity(normalized_df, normalized_df)
    

    source