How to loop through a list of documents' text and calculate TF-IDF Cosine Similarity
I have a list and text documents and I convert them into a dataframe, with two columns: Filename Column and Text Column. For the Text column, I already removed floats, numbers, punctuation, stopwords, and lower case all the text.
My documents dataframe:
Filename Text 0 xxx.txt blah1 blah1 1 xyz.txt blah2 blah2 2 xzz.txt blah3 blah3 3 zzy.txt blah4 blah4 : n aad.txt blah5 blah5
I created a code that can calculate the Cosine Similarity but what I really want to do is automated my code so that can it will calculate all the cosine similarity score within the dataframe. I want
xxx.txt to compare with xyz.txt, xzz.txt, zzy.txt......and aad.txt, same for
xyz.txt to compare with xxx.txt and so on Then it will only keep the highest cosine score and its corresponding filename. Finally, convert it back to a dataframe.
The output dataframe that I want:
Filename Cosine Similarity Score Corresponding Filename 0 xxx.txt 0.85 fdd.txt 1 xyz.txt 0.91 giu.txt 2 xzz.txt 0.97 diu.txt 3 zzy.txt 0.90 oil.txt : n aad.txt 0.88 sti.txt
The code that I wrote:
#Convert the Text Column into a list txt_list = df['Text'].values.tolist() from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity # Create the document term matrix tfidf = TfidfVectorizer() sparse_matrix = tfidf.fit_transform(txt_list) # Compute cosine similarity # Code below, only compare the first txt file with all the txt files within the dataframe cosine_sim = cosine_similarity(sparse_matrix[0:1], sparse_matrix) # Convert results into dataframe cosine_sim_df = pd.DataFrame(cosine_sim) cosine_sim_df = cosine_sim_df.transpose() # Isolate Filenames and bind two dataframes together; sort by sim score index_col = pd.DataFrame(df['Filename']) cosine_sim_df = pd.concat([index_col, cosine_sim_df], axis=1) cosine_sim_df.columns = ['Filename', 'Cosine Similarity Score'] #Sort the dataframe from highest sim score to lowest cosine_sim_df.sort_values(by=['Cosine Similarity Score'], ascending = False, inplace = True) #Drop the row that compare with itself, which the sim score will be 100% cosine_sim_df = cosine_sim_df.drop()
I am wondering how can I fix my code so that it will output the dataframe that I want (Like the one that I list above).
Thank you so much!!!