# How to loop through a list of documents' text and calculate TF-IDF Cosine Similarity

I have a list and text documents and I convert them into a dataframe, with two columns: Filename Column and Text Column. For the Text column, I already removed floats, numbers, punctuation, stopwords, and lower case all the text.

My documents dataframe:

``````    Filename     Text
0   xxx.txt   blah1 blah1
1   xyz.txt   blah2 blah2
2   xzz.txt   blah3 blah3
3   zzy.txt   blah4 blah4
:
``````

I created a code that can calculate the Cosine Similarity but what I really want to do is automated my code so that can it will calculate all the cosine similarity score within the dataframe. I want `xxx.txt to compare with xyz.txt, xzz.txt, zzy.txt......and aad.txt`, same for `xyz.txt to compare with xxx.txt and so on` Then it will only keep the highest cosine score and its corresponding filename. Finally, convert it back to a dataframe.

The output dataframe that I want:

``````    Filename     Cosine Similarity Score     Corresponding Filename
0   xxx.txt          0.85                            fdd.txt
1   xyz.txt          0.91                            giu.txt
2   xzz.txt          0.97                            diu.txt
3   zzy.txt          0.90                            oil.txt
:
``````

The code that I wrote:

``````#Convert the Text Column into a list
txt_list = df['Text'].values.tolist()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create the document term matrix
tfidf = TfidfVectorizer()
sparse_matrix = tfidf.fit_transform(txt_list)

# Compute cosine similarity
# Code below, only compare the first txt file with all the txt files within the dataframe
cosine_sim = cosine_similarity(sparse_matrix[0:1], sparse_matrix)

# Convert results into dataframe
cosine_sim_df = pd.DataFrame(cosine_sim)
cosine_sim_df = cosine_sim_df.transpose()

# Isolate Filenames and bind two dataframes together; sort by sim score
index_col = pd.DataFrame(df['Filename'])
cosine_sim_df = pd.concat([index_col, cosine_sim_df], axis=1)
cosine_sim_df.columns = ['Filename', 'Cosine Similarity Score']
#Sort the dataframe from highest sim score to lowest
cosine_sim_df.sort_values(by=['Cosine Similarity Score'], ascending = False, inplace = True)
#Drop the row that compare with itself, which the sim score will be 100%
cosine_sim_df = cosine_sim_df.drop([0])

``````

I am wondering how can I fix my code so that it will output the dataframe that I want (Like the one that I list above).

Thank you so much!!!