How do I compare pairs within a list in Python?

I'm trying to loop through a concatenated list of two lists that is essentially a bag of words - example outputs yields [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5), ..., ('brexit', 35), ('say', 28), , ('may', 5), ('uk', 1), ... ]

Having gathered all the text inputs from .txt files, I've removed the stop-words and using stemming to remove duplicated from tenses.

The next step I want to take is to loop through the list and find the differences in the number of appearances a given word - I would want 'brexit', 'say' and 'uk' to be flagged as significant words with either the two numbers of appearances or just the difference. My start of the code (partly python, partly pseudocode) is below.

def findSimilarities (word, count):
    for (word, count) in biasDict:
        if word == word and count != count:
            print (word, count - count)
        elif word ==word and count == count:
            del (word, count)
        (word, count)++

Any advice on how to approach this and edit the code to work? If it would be better, I can have the words come from two separate lists (which is how they are created; I concatenated them after they were created).

Many thanks.

3 answers

  • answered 2019-02-10 12:48 f.wue

    This would be an option. Not efficient, but the output is as desired. That is, if you want to delete word's with the same count (as shown in your code). If you want to keep the entries, just skip the biasDict.remove() part. If your just interested in the word's that occur twice with a different count, you could append the tuples to a new list instead of printing the difference. Afterwards return the new list.

    def findSimilarities (biasDict):
        remove_later = []
        for i in range(0, len(biasDict)):
            word, count = biasDict[i][0], biasDict[i][1]
            for c in range(0, len(biasDict)):
                word_compare, count_compare = biasDict[c][0], biasDict[c][1]
                if c==i:
                    pass #Same entry
                elif word == word_compare and count != count_compare:
                    print (word, count - count_compare)
                elif word == word_compare and count == count_compare and (word, count) not in remove_later:
                    remove_later.append((word, count))
        for entry in remove_later:
            biasDict.remove(entry)
        return biasDict
    biasDict =  [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5), ('brexit', 35), ('say', 28), ('may', 5), ('uk', 1)]
    print(findSimilarities(biasDict))
    

    Output:

    brexit -24
    say -17
    uk 6
    brexit 24
    say 17
    uk -6
    [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('brexit', 35), ('say', 28), ('may', 5), ('uk', 1)]
    

  • answered 2019-02-10 13:04 Maged Saeed

    The idea of combining occurrences seems fine for me. Here is my implementation. Any comment or optimization is appreciated.

    def merge_list(words_count_list):
    updated_list = list()
    words_list = list()
    for i in range(len(words_count_list)):
        word = words_count_list[i][0]
        count = words_count_list[i][1]
        if word not in words_list:
            words_list.append(word)
            for j in range(i+1,len(words_count_list),1):
                if word == words_count_list[j][0]:
                    count += words_count_list[j][1]
            updated_list.append((word,count))
    return updated_list
    
    print(merge_list([('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5), 
                                                    ('brexit', 35), ('say', 28),('may', 5), ('uk', 1)]))
    

    output:

    [('brexit', 46), ('say', 39), ('uk', 8), ('eu', 6), ('deal', 5), ('may', 10)]
    

    Now, you can specify a threshold on the word count, sort by the count, then remove the most significant words.

  • answered 2019-02-10 13:16 Devesh Kumar Singh

    Assuming you have two lists of the words, then you can do

    #Converts list of tuples to dictionary.
    #[('a',1'),('b',2)] => {'a':1,'b',2}
    def tupleListToDict(list):
    
        dictobj = {}
        for item in list:
            dictobj[item[0]] = item[1]
        return dictobj
    
    def findSimilarities(list1, list2):
        dict1 = tupleListToDict(list1)
        dict2 = tupleListToDict(list2)
        dict3 = {} #To store the difference
        #Find occurence of key in 2nd dict, if found, calculate the difference
        for key, value in dict1.items():
            if key in dict2.keys():
                dict3[key] = abs(value - dict2[key])
        return dict3
    

    Example output

    list1 = [('brexit', 11), ('say', 11), ('uk', 7), ('eu', 6), ('deal', 5), ('may', 5)]
    list2 = [('brexit', 35), ('say', 28), ('may', 5), ('uk', 1)]
    print(findSimilarities(list1, list2))
    {'brexit': 24, 'say': 17, 'uk': 6, 'may': 0}