Insert File contents into a dictionary Python

Each line in file(around 18 million lines) consists of word->docID,freqID I am trying to load it into dictionary as d[word] = [docID,freqID] Here is my code:

lex = dict()
with open('word.txt') as f:
    for a in f:
        # tab = []
        word = a.split("-")[0]
        freqID = int(a.split(",")[1])
        docID = int(a[a.find(">")+1:a.find(",")])
        lex[word] = [docID, freqID]

Its taking a lot of time, how to speed up the process , so it reads all contents and stores in the dictionary in less than a minute?

1 answer

  • answered 2018-11-08 19:58 J-L

    Try using a simple regular expression:

    import re
    lineRegExp = re.compile(r'(\w+)->(\d+),(\d+)' + '\n?')
    
    lex = dict()
    with open('blah.txt') as f:
        for line in f:
            try:
                word, freqId, docId = lineRegExp.match(line).groups()
                lex[word] = [int(freqId), int(docId)]
            except AttributeError:
                print("No match found in line:", line, end='')
    
    print(lex)
    

    You might think a regular expression would be slow, but don't knock it until you try it out. It might be a lot faster than you think. (Then again, maybe not!)

    Using split() can create extras lists and strings that you don't use, and so immediately discard. But by using a regular expression, no extra objects are created, other than the ones you use to populate your dict.