Python: Convert Dictionary with Tuple as key to a sparse matric using Pandas

I have a dictionary where the key is a tuple of length 2, and the value is a number, like this:

{('Alf', '2012.xlsx'): 600}

I want to create a sparse matrix, where Alf is the name of a row, 2012.xlsx is the name of a column, and 600 is the value where those two meet. And I want that to happen for all the other values in my dictionary. There may be keys like ('Alf', '2013.xlsx') and ('Elf','2012.xlsx')

The dictionary can be of any size, so I was thinking after creating it, I would loop through it and create a dataframe cell by cell, but I'm struggling to do that.

Here's the code I've written to create this dictionary (ing_dict). I'm open to approaching this problem in a different (better) way.

for filename in os.listdir(inv_folder):
    name, ext = os.path.splitext(filename)
    if ext == '.xlsx':
        if filename==inv_file:
            continue
        recipe_files.append(filename)
    
#loop through list of files, load each workbook, and send it to the inventory function      
for file in recipe_files: 
    file_counter += 1
    file_path = inv_folder+'\\'+file
    wb = load_workbook(file_path,data_only=True)
    sheet=wb.active
    inventory(sheet,file,file_counter)

def inventory(sheet,file,file_counter):
    print('\n',file)   
    for row in sheet.iter_rows(2,18,1,3):
        if row[0].value:
            ing_dict[(row[0].value,file)]=row[2].value
            

Thank you

1 answer

  • answered 2020-10-30 00:18 Sztyler

    The following code should do what you want. I added inline comments to explain how I transform your data.

    import numpy as np
    import pandas as pd
    
    # The expected input data
    data = {('Alf', '2012.xlsx'): 600, ('Elf', '2012.xlsx'): 400, ('Alf', '2013.xlsx'): 200, ('Tim', '2014.xlsx'): 150}
    
    row_to_pos = {}  # maps a row name to an actual position
    data_new = {}    # We need to reformat the data structure
    
    # loop through the data
    for key, value in data.items():
        row=key[0] # e.g., 'Alf'
        column=key[1] # e.g., '2012.xlsx'
        
        #  if a row name is new, we add it to our mapper
        if row not in row_to_pos:
            row_to_pos[row] = len(row_to_pos)
            
        # if a column name is new, we add a new entry in `data_new`
        if column not in data_new:
            data_new[column] = [[],[]]
            
        # store our data, key=column_name, value=a list of two lists
        data_new[column][0].append(row_to_pos[key[0]]) # store the position
        data_new[column][1].append(value) # store the actual value
    
    # we did not know in the first place how many unique row names we have so we have to loop once more
    for key, value in data_new.items():
        tmp = np.zeros(len(row_to_pos))
        tmp[value[0]] = value[1] # value[0] are the positions, value[1] the corresponding values
        data_new[key] = tmp
    
    # create our dataframe
    data_new['Name'] = list(row_to_pos.keys())
    df = pd.DataFrame(data_new)
    df = df.set_index(['Name'])
    print(df)
    

    This results in the following output:

          2012.xlsx  2013.xlsx  2014.xlsx
    Name
    Alf       600.0      200.0        0.0
    Elf       400.0        0.0        0.0
    Tim         0.0        0.0      150.0