Python3: How to improve the following Term Frequency algorithm?

I have various files with thousands of lines and no headers. The content on each line has the following structure:

enter image description here

The first element ('LGAV') represents origin airports and the second element ('EGGW') the destination airport, this are the only relevant data to work with. My goal is to create a top ten ranking of the busiest airports. I'd like to store the ranking in a dictionary with:

'nameAirport': [totalNumberOfMovements, numberTakeOffs, numberLandings]

Been the numberTakeOffs, the number of times an airport repeats as origin, numberLandings the number of times an airport repeats as destination and totalNumberOfMovements the sum of of the previous counts.

As an example:

TAKEOFFS:  {'EHAM': 55, 'EGLL': 46, 'LOWW': 44, 'LFPG': 43, 'LTBA': 38, 'EDDF': 37, 'LEMD': 34, 'EKCH': 33, 'EGKK': 31, 'LFPO': 30, ....

LANDINGS:  {'LEMD': 37, 'EDDM': 35, 'LEBL': 34, 'LFPO': 33, 'LFPG': 32, 'EKCH': 29, 'LTBA': 27, 'LSZH': 25, 'ENGM': 25, 'LTFJ': 24, 'LOWW': 23, 'EHAM': 23, ....

FINAL_DICT: {'EHAM': [78,55,23], 'EGLL': [67,46,21], 'LOWW': [67,44, 23], 'LFPG': [75,43, 32], .... 

I'm not satisfied with the code I have so far, takes too long when passing the biggest file. Also I don't know how to obtain the desired output data format

'nameAirport': [totalNumberOfMovements, numberTakeOffs, numberLandings]

Code so far:

# Libraries
import pandas as pd
import collections
from itertools import chain
from collections import defaultdict

# START

# Load Data from file
df = pd.read_csv('traffic1day.exp2', header=None, sep=';', usecols=[0,1])

# Dictionary 1 for aircrat takeOffs
# 'LPPD' header for origin airports
takeOffs = df[0].value_counts()
dict_1 = takeOffs.to_dict()
print("TAKEOFFS: ", dict_1)

# Dictionary 2 for aircraft landings
# 'LEMD' header for destination airports
landings  = df[1].value_counts()
dict_2 = landings.to_dict()
print("\nLANDINGS: ", dict_2)

dict_3 = defaultdict(list)

for key, value in chain(dict_1.items(), dict_2.items()):
    dict_3[key].append(value)

# Combine dict_1 and dict_2 keys:values
combined_dict = collections.Counter(dict_3).most_common(10)
print("\n------ Combined dictionary from dict_1 and dict_2 values: 
\n\n",combined_dict)

# Sum values from same key
new_dict = defaultdict(list) 
for key, value in combined_dict:
new_dict[key] = {"Total": sum(value), "T, L": value}  
print("\n------ Sum of values from combined dictionary for each key: 
\n\n", new_dict)
# END

1 answer

  • answered 2020-02-19 12:19 Christian Breinholt

    if I have understood your issue correctly, then I think it can be solved like this:

    def data_loader(filename_as_string):
      try:
         df = pd.read_csv(filename_as_string, sep=';')
    
      else:
         df = pd.read_csv(filename_as_string, sep=';', header = None)
    
      return df
    

    You may alternatively also do the following since you only need the two first columns:

    df = pd.read_csv('traffic1hour.exp2', sep=';', header = None, usecols=[0,1])