How to read a file from a directory and convert it to a table?

I have a class that takes in positional arguments (startDate, endDate, unmappedDir, and fundCodes), I have the following methods:

The method below is supposed to take in a an array of fundCodes and look in a directory and see if it finds files matching a certain format

def file_match(self, fundCodes):
    # Get a list of the files in the unmapped directory
    files = os.listdir(self.unmappedDir)

    # loop through all the files and search for matching fund code
    for check_fund in fundCodes:

        # set a file pattern
        file_match = 'unmapped_positions_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
        # look in the unmappeddir and see if there's a file with that name
        if file_match in files:
            # if there's a match, load unmapped positions as etl
            return self.read_file(file_match)
        else:
            Logger.error('No file found with those dates/funds')

The other method is simply supposed to create an etl table from that file.

def read_file(self, filename):
    loadDir = Path(self.unmappedDir)
    for file in loadDir.iterdir():
        print('*' *40)
        Logger.info("Found a file : {}".format(filename))
        print(filename)
        unmapped_positions_table = etl.fromcsv(filename)
        print(unmapped_positions_table)
        print('*' * 40)
        return unmapped_positions_table

When running it, I'm able to retrieve the filename:

Found a file : unmapped_positions_PUPSFF_2018-07-01_2018-07-11.csv unmapped_positions_PUPSFF_2018-07-01_2018-07-11.csv

But when trying to create the table, I get this error:

FileNotFoundError: [Errno 2] No such file or directory: 'unmapped_positions_PUPSFF_2018-07-01_2018-07-11.csv'

Is it expecting a full path to the filename or something?

2 answers

  • answered 2018-07-11 20:06 Jean-François Fabre

    with this:

    files = os.listdir(self.unmappedDir)
    

    you're getting the file names of self.unmappedDir

    So when you get a match on the name (when generating your name), you have to read the file by passing the full path (else the routine probably checks for the file in the current directory):

    return self.read_file(os.path.join(self.unmappedDir,file_match))
    

    Aside: use a set here:

    files = set(os.listdir(self.unmappedDir))
    

    so the filename lookup will be much faster than with a list

    And your read_file method (which I didn't see earlier) should just open the file, instead of scanning the directory again (and returning at first iteration anyway, so it doesn't make sense):

    def read_file(self, filepath):
        print('*' *40)
        Logger.info("Found a file : {}".format(filepath))
        print(filepath)
        unmapped_positions_table = etl.fromcsv(filepath)
        print(unmapped_positions_table)
        print('*' * 40)
        return unmapped_positions_table
    

    Alternately, don't change your main code (except for the set part), and prepend the directory name in read_file since it's an instance method so you have it handy.

  • answered 2018-07-11 20:15 abarnert

    The proximate problem is that you need a full pathname.

    The filename that you're trying to call fromcsv on is passed into the function, and ultimately came from listdir(self.unmappedDir). This means it's a path relative to self.unmappedDir.

    Unless that happens to also be your current working directory, it's not going to be a valid path relative to the current working directory.

    To fix that, you'd want to use os.path.join(self.unmappedDir, filename) instead of just filename. Like this:

    return self.read_file(os.path.join(self.unmappedDir), file_match)
    

    Or, alternatively, you'd want to use pathlib objects instead of strings, as you do with the for file in loadDir.iterdir(): loop. If file_match is a Path instead of a dumb string, then you can just pass it to read_file and it'll work.


    But, if that's what you actually want, you've got a lot of useless code. In fact, the entire read_file function should just be one line:

    def read_file(self, path):
        return etl.fromcsv(path)
    

    What you're doing instead is looping over every file in the directory, then ignoring that file and reading filename, and then returning early after the first one. So, if there's 1 file there, or 20 of them, this is equivalent to the one-liner; if there are no files, it returns None. Either way, it doesn't do anything useful except to add complexity, wasted performance, and multiple potential bugs.

    If, on the other hand, the loop is supposed to do something meaningful, then you should be using file rather than filename inside the loop, and you almost certainly shouldn't be doing an unconditional return inside the loop.