How to tag and store files, by metadata, in Python?

I want to build a manual file tagging system like this. Given that a folder contains these files:

data/
    budget.xls
    world_building_budget.txt
a.txt
b.exe
hello_world.dat
world_builder.spec

I want to write a tagging system where executing

py -3 tag_tool.py -filter=world -tag="World-Building Tool"

will output

These files were tagged with "World-Building Tool":
    data/world_building_budget.txt
    hello_world.dat
    world_builder.spec

Another example. If I execute:

py -3 tag_tool.py -filter="\.txt" -add_tag="Human Readable"

It will output

These files were tagged with "Human Readable":
    data/world_building_budget.txt
    a.txt

I am not asking "Do my homework for me". I want to know what approach I can take to build something this? What data structure should I use? How should I tag contents in a directory?

1 answer

  • answered 2022-05-07 03:33 rfportilla

    Nice question!

    First, I am not clear if this is actually homework, but my first recommendation is always to see if it's already done (and it seems to be): https://pypi.org/project/pytaggit/

    If I were to ignore that and build it myself, I would consider what a tagging systems structure is. Long story (skip ahead if not interested): consider a simple file system... It has exactly one path to every file. You can do a string search by file name or even properties, but the organization is such that a file can only exist in one place. Bring in file links (short cuts). Soft links make a file appear as though they were in multiple locations. This like being able to file Soccer under "S" and creating another file called (football) in "F" that just says "see soccer". Hard links actually make it so that they are effectively in multiple locations. This would be like being able to pull the exact same file "Soccer" in both "F" and "S". If someone makes a change to one, the change is made to both. This is still a very limited organization restricted to file location. If you wanted to be nimble and apply arbitrary organizations, hard links become heavy to maintain. Tagging is a way to accomplish this without too much overhead.

    ...... Past the skipped part ......

    There is more than one way to accomplish this, but here is a generic look at what is needed. Tags need to be able to have a many-to-many relationship between files and tags. I.e. you should be able to look at a file and see all tags associated AND you should be able to look at a tag and see all files associated with it. If you want to store the data once, you will have to choose which way to optimize as you are choosing to organize your data only one way. Therefore, forward lookup will be natural and reverse lookup will require processing. If you want to maintain two data sets (or indexes), you can store both forward and reverse lookups. If you know that your data won't grow past a certain size and/or the usage will typically only require one direction, then one index should be fine. Otherwise, I would choose two. The trade-off is the overhead of keeping them in-sync.
    If you want to optimize for tags(filename), then you would probably use a dict with something like

    filenameTags = {'myFileName': ['tag1', 'tag2', ...]}
    

    Getting filenames from tags with this structure would require a process of searching all of the embedded lists and returning the key associated, if there is a match. You can reverse this structure (filenames(tag)) if you want to optimize the other way. You can also create both file structures, but then you have the overhead of keeping both in sync.
    Lastly to keep this persistent, save to a file or DB. Redis supports this nicely.

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum