What (event-driven) architecture should I choose for a modular data processing pipeline?
I'm working on building an architecture for a modular data processing platform. The gist is that the platform should support the input of many documents (millions) of different types and process them with a pipeline of different data science modules. An example of a pipeline would be:
In the above diagram, I call each block a module. I have the following requirements:
- The platform should facilitate multiple projects (= multiple pipelines)
- A module can be created with any programming language, but mostly python will be used.
- Each module may handle a single document/model/entity at a time.
- It should be easy to add new modules to the platform and define pipelines.
- Building a new module should be easy for data scientists, e.g. they should be able to run/debug it locally with a test pipeline.
- A developer should be able to proxy to modules that are available (running on premise).
- The pipeline needs to be interrupt-able, and it should be straight forward to catch any failures.
- It should be possible to resume the pipeline, restart at any point or redo the work of a module.
- All the work needs to be traceable with logs.
We are currently using Celery for a first version of this platform but there are a number of issues:
- Celery does not have a clear manager strategy and we have to chain and queue all the tasks upfront.
- In addition to the above, there is no strategy when a task fails so we have to inspect all the logs manually (currently using graylog)
- Speaking about logging, celery doesn't include meta data with the default logger so we are manually overriding celery internals to facilitate this
- Overall I feel that celery is really frustrating to work with in non-happy path scenarios.
I'm thinking of building a frontend GUI (e.g. using GoJS) to make it easy to manage the pipelines. It would be great to have some autodiscovery(?) of modules to assist a user in building a flow, it should also be possible to use modules that are not (yet) registered.
I'm now considering an event driven microservice architecture where a manager service would supervise the pipelines and send events to the modules (containerised microservices) to do the work, modules can then send events back to the manager which in turn will signal other modules. The pipeline definitions would be stored in a database (e.g. MongoDB).
I've looked at nameko and am now looking at Kafka but in both cases it seems that I need to define all the available services/topics in the code upfront. Ideally I would want to add and remove streams of different modules (i.e. a topic for each module?) without modifying code. I am looking for a more dynamically defined event bus. The classic pub-sub architecture doesn't seem to fit because I don't know up front what all the microservices (data processing modules) are going to be.
Any thoughts on how to tackle this? Thanks a lot!
Dynamic Spooling of your Data Pipelines. You can do with Event Driven microservices talking over Kafka.
Have Topics dynamically created for your pipelines. Replication and Resilience of Topics are maintained well by Kafka.
You also get to Stream process your data and refer un-pulled data from Kafka any time