Multistep data pipeline with Kafka?
I'm evaluating Kafka as a potential tool for performing ETL jobs at my company. We have an existing workflow that goes something like this:
Pull all CSVs in a specific AWS S3 directory, parse those files line by lines and insert them into a database.
Once the entire directory has been processed, another job pulls a distinct list of IDs from that database and begins a machine learning analysis step for each ID.
If I am looking to replace this current process, would Kafka be capable of the above, and if so how should I set it up? From what I have read, my first thought was:
To replace the first step, create a Kafka producer that reads every file in the S3 directory and sends each line to a stream (would I call this the
S3 topic?). I would then create a Kafka consumer that would take each record from this stream and insert it into the database (lets call it the
To replace the second, I would create another producer that would fire once the first producer/consumer set had completed. This producer would gather a distinct set of ids from the table and send them to a second stream (the
Analysis topic?). Then as each record came into this stream a second consumer would pull the ID and perform the needed analysis.
From what I have read, this was my first though on how to tackle the problem, but I still have some questions that I'm hoping someone who has more experience can answer.
Does the Kafka setup I described seem like the proper way to use the tool, or are there improvements I should consider?
Is inserting records into the
device DBthat I mentioned even necessary or would it be possible to store all of those records in a stream? This would essentially merge the two steps and eliminate the consumer from step 1 and the producer from step 2.