What is your approach for querying Cassadra with Spark (in R or Python)?
I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using
sparklyr and the
spark-sql) and simply doing an inner join on the column of interest (it is a
partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an
IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single
id in the
partition key using a jdbc connection and then apply that closure 200k times for each
id I'm interested in. I use
spark_apply to apply that closure in parallel for each executor. I also set my
spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple
ids from a partition key column (
A few points here:
- Working with Spark-SQL is not always the most performant option, the optimized might not always as good of a job than a job you write yourself
- Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
- If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
- Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!