What is your approach for querying Cassadra with Spark (in R or Python)?

I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).

My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.

Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.

I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?

1 answer

  • answered 2018-03-14 00:15 Christophe Schmitz

    A few points here:

    • Working with Spark-SQL is not always the most performant option, the optimized might not always as good of a job than a job you write yourself
    • Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
    • If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
    • Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).

    Hope it helps!