How does Apache Spark read data from no-distributed storage systems?

If I create a pipeline of Spark transformations on a DataFrame to read from a non-distributed storage like a non-partitioned relational database and call action on it, the data inside the table will be pulled at once and stored in the form of partitions in my cluster or the data will come one row (or few rows) at a time and will not be partitioned at all?

What I'm not clear about is:

  1. If my data is not already partitioned then I think there will be no benefit of Spark's parallel processing.
  2. In that case would it make sense, as a pre-processing step, to pull the data in the cluster and then use Spark on that partitioned data?
  3. Does it make sense to use Spark for non-big data at all?

2 answers

  • answered 2018-11-08 07:09 Prasan Karunarathne

    Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).

    For more details, please refer the below web link..

    https://www.infoq.com/articles/apache-spark-introduction

    https://www.quora.com/How-can-I-run-Spark-without-HDFS

  • answered 2018-11-08 07:24 Siddhesh Rane

    It does not matter in what form your data is stored, whether a text file, database, or some network service. When you import data in Spark, you tell it how to parse records or rows from your data, and then it will club together multiple records into partitions.

    A record can be lines in a file, entire files, an object created in your programming language, etc. When you create DataFrame, you'll be mostly dealing with primitive data like strings, numbers and dates, or arrays of them. In case of RDD you can have any kind of programming language object.

    Whatever your records are, they get serialized into a format that can be sent over the network and are fully self-contained. What I mean by that is if you serialize an object, then the serialization will also include the entire transitive closure set of objects referenced. This is your partition. The amount of records in a partition depends on the number as well as the size of the serialized records. You can explicitly state how many partitions you want or what the size of one partition should be.

    Now to answer your questions:

    If my data is not already partitioned then I think there will be no benefit of Spark's parallel processing.

    As I wrote above, Spark partitions the data for you. However, you may not get that automatically. You have to be somewhat explicit in your coding.

    In that case would it make sense, as a pre-processing step, to pull the data in the cluster and then use Spark on that partitioned data?

    I think you mean to say pulling the data into HDFS where it will get partitioned. No, you don't need to do this. But it can be good if you want to rerun your Spark jobs on the same data over and over again.

    Does it make sense to use Spark for non-big data at all?

    You can use Spark RDD to 'save' your computations on disk. If you are doing some exploratory data science stuff, you could save snapshots, and go back to them, or start a new branch from some base work. Think revision control for your data. Not that it is not possible with other frameworks, but you do get an efficiently serialized format which is resilient and forms DAG of computation, also future-proofing your pet projects for scaling for big data if they ever take off.

    It was confusing for me when I started with Spark. You can check out this section on input and output in spark which will be relevant for you.