I cannot see the .csv file I have saved using Spark & Scala

I am trying to save into a .csv file the result of the query I made using SQL and Parquet, from a DataFrame.

This is how I wrote my query:

This his how I query

And this is how I save the data:

enter image description here

But the supposed .csv file output is no where to be seen:

enter image description here

2 answers

  • answered 2018-05-16 06:03 Yayati Sule

    You can try adding file:///home/hadoop/apr2.csv.gz instead of /home/hadoop.apr2.csv. You can refer to the following code snippet available on the databricks' spark-csv module present on Github Spark CSV

    import org.apache.spark.sql.SQLContext
    
    val sqlContext = new SQLContext(sc)
    val df = sqlContext.read
        .format("com.databricks.spark.csv")
        .option("header", "true") // Use first line of all files as header
        .option("inferSchema", "true") // Automatically infer data types
        .load("cars.csv")
    
    val selectedData = df.select("year", "model")
    selectedData.write
        .format("com.databricks.spark.csv")
        .option("header", "true")
        .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
        .save("newcars.csv.gz")
    

    As for the prefix file:/// we usually add it if we want to perform any read or write operation from local filesystem instead of HDFS.

  • answered 2018-05-16 10:15 Mugdha

    To add dependency, start your spark shell using following command:

    spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

    Read your paraquet file using:

    val df = sqlContext.read.parquet("file:///home/mugdha/users.parquet")
    

    Apply filter, select operations:

    val filteredDF = df.select("name","favorite_color")
    

    To save filteredDF, you can use following code:

    filteredDF.write.format("com.databricks.spark.csv").save("file:///home/mugdha/testSave")
    

    Inside testSave folder, you can check out your stored csv.