spark-shell commands throwing error : “error: not found: value spark”
:14: error: not found: value spark import spark.implicits._ ^ :14: error: not found: value spark import spark.sql ^
here is my enviroment configuration. I different many times but I keep getting this error. Anyone knows the reason? I saw a similar question but the answers did not solve my problem.
JAVA_HOME : C:\Program Files\Java\jdk1.8.0_51
HADOOP_HOME : C:\Hadoop\winutils-master\hadoop-2.7.1
SPARK_HOME : C:\Hadoop\spark-2.2.0-bin-hadoop2.7
PATH :%JAVA_HOME%\bin;%SCALA_HOME%\bin;%HADOOP_HOME%\bin;%SPARK_HOME%\bin;
do you know?
how many words do you know
See also questions close to this topic
-
Apache Spark Dataframe - Get length of each column
Question: In Apache Spark Dataframe, using
Python
, how can we get the data type and length of each column? I'm using latest version of python.Using
pandas
dataframe, I do it as follows:df = pd.read_csv(r'C:\TestFolder\myFile1.csv', low_memory=False) for col in df: print(col, '->', df[col].str.len().max())
-
Spark: retrieving old values of rows after casting made invalid input nulls
I am having trouble retrieving the old value before a cast of a column in spark. initially, all my inputs are strings and I want to cast the column num1 into a double type. However, when a cast is done to anything that is not a double, spark changes it to null.
Currently, I have dataframes
df1:
num1 unique_id 1 id1 a id2 2 id3 and a copy of df1: df1_copy where the cast is made.
when running
df1_copy = df1_copy.select(df1_copy.col('num1').cast('double'), df1_copy.col('unique_id'))
it returns df1_copy:
num1 unique_id 1 id1 null id2 2 id3 I have tried putting it into a different dataframe using select and when but get an error about not being able to find the column num1. The following is what I tried:
df2 = df1_copy.select(when(df1_copy.col("unique_id").equalTo(df1.col("unique_id")),df1.col('num1)).alias("invalid"), df1_copy.col('unique_id'))
-
Computing number of business days between start/end columns
I have two Dataframes
- facts:
- columns:
data
,start_date
andend_date
- columns:
- holidays:
- column:
holiday_date
- column:
What I want is a way to produce another Dataframe that has columns:
data
,start_date
,end_date
andnum_holidays
Where
num_holidays
is computed as: Number of days between start and end that are not weekends or holidays (as in theholidays
table).The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:
--Calculate and return the number of workdays using the input parameters. --This is the meat of the function. --This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes. RETURN ( SELECT --Start with total number of days including weekends (DATEDIFF(dd,@StartDate, @EndDate)+1) --Subtact 2 days for each full weekend -(DATEDIFF(wk,@StartDate, @EndDate)*2) --If StartDate is a Sunday, Subtract 1 -(CASE WHEN DATENAME(dw, @StartDate) = 'Sunday' THEN 1 ELSE 0 END) --If EndDate is a Saturday, Subtract 1 -(CASE WHEN DATENAME(dw, @EndDate) = 'Saturday' THEN 1 ELSE 0 END) --Subtract all holidays -(Select Count(*) from [dbo].[tblHolidays] where [HolDate] between @StartDate and @EndDate ) ) END
I'm new to pyspark and was wondering what's the efficient way to do this? I can post the udf I'm writing if it helps though I'm going slow because I feel it's the wrong thing to do:
- Is there a better way than creating a UDF that reads the
holidays
table in a Dataframe and joins with it to count the holidays? Can I even join inside a udf? - Is there a way to write a
pandas_udf
instead? Would it be faster enough? - Are there some optimizations I can apply like cache the holidays table somehow on every worker?
- facts:
-
What is the right memory allocations that can be given to multiple spark streaming jobs if it is being processed in a single EMR cluster (m5.xlarge)?
I have 12 spark streaming jobs and it receives a small size data at any time. These scripts has spark transformations and joins.
What is the right memory allocations can be given to these spark streaming jobs if it is being processed in a single EMR cluster (m5.xlarge) (not using EMR steps) ? The memory allocations includes num-executors, executor-memory etc.
Please explain the working of these spark jobs in the cluster. How will the cluster split resource to these jobs? Please help me with the basics.
-
Error Converting Rdd in Dataframe Pyspark
I am trying to turn a rdd into a dataframe. The operation seems to be successful but when I then try to count the number of elements in the dataframe I get an error. This is my code:
from pyspark import SparkContext from pyspark.sql import SparkSession from pyspark.sql import SQLContext from pyspark.sql.functions import col sc = SparkContext(appName = 'ANALYSIS', master = 'local') rdd = sc.textFile('file.csv') rdd = rdd.filter(lambda line: line != header) rdd = rdd.map(lambda line: line.rsplit(',', 6)) spark = SparkSession.builder \ .master("local[*]") \ .appName("ANALYSIS") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() feature = ['to_drop','watched','watching','wantwatch','dropped','rating','votes'] df = spark.createDataFrame(rdd, schema = feature) rdd.collect() --> **it works** df.show() --> **it works** df.count() --> **does not work**
Can someone kindly report any errors to me? Thanks
The error I encounter during the execution is the following
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <ipython-input-15-3c9a60fd698f> in <module> ----> 1 df.count() /opt/conda/lib/python3.8/site-packages/pyspark/sql/dataframe.py in count(self) 662 2 663 """ --> 664 return int(self._jdf.count()) 665 666 def collect(self): /opt/conda/lib/python3.8/site-packages/py4j/java_gateway.py in __call__(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /opt/conda/lib/python3.8/site-packages/pyspark/sql/utils.py in deco(*a, **kw) 109 def deco(*a, **kw): 110 try: --> 111 return f(*a, **kw) 112 except py4j.protocol.Py4JJavaError as e: 113 converted = convert_exception(e.java_exception) /opt/conda/lib/python3.8/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)