Hive python self-defined udf function error
add file /home/sdev/yanan/udf.py; select TRANSFORM (news_entry_id) USING 'python udf.py' AS (comb) from tmp.yanan_gbdt where p_date='20180708' limit 10 ;
I have defined a function to process the field of
tmp.yanan_gbdt table take the
p_date as partition.
The codes works well.
But if I get rid of this condition
where p_date='20180708' that I do not specify the partition, there comes the error:
FAILED: NullPointerException null
So what's wrong with this ?
See also questions close to this topic
Python Kafka Streaming API - Binning
I am using python kafka stream binning example given in this, Python Kafka Streaming API
I am able to generate the data using generator.py file given under winton-kafka-streams/examples/binning/, whereas when i run the binning.py file from the same folder, i got the below issue. Could someone help me, to resolve this?
Change color of missing values in Seaborn heatmap
Consider the example of missing values in the Seaborn documentation:
corr = np.corrcoef(np.random.randn(10, 200)) mask = np.zeros_like(corr) mask[np.triu_indices_from(mask)] = True sns.heatmap(corr, mask=mask, vmax=.3, square=True)
How do I change the color of the missing values to, for example, black? The color of the missing values should be specified independent of the color scheme of the heatmap, it may not be present in the color scheme.
I tried adding
facecolor = 'black'but that didn't work. The color can be affected by e.g.
sns.axes_style("white")but it isn't clear to me how that can be used to set an arbitrary color.
Xpath + Scrapy + Python : data point couldn't be scraped
This is the XML structure:
<tr> <td> <font size="3"> <strong>Location:</strong> Hiranandani Gardens, Powai </font> </td> </tr>
I want to extract : Hiranandani Gardens, Powai
I tried with these:
Both returned an empty list.
Note: we must have to use the text of tag, i.e., "Location:". Otherwise, there are many other places on the site where the same XML structure is used. So, it'll fetch many more unnecessary things apart from the desired value if the text of strong tag is not used.
how to speed up sort in hive
I would like to speed up hive process, but I do not know how to do it. The data is about 200GB and about 300000000 lines text data, and I split it into 50file in advance, then 1 file is about 4GB. I would like to get 1 file as a result of the sort then I select the number of reducer is 1 and the number of mapper is 50. Each line of the data consists of word and frepuency. The same word should be grouped and frepuency of it should be sumed. All of files are gzip files. It takes a few day to complete the process, and I would like to speed up it to a few hours if I can. Which parameter should I chgange to speed up the process?
hive table map type K-V has specified separator for each K-V pair ":" but still show default separator “=”
create table like this:
create table xxx： ( user_id bigint comment '', order_count map<string,string> comment '' ) comment '' PARTITIONED BY (`partition_date` string comment "") row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':' lines terminated by '\n'
the inser sql this:
str_to_map(concat_ws(",",collect_set(concat_ws(':', datetime_type, cast(order_count as string)))),',',':')
when select shows :
I had specified separator for each K-V pair ":", but why still showed default separator “=” and how to show like
How to parse XML attributes in Hive?
I can parse out values like 9531 since they're enclosed in the tag
StatusValue. But how do I do the same for
<PhoneCallEvent DateTime="2018-09-10T12:51:33.743-04:00" FromAppId="200002" MessageId="3407802"> <LoanNumber>307375</LoanNumber> <StatusValue>9531</StatusValue> <StatusUserCommonID>2561550</StatusUserCommonID> <CallDirection>Inbound</CallDirection> <CallStartTime>2018-09-10T12:49:37.000-04:00</CallStartTime> <CallEndTime>2018-09-10T12:51:28.000-04:00</CallEndTime> <VectorDirectoryNumber/> </PhoneCallEvent>
Dataframe null values transformed to 0 after UDF. Why?
How can Nulls be handled when accessing dataframe
Rowvalues? Does the Null pointer Exception really require to be handled manually? There must be a better solution.
case class FirstThing(id:Int, thing:String, other:Option[Double]) val df = Seq(FirstThing(1, "first", None), FirstThing(1, "second", Some(2)), FirstThing(1, "third", Some(3))).toDS df.show val list = df.groupBy("id").agg(collect_list(struct("thing", "other")).alias("mylist")) list.show(false)
This fails with NPE:
val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getDouble(1))) list.withColumn("aa", xxxx(col("mylist"))).show(false)
This strangely gives 0:
val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getAs[Double]("other"))) list.withColumn("aa", xxxx(col("mylist"))).show(false) +---+-----------------------------------------+---------------+ |id |mylist |aa | +---+-----------------------------------------+---------------+ |1 |[[first,null], [second,2.0], [third,3.0]]|[0.0, 2.0, 3.0]| +---+-----------------------------------------+---------------+
Sadly this approach which works fine with data frames/datasets fails as well:
val xxxx = udf((t:Seq[Row])=> t.map(elem => elem.getAs[Option[Double]]("other"))) list.withColumn("aa", xxxx(col("mylist"))).show(false)
ClassCastException: java.lang.Double cannot be cast to scala.Option
How to Spark python UDF: unregister or register to overwrite function of same name
f = lambda x: str(x) with SparkContext("local", "HelloWorld") as sc: spark = SQLContext(sc) spark.udf.register("f", f)
This code works to register the python udf once so it can be called such as with:
%sql "select f(col_name) from table_name"
But the function does not change the next time this gets called (after f has been redefined)! How do you redefine a udf, i.e. re-register it so as the overwrite the old udf. Is there a drop_udf function, etc.?
Excel function to mimic RMS calculation returns error
All, below is an Excel VBA script I wrote to try to calculate RMS in Excel. the
avgvariable is a single cells value,
setteris a range. This should sort of mimic a root-mean-square error function. It needs to be applied on a rolling basis throughout the sheet, not just over one static dataset, thus is it needs to be a UDF.
To be clear, this returns a
#NAME?error. The formula is entered
=runs_test(S86,T66:T86); all of the S and T column are formulas that return numbers.
Any advice is appreciated, thanks!
Function runs_test(avg As Double, setter As Range) Option explicit Dim i As Variant Dim counter As Double Dim er As Double For Each i In setter er = (avg - i.Value) ^ 2 counter = counter + er Next i total = setter.Cells.Count er = counter / total er = er ^ (1 / 2) runs = er End Function
Columns to Rows in Hive
I have the following table structure in Hive,
Date ID x1 x1_value x2 x2_value 2018-09-17 1 a 10 b 20 2018-09-17 2 b 20 c 30
I want to convert this to ,
Date ID x x_value 2018-09-17 1 a 10 2018-09-17 1 b 20 2018-09-17 2 b 20 2018-09-17 2 c 30
I want to do this Hive. Can anybody please give idea to solve this ?
Hive jobs getting stuck after log initialization in a specified queue
It seems lack of resources due to other running jobs in the same queue. Is there any work around to priorities some jobs over running jobs in tsame queue to execute first?