Hive: create table partitioned by year and month of TIMESTAMP column
I was wondering whether it is possible to create the table partitioned by YEAR and MONTH of a TIMESTAMP column? For example, like the following command:
USE database; CREATE TABLE credit_transactions( processdate TIMESTAMP, requestprocess STRING, cardno_hash STRING, ) PARTITIONED BY (YEAR(processdate) INT, MONTH(processdate), INT) CLUSTERED BY (cardno_hash) into 50 buckets stored as orc TBLPROPERTIES("transactional"="true");
Then, is it possible to simply add data from a csv file and Hive automatically partitions the data? For example:
LOAD DATA LOCAL INPATH '/root/usr/transact.csv' OVERWRITE INTO TABLE credit_transact
See also questions close to this topic
Does sql workbench support kerberos login
Does SQL Workbench support kerberos login? I want to connect to a Hive server using Kerberos login on SQL Workbench
Combing the data of a table in hive
Need to combine the data in a hive table in one row. The intention is to capture the data/ value other than
'N'i.e. whatever value is present other than
'N'should be captured for all the
col1 col2 col3 col4 col5 col6 ----------------------------- GHY BG Q N N N GHY BG N T N N GHY BG N N A N GHY BG N N N Z
Tried with the following query:
Select col1, col2,array( max(CASE WHEN col3 == 'Q' THEN 'Q' ELSE 'None' END), max(CASE WHEN col4 == 'T' THEN 'T' ELSE 'None' END), max(CASE WHEN col5 == 'A' THEN 'A' ELSE 'None' END), max(CASE WHEN col6 == 'Z' THEN 'Z' ELSE 'None' END)) FROM table1 GROUP BY col1,col2;
and got the below:
GHY BG ['None','None','A','None']
GHY BG ['Q','T','A','Z']
Not getting the point of error :(
After removing 'max' from the query:
FAILED: SemanticException [Error 10025]: Line 2:11 Expression not in GROUP BY key 'Q'
select col1,col2,collect_set(col) from (select col1,col2,t.col from tbl lateral view explode(array(col3,col4,col5,col6)) t as col where t.col <> 'N' ) t
FAILED: SemanticException [Error 10025]: Line 1:7 Expression not in GROUP BY key 'col1'
Unable to connect to hive using python using impyla/dbapi.py
I am trying to connect to hive[with default derby db] using python:
from impala.dbapi import connect conn = connect( host='localhost', port=10000) cursor = conn.cursor() cursor.execute('SELECT * FROM employee') print cursor.description # prints the result set's schema results = cursor.fetchall()
but I am getting error:
Traceback (most recent call last): File "hivetest_b.py", line 2, in <module> conn = connect( host='localhost', port=10000) File "/home/ubuntu/.local/lib/python2.7/site-packages/impala/dbapi.py", line 147, in connect auth_mechanism=auth_mechanism) File "/home/ubuntu/.local/lib/python2.7/site-packages/impala/hiveserver2.py", line 758, in connect transport.open() File "/home/ubuntu/.local/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 149, in open return self.__trans.open() File "/home/ubuntu/.local/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 101, in open message=message) thrift.transport.TTransport.TTransportException: Could not connect to localhost:10000
entry in my /etc/hosts is:
127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts
I am using default hive-site.xml and defult derby database for running my hive. When I run hive through shell it shows me that table:
hive> show databases; OK default test test_db Time taken: 0.937 seconds, Fetched: 3 row(s) hive> show tables; OK employee Time taken: 0.054 seconds, Fetched: 1 row(s) hive> describe employee; OK empname string age int gender string income float department string dept string # Partition Information # col_name data_type comment dept string Time taken: 0.451 seconds, Fetched: 11 row(s)
I am not sure what exactly am I missing here. Any quick references/pointers would be appreciated.
Hive - Issue with the hive sub query
My problem statement is like
"Find top 2 districts per state with the highest population"
data is like
My expected output is
I tried this with lot of queries and sub-queries but results in SQL error with the sub query
Can anyone help me with getting this result?
Thanks in advance.
Queries I tried
- Select state_name, (select concat_ws(',', collect_set(dist_name as string)) from population where state_name = state_name group by state order by population desc 2)
from population group by state_name
state_name, concat_ws(',', collect_set(cast(dist_name as string)))
from population where population.dist_name in (select dist_name from ( select dist_name , max(b.population) as total from population b where state_name = b.state_name group by b.dist_name , b.dist_name order by total desc limit 2) as dist_name ) group by state_name
Which one is faster in Hive? "in" or "or"?
select * from t where something in ('a', 'b', 'c')
select * from t where something='a' or something='b' or something='c'
Is there an efficiency difference between these two? Or they are the same under the hood?
Partition discovery in spark is not showing right number of partition
Spark partition discovery is not partition the data based on the folder structure
i have a directory called list inside this i have a folder for each country with the label like COUNTRY=US etc , inside these country i have a folder call region with label ike REGION=NORTH and inside the region folder there are multiple csv files.
now i would like to read this data in spark using spark 2.3 dataframe API and while reading the data i am using the base-path option, so that spark would automatically discover the underline partition.
since i have 2 country level folder and inside one country i have 5 region and inside second country i have 2 region. so in total i have 7 region. therefore my dataframe should show the numberofpartition as 7. rather the dataframe is showing the the number of partition as 5.
here is the code for your reference
start = time.time() path = "D:\\Sonika\\Propcount\\*\\*\\*.txt" df_probe_count_base = spark.read.option("header", "true")\ .option("basePath", "D:\\Sonika\\Propcount")\ .option("Delimiter", ",").csv(path) print("df_probe_count_base",df_probe_count_base.rdd.getNumPartitions()) end = time.time()
also i try to see which row belongs to which partition and surprise to find that it randomly distribute the date into 4 partition. i.e. in one partition i can see the rows belong to 2 different countries with 2 different regions
print(df_probe_count_base.rdd.glom().collect()) [ Row(_c0='txt', _c1='RowNumber', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='1', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='2', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='3', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='4', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='5', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='6', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='7', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='8', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='9', COUNTRY='CAN', REGION='Reg2'), Row(_c0='0', _c1='10', COUNTRY='CAN', REGION='Reg2'), Row(_c0='txt', _c1='RowNumber', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='1', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='2', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='3', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='4', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='5', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='6', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='7', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='8', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='9', COUNTRY='ABC', REGION='Reg1'), Row(_c0='0', _c1='10', COUNTRY='ABC', REGION='Reg1') ]
This is different then hive partition because in HIVE you will have a single file for that partition where as in this case you may have multiple files
can anyone please suggest
- how partition discovery works in spark 2.3 with CSV file
- why there are 4 partition not 7 (FYI i have 4 cores in my machine)
- the whole logic of doing this is, i have 2 different data frame with same level of partitioned data, later on i would to like to join to 2 data frame based on country,region and one primary key. therefore i would like to partition the data based on country and region to avoid shuffling across node (though shuffling within node may occur but i am not bothered). can you please let me know if this understanding is correct or not
Combine File input format hadoop
I am using CombineFileInputFormat for map-reduce action to process small files (kb in size) and large files (hundreds of mb and some GB). I have MapReduce.input.fileinputformat.split.maxsize as 64 MB and setMaxSplitSize(67108864). when mappers start, this line is printed in syslog.
2018-12-29 10:26:10,138 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: Paths: /input/file.csv-m-00002:0+908250, /input/file_68171.txt-m-00000:0+36589, /input/file_27138.txt-m-00000:0+62929, /input/file_62783.txt-m-00000:0+77776, /input/file_26540.txt-m-00001:0+50115, /input/file_12282018.txt-m-00007:0+65766888, /input/file_12282018.txt-m-00007:65766888+65766889.
Can someone explain above processing split? When I add these split total is more than split size.
I have some questions regading file splits
which value is used when CombineFileInputFormat is used mapreduce.input.fileinputformat.split.maxsize or setMaxSplitSize() from CombineFileInputFormat class?
how setMaxSplitSize() works for larger files size grater than maxSplitSize?
what is the difference between mapreduce.input.fileinputformat.split.maxsize and setMaxSplitSize()?
Is it possible to virtually divide hadoop cluster into small clusters
We are working to build a big cluster of 100 nodes with 300 TB storage. Then we have to serve it to different users (clients) with restricted resources limit i.e., we do not want to expose complete cluster to each user. Is it possible ? If it is not possible then what are other ways to do it. Are there any builtin solutions available ? It is just like cluster partitioning on demand.