load JSON file to Hive table. Select Count(*) is not working
I have a huge json file. This is data of only one user:
{"user_id":"EZmocAborM6z66rTzeZxzQ","name":"Rob","review_count":761,"yelping_since":"2009-09-12","friends":["iJg9ekPzF9lkMuvjKYX6uA","ctWAuzS04Xu0lke2Rop4lQ","B8CqppjOne8X4RSJ5KYOvQ","_K9sKlA4fVkWI4hyGSpoPA","Ec-epOsAWvjI6e90IlM8jw","r2UUCzGxqI6WPsiWPgqG2A","3ybkL7N63UdSn4wepINzUw","d-lzusSagnkDuiyLlfF5pw","Ydh2zA5wUlD-UbApp8toGA","DeZhnC-RsNFmKSlI0lUksw","NTuvVb-ZwQ_rFn6W9Krm7A","PCdUS3L8LhQOereIyQ6_RA","_SZGgg8xSk7v_E4TPLfXEg"],"useful":18456,"funny":12316,"cool":17579,"fans":298,"elite":["2017","2015","2016","2014","2011","2013","2012"],"average_stars":3.59,"compliment_hot":3904,"compliment_more":305,"compliment_profile":207,"compliment_cute":79,"compliment_list":19,"compliment_note":4705,"compliment_plain":2617,"compliment_cool":4192,"compliment_funny":4192,"compliment_writer":1147,"compliment_photos":1347,"type":"user"}
I wrote the code in sql to create a table:
CREATE TABLE Data1 (
user_id STRING,
name STRING,
review_count INT,
yelping_since STRING,
friends ARRAY<STRING>,
useful INT,
funny INT,
cool INT,
fans INT,
elite ARRAY<STRING>,
average_stars INT,
compliment_hot INT,
compliment_more INT,
compliment_profile INT,
compliment_cute INT,
compliment_list INT,
compliment_note INT,
compliment_plain INT,
compliment_cool INT,
compliment_funny INT,
compliment_writer INT,
compliment_photos INT,
type STRING
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
LOAD DATA LOCAL INPATH 'file.json' INTO TABLE Data1;
The problem is when I run the command SELECT Count(*) FROM Data1;
I get this error:
Status: Failed
Vertex failed, vertexName=Map 1, vertexId=vertex_1613999331066_0005_1_00, diagnostics=[Task failed, taskId=task_1613999331066_0005_1_00_000001, diagnostics=[TaskAttempt 0
failed, info=[Error: Error while running task ( failure ) : attempt_1613999331066_0005_1_00_000001_0:java.lang.RuntimeException: java.lang.RuntimeException: Map operator
initialization failed
Could anyone please explain what is the problem and how can I solve this?
See also questions close to this topic
-
Azure Api call taking long time
I am currently working on a project in which i need to get the list of sql databases
I used this api call
https://docs.microsoft.com/en-us/rest/api/sql/databases/listbyserver
It takes three parameter namely subscription id,resource group and server-name
I used another api to call resource group
https://docs.microsoft.com/en-us/rest/api/resources/resourcegroups/list
And for server-name and subscription-id,I have created dictionary for manual insertion.
My code structure looks something like:
For loop:To loop through server name.
Calling resources group api
For loop:to loop through resources- group
Calling listbyserver api
For loop:getting the values from list by server api call
The problem i am facing here is ,it is taking almost an hour to get the result.
I have worked on another project with different api call (mentioned below) but its code structure and parameters are similar and call completes within 5 mintues.
Api call: https://docs.microsoft.com/en-us/rest/api/compute/virtualmachines/list
I want to understand the reason why it is taking so long in case of listofserver api call.
Is there any other way i can get the expected results.
Thanks in advance!!!
-
SQL: How to join tables with 1+ millions of records
I want to join two tables ("products" table has 1.5 millions of records) using the following query, but after 15 minutes the query was still running and my pc was overheating (it's a lenovo v330-14ikb with 8gb of RAM), so I stopped it.
I am very new to indexes, and I tried by creating the followings:
- CREATE INDEX customer_id_idx1 ON orders (customer_id)
- CREATE INDEX customer_id_idx2 ON products (customer_id)
- CREATE INDEX customer_id_revenues_idx ON orders(customer_id,revenues)
- CREATE INDEX customer_id_costs_idx ON products(customer_id,costs)
This is the query:
SELECT a.customer_id, (SUM(a.revenues) / SUM(b.costs) :: FLOAT) AS roi FROM orders a JOIN products b ON a.customer_id = b.customer_id WHERE a.customer_id IN ( SELECT customer_id FROM (SELECT customer_id, COUNT(*) AS n_products FROM products GROUP BY 1 ORDER BY 2 DESC LIMIT 5) x ) GROUP BY a.customer_id ORDER BY roi DESC
The output should return the ratio of revenues/costs for the top 5 customers by number of products they bought.
I am using pgadmin. Can someone explain me how to speed up and make it compile? Thank you in advance.
-
MySQL Laravel how to multiply by array of values
I have
products
table which hasid, price, name
, and I am receiving an array of products id and quantity from front-end, what I want is to calculate the sum of price*quantity.I know that I can use PHP foreach, but I am looking for database way
//Laravel php way $sum = 0; foreach($request->items as $item) { product = Product::find($item['product_id']); $sum += $product->price*$item['quantity'] }
what I want is to pass the array to mysql and make mysql handle the calculation.
-
JSONDecodeError when using for loop in python
I've been trying to do some API queries to get some missing data in my DF. I'm using grequest library to send multiple request and create a list for the response object. Then I use a for loop to load the response in a json to retrieve the missing data. What I noticed is that when loading the data using .json() from the list directly using notition list[0].json() it works fine, but when trying to read the list and then load the response into a json, This error comes up : JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here's my code :
import requests import json import grequests ls = [] for i in null_data['name']: url = 'https://pokeapi.co/api/v2/pokemon/' + i.lower() ls.append(url) rs = (grequests.get(u) for u in ls) s = grequests.map(rs) #This line works print(s[0].json()['weight']/10) for x in s: #This one fails js = x.json() peso = js['weight']/10 null_data.loc[null_data['name'] == i.capitalize(), 'weight_kg'] = peso
<ipython-input-21-9f404bc56f66> in <module> 13 14 for x in s: ---> 15 js = x.json() 16 peso = js['weight']/10 17 null_data.loc[null_data['name'] == i.capitalize(), 'weight_kg'] = peso JSONDecodeError: Expecting value: line 1 column 1 (char 0)
-
How i can read the nested properties and pushing to the array using javascript?
I want to read the JSON properties. if type is array or object or jsonp then i have to read the nested properties and need push it one array which should also be nested. like:
{ name:"test", type:"array" [name]: { name: "test1", type: "..." } }
JSON object to read:
{ "properties": { "value": { "type": "array", "items": { "required": [ "@odata.etag", "id", "createdDateTime", "lastModifiedDateTime", "changeKey", "originalStartTimeZone", "originalEndTimeZone", "iCalUId", "reminderMinutesBeforeStart", "isReminderOn", "hasAttachments", "subject", "bodyPreview", "importance", "sensitivity", "isAllDay", "isCancelled", "isOrganizer", "responseRequested", "showAs", "type", "webLink", "isOnlineMeeting", "onlineMeetingProvider", "allowNewTimeProposals", "isDraft", "hideAttendees" ], "properties": { "id": { "type": "string", "minLength": 1 }, "end": { "type": "object", "required": [ "dateTime", "timeZone" ], "properties": { "dateTime": { "type": "string", "minLength": 1 }, "timeZone": { "type": "string", "minLength": 1 } } }, "body": { "type": "object", "required": [ "contentType", "content" ], "properties": { "content": { "type": "string", "minLength": 1 }, "contentType": { "type": "string", "minLength": 1 } } }, "type": { "type": "string", "minLength": 1 }, "start": { "type": "object", "required": [ "dateTime", "timeZone" ], "properties": { "dateTime": { "type": "string", "minLength": 1 }, "timeZone": { "type": "string", "minLength": 1 } } }, "showAs": { "type": "string", "minLength": 1 }, "iCalUId": { "type": "string", "minLength": 1 }, "isDraft": { "type": "boolean" }, "subject": { "type": "string", "minLength": 1 }, "webLink": { "type": "string", "minLength": 1 }, "isAllDay": { "type": "boolean" }, "location": { "type": "object", "required": [ "displayName", "locationType", "uniqueIdType", "address", "coordinates" ], "properties": { "address": { "type": "object", "required": [], "properties": {} }, "coordinates": { "type": "object", "required": [], "properties": {} }, "displayName": { "type": "string" }, "locationType": { "type": "string", "minLength": 1 }, "uniqueIdType": { "type": "string", "minLength": 1 } } }, "attendees": { "type": "array", "items": { "required": [ "type" ], "properties": { "type": { "type": "string", "minLength": 1 }, "status": { "type": "object", "required": [ "response", "time" ], "properties": { "time": { "type": "string", "minLength": 1 }, "response": { "type": "string", "minLength": 1 } } }, "emailAddress": { "type": "object", "required": [ "name", "address" ], "properties": { "name": { "type": "string", "minLength": 1 }, "address": { "type": "string", "minLength": 1 } } } } }, "minItems": 1, "uniqueItems": true }, "changeKey": { "type": "string", "minLength": 1 }, "locations": { "type": "array", "items": { "required": [], "properties": {} } }, "organizer": { "type": "object", "required": [ "emailAddress" ], "properties": { "emailAddress": { "type": "object", "required": [ "name", "address" ], "properties": { "name": { "type": "string", "minLength": 1 }, "address": { "type": "string", "minLength": 1 } } } } }, "categories": { "type": "array", "items": { "required": [], "properties": {} } }, "importance": { "type": "string", "minLength": 1 }, "recurrence": {}, "@odata.etag": { "type": "string", "minLength": 1 }, "bodyPreview": { "type": "string", "minLength": 1 }, "isCancelled": { "type": "boolean" }, "isOrganizer": { "type": "boolean" }, "sensitivity": { "type": "string", "minLength": 1 }, "isReminderOn": { "type": "boolean" }, "hideAttendees": { "type": "boolean" }, "onlineMeeting": {}, "transactionId": {}, "hasAttachments": { "type": "boolean" }, "responseStatus": { "type": "object", "required": [ "response", "time" ], "properties": { "time": { "type": "string", "minLength": 1 }, "response": { "type": "string", "minLength": 1 } } }, "seriesMasterId": {}, "createdDateTime": { "type": "string", "minLength": 1 }, "isOnlineMeeting": { "type": "boolean" }, "onlineMeetingUrl": {}, "responseRequested": { "type": "boolean" }, "originalEndTimeZone": { "type": "string", "minLength": 1 }, "lastModifiedDateTime": { "type": "string", "minLength": 1 }, "allowNewTimeProposals": { "type": "boolean" }, "onlineMeetingProvider": { "type": "string", "minLength": 1 }, "originalStartTimeZone": { "type": "string", "minLength": 1 }, "reminderMinutesBeforeStart": { "type": "number" } } }, "minItems": 1, "uniqueItems": true }, "@odata.context": { "type": "string", "minLength": 1 }, "@odata.nextLink": { "type": "string", "minLength": 1 } } }
-
Using Jackson to deserialize with Lombok builder
The Question
Lombok's
@builder
annotation creates a builder-class for a class. To support deserialization of json items (using Jackson's ObjectMapper), I've added the following annotations:@Builder @JsonDeserialize(builder = Item.ItemBuilder.class) @JsonPOJOBuilder(withPrefix="") public class Item { @Getter String partitionvalue; }
This is based on the
@Jacksonized
documentation. On the usage of the deserializer, upon a json file which is stored in AWS S3 bucket and its content is simply:{"partitionvalue": "test"}
, my code is:AmazonS3 s3Client = AmazonS3ClientBuilder.standard() .withCredentials(new DefaultAWSCredentialsProviderChain()) .withRegion(region) .build(); S3Object s3Object = s3Client.getObject(new GetObjectRequest(bucket, key)); Item item = objectMapper.readValue(s3Object.getObjectContent(), Item.class);
However when running on a json file that Jackson fails with the message:
Unrecognized field "partitionvalue" (class com.example.Test$TestBuilder), not marked as ignorable (0 known properties: ]) at [Source: com.amazonaws.services.s3.model.S3ObjectInputStream@2ca47471; line: 1, column: 21] (through reference chain: com.example.TestBuilder["partitionvalue"])
Extra Details
Using @
Jacksonized
annotation directly didn't work as well, and since it is lombok-experimental I used the annotations I needed to use with builder directly.I verified that Lombok will do what I expect of a builder class by using "delombok" option in the Lombok IntelliJ plugin:
public class Item { String partitionvalue; Item(String partitionvalue) { this.partitionvalue = partitionvalue; } public static ItemBuilder builder() { return new ItemBuilder(); } public String getPartitionvalue() { return this.partitionvalue; } public static class ItemBuilder { private String partitionvalue; ItemBuilder() { } public Item.ItemBuilder partitionvalue(String partitionvalue) { this.partitionvalue = partitionvalue; return this; } public Item build() { return new Item(partitionvalue); } public String toString() { return "Item.ItemBuilder(partitionvalue=" + this.partitionvalue + ")"; } } }
- Without the
@Builder
annotation (and with adding@NoArgsConstructor
+@AllArgsConstructor
+@Setter
) it worked fine, so the problem isn't with the file from the S3 bucket or the way it is parsed.
-
java.lang.ClassCastException Error , (Plz help me with this...)
Exception in thread "main" java.lang.ClassCastException: class jdk.internal.loader.ClassLoaders$AppClassLoader cannot be cast to class java.net.URLClassLoader (jdk.internal.loader.ClassLoaders$AppClassLoader and java.net.URLClassLoader are in module java.base of loader 'bootstrap') at org.apache.hadoop.hive.ql.session.SessionState.<init>(SessionState.java:413) at org.apache.hadoop.hive.ql.session.SessionState.<init>(SessionState.java:389) at org.apache.hadoop.hive.cli.CliSessionState.<init>(CliSessionState.java:60) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:705) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:564) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
-
MetaException(message:MetaException(message:java.io.IOException: java.lang.reflect.InvocationTargetException
I am facing this issue when creating external table from Hive to HBase. I am using Hadoop 3.2.2, Hive 2.3.8 and HBase 2.3.4 with JDK 11.0. I start hadoop and HBase, and all service ( HMaster, RegionServer, Zookeeper, etc) are running fine. But getting this error:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:MetaException(message:java.io.IOException: java.lang.reflect.InvocationTargetException
-
Hive, Impala - get the email address who have only 2 characters before @
have email addresses like
ab@gmail.com, abc@gmail.com xyz@hotmail.com,cd@gamil.com ...
etc. I want a hive SELECT that will trim user names who have only 2 characters before '@'Expected result:
ab@gmail.com, cd@gamil.com
-
Implementing a complex custom struct serializer
I'm experimenting with Rust and am trying to write a wrapper for Elasticsearch queries in Rust. There's a query which I have implemented and it works correctly, however I truly dislike the way I did it with
json!
macro.fn main() { let actual = Query { field: "field_name".into(), values: vec![1, 2, 3], boost: Some(2), name: Some("query_name".into()), }; let expected = serde_json::json!({ "terms": { "field_name": [1, 2, 3], "boost": 2, "_name": "query_name" } }); let actual_str = serde_json::to_string(&actual).unwrap(); let expected_str = serde_json::to_string(&expected).unwrap(); assert_eq!(actual_str, expected_str); } #[derive(Debug)] struct Query { field: String, values: Vec<i32>, boost: Option<i32>, name: Option<String>, } impl serde::Serialize for Query { fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error> where S: serde::Serializer, { let field = self.field.as_str(); let values = self.values.as_slice(); let value = match (&self.boost, &self.name) { (None, None) => serde_json::json!({ field: values }), (None, Some(name)) => serde_json::json!({ field: values, "_name": name }), (Some(boost), None) => serde_json::json!({ field: values, "boost": boost }), (Some(boost), Some(name)) => { serde_json::json!({ field: values, "boost": boost, "_name": name }) } }; serde_json::json!({ "terms": value }).serialize(serializer) } }
I would like to know how I could implement such a serializer using serde's built in traits, such as
SerializeStruct
,SerializeMap
, etc. Basically I'd like to avoid using json macro or creating intermediate data structures. -
Using `DeserializedOwned` trait in a oneshot channel: object safety error
The following code:
trait ClientResponse: DeserializeOwned + Send + Sized {} struct ClientMsg { ... resp: oneshot::Sender<Box<dyn ClientResponse>> } async fn client_thread(rx: mpsc::Receiver<ClientMsg>, client: reqwest::Client, base_url: Url) -> Result<(), Box<dyn Error>> { while let Some(msg) = rx.recv().await { ... let response = client.get(url).send().await?.json().await?; msg.resp.send(response); } }
Fails with error:
error[E0038]: the trait `ClientResponse` cannot be made into an object --> src/main.rs:16:11 | 16 | resp: oneshot::Sender<Box<dyn ClientResponse>> | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ `ClientResponse` cannot be made into an object | note: for a trait to be "object safe" it needs to allow building a vtable to allow the call to be resolvable dynamically; for more information visit <https://doc.rust-lang.org/reference/items/traits.html#object-safety> --> src/main.rs:12:23 | 12 | trait ClientResponse: DeserializeOwned + Send + Sized {} | -------------- ^^^^^^^^^^^^^^^^ ^^^^^ ...because it requires `Self: Sized` | | | | | ...because it requires `Self: Sized` | this trait cannot be made into an object...
As you can see, I tried adding the
Sized
trait as a super trait after reading the compiler error, but it still gives me the same error. I'm not sure how else to approach this problem, since I want a client thread that can deserialize responses into types decided by the senders.