CTAs on External Hive MetaStore Vs Copy Command Redshift Performance
I am working on a migration project where I want to migrate the jobs from Hive/Presto to Redshift keeping performance improvement with data consistency in order as the top most priority. Ive done the analysis by POC of a COPY COMMAND ON S3 and a CTA on External Tables for approx 16 tables and found that there's as such very minimal difference between the 2 though CTAs on external table has shown better performance than Copy on average through graph but I wanna be 100% sure on this coz the ask is each day we get billions of records and reporting queries should be able to see complex queries data in less than milliseconds currently the test is done on a max to max 200k records. Can we say by this data load POC that CTAs on external hive metastore would show great performance for billions of data as well as compared to Copy command.
do you know?
how many words do you know
See also questions close to this topic
-
Upload file from html when block public access is true
I am using
django-s3direct
to file uploadhttps://github.com/bradleyg/django-s3direct
Using IAM role setting because I upload the file from the server on ECS container.
Now I set the
blockPublicAccess
ofS3
false.When uploading images from html, there comes error.
https://s3.ap-northeast-1.amazonaws.com/static-resource-v/images/c64d6e593de44aa5b10dcf1766582547/_origin.jpg?uploads (403 (Forbidden) ) initiate error: static-resource-v/line-assets/images/c64d6e593de44aa5b10dcf1766582547/_origin.jpg AWS Code: AccessDenied, Message:Access Deniedstatus:403
OK, it is understandable.
Browser try to access the for initiation.
However there is any way to upload file from browser when blockPublicAccess is true??
-
Linux on Lightsail instance is asking for a password and it's not working
I'm trying to restart
mariaDB
on Ubuntu but it's not letting me.I enter:
systemctl restart mariadb
and get:
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units === Authentication is required to restart 'mariadb.service'. Authenticating as: Ubuntu (ubuntu) Password: polkit-agent-helper-1: pam_authenticate failed: Authentication failure ==== AUTHENTICATION FAILED ===
I have the same password for all functions so I do not understand why it is not working. What can I do?
-
AWS Pinpoint sendMessages() Addresses param field error
I'm having trouble replicating the format of the params object's Addresses format in a way where I can easily add to the object.
If I use this as the params with
destinationNumber[0]
anddestinationNumber[1]
in the format of 1 + 9 digit number ie13334535667
then it sends the message to both numbers no problem.const params = { ApplicationId: applicationId, MessageRequest: { Addresses: { [destinationNumber[0]]: { ChannelType: 'SMS' }, [destinationNumber[1]]: { ChannelType: 'SMS' } }, MessageConfiguration: { SMSMessage: { Body: message, Keyword: registeredKeyword, MessageType: messageType, OriginationNumber: originationNumber } } } };
I'm trying to replicate this format for
Addresses
, but I'm gettingUnexpected key '13334535667' found in params.MessageRequest.Addresses['0']
. The format my console output shows for Addresses is[ { '12345678910': { ChannelType: 'SMS' } }, { '12345678911': { ChannelType: 'SMS' } } ]
I'm using a map to call this
function createPhoneMessagingObject(phoneNumber: string) { return { [phoneNumber]: { ChannelType: 'SMS' } }; }
I tried wrapping key in array like in phone object, but per the output, the brackets goes away so maybe there's an easier/more correct way of doing this. I appreciate any help!
-
Assign ELB Account to S3 Bucket Policy
I used the AWS console from the load balancer edit attributes screen and used it to create a bucket to use for access logging. I'm using this policy to form CDK code in typescript to stand up new S3 buckets to use for access logging in higher level environments where I cannot use the console. This is the policy I need to somehow form in typescript CDK code:
"Statement": [ { "Effect":Allow", "Principal": { "AWS": "arn:--ELB-arnstuff--:root" }, "Action": "s3:PutObject", "Resource": "arn:--S3-Bucket-arnstuff--/AWSLogs/123456789/*" } ]
I've managed to get the cdk code figured out to this point:
bucket.addToResourcePolicy( new cdk.aws_iam.PolicyStatement({ effect: awsIam.Effect.ALLOW, principals: //'**This is part I haven't figured out**', actions: ['s3:PutObject'], resources: ['${bucket.bucketArn}/*'] }) );
At this point I don't care if it's hard coded in the CDK, I just need something to help keep the ball rolling forward. Any help is appreciated, thanks
-
How can I get file metadata like the date created attribute from an S3 file?
I have a lot of LogicPro files (.logicx) stored in an S3 bucket, and I want to extract the creation date from all of these files. This should not be the creation date of the object on s3, but the date for when it was created on my MacBook (the "Created" attribute in Finder).
I've tried to retrieve the metadata from the object using the
HEAD
action:aws s3api head-object --bucket <my-bucket> --key <my-object-key>
The output did not contain any information about the creation date of the actual file.
{ "AcceptRanges":"bytes", "LastModified":"2021-10-28T13:22:33+00:00", "ContentLength":713509, "ETag":"\"078c18ff0ab5322ada843a18bdd3914e\"", "VersionId":"9tseZuMRenKol1afntNM8mkRbeXo9n2W", "ContentType":"image/jpeg", "ServerSideEncryption":"AES256", "Metadata":{}, "StorageClass":"STANDARD_IA" }
Is it possible to extract the file creation metadata attribute from an S3 object, without having to download the whole object?
-
Updating RDS snapshot export into S3
We have some data in our Mysql RDS, which slows down our application, but it's no longer needed. So we want to remove old records but keep them somewhere so our Data Engineers can run analytic queries.
I've backed up my Mysql RDS instance into S3 and connected it with Athena. It works great. But if I remove old records from my DB and we will need to repeat the process in the next 6 months, our analytic team will need to query two DBs to get the data. Is there a way to make it simpler for them and update S3 snapshot? Or maybe we should use something different, Hive?
-
Hive/ Impala query group by query for total success and failed record
I am trying to add the group by clause on the impala/Hive table but its not working.
I am having the jobs details table which having job name and status column.
Table jobs_details : --------------------- Job name status --------------------- A failed B Failed A success A failed ---------------------------------- I want the below type output : ---------------------------------- Job name failed_count success_count A 2 1 B 1 0
I tried to use the group by clause on job name but it's showing me total count ( failed + success )
-
impala/hive show file format
How can I have impala or hive return the file format of the underlying files on HDFS for a table?
I tried:
SHOW FILES database.table_name
This ilst the files, but the problem is that some people stored parquet files as
.parq
and others.parquet
. Is there anyway to return the file format, such that one could use it in a new create statement? -
SQL query for view contribution percent
I have to calculate % contribution for each category.
select portfolio,Portfolio_views,Portfolio_views/total_views*100 as perc_contribution from ( select category,sum(views) as Portfolio_views,select sum(portfolio_views) from gold.user_daily_osv as total_views from gold.user_daily_osv group by category )
but this throws error: line 2:40: mismatched input 'select'. Expecting: '*',
-
AWS Glue 3.0 PySpark: different behavior when installing dependencies using wheels vs installing same dependencies with Glue itself
Having a problem launching PySpark job that utilizes connection to RedShift via awswrangler lib. Everything works fine if using --additional-python-modules: awswrangler==2.10.0 parameter (which I suppose makes Glue to utilize pip install awswrangler==2.10.0 under the hood). But this approach is restricted because we're using company's artifactory as dependency repo.
However, if I set awsrangler wheel (and it's connected dependencies as a wheels) using Glue's 'Python libraries path', I'm getting redshift connection error (NotADirectoryError, which caused by ssl settings presumably). The question is why behavior is different? The list of wheels I'm getting by 'pip freeze' after awswrangler installation on clear virtual env.
Will be appreciated to any clues\ideas.
Upd: the second question is: can custom artifacts repo be set for pulling dependencies by Glue?
-
Unload Redshift data to S3 in parquet format
I'm trying to unload redshift data to S3, but it's unloading in CSV format. How can unload the Redshift table to S3 bucket in parquet format using Java?
-
Version problem while reading parquet file data from Redshift Spectrum
I created an external table on top of a partitioned parquet file in Redshift spectrum. The table has been created and I have added partitions as well but when I am trying to access/read the data, getting below error.
error: Spectrum Scan Error code: 15007 context: File has an invalid version number.
My main intention is to read partitioned data from Redshift spectrum
-
Does AWS Redshift Spectrum support JSON
I saw the below text in a blog and it did not name json. Does AWS Redshift Spectrum support processing of JSON. In normal Redshift, we need to create the table structure before processing JSON.
Amazon Redshift Spectrum supports structured and semi-structured data formats which include Parquet, Textfile, Sequencefile and Rcfile. Amazon recommends using a columnar format because it will allow you to choose only the columns you need to transfer data from S3.
Thanks
-
AWS Redshift with external schema error after migration to another AWS account
I have a restored from a snapshot AWS Redshift cluster, from another AWS account. It has several external schemas. Bad thing that when you created an external schema you need to specify an IAM role to let Redshift goes to that external data store, like Athena Data Catalog and S3.
And restored schemas havev an IAM role from original AWS account, where they were initially created. Do you know guys is it safe recreate somehow that external tables? Are we sure that Redshift don't store any metadata on externals schemas? Or is it possible change an IAM role ARN in external schema definition?
Thanks.