Getting Spectrum Scan Error code 15007 on select query on redshift external table
I have created a external table in redshift spectrum.Upon running the select * from table_name, i am getting following error
SQL Error [XX000]: ERROR: Spectrum Scan Error Detail: ----------------------------------------------- error: Spectrum Scan Error code: 15007 context: Forbidden: HTTP response error code: 403 Message: AccessDenied Access Denied
Please let me know what can be issue. I am able to do aws s3 ls and aws s3 cp command on same s3 location.
do you know?
how many words do you know
See also questions close to this topic
Best way for a Lambda to start processing messages in a SQS Queue at a specific time of day
I have a SQS queue which fills up with messages throughout the day and I want to start processing all the messages at a specific time. The scenario would be:
- Between 9AM and 5PM the queue would receive messages
- At 6PM the messages should be processed by a lambda
I was thinking of:
- Enabler: Lambda A which will be executed using a CloudWatch Event Bridge ruleat 6PM. This lambda would create a SQS trigger for Lambda C
- Disabler: Lambda B which will be executed using a CloudWatch Event Bridge rule at 8PM . This lambda would remove the SQS trigger of Lambda C
- Executer: Lambda C which process the messages in the queue
Is this the best way to do this?
Returning custom status code from AWS Proxy lambda function based on certain condition, like 204, 404, but I get always 200 code as a response
We are using below set up for AWS Lambda function. 1 Lambda type - Proxy lambda 2 Handler - org.springframework.cloud.function.adapter.aws.FunctionInvoker 3 Using spring cloud functions for implementing core logic 4 Tried returning below objects as a return type by setting explicitly custom http status code APIGatewayProxyResponseEvent Message> I always get 200 irrespective whatever I set in the response object.
Does AWS-Lamda support multi-threading?
I am writing an AWS-lambda function that reads past 1-minute data from a DB, converts it to JSON, and then pushes it to a Kafka topic. The lambda will run every 1 minute. So if a scenario comes like this: at t1, a lambda process is invoked let's say P1, it is reading data from t1 to t1-1. at t2, if P1 is not finished then will a new lambda process will be invoked or we will wait for P1 to finish so that we can invoke another process? I understand that lambda support up to 1000 parallel processes in a region but in this situation, the lambda function will already be running in a process.
Duplicate rows instead of overwriting rows data using COPY command to load data from Amazon S3 to redshift
copy Agent1 from 's3://my-bucket/Reports/Historical Metrics Report (1).csv' iam_role 'arn:aws:iam::my-role:role/RedshiftRoleForS3' csv null as '\000' IGNOREHEADER 1;
I am using this (above) to pull the data from s3 to redshift table. its working fine but there is one problem as when data is pulled/copied very first time it inserted into table but when the data get updated in s3 bucket file and we run the same query what it does is add the whole new rows of data instead of overwriting the already created rows.
How to stop duplication? i just want that when the data get updated on s3 file, after running Copy Command my data (rows) get overwritten and replaced the rows data with new data.
SQL to get the levels for a manager and his supervisors
I have employee table which has data in below format
emplid supervisor employee_level ------------------------------------------ A xyx 5 B abc 5 xyz def 6 abc zzz 5 zzz xxx 6
emplid report_1 report2 report_3 reports_4 ----------------------------------------------------------- A yyy xdc def xyz B xxx zzz abc
This is the data that I need:
emplid supervisor level ------------------------------------- A xyz 6 B zzz 6
Can anyone help me? I need to find out the each employee's manager and then find each managers level and only choose manager whose level is 6 if the employee manager level 5 then chose his boss .
Count distinct value and group by another value in the last 24 hours SQL Redshift
I would like to do a count of B in the last 24 hours for grouped by date and A.
| DATE | A | B | ---------------------------------------------------------------- | 2021-12-01 00:00:00 | John | device 1 | | 2021-12-01 01:00:00 | Maria | device 1 | | 2021-12-01 01:00:00 | John | device 2 | | 2021-12-01 02:00:00 | John | device 3 | | 2021-12-01 02:00:00 | Maria | device 2 | | 2021-12-03 05:00:00 | John | device 4 | | 2021-12-03 09:00:00 | John | device 5 |
| DATE | A | devices_last_24h | ---------------------------------------------------------------- | 2021-12-01 00:00:00 | John | 1 | | 2021-12-01 01:00:00 | Maria | 1 | | 2021-12-01 01:00:00 | John | 2 | | 2021-12-01 02:00:00 | John | 3 | | 2021-12-01 02:00:00 | Maria | 1 | | 2021-12-03 05:00:00 | John | 1 | | 2021-12-03 09:00:00 | John | 2 |
im using Redshift database.
Can someone help me please?
CTAs on External Hive MetaStore Vs Copy Command Redshift Performance
I am working on a migration project where I want to migrate the jobs from Hive/Presto to Redshift keeping performance improvement with data consistency in order as the top most priority. Ive done the analysis by POC of a COPY COMMAND ON S3 and a CTA on External Tables for approx 16 tables and found that there's as such very minimal difference between the 2 though CTAs on external table has shown better performance than Copy on average through graph but I wanna be 100% sure on this coz the ask is each day we get billions of records and reporting queries should be able to see complex queries data in less than milliseconds currently the test is done on a max to max 200k records. Can we say by this data load POC that CTAs on external hive metastore would show great performance for billions of data as well as compared to Copy command.
Error when creating external table in Redshift Spectrum with dbt: cross-database reference not supported
I want to create an external table in Redshift Spectrum from CSV files. When I try doing so with dbt, I get a strange error. But when I manually remove some double quotes from the SQL generated by dbt and run it directly, I get no such error.
First I run this in Redshift Query Editor v2 on default database
devin my cluster:
CREATE EXTERNAL SCHEMA example_schema FROM DATA CATALOG DATABASE 'example_db' REGION 'us-east-1' IAM_ROLE 'iam_role' CREATE EXTERNAL DATABASE IF NOT EXISTS ;
devnow has an external schema named
example_schema(and Glue catalog registers
I then upload
example_file.csvto the S3 bucket
s3://example_bucket. The file looks like this:
col1,col2 1,a, 2,b, 3,c
Then I run
dbt run-operation stage_external_sourcesin my local dbt project and get this output with an error:
21:03:03 Running with dbt=1.0.1 21:03:03 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources. There are 1 unused configuration paths: - models.example_project.example_models 21:03:03 1 of 1 START external source example_schema.example_table 21:03:03 1 of 1 (1) drop table if exists "example_db"."example_schema"."example_table" cascade 21:03:04 Encountered an error while running operation: Database Error cross-database reference to database "example_db" is not supported
I try running the generated SQL in Query Editor:
DROP TABLE IF EXISTS "example_db"."example_schema"."example_table" CASCADE
and get the same error message:
ERROR: cross-database reference to database "example_db" is not supported
But when I run this SQL in Query Editor, it works:
DROP TABLE IF EXISTS "example_db.example_schema.example_table" CASCADE
Note that I just removed some quotes.
What's going on here? Is this a bug in
dbt_external_tables--or just a mistake on my part?
To confirm, I can successfully create the external table by running this in Query Editor:
DROP SCHEMA IF EXISTS example_schema DROP EXTERNAL DATABASE CASCADE ; CREATE EXTERNAL SCHEMA example_schema FROM DATA CATALOG DATABASE 'example_db' REGION 'us-east-1' IAM_ROLE 'iam_role' CREATE EXTERNAL DATABASE IF NOT EXISTS ; CREATE EXTERNAL TABLE example_schema.example_table ( col1 SMALLINT, col2 CHAR(1) ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION 's3://example_bucket' TABLE PROPERTIES ('skip.header.line.count'='1') ;
dbt config files
models/example/schema.yml(modeled after this example:
version: 2 sources: - name: example_source database: dev schema: example_schema loader: S3 tables: - name: example_table external: location: 's3://example_bucket' row_format: > serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties ( 'strip.outer.array'='false' ) columns: - name: col1 data_type: smallint - name: col2 data_type: char(1)
name: 'example_project' version: '1.0.0' config-version: 2 profile: 'example_profile' model-paths: ["models"] analysis-paths: ["analyses"] test-paths: ["tests"] seed-paths: ["seeds"] macro-paths: ["macros"] snapshot-paths: ["snapshots"] target-path: "target" clean-targets: - "target" - "dbt_packages" models: example_project: example: +materialized: view
packages: - package: dbt-labs/dbt_external_tables version: 0.8.0
Load Redshift Spectrum external table with CSVs with differing column order
This got no answer and I have a similar question, though I'll expand it.
Suppose I have 3 CSV files in
s3://test_path/. I want to create an external table and populate it with the data in these CSVs. However, not only does column order differ across CSVs, but some columns may be missing from some CSVs.
Is Redshift Spectrum capable of doing what I want?
id,name,type a1,apple,1 a2,banana,2
type,id,name 1,b1,orange 2,b2,lemon
I create the external database/schema and table by running this in Redshift query editor v2 on my Redshift cluster:
CREATE EXTERNAL SCHEMA test_schema FROM DATA CATALOG DATABASE 'test_db' REGION 'region' IAM_ROLE 'iam_role' CREATE EXTERNAL DATABASE IF NOT EXISTS ; CREATE EXTERNAL TABLE test_schema.test_table ( "id" VARCHAR, "name" VARCHAR, "type" SMALLINT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION 's3://test_path/' TABLE PROPERTIES ('skip.header.line.count'='1') ;
SELECT * FROM test_schema.test_tableto yield:
id name type a1 apple 1 a2 banana 2 b1 orange 1 b2 lemon 2 c1 kiwi NULL
Instead I get:
id name type a1 apple 1 a2 banana 2 1 b1 NULL 2 b2 NULL kiwi c1 NULL
It seems Redshift Spectrum cannot match columns by name across files the way
pandas.concat()can with data frames with differing column order.