Query by multiple values from an AWS Athena bucketed table

I have a bucketed table from which I want to query by multiple values. Here is an example:

FROM my_bucketed_table
WHERE bucketed_column IN (value1, value2)

The result is a full scan of the table, instead of using the index.

When I used union to query each value at a time it worked as expected in terms of data scanned:

FROM my_bucketed_table
WHERE bucketed_column = value1
FROM my_bucketed_table
WHERE bucketed_column = value2

but I want the list to be dynamic, so this solution is not good enough for me.

I expect the data scanned to be the same as in the UNION solution using the IN operator or a JOIN with another table

2 answers

  • answered 2019-10-15 10:57 Gordon Linoff

    This is a bit long for a comment.

    I think you are referring to partition pruning, which is a bit different from "using an index". You want the query to only read the relevant partitions.

    Partition pruning is quite tricky. The basic problem is that the query needs to know what data to read before it starts executing the query. This is usually handled by requiring explicit comparisons on the partitioning column.

    Identifying the right partitions should work correctly with =, >, >=, <, and <=. It might get more complicated with in and not in. It probably will not work when you use a join on one table and don't explicit include the partition for both tables in the join.

  • answered 2019-10-15 11:02 Zaynul Abadin Tuhin

    you can try like below which may help to use index

    SELECT *
    FROM my_bucketed_table
    WHERE bucketed_column = 0000 or bucketed_column IN (value1, value2)

    assume you have not any value 0000 in your column