Strategy to filter and group by for a String column in AWS Redshift Database

How to Strategize to filter and group by for a String column in AWS Redshift Database ?

Table_Id | Categories          | Value
<ID>     | AAA1; AAA1-1; AAA2  | 10
<ID>     | AAA1; AAA1-2; AAA2  | 15
<ID>     | AAA2                | 5
.....

Now I want to filter records based on individual category like 'AAA1' or 'AAA1 and AAA2' Expected output from query would like:

Table_Id | Categories         | Value
<ID>     | AAA1               | 25
<ID>     | AAA1-1             | 10
<ID>     | AAA1-2             | 15
<ID>     | AAA2               | 30
.....

So need to group results based on individual categories. Please note that the this question does not satisfy my use-case, as there is no possibility to run a regex or split_part on the huge number of records. Running that solution results in 4+ hours to fetch the data.

Other alternative ways that we have tried:

  1. Generate a hash value for each possible combination and then lookup using this hash. However, this results in an extremely large number of hash values.
  2. Assign a distinct prime number to each category and then store the product of the primes against the value. However, this results in very large number that cannot be stored in the database.

Is there any other mathematical or other strategy that can be applied to resolve this issue ?

1 answer

  • answered 2018-05-16 05:45 John Rotenstein

    You need the data in a better format for querying. There's two potential designs:

    Single table with a column for each attribute

    Table_Id | Categories          | Value | CAT-AAA1 | CAT-AAA1-1 | CAT-AAA2
    <ID>     | AAA1; AAA1-1; AAA2  | 10    | TRUE     | TRUE       | TRUE
    <ID>     | AAA1; AAA1-2; AAA2  | 15    | TRUE     | FALSE      | TRUE
    <ID>     | AAA2                | 5     | FALSE    | FALSE      | TRUE
    .....
    

    This would involve adding a column for each attribute, then running some UPDATE commands to populate the columns, such as:

    UPDATE <table> SET CAT-AAA1 = TRUE WHERE Categories CONTAINS '%AAA1;%'
    

    Then, it would be easy to query the table:

    SELECT SUM(Value) FROM <table> WHERE CAT-AAA1 AND CAT-AAA1-2;
    

    Redshift can handle up to 1600 columns per table. It is quite normal to have wide tables in a Data Warehouse.

    One-to-Many table

    This option would involve creating a new table that links each row to multiple categories:

    Table_Id | Category
    1     | AAA1
    1     | AAA1-1
    1     | AAA1-2
    2     | AAA1
    

    You could then query by joining to this lookup table to find the right rows, such as:

    SELECT SUM(Value)
    FROM <table>
    JOIN <lookup-table> USING Table_Id
    WHERE Category = 'AAA1';