Presto - how to perform correlations on between all columns in one query

I have a table in the following format:

A   B   C   D 
7   7   2   12
2   2   3   4
2   2   2   4
2   2   2   3
5   5   2   7

I would like to calculate correlations between each of the columns using the build-in correlation function ( corr(y, x) → double)

I could run over all the columns and perform the corr calculation each time with: select corr(A,B) from table but I would like to reduce the number of times I access presto and run it in one query if its possible.

Would it be possible to get as a result the column names that pass a certain threshold or at least the correlation scores between all possible combinations in one query?


2 answers

  • answered 2019-12-08 18:14 GMB

    I would like to calculate correlations between each of the columns

    Correlation involves two series of data (in SQL, two columns). So I understand your question as: how to compute the correlation for each and every possible combination of columns in the table. That would look like:

        corr(a, b) corr_a_b,
        corr(a, c) corr_a_c,
        corr(a, d) corr_a_d,
        corr(b, c) corr_b_c,
        corr(b, d) corr_c_d,
        corr(c, d) corr_c_d
    from mytable

  • answered 2019-12-08 20:13 Gordon Linoff

    You can use a lateral join to unpivot the table, then a self join and aggregation:

    with v as (
          select v.*,
          from (select t.*,
                       row_number() over (order by a) as id
                from t
               ) t cross join lateral
               (values ('a', a), ('b', b), ('c', c), ('d', d)
               ) v(col, val)
    select v1.col, v2.col, corr(v1.val, v2.val)
    from v v1 join
         v v2
         on = and v1.which < v2.which
    group by v1.col, v2.col;

    The row_number() is only to generate a unique id for each row, which is then used for the self-join. You may already have a column with this information, so that might not be necessary.