SQL - Find records in continuous time series as groups in PostgreSQL

I have a timeseries data in which I'm trying to find continous streaks of records in order within a certain time interval, grouping by each series of continual records. For example, if the time interval is 5 minutes between records for each set (group) any record next record that was within 5 minutes would be in the set and any record over 5 minutes would end the set (group). A new set (group) would emerge from the next two records within 5 minutes.

**Minute seperation is not in the table, it is calculated in the query

|    |                     |                                   |          |                                                            | 
|----|---------------------|-----------------------------------|----------|------------------------------------------------------------| 
| id | timestamp           | minute separation (Calculated **) | group    | notes                                                      | 
| 1  | 2018-02-13T01:18:00 | 0                                 | Group 1  |                                                            | 
| 2  | 2018-02-13T01:22:00 | 4                                 | Group 1  |                                                            | 
| 3  | 2018-02-13T01:25:00 | 3                                 | Group 1  |                                                            | 
| 4  | 2018-02-13T01:31:00 | 6                                 | No Group | breaks group 1                                             | 
| 5  | 2018-02-13T01:38:00 | 7                                 | No Group | not within interval on either side                         | 
| 6  | 2018-02-13T01:44:00 | 6                                 | Group 2  | Start of group 2                                           | 
| 7  | 2018-02-13T01:47:00 | 3                                 | Group 2  |                                                            | 
| 8  | 2018-02-13T01:48:00 | 1                                 | Group 2  |                                                            | 
| 9  | 2018-02-13T01:49:00 | 1                                 | Group 2  |                                                            | 
| 10 | 2018-02-13T01:51:00 | 2                                 | Group 2  |                                                            | 
| 11 | 2018-02-13T01:57:00 | 6                                 | Group 3  | Breaks Group 2, included in next group as start of group 3 | 
| 12 | 2018-02-13T01:59:00 | 2                                 | Group 3  |                                                            | 
| 13 | 2018-02-13T02:01:00 | 2                                 | Group 3  |                                                            | 
| 14 | 2018-02-13T02:02:00 | 1                                 | Group 3  |                                                            | 
| 15 | 2018-02-13T02:08:00 | 6                                 | No Group | Breaks group 3                                             | 
| 16 | 2018-02-13T02:15:00 | 7                                 | No Group |                                                            | 
| 17 | 2018-02-13T02:22:00 | 7                                 | No Group |                                                            | 

I can find the difference between 2 rows in order and extract them, but I'm not sure how I would extract a series of n+1 rows.

My Working sample is here: http://sqlfiddle.com/#!17/e9fa1/7

Sample Data SQL Insert:

CREATE TABLE time_series (
  id SERIAL UNIQUE, 
  name TEXT,
  timestamp TIMESTAMPTZ
);

INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:18:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:22:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:25:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:31:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:38:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:44:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:47:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:48:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:49:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:51:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:57:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T01:59:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:01:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:02:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:08:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:15:00');
INSERT into time_series (name,timestamp) VALUES ('Test','2018-02-13T02:22:00');

Working Query:

WITH my_series AS
(
  SELECT *, ROW_NUMBER() OVER (ORDER BY timestamp) AS rn
    FROM time_series 
)
SELECT o1.id AS id1, o1.timestamp AS date1, o2.id AS id2, o2.timestamp  AS date2, ABS(EXTRACT(EPOCH FROM (o1.timestamp - o2.timestamp))) AS diff

  FROM my_series as o1 JOIN my_series  as o2
    ON o1.rn + 1 = o2.rn
  WHERE ABS(EXTRACT(EPOCH FROM (o1.timestamp - o2.timestamp))) < 300;

I need to either return each group as one set (that will work as I should be able to repeat the query noting the timestamp that the query ended to start the next one, although that does not seem very efficent), or return all the groups labeled as group, like this:

Desired result:

|    |                     |                                   |          |                                                            | 
|----|---------------------|-----------------------------------|----------|------------------------------------------------------------| 
| id | timestamp           | minute separation (Calculated **) | group    | notes                                                      | 
| 1  | 2018-02-13T01:18:00 | 0                                 | 1        |                                                            | 
| 2  | 2018-02-13T01:22:00 | 4                                 | 1        |                                                            | 
| 3  | 2018-02-13T01:25:00 | 3                                 | 1        |                                                            | 
| 6  | 2018-02-13T01:44:00 | 6                                 | 2        |                                                            | 
| 7  | 2018-02-13T01:47:00 | 3                                 | 2        |                                                            | 
| 8  | 2018-02-13T01:48:00 | 1                                 | 2        |                                                            | 
| 9  | 2018-02-13T01:49:00 | 1                                 | 2        |                                                            | 
| 10 | 2018-02-13T01:51:00 | 2                                 | 2        |                                                            | 
| 11 | 2018-02-13T01:57:00 | 6                                 | 3        |                                                            | 
| 12 | 2018-02-13T01:59:00 | 2                                 | 3        |                                                            | 
| 13 | 2018-02-13T02:01:00 | 2                                 | 3        |                                                            | 
| 14 | 2018-02-13T02:02:00 | 1                                 | 3        |                                                            | 

1 answer

  • answered 2018-02-13 03:41 Gordon Linoff

    You want to use lag() and lead(). I would enumerate all the groups, even the ones with only one row:

    select s.*,
           sum( (timestamp >= prev_timestamp + interval '5 minute' or prev_timestamp is null)::int ) over (order by timestamp) as grp
    from (select s.*,
                 lag(timestamp) over (order by timestamp) as prev_timestamp,
                 lead(timestamp) over (order by timestamp) as next_timestamp
          from my_series s
         ) s;
    

    The logic is a bit more complicated for your actual question. I think this does the trick:

    select s.*,
           (case when timestamp > prev_timestamp + interval '5 minute' and
                      timestamp < next_timestamp - interval '5 minute' and
                 then NULL
                 else sum( (timestamp >= prev_timestamp + interval '5 minute' and next_timestamp < timestamp + interval '5 minute')::int ) over (order by timestamp) 
            end) as grp
    from (select s.*,
                 lag(timestamp) over (order by timestamp) as prev_timestamp,
                 lead(timestamp) over (order by timestamp) as next_timestamp
          from my_series s
         ) s;