SQL query to 'collapse' rows which are close together

We have an application that captures searches that are made by users. Because of the nature of our search (we serve results after a few characters) and the speed in which people type, we are getting a log entry for every search/letter. This looks like this:

Search log

(it looks like a up-side-down xmas tree...)

We need this data internally for counting searches (aka API calls) but for reporting to our customers it is not very nice to report on 'half' queries.

I'm looking for a way to collapse these rows into one that has the longest/last term(s) for a search.

There is a catch: a user (cid) can make more than 1 searches in a session but we can separate that if we look at timestamps I guess..

It has to be something like:

1) Group rows which are no more than 2 seconds apart

2) Order by length (or last) query to get the final query

3) Group by terms to get a count of how often a term is used to report back

Data as text:

2019-12-09  2019-12-09 12:58:45 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cleaner
2019-12-09  2019-12-09 12:58:45 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cleane
2019-12-09  2019-12-09 12:58:44 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum clean
2019-12-09  2019-12-09 12:58:43 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum clea
2019-12-09  2019-12-09 12:58:43 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cle
2019-12-09  2019-12-09 12:58:42 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum cl
2019-12-09  2019-12-09 12:58:41 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum c
2019-12-09  2019-12-09 12:58:40 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuum
2019-12-09  2019-12-09 12:58:39 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacuu
2019-12-09  2019-12-09 12:58:38 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vacu
2019-12-09  2019-12-09 12:58:37 5dea585477c94502b52c43fb    92cd6cef-3ed8-4416-ac2d-cc347780b976    search  1   search  query   vac
2019-12-09  2019-12-09 12:58:15 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue widget
2019-12-09  2019-12-09 12:58:14 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue widge
2019-12-09  2019-12-09 12:58:13 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue widg
2019-12-09  2019-12-09 12:58:12 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue wid
2019-12-09  2019-12-09 12:58:12 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue wi
2019-12-09  2019-12-09 12:58:11 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue w
2019-12-09  2019-12-09 12:58:10 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blue
2019-12-09  2019-12-09 12:58:09 5dea585477c94502b52c43fb    9b41fb1d-59d2-4a12-8974-b2261b2fe484    search  0   search  query   blu
2019-12-09  2019-12-09 12:57:38 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   widget
2019-12-09  2019-12-09 12:57:37 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   widge
2019-12-09  2019-12-09 12:57:36 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   widg
2019-12-09  2019-12-09 12:57:35 5dea585477c94502b52c43fb    f96305d9-590b-4a10-95a2-2d49a9fc63a3    search  1   search  query   wid

Expected result:

vacuum cleaner  1
blue widget     1
widget          1

1 answer

  • answered 2019-12-09 21:47 vladimir

    I suppose that the user can not only add new characters but remove ones so 'xmas tree'-rule not applicable for each final query.

    This query returns the latest search input in session (final query) that maybe not the longest one in session.

    SELECT search_input, count()
    FROM (
      SELECT 
        /* create group of pairs (input_seconds, input_text). */
        groupArray((toInt32(ts), ev)) inputs,
        /* define the end of each session. */
        arrayMap((x, index) -> index = 1 ? 1 : (inputs[index - 1].1 - x.1 > 2 /* 2 is max delay between inputs */ ? index : 0), inputs, arrayEnumerate(inputs)) session_end_points,
        /* take the latest input in each session. */
        arrayMap(x -> inputs[x].2, arrayFilter(x -> x > 0, session_end_points)) search_inputs,
        arrayJoin(search_inputs) search_input
      FROM (
        /* test data, sorted by DESCending */
        SELECT toDateTime(data.1) ts, data.2 cid, data.3 ev
        FROM (
          SELECT arrayJoin([
          ('2019-12-09 12:58:55', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'X'),    
    
          ('2019-12-09 12:58:45', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cleaner'),
          ('2019-12-09 12:58:45', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cleane'),
          ('2019-12-09 12:58:44', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clean'),
          ('2019-12-09 12:58:43', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clea'),
          ('2019-12-09 12:58:43', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cle'),
          ('2019-12-09 12:58:42', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cl'),
          ('2019-12-09 12:58:41', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum c'),
          ('2019-12-09 12:58:40', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum'),
          ('2019-12-09 12:58:39', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuu'),
          ('2019-12-09 12:58:38', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacu'),
          ('2019-12-09 12:58:37', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vac'),
    
          ('2019-12-09 12:58:15', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue widget'),
          ('2019-12-09 12:58:14', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue widge'),
          ('2019-12-09 12:58:13', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue widg'),
          ('2019-12-09 12:58:12', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue wid'),
          ('2019-12-09 12:58:12', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue wi'),
          ('2019-12-09 12:58:11', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue w'),
          ('2019-12-09 12:58:10', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blue'),
          ('2019-12-09 12:58:09', '9b41fb1d-59d2-4a12-8974-b2261b2fe484', 'blu'),
    
          ('2019-12-09 12:57:38', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'widget'),
          ('2019-12-09 12:57:37', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'widge'),
          ('2019-12-09 12:57:36', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'widg'),
          ('2019-12-09 12:57:35', 'f96305d9-590b-4a10-95a2-2d49a9fc63a3', 'wid'),
    
          ('2019-12-09 12:58:34', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vaX'),  
    
          ('2019-12-09 12:58:30', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clean'),
          ('2019-12-09 12:58:28', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cleane'),
          ('2019-12-09 12:58:26', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clean'),
          ('2019-12-09 12:58:24', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum clea'),
          ('2019-12-09 12:58:22', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacuum cle'),
    
          ('2019-12-09 12:58:15', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vac'),
          ('2019-12-09 12:58:14', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vacu'),
          ('2019-12-09 12:58:13', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vac'),
    
          ('2019-12-09 12:57:34', '92cd6cef-3ed8-4416-ac2d-cc347780b976', 'vaX')]) data)
        ORDER BY ts DESC)
      GROUP BY cid)
    GROUP BY search_input;
    
    /* Result:
    ┌─search_input───┬─count()─┐
    │ widget         │       1 │
    │ vacuum cleaner │       1 │
    │ blue widget    │       1 │
    │ vac            │       1 │
    │ vaX            │       2 │
    │ X              │       1 │
    │ vacuum clean   │       1 │
    └────────────────┴─────────┘
    */