A simple mySQL query + Statistical Analysis

I'm looking for patterns in a database with around 1 million records. I've been experimenting a bit using Keras and TensorFlow, specifically LSTM,

However, as I'm really new in the field, on it, I found better results doing some very specific queries.

Having the following table with the following data:

round   value   class       creacion
1       15.49   H       2018-01-27 14:03:54
2       7.42    H       2018-01-27 14:04:42
3       1.04    L       2018-01-27 14:39:28
4       2.71    H       2018-01-27 14:39:36
5       1.95    L       2018-01-27 14:39:59
6       4       H       2018-01-27 14:40:17
7       4.4     H       2018-01-27 14:40:45
8       1.52    L       2018-01-27 14:41:14
9       28.69   H       2018-01-27 14:41:28
10      7.44    H       2018-01-27 14:42:25
11      1.1     L       2018-01-27 14:43:02
12      1.1     L       2018-01-27 14:43:12
13      1.41    L       2018-01-27 14:43:21
14      1.04    L       2018-01-27 14:53:10
15      1.66    L       2018-01-27 14:53:19
16      8.44    H       2018-01-27 14:53:34
17      1.55    L       2018-01-27 14:54:13
18      2.39    H       2018-01-27 14:55:29
19      2.9     H       2018-01-27 14:55:50
20      1.66    L       2018-01-27 14:56:13
21      2.7     H       2018-01-27 14:56:29
22      7.53    H       2018-01-27 14:56:51
23      2.04    H       2018-01-27 14:57:28
24      1.97    L       2018-01-27 14:57:47
25      1.35    L       2018-01-27 14:58:05

As you can see, I'm classifying all values below 2, as 'L' (low) values, and bigger as H (high) values.

So the main goal here is trying to predict the next value.

I have been using this query, which sums 100 values, considering high values as 2 and low values as 1. The following query sums the last 100 results and provide one number as output, assuming that the number is lower than the median, we can predict that the chances of a high value are increased.

SELECT SUM(n)
FROM (
SELECT *, IF(value < 2, @nvalue := 1, @nvalue := 2) AS n
FROM crawler
ORDER BY round DESC
LIMIT 0, 100
) AS sub

So, the first question is about the query:

I would like to create a new column, adding the sum of the previous 100 values. Do you know how this could be done?

I can replicate the results doing the following query:

SELECT round, value, class, creacion, sum(n)
FROM (
SELECT *, if(value < 2, @nvalue := 1, @nvalue := 2) AS n
FROM crawler
ORDER BY round DESC
LIMIT 0, 100
) AS sub

However, it obviously displays the last record alone:

round   value   class   creacion                sum(n)
560894  3.24    hi      2018-06-22 22:58:59     162

When I'm actually looking for the same result, but with every single record with a limit to avoid the large loading times.

1 answer

  • answered 2018-06-22 23:03 Gordon Linoff

    The naive way to get the last hundred values is:

    select c.*,
           (select sum(c2.value)
            from (select c3.*
                  from c3
                  where c3.creation <= c.creation
                  order by c3.creation desc
                  limit 100
                 ) c2
           ) as sum_last100
    from crawler c;
    

    Because the correlation clause is two levels deep, MySQL does not accept this.

    In MySQL 8+, this is much easier:

    select c.*,
           sum(value) over (order by creation rows between 99 preceding and current row) as sum_last100
    from crawler c;
    

    At this point, I might suggest that you switch to either MySQL 8 or to some other database (such as Postgres). Getting your desired query to work efficiently on a million rows may not be worth the effort in older versions of MySQL.