pandas groupby hour and calculate stockout time

I have a time series like below:

| datetime_create         | quantity_old | quantity_new | quantity_diff | is_stockout |
| 2018-02-15 08:12:54.289 | 16           | 15           | -1            | False       |
| 2018-02-15 08:14:10.619 | 15           | 13           | -2            | False       |
| 2018-02-15 08:49:15.962 | 13           | 9            | -4            | False       |
| 2018-02-15 08:51:04.740 | 9            | 8            | -1            | False       |
| 2018-02-15 08:56:37.086 | 8            | 7            | -1            | False       |
| 2018-02-15 09:23:22.858 | 7            | 5            | -2            | False       |
| 2018-02-15 10:16:50.324 | 5            | 4            | -1            | False       |
| 2018-02-15 10:19:25.071 | 4            | 3            | -1            | False       |
| 2018-02-15 10:33:22.788 | 3            | 2            | -1            | False       |
| 2018-02-15 10:33:34.125 | 2            | 0            | -2            | True        |
| 2018-02-15 16:45:24.747 | 0            | 1            | 1             | False       |
| 2018-02-15 16:48:29.996 | 1            | 0            | -1            | True        |
| 2018-02-17 10:42:58.325 | 0            | 42           | 42            | False       |
| 2018-02-17 10:47:07.380 | 42           | 41           | -1            | False       |
| 2018-02-17 11:42:31.008 | 41           | 40           | -1            | False       |
| 2018-02-17 11:48:31.070 | 40           | 39           | -1            | False       |
| 2018-02-17 12:39:13.681 | 39           | 38           | -1            | False       |
| 2018-02-17 12:48:00.286 | 38           | 37           | -1            | False       |
| 2018-02-17 12:56:59.203 | 37           | 36           | -1            | False       |
| 2018-02-17 13:18:12.285 | 36           | 35           | -1            | False       |
| 2018-02-17 13:29:53.465 | 35           | 34           | -1            | False       |
| 2018-02-17 14:54:55.810 | 34           | 33           | -1            | False       |
| 2018-02-17 15:53:38.816 | 33           | 32           | -1            | False       |
| 2018-02-17 16:28:08.076 | 32           | 31           | -1            | False       |
| 2018-02-17 16:45:18.965 | 31           | 30           | -1            | False       |
| 2018-02-17 16:59:11.111 | 30           | 29           | -1            | False       |
| 2018-02-17 17:18:53.646 | 29           | 27           | -2            | False       |
| 2018-02-17 17:44:43.508 | 27           | 26           | -1            | False       |
| 2018-02-17 19:34:49.701 | 26           | 25           | -1            | False       |
| 2018-02-17 20:49:00.205 | 25           | 24           | -1            | False       |
| 2018-02-18 07:14:22.207 | 24           | 22           | -2            | False       |
| 2018-02-18 08:35:41.560 | 22           | 20           | -2            | False       |
| 2018-02-18 10:22:18.825 | 20           | 19           | -1            | False       |
| 2018-02-18 10:28:33.909 | 19           | 18           | -1            | False       |
| 2018-02-18 10:37:30.427 | 18           | 17           | -1            | False       |
| 2018-02-18 10:50:55.265 | 17           | 16           | -1            | False       |
| 2018-02-18 11:17:53.359 | 16           | 15           | -1            | False       |
| 2018-02-18 11:42:29.214 | 0            | 30           | 30            | False       |
| 2018-02-18 11:58:19.113 | 15           | 14           | -1            | False       |
| 2018-02-18 11:58:56.432 | 14           | 13           | -1            | False       |
| 2018-02-18 12:06:48.438 | 13           | 12           | -1            | False       |
| 2018-02-18 12:21:43.634 | 12           | 11           | -1            | False       |
| 2018-02-18 12:44:46.288 | 11           | 9            | -2            | False       |
| 2018-02-18 13:26:01.952 | 9            | 8            | -1            | False       |
| 2018-02-18 13:26:40.940 | 8            | 9            | 1             | False       |
| 2018-02-18 13:27:34.090 | 9            | 8            | -1            | False       |
| 2018-02-18 13:27:52.443 | 8            | 9            | 1             | False       |
| 2018-02-18 13:28:58.832 | 9            | 8            | -1            | False       |
| 2018-02-18 14:56:49.105 | 8            | 7            | -1            | False       |
| 2018-02-18 16:00:32.212 | 7            | 6            | -1            | False       |
| 2018-02-18 16:28:20.175 | 6            | 5            | -1            | False       |
| 2018-02-18 16:31:48.741 | 5            | 3            | -2            | False       |
| 2018-02-18 16:40:33.922 | 3            | 2            | -1            | False       |
| 2018-02-18 16:56:17.864 | 2            | 1            | -1            | False       |
| 2018-02-18 17:15:01.065 | 1            | 2            | 1             | False       |
| 2018-02-18 17:40:43.062 | 2            | 1            | -1            | False       |
| 2018-02-18 17:55:50.520 | 1            | 0            | -1            | True        |
| 2018-02-18 18:20:21.664 | 30           | 29           | -1            | False       |
| 2018-02-18 21:38:10.645 | 29           | 28           | -1            | False       |
| 2018-02-19 06:36:04.564 | 28           | 27           | -1            | False       |
| 2018-02-19 08:49:23.080 | 27           | 26           | -1            | False       |

I want calculate the total stockout time in every hour in one day, like

|    date    |  0  |  1  |  2  |  3  | ... | 23  |
| ---------- | --- | --- | --- | --- | --- | --- |
| 2018-02-15 | 10  | 0   | 0   | 10  | ... | 13  |
| 2018-02-16 | 6   | 0   | 7   | 10  | ... | 20  |
| 2018-02-17 | 6   | 0   | 0   | 10  | ... | 20  |

The rule:

  1. group by hour
  2. I can access all rows in an hour.
  3. calculate time between

    • start point: is_stockout from False to True
    • end point: is_stockout from True to False

    In an hour. There may be many start point and end point

  4. change index to day, and column to 24 hour.

It looks a little like new-syntax-to-window-and-resample-operations

I think I need use

df.resample('H').apply(caluclate_time_in_hour)

But this seems not enough:

  1. df.resample('H') result index be hour, not column
  2. How to write proper caluclate_time_in_hour ? I think apply can't do this.

    I wrote a pseudo-code:

    def caluclate_time_in_hour(item):
        # note: item here is stockcount . not just True or False
    
        global last_time
        global is_stockout
        global data
    
        cur_time = item.name
    
        # I need pandas return every row even that hour doesn't have data
        # so that no need to check the how many hours elasped.
    
        if item is np.nan:
            if is_stockout:
                data[cur_time.hour] = 60*60
            else:
                data[cur_time.hour] = 0
    
        if is_stockout:
            if item > 0:
                data[cur_time.hour] += cur_time - last_time
            else:
                is_stockout = False
        else:
            if item = 0:
                is_stockout = True
    
        last_time = item.name
    
        return data.copy()
    

    How to know this item is the last one in this hour , so that I can return the data ? This is the apply problem. Maybe I need pandas return all rows by hours to do apply.

I just wonder can I do above things by pandas built-in function, without looping all rows to constuct the new DataFrame.


For example, 2018-02-15 ~ 2018-02-16 has below two records:

| datetime_create     | quantity_old | quantity_new | quantity_diff | is_stockout |
| 2018-02-14 00:45:00 | 40           | 10           | -30           | False       |
| 2018-02-15 12:45:00 | 10           | 2            | -8            | False       |
| 2018-02-15 13:45:00 | 2            | 1            | -1            | False       |
| 2018-02-15 16:45:00 | 1            | 0            | -1            | True        |
| 2018-02-16 10:42:00 | 0            | 42           | 42            | False       |
| 2018-02-16 13:42:00 | 42           | 40           | -2            | False       |
| 2018-02-16 19:42:00 | 40           | 38           | -2            | False       |
| 2018-02-17 20:42:00 | 38           | 40           | 2             | False       |
# duplicate above 
| 2018-02-18 00:45:00 | 40           | 10           | -30           | False       |
| 2018-02-19 12:45:00 | 10           | 2            | -8            | False       |
| 2018-02-19 13:45:00 | 2            | 1            | -1            | False       |
| 2018-02-19 16:45:00 | 1            | 0            | -1            | True        |
| 2018-02-20 10:42:00 | 0            | 42           | 42            | False       |
| 2018-02-20 13:42:00 | 42           | 40           | -2            | False       |
| 2018-02-20 19:42:00 | 40           | 38           | -2            | False       |
| 2018-02-21 20:42:00 | 38           | 40           | 2             | False       |

csv:

datetime_create,quantity_old,quantity_new,quantity_diff,is_stockout
2018-02-14 00:45:00,40,10,-30,False
2018-02-15 12:45:00,10,2,-8,False
2018-02-15 13:45:00,2,1,-1,False
2018-02-15 16:45:00,1,0,-1,True
2018-02-16 10:42:00,0,42,42,False
2018-02-16 13:42:00,42,40,-2,False
2018-02-16 19:42:00,40,38,-2,False
2018-02-17 20:42:00,38,40,2,False
2018-02-18 00:45:00,40,10,-30,False
2018-02-19 12:45:00,10,2,-8,False
2018-02-19 13:45:00,2,1,-1,False
2018-02-19 16:45:00,1,0,-1,True
2018-02-20 10:42:00,0,42,42,False
2018-02-20 13:42:00,42,40,-2,False
2018-02-20 19:42:00,40,38,-2,False
2018-02-21 20:42:00,38,40,2,False

Would result (here time unit is minutes, for beauty) :

date,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
2018-02-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-15,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0
2018-02-16,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-19,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0
2018-02-20,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,60.0,42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-02-21,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

enter image description here

1 answer

  • answered 2018-07-12 07:47 jezrael

    I think need first resample by minutes with forward filling NaNs, convert to inetgers and for Series add DataFrame.squeeze.

    Then aggregate by dates and hours with sum and last reshape by unstack:

    s = df[['is_stockout']].resample('T').ffill().astype(int).squeeze()
    df1 = s.groupby([s.index.date, s.index.hour]).sum().unstack(fill_value=0)
    print (df1)
    datetime_create  0   1   2   3   4   5   6   7   8   9  ...  14  15  16  17  \
    2018-02-14        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
    2018-02-15        0   0   0   0   0   0   0   0   0   0 ...   0   0  15  60   
    2018-02-16       60  60  60  60  60  60  60  60  60  60 ...   0   0   0   0   
    2018-02-17        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
    2018-02-18        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
    2018-02-19        0   0   0   0   0   0   0   0   0   0 ...   0   0  15  60   
    2018-02-20       60  60  60  60  60  60  60  60  60  60 ...   0   0   0   0   
    2018-02-21        0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   
    
    datetime_create  18  19  20  21  22  23  
    2018-02-14        0   0   0   0   0   0  
    2018-02-15       60  60  60  60  60  60  
    2018-02-16        0   0   0   0   0   0  
    2018-02-17        0   0   0   0   0   0  
    2018-02-18        0   0   0   0   0   0  
    2018-02-19       60  60  60  60  60  60  
    2018-02-20        0   0   0   0   0   0  
    2018-02-21        0   0   0   0   0   0