SQL query: Difference between EXTRACT(DATE FROM timestamp) vs. WHERE timestamp

I'm querying for the number of trips throughout the hours of the day with the avg mph, within a given date range.

What exactly is the difference between these two functions for working with timestamp data and can someone please explain why the first value in the num_trips column is different? See below

   (
   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
   trip_seconds, trip_miles
   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
   WHERE EXTRACT(DATE FROM trip_start_timestamp) >= '2017-01-01'
   AND EXTRACT(DATE FROM trip_start_timestamp) < '2017-07-01'
           AND trip_seconds > 0 
           AND trip_miles > 0
       )
       SELECT hour_of_day, 
       COUNT(1) as num_trips,
       (3600 * SUM(trip_miles) / SUM(trip_seconds)) as avg_mph
               FROM RelevantRides1
               GROUP BY hour_of_day
               ORDER BY hour_of_day 

   (
   SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
   trip_miles, trip_seconds
   FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
                   WHERE trip_start_timestamp > '2017-01-01' AND 
                         trip_start_timestamp < '2017-07-01' AND 
                         trip_seconds > 0 AND 
                         trip_miles > 0
         )
         SELECT hour_of_day, 
         COUNT(1) AS num_trips, 
         3600 * SUM(trip_miles) / SUM(trip_seconds) AS avg_mph
             FROM RelevantRides
             GROUP BY hour_of_day
             ORDER BY hour_of_day

I expected both queries to return the same result but when I print the dataframe, the first query gives a different result for num_trips for the first hour of the day, where I used Extract, the rest of the results are identical.

2 answers

  • answered 2019-08-13 05:06 Mikhail Berlyant

    I think the issue is in trip_start_timestamp > '2017-01-01'
    To fix discrepancy you should use trip_start_timestamp >= '2017-01-01'

    Below simplified example reproduces the issue

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT CURRENT_TIMESTAMP() trip_start_timestamp UNION ALL
      SELECT TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY), DAY) UNION ALL
      SELECT TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 2 DAY)
    )
    SELECT trip_start_timestamp
    FROM `project.dataset.table`
    WHERE trip_start_timestamp > '2019-08-12'
    

    returns

    Row trip_start_timestamp     
    1   2019-08-13 05:04:34.747114 UTC   
    

    while

    SELECT trip_start_timestamp
    FROM `project.dataset.table`
    WHERE trip_start_timestamp >= '2019-08-12'   
    

    returns

    Row trip_start_timestamp     
    1   2019-08-13 05:05:38.784956 UTC   
    2   2019-08-12 00:00:00 UTC     
    

  • answered 2019-08-13 05:19 Spasnof

    So I cannot make the original example run because I get a Syntax error: Unexpected keyword SELECT at [10:8] if I attempt to copy paste the query in a new project using the bigquery-public-data.chicago_taxi_trips table. That said I will attempt an answer using the first select statements given...

    User Mikhail Berlyant gave part of the solution which is that you are mixing >= and > so you have two different queries and are comparing apples with oranges.

    If you run the following queries you will notice the same record count being returned.

    SELECT count(*) from
     (
      SELECT 
        EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
        trip_seconds, trip_miles
      FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
      WHERE EXTRACT(DATE FROM trip_start_timestamp) >= '2017-01-01'
        AND EXTRACT(DATE FROM trip_start_timestamp) < '2017-07-01'
        AND trip_seconds > 0 
        AND trip_miles > 0
     ) t;
    -- returns 11460748
    
    SELECT count(*) from 
    (
       SELECT EXTRACT(HOUR FROM trip_start_timestamp) AS hour_of_day, 
       trip_miles, trip_seconds
       FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
      WHERE trip_start_timestamp >= '2017-01-01' 
        AND trip_start_timestamp < '2017-07-01'  
        AND trip_seconds > 0 AND 
        trip_miles > 0
    ) t1
    -- returns 11460748
    

    However if you do > on both ends you don't get the same results. (11409890 vs 11460748 then Why is that? Well that is because when you use EXTRACT(DATE FROM trip_start_timestamp) you floor the field prior to comparison. So for example:

    select '2017-01-01 00:15:00' > '2017-01-01'; --true
    select extract(date from timestamp '2017-01-01 00:15:00'); --2017-01-01
    select extract(date from timestamp '2017-01-01 00:15:00') > '2017-01-01'; --false
    
    1. In the first statement we say that 2017-01-01 00:15:00 is greater than 2017-01-01 00:00:00 (the extra hour/minute/second of precision is implied)
    2. In the second statement you see what your value floors to (2017-01-01 or 2017-01-01 00:00:00 with implied precision)
    3. When we compare the floored value the given value and factor in the implied precision 2017-01-01 00:00:00 > 2017-01-01 00:15:00 you return false so you exclude any records on the same day. 2017-01-01 06:15:00, 2017-01-01 22:15:00etc

    I would recommend using the Minimal, reproducible, viable example docs available here.