When to and Not to use alias for JOINs in BigQuery

I used citibike_stations and citibike_trips from public data base and copied those tables. So I have : Dataset: Citibike_stations1 Tables - Stations and Trips.

Below is the query where I get error - "Unrecognized name:

SELECT st.station_id, st.name, number_of_rides AS number_of_rides_fromstation,
doubt1- number_of_rides is not a column so how will SQL select this
FROM ( SELECT start_station_id, COUNT(*)number_of_rides FROM leafy-racer-348015.citibike_stations1.trips tr GROUP BY start_station_id ) AS station_num_trips INNER JOIN leafy-racer-348015.citibike_stations1.stationsst ON st.station_id = tr.start_station_id ORDER BY number_of_rides DESC

Doubt2 -When I run this query, I get Unrecognized error in the line st.station_id=tr.start_station_id. But if I remove this alias, then it works fine.

I am referring this question from Google Data Analytics Course Wk3 Nested Queries Module. Earlier in the JOIN module, I understood that Aliases are necessary for the Join function to work. But here -it is opposite. Why?

1 answer

  • answered 2022-05-04 10:50 Caius Jard

    OK, I think I've managed to work out what this is asking. First, let's format your query so it's legible (but still has errors):

    SELECT 
      st.station_id, 
      st.name, 
      number_of_rides AS number_of_rides_fromstation,
    FROM 
      ( 
        SELECT start_station_id, COUNT(*) as number_of_rides 
        FROM leafy-racer-348015.citibike_stations1.trips tr 
        GROUP BY start_station_id 
      ) station_num_trips 
    
      INNER JOIN leafy-racer-348015.citibike_stations1.stations st 
      ON 
        st.station_id = tr.start_station_id 
    
    ORDER BY number_of_rides DESC
    

    number_of_rides is not a column so how will SQL select this

    Sure it is. The subquery aliased by station_num_trips generates it from COUNT(*).

    SQL is just "rectagular blocks of data in, rectangular blocks of data out" - every SELECT produces a rectangular block of data, just like a table, that can be fed into another operation, like a JOIN, FROM or WHERE. It can even be fed into a SELECT if it's a single value. Your subquery here:

      ( 
        SELECT start_station_id, COUNT(*) as number_of_rides 
        FROM leafy-racer-348015.citibike_stations1.trips tr 
        GROUP BY start_station_id 
      ) station_num_trips 
    

    ..took all the data in trips, threw all the columns away except the station id, counted the number of trips from that station, and produced a new block of data that was just station id, and the count: it measures 2 columns wide by N rows high (N is the number of unique station IDs). This is what gets joined in when you do the join; your 2 wide by N high block of data.

    Trips might have had 1000 rows of the same station ID and 10 columns; they're all gone, collapsed into 1 row, 2 columns and a count of 1000 in the one row. This now behaves like a new table called station_num_trips

    --

    At every step columns and rows are added or removed, so basically all you're ever doing in SQL is cutting up, and joining together blocks of data, to form new blocks of data.

    Tables often form the starting blocks, but they don't have to:

    SELECT * FROM (SELECT 1 as A UNION SELECT 2) x
    

    This is perfectly valid; you could apply a where clause to remove rows, a JOIN to a table to add more columns and then SELECT those new columns or use them to produce more calculated values


    There's no point aliasing your table in the subquery; there is only one table so it's obvious what you're referring to:

      ( 
        SELECT start_station_id, COUNT(*) as number_of_rides 
        FROM leafy-racer-348015.citibike_stations1.trips       --removed alias
        GROUP BY start_station_id 
      ) station_num_trips 
    

    Aliasing it also then led you to another mistake:

    When I run this query, I get Unrecognized error in the line st.station_id=tr.start_station_id. But if I remove this alias, then it works fine.

    tr is an alias inside the subquery. It doesn't exist where youre using it. The whole subquery is aliased as station_num_trips so it's station_num_trips.start_station_id you need to use in your join condition, not tr.start_station_id. The tr alias is gone; you cannot access an subquery's alias from the parent query that wraps it.

    You can, however, access a parent query's alias in a subquery (the other way round from what you have)

    SELECT *              
    FROM table1 x         <-- this is the parent query
    WHERE 
        EXISTS(SELECT null FROM table2 y WHERE x.id = y.id)
                                              ^^^
                                      from the parent query
    

    Doing this is a common way to coordinate a subquery's data with the main query data. Here this query asks "get me only x records where there is a matching record in table2" - it needs that coordination between the subquery and the main in order to work


    Make use of indentation to keep straight in your mind what is a parent and what is a subquery. Small subqueries might fit all on one line, or a compact form where each line is lead by the block keyword. Larger queries should be spread out:

    SELECT
      smiths.columnX,
      counted_things.column3,
      SUM(x.yz)
    
    FROM
      (SELECT * FROM People WHERE name = 'SMITH') smiths
    
      JOIN (
        SELECT name, age, column3, COUNT(*) as columnX
        FROM TableA JOIN tableB On x = y
        WHERE column4 = 'something'
        GROUP BY name, age, column3
      ) counted_things
      ON 
        counted_things.name = smiths.Name AND 
        counted_things.age = smiths.Age AND
        ..
    
      JOIN sometable x ON x.whatever = counted_things.whatever
    
      JOIN anothertable y
      ON
        x.id = y.id AND
        ...
    
    WHERE
      x.thing = 123
    
    GROUP BY
      smiths.columnX,
      counted_things.column3
    

    See how I'm using indentation to describe what level everything is operating at - the SELECT FROM WHERE GROUP etc are aligned; they belong to the same query. The data blocks int he FROM are indented and aligned with each other; subqueries have parentheses that align with other parentheses and table names. It's easy to pick out a subquery. It's easy to see what it joins to; smiths, counted_things, sometable and anothertable are indented the same, they're joined together and operate at the same level. Subqueries are indented inside thair brackets to clarify that they're subqueries

    All this helps you use rules like "subqueries can access parent aliases but not the other way round" and "subqueries produce rows and columns that behave like new tables"

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum