When to and Not to use alias for JOINs in BigQuery
I used citibike_stations and citibike_trips from public data base and copied those tables. So I have : Dataset: Citibike_stations1 Tables - Stations and Trips.
Below is the query where I get error - "Unrecognized name:
number_of_rides AS number_of_rides_fromstation,
doubt1- number_of_rides is not a column so how will SQL select this
FROM ( SELECT start_station_id, COUNT(*)number_of_rides FROM
GROUP BY start_station_id ) AS station_num_trips
st.station_id = tr.start_station_id
Doubt2 -When I run this query, I get Unrecognized error in the line st.station_id=tr.start_station_id. But if I remove this alias, then it works fine.
I am referring this question from Google Data Analytics Course Wk3 Nested Queries Module. Earlier in the JOIN module, I understood that Aliases are necessary for the Join function to work. But here -it is opposite. Why?
OK, I think I've managed to work out what this is asking. First, let's format your query so it's legible (but still has errors):
SELECT st.station_id, st.name, number_of_rides AS number_of_rides_fromstation, FROM ( SELECT start_station_id, COUNT(*) as number_of_rides FROM leafy-racer-348015.citibike_stations1.trips tr GROUP BY start_station_id ) station_num_trips INNER JOIN leafy-racer-348015.citibike_stations1.stations st ON st.station_id = tr.start_station_id ORDER BY number_of_rides DESC
number_of_rides is not a column so how will SQL select this
Sure it is. The subquery aliased by
station_num_tripsgenerates it from
SQL is just "rectagular blocks of data in, rectangular blocks of data out" - every SELECT produces a rectangular block of data, just like a table, that can be fed into another operation, like a JOIN, FROM or WHERE. It can even be fed into a SELECT if it's a single value. Your subquery here:
( SELECT start_station_id, COUNT(*) as number_of_rides FROM leafy-racer-348015.citibike_stations1.trips tr GROUP BY start_station_id ) station_num_trips
..took all the data in
trips, threw all the columns away except the station id, counted the number of trips from that station, and produced a new block of data that was just
station id, and the
count: it measures 2 columns wide by N rows high (N is the number of unique station IDs). This is what gets joined in when you do the join; your 2 wide by N high block of data.
Trips might have had 1000 rows of the same station ID and 10 columns; they're all gone, collapsed into 1 row, 2 columns and a count of 1000 in the one row. This now behaves like a new table called
At every step columns and rows are added or removed, so basically all you're ever doing in SQL is cutting up, and joining together blocks of data, to form new blocks of data.
Tables often form the starting blocks, but they don't have to:
SELECT * FROM (SELECT 1 as A UNION SELECT 2) x
This is perfectly valid; you could apply a where clause to remove rows, a JOIN to a table to add more columns and then SELECT those new columns or use them to produce more calculated values
There's no point aliasing your table in the subquery; there is only one table so it's obvious what you're referring to:
( SELECT start_station_id, COUNT(*) as number_of_rides FROM leafy-racer-348015.citibike_stations1.trips --removed alias GROUP BY start_station_id ) station_num_trips
Aliasing it also then led you to another mistake:
When I run this query, I get Unrecognized error in the line
st.station_id=tr.start_station_id. But if I remove this alias, then it works fine.
tris an alias inside the subquery. It doesn't exist where youre using it. The whole subquery is aliased as
station_num_trips.start_station_idyou need to use in your join condition, not
tralias is gone; you cannot access an subquery's alias from the parent query that wraps it.
You can, however, access a parent query's alias in a subquery (the other way round from what you have)
SELECT * FROM table1 x <-- this is the parent query WHERE EXISTS(SELECT null FROM table2 y WHERE x.id = y.id) ^^^ from the parent query
Doing this is a common way to coordinate a subquery's data with the main query data. Here this query asks "get me only x records where there is a matching record in table2" - it needs that coordination between the subquery and the main in order to work
Make use of indentation to keep straight in your mind what is a parent and what is a subquery. Small subqueries might fit all on one line, or a compact form where each line is lead by the block keyword. Larger queries should be spread out:
SELECT smiths.columnX, counted_things.column3, SUM(x.yz) FROM (SELECT * FROM People WHERE name = 'SMITH') smiths JOIN ( SELECT name, age, column3, COUNT(*) as columnX FROM TableA JOIN tableB On x = y WHERE column4 = 'something' GROUP BY name, age, column3 ) counted_things ON counted_things.name = smiths.Name AND counted_things.age = smiths.Age AND .. JOIN sometable x ON x.whatever = counted_things.whatever JOIN anothertable y ON x.id = y.id AND ... WHERE x.thing = 123 GROUP BY smiths.columnX, counted_things.column3
See how I'm using indentation to describe what level everything is operating at - the
SELECT FROM WHERE GROUPetc are aligned; they belong to the same query. The data blocks int he FROM are indented and aligned with each other; subqueries have parentheses that align with other parentheses and table names. It's easy to pick out a subquery. It's easy to see what it joins to;
smiths, counted_things, sometable and anothertableare indented the same, they're joined together and operate at the same level. Subqueries are indented inside thair brackets to clarify that they're subqueries
All this helps you use rules like "subqueries can access parent aliases but not the other way round" and "subqueries produce rows and columns that behave like new tables"