Having a deeper understanding of the different JOIN flavors is critical. For example, I don't think many people realize how deceptively dangerous OUTER joins are:
SELECT user.user_id
FROM users
RIGHT JOIN purchases
ON purchases.user_id = user.user_id
AND user.user_id=123
By leaving the user_id=123 constraint in the JOIN instead of putting it in the WHERE, you've just exposed everyone's purchase data to the user. Easy to miss too if your tests don't fully create fixtures for multiple users. This becomes even more dangerous as we're entering the world of LLMs generating SQL[1]
I mean yes writing the wrong query will give you the wrong results. But there’s nothing inherently risky with a left/right/full join.
I could make the same argument about inner joins giving the risk of dropping records you intended to keep (very common in analytics). Just have to know what you’re doing and what data you got.
Outer joins are inherently risky because they make it very easy to put conditions in a place that looks correct (read: passes review) but isn't, and the results are disastrous (information leak).
In this case the query doesn't make much sense to me and hopefully wouldn't pass review. If you're only looking up purchases for a single user, why would you be joining on the user table at all? I suppose I could see a bad ORM or query builder somehow come up with a bad query like this.
I disagree, as the previous commenter said it’s not the tool, it’s the user and if you don’t know what you’re doing you’re prone to doing damage. Non trivial amounts of auth are made by “junior” engineers. I find that a lot more disastrous than a potential outer join + where accident *that also requires* the application code to leak the data to the outside world to be damaging *and* the underlying data to be sensitive.
So tools carry the same level of risk if the only way they can do harm is user error, even if that user error is easier to miss by design? I guess I just fundamentally disagree because I think risk analysis is holistic. Is the tool fundamentally correct (free of bugs) and is it easier to misuse.
I'm going to look at an outer join FAR more critically, regardless of who wrote it, because it's easy to mess up the conditions.
The same could be said about clueless programming in general. Tbh I didn’t immediately catch what that query does, because it’s non-regular(?) looking so there’s no quick pattern for it, but it still makes you think what it does actually. And no, there’s no way I could think of putting a filter into right join. It’s a join. It’s like… idk. Being clueless.
> This becomes even more dangerous as we're entering the world of LLMs generating SQL[1]
I am beginning to push back against the idea of LLMs authoring SQL. Determinism is really important for most realistic use cases. SQL is one of those "all or nothing" experiences. Pretty much the last thing you want to have a temperature/spiciness slider over.
Granted, if your use case is to simply memorize a handful of canonical queries and to provide a UX-friendly way for business administrators to retrieve these queries, then have at it.
But, if you are hoping to ask the LLM to write novel, sophisticated, domain-specific queries involving joins across 10+ tables, aggregations, recursion, windowing, etc., you are going to have a very bad time. In my experience, this is what everyone is dreaming about doing, and I think it is an unproductive fantasy at this point. Fine tuning will not give you what you seek, but don't let me stop you from trying. We burned over 200 hours on it with nothing to show for our time.
For me, LLMs are far more interesting for determining the user's intentions than emitting a final query. Classification seems like the actual superpower here. Classification can feed a very powerful deterministic query building engine that will actually give you what you want and you can prove it will work. It just takes a lot more tedious work than most humans are willing to endure, so here we are talking about various shortcuts.
>We burned over 200 hours on it with nothing to show for our time.
Can you talk more about this? I'm working in this space now, but not from the "make an LLM give you the right query" perspective, but from the "make the LLM-produced query be safe" perspective[1]
How about an LLM talking to a semantic layer that then writes the SQL? I added experimental support for that in a project of mine, but using a semantic layer makes most querying so simple that natural language based approaches just seem kinda silly unless you have interface constraints (like on a mobile device).
why do people use right joins. i'll never understand it. especially in this case you when you can just query purchases directly and joining to users is pointless.
It takes all purchases records and joins a corresponding user to each record, but only if it has a specific id. Otherwise it’s still every purchases record. But only one user gets joined. Join condition is not a filter, it is a join condition. Left join will always yield all left-table rows. Right join will always yield all right-table rows. “Where” works on a final result of select+joins (but before group by), so if you want to filter by user id, the condition should go in where.
I worked with a programmer who decided, one day, that all his joins were from now on going to be “natural joins”, and code like this started appearing:
select
job_name, comp_name
from jobs
natural join companies;
This is a foot gun just waiting for someone to unsuspectingly add a new column to “jobs” with the same name as one already in “companies”, thereby silently breaking the query logic.
Sadly, the boss declined to rein in such shenanigans, and I left soon after.
Natural join is actually nice mental test of db semantics - if it does not just work, you have probably mixup of column names (same name meaning different things). If you have really clean and good patterns (I know you dont and you dont discover it before this major incident) it would work fine.
I do feel like this is an obvious thing for SQL to allow and databases to support. Given that it's not part of the standard (and NATURAL JOIN is) I can only assume that there's a compelling reason that I haven't thought about.
If anyone out there knows why this isn't a thing, please chime in!
As soon as you have multiple relationships to a table relying on different foreign / composite keys, I’m pretty sure this completely breaks down. And the second you add an additional foreign key, you break all queries — so you’ve turned foreign key constraints from an internally managed constraint to a public interface whose addition/removal at any point is a breaking change.
Natural joins however depend on already publicly exposed metadata (column names), and if that changed you’d break queries anyways.
At least they could make a syntax for joining on a “foreign key references” clause from create statement. E.g. “from a join b by a.b_id[, a.b_compound_id_2]”.
But that’s just a sugar for still too low-level SQL. I’d better have well named relations and use these names in code. This way, there would be no situation when you add/remove a constraint but forget to add/remove a condition.
create table a (id, b_id);
create table b (id);
create relation atob from a to b on a.b_id = b.id;
select from a left join b using atob;
This could also define classes of relationships to check in runtime. E.g. I’ve never sent all purchases with right join in my life, but had enough exploding relationships where 1:1 was expected.
But 99.99% of my child tables have only a single foreign keys linking it to its parent and will not have more. And breaking queries is much better than silently doing the wrong thing like NATURAL JOIN.
For the edge cases it could support taking the name of the foreign key as an optional parameter.
I’ve done a lot of interviewing and one thing I’ve noticed is that many people are pretty confused about how joins work.
They think the foreign keys put some sort of link on a parent table to let it access rows of a child table as if they were an array.
There’s also a massive misunderstanding around ordering where a lot of people think that by reordering the joins you can control the order in which the db is going to execute the query.
A good planner is going to pick the best strategy however you order them, because it’s job is to give the results fastest no matter how you write the query (all things being equal).
Consider that there’s a unique index on one of the 2 columns. It’s almost certainly optimal to use that to find a single row and then execute your other filters no matter the order.
Maybe there’s a situation where you can adjust the filters so ones that remove the most rows with the least cpu cycles are first? I’m going to try to create an example to see if I can make it behave differently.
This is a nice overview, but am I the only one who found the graphic more confusing than helpful? I can decipher its meaning by using my knowledge of joins, but it in no way improves my understanding of joins.
I always enjoy how the planner explicitly calls it an anti join but for those unfamiliar with the tricks the sql looks pretty weird on first glance.
Traditionally I’ve always done it as left join null but I saw a case a while back where the not exists version performed better (Postgres). I feel like I must have missed something because I thought they optimised to the same thing (maybe there were other criteria in the subquery that allowed it to use a better index).
1. https://docs.heimdallm.ai/en/main/attack-surface/sql.html#ou...