Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SQL Join Flavors (antonz.org)
94 points by nalgeon on Sept 21, 2023 | hide | past | favorite | 37 comments


Having a deeper understanding of the different JOIN flavors is critical. For example, I don't think many people realize how deceptively dangerous OUTER joins are:

  SELECT user.user_id
  FROM users
  RIGHT JOIN purchases
    ON purchases.user_id = user.user_id
    AND user.user_id=123
By leaving the user_id=123 constraint in the JOIN instead of putting it in the WHERE, you've just exposed everyone's purchase data to the user. Easy to miss too if your tests don't fully create fixtures for multiple users. This becomes even more dangerous as we're entering the world of LLMs generating SQL[1]

1. https://docs.heimdallm.ai/en/main/attack-surface/sql.html#ou...


I mean yes writing the wrong query will give you the wrong results. But there’s nothing inherently risky with a left/right/full join.

I could make the same argument about inner joins giving the risk of dropping records you intended to keep (very common in analytics). Just have to know what you’re doing and what data you got.


Outer joins are inherently risky because they make it very easy to put conditions in a place that looks correct (read: passes review) but isn't, and the results are disastrous (information leak).


In this case the query doesn't make much sense to me and hopefully wouldn't pass review. If you're only looking up purchases for a single user, why would you be joining on the user table at all? I suppose I could see a bad ORM or query builder somehow come up with a bad query like this.


I disagree, as the previous commenter said it’s not the tool, it’s the user and if you don’t know what you’re doing you’re prone to doing damage. Non trivial amounts of auth are made by “junior” engineers. I find that a lot more disastrous than a potential outer join + where accident *that also requires* the application code to leak the data to the outside world to be damaging *and* the underlying data to be sensitive.


So tools carry the same level of risk if the only way they can do harm is user error, even if that user error is easier to miss by design? I guess I just fundamentally disagree because I think risk analysis is holistic. Is the tool fundamentally correct (free of bugs) and is it easier to misuse.

I'm going to look at an outer join FAR more critically, regardless of who wrote it, because it's easy to mess up the conditions.


The same could be said about clueless programming in general. Tbh I didn’t immediately catch what that query does, because it’s non-regular(?) looking so there’s no quick pattern for it, but it still makes you think what it does actually. And no, there’s no way I could think of putting a filter into right join. It’s a join. It’s like… idk. Being clueless.


> This becomes even more dangerous as we're entering the world of LLMs generating SQL[1]

I am beginning to push back against the idea of LLMs authoring SQL. Determinism is really important for most realistic use cases. SQL is one of those "all or nothing" experiences. Pretty much the last thing you want to have a temperature/spiciness slider over.

Granted, if your use case is to simply memorize a handful of canonical queries and to provide a UX-friendly way for business administrators to retrieve these queries, then have at it.

But, if you are hoping to ask the LLM to write novel, sophisticated, domain-specific queries involving joins across 10+ tables, aggregations, recursion, windowing, etc., you are going to have a very bad time. In my experience, this is what everyone is dreaming about doing, and I think it is an unproductive fantasy at this point. Fine tuning will not give you what you seek, but don't let me stop you from trying. We burned over 200 hours on it with nothing to show for our time.

For me, LLMs are far more interesting for determining the user's intentions than emitting a final query. Classification seems like the actual superpower here. Classification can feed a very powerful deterministic query building engine that will actually give you what you want and you can prove it will work. It just takes a lot more tedious work than most humans are willing to endure, so here we are talking about various shortcuts.


>We burned over 200 hours on it with nothing to show for our time.

Can you talk more about this? I'm working in this space now, but not from the "make an LLM give you the right query" perspective, but from the "make the LLM-produced query be safe" perspective[1]

1. https://docs.heimdallm.ai/en/main/blog/posts/safe-sql-execut...


> "make an LLM give you the right query"

> "make the LLM-produced query be safe"

I think these are the same problem.


They're not. A query produced by an LLM may do exactly what the user asks for, but it is not safe to execute.


You're right. Now that I'm thinking about it, making a query that works is less difficult than also proving it is safe.

Good luck with your project.


How about an LLM talking to a semantic layer that then writes the SQL? I added experimental support for that in a project of mine, but using a semantic layer makes most querying so simple that natural language based approaches just seem kinda silly unless you have interface constraints (like on a mobile device).

https://github.com/totalhack/zillion


SQLite doesn't support right joins, which is a virtue in this case?

The article really ought to filter everything through SQLite, for how pervasive it is.


Not any more. SQLite received support for RIGHT and FULL OUTER joins in 3.39.0, released over a year ago.


why do people use right joins. i'll never understand it. especially in this case you when you can just query purchases directly and joining to users is pointless.


Why doesn’t that work?


It takes all purchases records and joins a corresponding user to each record, but only if it has a specific id. Otherwise it’s still every purchases record. But only one user gets joined. Join condition is not a filter, it is a join condition. Left join will always yield all left-table rows. Right join will always yield all right-table rows. “Where” works on a final result of select+joins (but before group by), so if you want to filter by user id, the condition should go in where.


I worked with a programmer who decided, one day, that all his joins were from now on going to be “natural joins”, and code like this started appearing:

select job_name, comp_name from jobs natural join companies;

This is a foot gun just waiting for someone to unsuspectingly add a new column to “jobs” with the same name as one already in “companies”, thereby silently breaking the query logic.

Sadly, the boss declined to rein in such shenanigans, and I left soon after.


Natural join is actually nice mental test of db semantics - if it does not just work, you have probably mixup of column names (same name meaning different things). If you have really clean and good patterns (I know you dont and you dont discover it before this major incident) it would work fine.


I've been longing for our db to add support for something ala

    select j.job_name, c.comp_name 
    from companies c
    join jobs j on foreign key
Could take optional name in case there's more than one foreign key linking the two.


I do feel like this is an obvious thing for SQL to allow and databases to support. Given that it's not part of the standard (and NATURAL JOIN is) I can only assume that there's a compelling reason that I haven't thought about.

If anyone out there knows why this isn't a thing, please chime in!


As soon as you have multiple relationships to a table relying on different foreign / composite keys, I’m pretty sure this completely breaks down. And the second you add an additional foreign key, you break all queries — so you’ve turned foreign key constraints from an internally managed constraint to a public interface whose addition/removal at any point is a breaking change.

Natural joins however depend on already publicly exposed metadata (column names), and if that changed you’d break queries anyways.


At least they could make a syntax for joining on a “foreign key references” clause from create statement. E.g. “from a join b by a.b_id[, a.b_compound_id_2]”.

But that’s just a sugar for still too low-level SQL. I’d better have well named relations and use these names in code. This way, there would be no situation when you add/remove a constraint but forget to add/remove a condition.

  create table a (id, b_id);
  create table b (id);
  create relation atob from a to b on a.b_id = b.id;

  select from a left join b using atob;
This could also define classes of relationships to check in runtime. E.g. I’ve never sent all purchases with right join in my life, but had enough exploding relationships where 1:1 was expected.


But 99.99% of my child tables have only a single foreign keys linking it to its parent and will not have more. And breaking queries is much better than silently doing the wrong thing like NATURAL JOIN.

For the edge cases it could support taking the name of the foreign key as an optional parameter.


I’ve done a lot of interviewing and one thing I’ve noticed is that many people are pretty confused about how joins work.

They think the foreign keys put some sort of link on a parent table to let it access rows of a child table as if they were an array.

There’s also a massive misunderstanding around ordering where a lot of people think that by reordering the joins you can control the order in which the db is going to execute the query.


just for clarity ordering your where clauses can matter right? I've tested it and it made a difference, i'm not hallucinating right


You’re probably hallucinating!

A good planner is going to pick the best strategy however you order them, because it’s job is to give the results fastest no matter how you write the query (all things being equal).

Consider that there’s a unique index on one of the 2 columns. It’s almost certainly optimal to use that to find a single row and then execute your other filters no matter the order.

Maybe there’s a situation where you can adjust the filters so ones that remove the most rows with the least cpu cycles are first? I’m going to try to create an example to see if I can make it behave differently.


Indeed it can, at least with sqlserver.


This is a nice overview, but am I the only one who found the graphic more confusing than helpful? I can decipher its meaning by using my knowledge of joins, but it in no way improves my understanding of joins.


On the section about partitioned joins, how is that different from

SELECT sale_dt, name, Sum(quantity) FROM sales LEFT JOIN product ON sales.product_id = product.id GROUP BY sale_dt, name;


I just want first class ASOF join. The lateral queries with limit 1 are so long to type out


I could get a job at https://hoogle.haskell.org/ ? Where do I apply?


Love me some lateral join / cross apply. Correlated subqueries and visibility in the where clause?! Whoo.


LATERAL is part of the SQL standard (since 1999 if I'm not mistaken)


Maybe someday SQL will let us do ANTI JOIN

A man can dream


I always enjoy how the planner explicitly calls it an anti join but for those unfamiliar with the tricks the sql looks pretty weird on first glance.

Traditionally I’ve always done it as left join null but I saw a case a while back where the not exists version performed better (Postgres). I feel like I must have missed something because I thought they optimised to the same thing (maybe there were other criteria in the subquery that allowed it to use a better index).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: