I've always used this. Any rows returned means there are differences select * fr...

ryzvonusef · on July 29, 2023

I don't think most SQL flavours support MINUS function, imho.

Bing Chat says: > The MINUS operator is not supported in all SQL databases. It can be used in databases like MySQL and Oracle. For databases like SQL Server, PostgreSQL, and SQLite, use the EXCEPT operator to perform this type of query

OskarS · on July 28, 2023

Does this work with the bag/multiset distinction that the author uses? Like, if table1 has two copies of some row and table2 has a single copy of that row, wont this query return that they're the same? But they're not: table1 has two copies of the same row, whereas table2 just has one?

LeonB · on July 28, 2023

I found that a weird edge case for the original author to fixate on. In mathematics or academia sure, but in “real” sql tables, that serve any kind of purpose, duplicate rows are not something you need to support, let alone go to twice the engineering effort to support. Duplicates are more likely to be something you deliberately eradicate (before the comparison) than preserve and respect.

SoftTalker · on July 28, 2023

Exactly. If you have two absolutely duplicate rows in a table, you're going to have problems with a lot of your queries. It's usually an indication that you are lacking a unique constraint or have a bug somewhere.

adammarples · on July 29, 2023

It's exactly something I need to support, because the only reason I am ever diffing two tables is to check that my new and improved sql query worked as expected against a known correct table. I don't want my new code to accidentally duplicate on some silly join and then not pick up on it in automated testing.

deely3 · on July 28, 2023

> duplicate rows are not something you need to support

I can imagine that you want to have duplicates rows in a logging. If some events happens twice - you definitely want to log it twice.

LeonB · on July 28, 2023

I have had it happen in real data from real systems - but it’s not good — “true duplicates” are a sign that something is missing from your schema or something has gone wrong earlier in the pipeline.

Trying to delete duplicates (but leave 1 behind) is tricky in itself. I recorded notes on it one time here — using “row_number()” to act as the iniquitie, https://til.secretgeek.net/sql_server/delete_duplicate_rows....

LeonB · on July 29, 2023

(I just noticed that I emitted the word “iniquitie” above, where I thought I’d written “uniquifier” — and have just confirmed that autocorrect was responsible. Strangely, in context, the word “iniquitie” sort of “appears” to be an appropriate word, when its meaning is certainly not. I found that interesting.)

SoftTalker · on July 28, 2023

Logs usually have a timestamp that would differentiate the two events.

vikingerik · on July 28, 2023

Not necessarily - the clock source for logging is often at millisecond resolution, but at the speed of modern systems you could pile up quite a few log entries in a millisecond.

I handle this by having a guid field for a primary key on such tables where there isn't a naturally unique index in the shape of the data. So something is unique, and you can delete or ignore other rows relative to that. (Just don't make your guid PK clustered; I use create-date or log-date for that.)

kmeisthax · on July 28, 2023

Your log tables don't have timestamps or line numbers on them?

More generally (and formally) speaking, multisets violate normalization. Either you add information to the primary key to identify the copies or you roll the copies up into a quantity field. I can't think of any kind of data where neither of these would be good options.

deely3 · on July 28, 2023

Primary key and quantity field could be not a perfect solution from performance point of view. Timestamps does not guarantee uniqueness.

Look, Im not trying to win the argument. In most cases you definitely right, my point is that sometime you have to work with working legacy code/system, and sometime this system could have some unique features.

Dylan16807 · on July 29, 2023

It's still not something you "want to have", it's something you might be forced to deal with because someone screwed up the design.

And ensuring you have a real primary key should only be good for performance, in the realm of SQL databases.

gregw2 · on July 28, 2023

I spent time researching this a while back.

The duplicate row issue is part of why I don't use MINUS for table value comparisons, nor RECURSIVE like the original article suggests (which is not supported in all databases and scarier for junior developers)... You can accomplish the same thing and handle that dupes scenario too, with just GROUP BY/UNION ALL/HAVING, using the following technique:

https://github.com/gregw2hn/handy_sql_queries/blob/main/sql_...

It will catch both if you have 1 row for a set of values in one table and 0 in another... or vice-versa... or 1 row for a set of values in one table and 2+ (dupes) in another.

I have compared every row + every column value of billion-row tables in under a minute on a columnar database with this technique.

Pseudocode summary explanation: Create (via group by) a rowcount for every single set of column values you give it from table A, create that same rowcount for every single set of column values from table B, then compare if those rowcounts match for all rows, and lets you know if they don't (sorted to make it easier to read when you do have differences). A nice fast set-based operation.

globular-toast · on July 28, 2023

This is symmetric difference, but still has the problem that it's a set operation whereas in general a table is a bag (multiset).

DougBTX · on July 28, 2023

In theory, yes, however the vast majority of tables will have some form of unique ID in each record... so in practice, there’s usually no difference. But if it must work for all tables...

EddieJLSH · on July 28, 2023

Realistically which production DB tables don't have a unique id? Genuine question, never used one in my life.

dspillett · on July 28, 2023

Log analytics or warehouse tables often have no simple useful key for this sort of comparison.

Also in a more general case you might be comparing tables that may contain the same data but have been constructed from different sources. Or perhaps a distributed dataset became disconnected and may have seen updates in both partitions, and you have brought them together to compare to try decide which to keep or if it is worth trying to merge. In those and other circumstances there may be a key but if it is a surrogate key it will be meaningless for comparing data from two sets of updates, so you would have to disregard it and compare on the other data (which might not include useful candidate keys).

theodpHN · on July 28, 2023

Also, database tables where unique key constraints aren't enforced. Programming and operational mistakes happen. :-)

https://stackoverflow.com/questions/62735776/what-is-the-poi...

erinnh · on July 28, 2023

It happens. I’m currently working on a project where the CRM tool I need to access for data, actually does not have a unique id in its db. I have no idea if I will be able to successfully complete the project yet.

justinclift · on July 28, 2023

Is there any chance that the rows actually do have a unique id, but it's not being displayed without some magic incantation?

Asking because I've seen that before in some software, where it tries to "keep things simple" by default. But that behaviour can be toggled off so it shows the full schema (and data) for those with the need. :)

erinnh · on July 28, 2023

Sadly, no.

The manufacturer is just really incompetent.

I was told their reason when asked was „it was easier (for us)“.

justinclift · on July 29, 2023

> it was easier (for us)

That's not all that unusual when something gets implemented, as people tend to take the easy approach for things that meet the desired goal.

It just sounds like the spec they were writing to wasn't very clear or it was just a checkbox list of features provided to them by marketing. So "lets get this list done then ship it". ;)

Dylan16807 · on July 29, 2023

The question is whether it was actually easier.

Even a couple minutes of extra debugging takes longer than learning how to add a synthetic primary.

valenterry · on July 28, 2023

For example tables that store huge amount of logs or sensor data where IDs are not very useful and just increase space usage and decrease performance.

hobs · on July 28, 2023

PostTags in the published Stack Overflow schema - https://data.stackexchange.com/stackoverflow/query/edit/1772...

It happens a lot when people are implementing something quick and often happens in linking tables.

wodenokoto · on July 28, 2023

Don’t things like BigQuery always allow duplicates?

EddieJLSH · on July 28, 2023

Good point, have not used it before but looks like you have to add a unique ID if you want one