I don't think most SQL flavours support MINUS function, imho.
Bing Chat says:
> The MINUS operator is not supported in all SQL databases. It can be used in databases like MySQL and Oracle. For databases like SQL Server, PostgreSQL, and SQLite, use the EXCEPT operator to perform this type of query
Does this work with the bag/multiset distinction that the author uses? Like, if table1 has two copies of some row and table2 has a single copy of that row, wont this query return that they're the same? But they're not: table1 has two copies of the same row, whereas table2 just has one?
I found that a weird edge case for the original author to fixate on. In mathematics or academia sure, but in “real” sql tables, that serve any kind of purpose, duplicate rows are not something you need to support, let alone go to twice the engineering effort to support. Duplicates are more likely to be something you deliberately eradicate (before the comparison) than preserve and respect.
Exactly. If you have two absolutely duplicate rows in a table, you're going to have problems with a lot of your queries. It's usually an indication that you are lacking a unique constraint or have a bug somewhere.
It's exactly something I need to support, because the only reason I am ever diffing two tables is to check that my new and improved sql query worked as expected against a known correct table. I don't want my new code to accidentally duplicate on some silly join and then not pick up on it in automated testing.
I have had it happen in real data from real systems - but it’s not good — “true duplicates” are a sign that something is missing from your schema or something has gone wrong earlier in the pipeline.
(I just noticed that I emitted the word “iniquitie” above, where I thought I’d written “uniquifier” — and have just confirmed that autocorrect was responsible. Strangely, in context, the word “iniquitie” sort of “appears” to be an appropriate word, when its meaning is certainly not. I found that interesting.)
Not necessarily - the clock source for logging is often at millisecond resolution, but at the speed of modern systems you could pile up quite a few log entries in a millisecond.
I handle this by having a guid field for a primary key on such tables where there isn't a naturally unique index in the shape of the data. So something is unique, and you can delete or ignore other rows relative to that. (Just don't make your guid PK clustered; I use create-date or log-date for that.)
Your log tables don't have timestamps or line numbers on them?
More generally (and formally) speaking, multisets violate normalization. Either you add information to the primary key to identify the copies or you roll the copies up into a quantity field. I can't think of any kind of data where neither of these would be good options.
Primary key and quantity field could be not a perfect solution from performance point of view. Timestamps does not guarantee uniqueness.
Look, Im not trying to win the argument. In most cases you definitely right, my point is that sometime you have to work with working legacy code/system, and sometime this system could have some unique features.
The duplicate row issue is part of why I don't use MINUS for table value comparisons, nor RECURSIVE like the original article suggests (which is not supported in all databases and scarier for junior developers)... You can accomplish the same thing and handle that dupes scenario too, with just GROUP BY/UNION ALL/HAVING, using the following technique:
It will catch both if you have 1 row for a set of values in one table and 0 in another... or vice-versa... or 1 row for a set of values in one table and 2+ (dupes) in another.
I have compared every row + every column value of billion-row tables in under a minute on a columnar database with this technique.
Pseudocode summary explanation: Create (via group by) a rowcount for every single set of column values you give it from table A, create that same rowcount for every single set of column values from table B, then compare if those rowcounts match for all rows, and lets you know if they don't (sorted to make it easier to read when you do have differences). A nice fast set-based operation.
In theory, yes, however the vast majority of tables will have some form of unique ID in each record... so in practice, there’s usually no difference. But if it must work for all tables...
Log analytics or warehouse tables often have no simple useful key for this sort of comparison.
Also in a more general case you might be comparing tables that may contain the same data but have been constructed from different sources. Or perhaps a distributed dataset became disconnected and may have seen updates in both partitions, and you have brought them together to compare to try decide which to keep or if it is worth trying to merge. In those and other circumstances there may be a key but if it is a surrogate key it will be meaningless for comparing data from two sets of updates, so you would have to disregard it and compare on the other data (which might not include useful candidate keys).
It happens. I’m currently working on a project where the CRM tool I need to access for data, actually does not have a unique id in its db. I have no idea if I will be able to successfully complete the project yet.
Is there any chance that the rows actually do have a unique id, but it's not being displayed without some magic incantation?
Asking because I've seen that before in some software, where it tries to "keep things simple" by default. But that behaviour can be toggled off so it shows the full schema (and data) for those with the need. :)
That's not all that unusual when something gets implemented, as people tend to take the easy approach for things that meet the desired goal.
It just sounds like the spec they were writing to wasn't very clear or it was just a checkbox list of features provided to them by marketing. So "lets get this list done then ship it". ;)