Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> duplicate rows are not something you need to support

I can imagine that you want to have duplicates rows in a logging. If some events happens twice - you definitely want to log it twice.



I have had it happen in real data from real systems - but it’s not good — “true duplicates” are a sign that something is missing from your schema or something has gone wrong earlier in the pipeline.

Trying to delete duplicates (but leave 1 behind) is tricky in itself. I recorded notes on it one time here — using “row_number()” to act as the iniquitie, https://til.secretgeek.net/sql_server/delete_duplicate_rows....


(I just noticed that I emitted the word “iniquitie” above, where I thought I’d written “uniquifier” — and have just confirmed that autocorrect was responsible. Strangely, in context, the word “iniquitie” sort of “appears” to be an appropriate word, when its meaning is certainly not. I found that interesting.)


Logs usually have a timestamp that would differentiate the two events.


Not necessarily - the clock source for logging is often at millisecond resolution, but at the speed of modern systems you could pile up quite a few log entries in a millisecond.

I handle this by having a guid field for a primary key on such tables where there isn't a naturally unique index in the shape of the data. So something is unique, and you can delete or ignore other rows relative to that. (Just don't make your guid PK clustered; I use create-date or log-date for that.)


Your log tables don't have timestamps or line numbers on them?

More generally (and formally) speaking, multisets violate normalization. Either you add information to the primary key to identify the copies or you roll the copies up into a quantity field. I can't think of any kind of data where neither of these would be good options.


Primary key and quantity field could be not a perfect solution from performance point of view. Timestamps does not guarantee uniqueness.

Look, Im not trying to win the argument. In most cases you definitely right, my point is that sometime you have to work with working legacy code/system, and sometime this system could have some unique features.


It's still not something you "want to have", it's something you might be forced to deal with because someone screwed up the design.

And ensuring you have a real primary key should only be good for performance, in the realm of SQL databases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: