Very well written article. I'll add that there are tricks to manage that behavio...

rubiquity · on April 18, 2021

> I humbly propose: The CLAP trade-off, with L for Latency.

Daniel Abadi beat you to the punch with PACELC

https://en.wikipedia.org/wiki/PACELC_theorem

jamii · on April 18, 2021

> converting transactions to datastream then back to SQL will introduce a materializing barrier

It seems that this would still leave the problem where each transactions causes two deletes and two inserts to `balance` and then `total` sums those one by one? These errors are very transient but they still cause at least 3/4 outputs to be incorrect.

> tumbling event-time windows

Will this work for the join? For a given account, the last update to each of credits and debits may have arbitrarily disparate event times and transaction ids. It doesn't seem like there is window you could set that would be guaranteed to connect them?

BenoitP · on April 18, 2021

> this would still leave the problem where each transactions causes two deletes and two inserts

Indeed! You might be interested in the '6.5 Materialization Controls' section of this paper:

https://arxiv.org/pdf/1905.12133.pdf

They intend to add these extensions:

    SELECT ... EMIT STREAM AFTER DELAY INTERVAL '6' MINUTES ; -- to fix the output multiplication you mention
    SELECT ... EMIT STREAM AFTER WATERMARK ; -- this should fix the two deletes and two inserts

I don't know if this is still in active development, though. There has been no news since the paper got published 2 years ago.

> It doesn't seem like there is window you could set that would be guaranteed to connect them?

No there isn't. But I'm quite confident having a transaction-centric approach can be obtained in SQL. Your use case is perfect for illustrating the problem, still.

I'd try something like:

CREATE VIEW credits AS SELECT to_account AS account, sum(amount) AS credits, last_value(id) AS updating_tx FROM transactions GROUP BY to_account;

And then try to join credits and debits together by updating_tx.

No idea on how to check that accounts don't go negative, though.

jamii · on April 18, 2021

> And then try to join credits and debits together by updating_tx.

You can't join on updating_tx because the credits and debits per account are disjoint sets of transactions - that join will never produce output.

I did try something similar with timestamps - https://github.com/jamii/streaming-consistency/blob/main/fli.... This is also wrong (because the timestamps don't have to match between credits and debits) but it at least produces output. It had a very similar error distribution to the original.

Plus the join is only one of the problems here - the sum in `total` also needs to at minimum process all the balance updates from a single transaction atomically.

jamii · on April 18, 2021

You could instead put the global max seen id into every row, but then you would have to update all the rows on every transaction. Which is not great peformance-wise, but would also massively exacerbate the non-atomic sum problem downstream in total.

kcolford · on April 18, 2021

Latency has always been part of the CAP theorem. There is a strictly stronger generalization of it that says in a consistent system the latency of a transaction is bounded by the latency of the links in the network. In the case of a partition the network latency is infinite, so too is the transaction latency, thus no availability.