Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very well written article. I'll add that there are tricks to manage that behavior.

The dataflow is a diamond:

    transactions ----> credits 
      \                   \
       ----> debits -----JOIN----> balance ---> total
In Flink the SQL query planner may instantiate two different consumers groups on transactions; that topic might be read twice. These can progress at different rates, which is partly why some inconsistencies could appear.

Now, the trick: converting transactions to datastream then back to SQL will introduce a materializing barrier, and you will benefit from high temporal locality. There might be some inconsistencies left though, as the JOIN requires some shuffling and that can introduce some processing skew. For example the Netflix account will be the target of many individual accounts; the instance responsible for Netflix might be a hot spot and process things differently (probably by backpressuring a little, making the data arrive in larger micro batches).

Anyway when processing financial data you might want to make the transaction IDs tag along the processing; maybe pair them back together in tumbling event-time windows at the total. Like he said: 'One thing to note is that this query does not window any of its inputs, putting it firmly on the low temporal locality side of the map where consistency is more difficult to maintain.'. Also, windowing would have introduced some end-to-end Latency.

This makes me think: Streaming systems introduce a new letter in the CAP trade-off?

I humbly propose: The CLAP trade-off, with L for Latency.

[1] https://issues.apache.org/jira/browse/FLINK-15775?filter=-2



> I humbly propose: The CLAP trade-off, with L for Latency.

Daniel Abadi beat you to the punch with PACELC

https://en.wikipedia.org/wiki/PACELC_theorem


> converting transactions to datastream then back to SQL will introduce a materializing barrier

It seems that this would still leave the problem where each transactions causes two deletes and two inserts to `balance` and then `total` sums those one by one? These errors are very transient but they still cause at least 3/4 outputs to be incorrect.

> tumbling event-time windows

Will this work for the join? For a given account, the last update to each of credits and debits may have arbitrarily disparate event times and transaction ids. It doesn't seem like there is window you could set that would be guaranteed to connect them?


> this would still leave the problem where each transactions causes two deletes and two inserts

Indeed! You might be interested in the '6.5 Materialization Controls' section of this paper:

https://arxiv.org/pdf/1905.12133.pdf

They intend to add these extensions:

    SELECT ... EMIT STREAM AFTER DELAY INTERVAL '6' MINUTES ; -- to fix the output multiplication you mention
    SELECT ... EMIT STREAM AFTER WATERMARK ; -- this should fix the two deletes and two inserts
I don't know if this is still in active development, though. There has been no news since the paper got published 2 years ago.

> It doesn't seem like there is window you could set that would be guaranteed to connect them?

No there isn't. But I'm quite confident having a transaction-centric approach can be obtained in SQL. Your use case is perfect for illustrating the problem, still.

I'd try something like:

CREATE VIEW credits AS SELECT to_account AS account, sum(amount) AS credits, last_value(id) AS updating_tx FROM transactions GROUP BY to_account;

And then try to join credits and debits together by updating_tx.

No idea on how to check that accounts don't go negative, though.


> And then try to join credits and debits together by updating_tx.

You can't join on updating_tx because the credits and debits per account are disjoint sets of transactions - that join will never produce output.

I did try something similar with timestamps - https://github.com/jamii/streaming-consistency/blob/main/fli.... This is also wrong (because the timestamps don't have to match between credits and debits) but it at least produces output. It had a very similar error distribution to the original.

Plus the join is only one of the problems here - the sum in `total` also needs to at minimum process all the balance updates from a single transaction atomically.


You could instead put the global max seen id into every row, but then you would have to update all the rows on every transaction. Which is not great peformance-wise, but would also massively exacerbate the non-atomic sum problem downstream in total.


Latency has always been part of the CAP theorem. There is a strictly stronger generalization of it that says in a consistent system the latency of a transaction is bounded by the latency of the links in the network. In the case of a partition the network latency is infinite, so too is the transaction latency, thus no availability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: