I wrote a really long reply to a comment that got deleted before I finished. In case other people have the same questions:
> What is a valid subset to consider?
For these purposes, any subset.
> What is "correct output"?
I absolutely agree that deciding whether an output is correct requires actually defining the intended semantics of the streaming program. I don't believe that exists for flink or ksqldb. (It does for differential dataflow, to some degree - https://link.springer.com/content/pdf/10.1007/978-3-662-4667...). If the outcome of this testing was that streaming systems gave a clear definition of their intended semantics which allowed for this degree of error, I think that would be excellent. That's what I was getting at towards the end of the post: "Ideally the behavior would be sufficiently well defined that it's possible to write tests that examine intermediate outputs, because those are the outputs that real deployments will experience."
But in the meantime, for the deterministic sql queries here I think "run the same computation in a consistent batch system and compare the output" is a highly reasonable set of semantics by which we can judge correctness.
> Why would the example be internally inconsistent (per the definition)? The out of date debits data is a subset of the input so far, is it not?
The debits data is using one subset and the credits data is using another. So if we ask "what subset of the inputs is balance computed from" the answer is sort of this one and sort of that one. The end result is that there is no single subset of the data that could produce those results in a consistent system.
> the goal is not to permit arbitrary subsets of data but to restrict the considered subsets of data to a consistent selection.
Yes. The goal is that the final output is computed from a well-defined subset of the data, not from a mishmash of different subsets. I think we're roughly in agreement on the intended behavior and just not on the language used to describe it?
> systems that implement full/immediate consistency.
I don't know what full consistency would mean for a streaming system if not internal consistency + eventual consistency. If you have both, you can send an input and then just block until you see the output for the corresponding timestamp.
Part of the trouble is that all our language for describing consistency comes from read-write transactions in databases and doesn't translate directly into streaming systems. The closest analogue is probably snapshot isolation - https://jepsen.io/consistency/models/snapshot-isolation - but you can't just port the definitions over to a system that doesn't have a concept of transactions.
> What is a valid subset to consider?
For these purposes, any subset.
> What is "correct output"?
I absolutely agree that deciding whether an output is correct requires actually defining the intended semantics of the streaming program. I don't believe that exists for flink or ksqldb. (It does for differential dataflow, to some degree - https://link.springer.com/content/pdf/10.1007/978-3-662-4667...). If the outcome of this testing was that streaming systems gave a clear definition of their intended semantics which allowed for this degree of error, I think that would be excellent. That's what I was getting at towards the end of the post: "Ideally the behavior would be sufficiently well defined that it's possible to write tests that examine intermediate outputs, because those are the outputs that real deployments will experience."
But in the meantime, for the deterministic sql queries here I think "run the same computation in a consistent batch system and compare the output" is a highly reasonable set of semantics by which we can judge correctness.
This framing is laid out more explicitly in intro to the previous article - https://scattered-thoughts.net/writing/an-opinionated-map-of...
> Why would the example be internally inconsistent (per the definition)? The out of date debits data is a subset of the input so far, is it not?
The debits data is using one subset and the credits data is using another. So if we ask "what subset of the inputs is balance computed from" the answer is sort of this one and sort of that one. The end result is that there is no single subset of the data that could produce those results in a consistent system.
> the goal is not to permit arbitrary subsets of data but to restrict the considered subsets of data to a consistent selection.
Yes. The goal is that the final output is computed from a well-defined subset of the data, not from a mishmash of different subsets. I think we're roughly in agreement on the intended behavior and just not on the language used to describe it?
> systems that implement full/immediate consistency.
I don't know what full consistency would mean for a streaming system if not internal consistency + eventual consistency. If you have both, you can send an input and then just block until you see the output for the corresponding timestamp.
Part of the trouble is that all our language for describing consistency comes from read-write transactions in databases and doesn't translate directly into streaming systems. The closest analogue is probably snapshot isolation - https://jepsen.io/consistency/models/snapshot-isolation - but you can't just port the definitions over to a system that doesn't have a concept of transactions.