At Estuary, we’re creating a real-time data streaming platform that doesn’t rely on Kafka and uses JSON as the primary data format stored in an object storage. Many people are interested in how we achieve millisecond-level latency in our data streams, so we will be publishing a series of articles on this topic!
Imagine a system monitoring payment transactions. Each transaction stream (e.g., purchase events) could be joined with customer account data (e.g., past purchasing patterns or blacklist flags). Streaming joins enable flagging potentially fraudulent transactions leveraging live context.
Gazette is at the core of Estuary Flow (https://estuary.dev), a real-time data platform. Unlike Kafka, Gazette’s architecture is simpler to reason about and operate. It plays well with k8s and is backed by S3 (or any object storage).
With the 2-pass strategy, we can write arbitrary row group sizes while using a fixed amount of memory, with probably 100-200 MiB of overhead for the parquet file processing, depending on how large the metadata is for the scratch file. without the 2 pass strategy, the amount of memory is proportional to the size of the row group.
I wrote git-genie to automate commit message writing with GPT & pre-commit hooks, works surprisingly well (most of the time) - https://github.com/danthelion/git-genie
Always interesting to hear how others do things. There are only a handful of usernames that I know, but when I look for them it's important because (for example) they are the author/creator of a tool. From just a couple minutes ago for example, reading the Elixir post[1] the comments from josevalim are from the creator of the language, and the author of TFA. Not knowing that I think would make the thread a lot different (IMHO in a bad way), but I can definitely see the appeal of anonymizing other times.
Maybe the extension should also randomize the ordering of comments. Because I can infer that the top comment is more likely to be from someone influential in the space.