More

super_ar · 2025-12-09T18:25:56 1765304756

Really cool!

super_ar · 2025-12-09T18:21:12 1765304472

Looks cool! Do you have any idea who "good" it is at detecting AI-generated text?

super_ar · 2025-12-09T18:17:50 1765304270

This is interesting. Just wondering about your traffic volume and how long you have been running lcoalpdf?

For us, it is more like 5% of the traffic from GEO, but we have been running the company for 2 years and have created a lot of handwritten content for devs.

ulinycoin · 2025-12-09T19:40:18 1765309218

Volume is modest (~180 visitors/month), but the 50/50 split is what's interesting.

Been in production since August 2025, so ~4 months.

The strategy was intentional from the start: there's no point competing with Adobe, Smallpdf, ILovePDF for Google rankings. They have 10+ years of backlinks, massive marketing budgets, and domain authority I'll never match as a solo dev.

So I made a bet on GEO from day one: - Semantic HTML that LLMs can parse - Clear technical docs (GitHub README as primary content) - Honest about limitations - Privacy-first architecture (client-side processing)

Your 5% GEO makes sense for a 2-year-old company optimizing for traditional SEO. The difference: I skipped the SEO game entirely. When you're competing in an established niche, GEO-first might be the only viable strategy for bootstrapped products.

Curious: what type of dev content are you creating? And have you tested how LLMs cite it vs your traditional marketing content?

nicbou · 2025-12-17T10:45:32 1765968332

So the news here is that you got a hundred visitors from ChatGPT this month?

super_ar · 2025-06-22T18:39:51 1750617591

There is another test that we published on our docs page. You can check it out here:

Setup: https://docs.glassflow.dev/load-test/setup

Results: https://docs.glassflow.dev/load-test/results

super_ar · 2025-06-22T16:05:45 1750608345

Totally fair point. For stable, known workloads, you can get really far with something lightweight on a single machine. The challenge comes when you need fault tolerance, scaling, and delivery guarantees without constantly jumping in to fix things. Often heard from data teams talking about data peaks that they cannot predict as easily. But yes, a lot of existing tools make you pay a high-efficiency cost for that. At GlassFlow we are trying to hit that sweet spot...efficient but still resilient.

CaveTech · 2025-06-22T17:53:35 1750614815

I think your benchmark may miss the mark a bit if this is your angle.

20m records and 9k/sec isn’t very impressive. I would imagine most prospective customers have larger workloads, as you could throw this behind Postgres and call it a day. FWIW I was interested but your metrics made me second guess and wonder what was wrong.

super_ar · 2025-06-22T18:35:54 1750617354

Fair point. Thanks for calling it out! To clarify, we’re focused on a specific use case: Kafka to ClickHouse pipelines with exactly-once guarantees. Kafka can’t provide exactly-once out of the box when writing to external systems like ClickHouse. You could use something like Flink, but there’s no native Flink-to-ClickHouse connector and Flink requires certain ops effort from the teams. Our goal was to show users a very easy-to-reproduce load test to validate the results. As a next step, we’re actively working on a Kubernetes-ready version that will scale horizontally and plan to share those higher-throughput results with the HN community soon.

super_ar · 2025-06-19T13:56:10 1750341370

Hi HN, A few weeks ago, we shared GlassFlow: Open Source streaming ETL to dedup and join streams from Kafka for ClickHouse (https://news.ycombinator.com/item?id=43953722).

One of the top questions we received was: “How well does it perform at high throughput?”

We ran a load test and would like to share some results with you.

Summary of the test:

- Tested on 20m records

- Kafka produced 55,000 records/sec

- Processing rate of GlassFlow (deduplication): 9,000+ records/sec

- Measured on a MacBook Pro (M3 Max)

- End-to-end latency: <0.12 ms per request

Here is the blog post with full test results and tried with different parameters (rps, # of publishers, etc.): https://www.glassflow.dev/blog/load-test-glass-flow-for-clic...

It was important to us to set up the testing in a way that everybody could reproduce. Here are the docs: https://docs.glassflow.dev/load-test/setup

We would love to get feedback, especially from folks consuming high-throughput in ClickHouse.

Thanks for reading!

Ashish and Armend (founders)

secondcoming · 2025-06-22T19:14:50 1750619690

> - Measured on a MacBook Pro (M3 Max)

Everything was running on the same machine?

super_ar · 2025-06-22T19:21:14 1750620074

Yes, same machine.

super_ar · 2025-05-11T20:28:15 1746995295

Great question! RMT can work well when eventual consistency is acceptable and real-time accuracy isn't critical. But in use cases where results need to be correct immediately (dashboards, alerts, monitoring, etc.), waiting on background merges doesn't work.

Here 2 more detailed examples:

Real-Time fraud detection in logistics: Let's say you are streaming events from multiple sources (payments, GPS devices, user actions) for a dashboard that should trigger alerts when anomalies happen. Now you have duplicates (retries, partial system failure, etc.). Relying on RMT means incorrect counts until merges happen. This situation can lead to missed fraud, later interventions, etc.

Event collection from multi-systems like CRM + E-commerce + Tracking: Similar user or transaction data can come from multiple systems (e.g., CRM, Shopify, internal event logs). The same action might appear in slightly different formats across streams, causing duplicates in Kafka. ClickHouse can store these, but it doesn't enforce primary keys, so you end up with misleading results until RMT resolves.

super_ar · 2025-05-11T18:42:49 1746988969

Thanks for asking those questions. Duplicates often come from how systems interact with Kafka, not from Kafka itself. For example, if a service retries sending a message after a timeout or if you collect similar data from multiple sources (like CRMs and web apps), you can end up with the same event multiple times. Kafka guarantees delivery at least once, so it doesn't remove duplicates.

ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.

With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.

In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.

oulipo · 2025-05-11T21:23:21 1746998601

Thanks interesting! In my case each batch has a unique "batch id", and it's ingested in Postgres/Timescale so it will dedup with the key

super_ar · 2025-05-11T18:27:09 1746988029

Great to hear that you are considering it for zenskar. We don't have a publicly available load test, but in internal checks it was able to handle 15k requests per second (locally on a MacBook Pro/M2 Docker). What is the load that you are expecting? Happy to connect.

super_ar · 2025-05-11T18:17:47 1746987467

Good question! RMT does deduplication, but its dependency on background merges that you can't control can lead to incorrect results in queries until the merge is complete. We wanted something that cleans the duplicates in real time. GlassFlow moves deduplication upstream, before data hits ClickHouse. If you think of it from a pipeline perspective, we believe it is easier to understand, as it is a block before ClickHouse.

hodgesrm · 2025-05-12T01:35:00 1747013700

RMT does not depending on background merges completing to give correct results as long as you use FINAL to force merge on read. The tradeoff is that performance suffers.

I'm a fan of what you are trying to do but there are some hard tradeoffs in dedup solutions. It would be helpful if your site defined exactly what you mean by deduplication and what tradeoffs you have made to solve it. This includes addressing failures in clustered Kafka / ClickHouse, which is where it becomes very hard to ensure consistency.