This is interesting. Just wondering about your traffic volume and how long you have been running lcoalpdf?
For us, it is more like 5% of the traffic from GEO, but we have been running the company for 2 years and have created a lot of handwritten content for devs.
Volume is modest (~180 visitors/month), but the 50/50 split is what's interesting.
Been in production since August 2025, so ~4 months.
The strategy was intentional from the start: there's no point competing with Adobe, Smallpdf, ILovePDF for Google rankings. They have 10+ years of backlinks, massive marketing budgets, and domain authority I'll never match as a solo dev.
So I made a bet on GEO from day one:
- Semantic HTML that LLMs can parse
- Clear technical docs (GitHub README as primary content)
- Honest about limitations
- Privacy-first architecture (client-side processing)
Your 5% GEO makes sense for a 2-year-old company optimizing for traditional SEO. The difference: I skipped the SEO game entirely. When you're competing in an established niche, GEO-first might be the only viable strategy for bootstrapped products.
Curious: what type of dev content are you creating? And have you tested how LLMs cite it vs your traditional marketing content?
Totally fair point. For stable, known workloads, you can get really far with something lightweight on a single machine. The challenge comes when you need fault tolerance, scaling, and delivery guarantees without constantly jumping in to fix things. Often heard from data teams talking about data peaks that they cannot predict as easily. But yes, a lot of existing tools make you pay a high-efficiency cost for that. At GlassFlow we are trying to hit that sweet spot...efficient but still resilient.
I think your benchmark may miss the mark a bit if this is your angle.
20m records and 9k/sec isn’t very impressive. I would imagine most prospective customers have larger workloads, as you could throw this behind Postgres and call it a day.
FWIW I was interested but your metrics made me second guess and wonder what was wrong.
Fair point. Thanks for calling it out! To clarify, we’re focused on a specific use case: Kafka to ClickHouse pipelines with exactly-once guarantees. Kafka can’t provide exactly-once out of the box when writing to external systems like ClickHouse. You could use something like Flink, but there’s no native Flink-to-ClickHouse connector and Flink requires certain ops effort from the teams.
Our goal was to show users a very easy-to-reproduce load test to validate the results. As a next step, we’re actively working on a Kubernetes-ready version that will scale horizontally and plan to share those higher-throughput results with the HN community soon.
Great question! RMT can work well when eventual consistency is acceptable and real-time accuracy isn't critical. But in use cases where results need to be correct immediately (dashboards, alerts, monitoring, etc.), waiting on background merges doesn't work.
Here 2 more detailed examples:
Real-Time fraud detection in logistics:
Let's say you are streaming events from multiple sources (payments, GPS devices, user actions) for a dashboard that should trigger alerts when anomalies happen. Now you have duplicates (retries, partial system failure, etc.). Relying on RMT means incorrect counts until merges happen. This situation can lead to missed fraud, later interventions, etc.
Event collection from multi-systems like CRM + E-commerce + Tracking:
Similar user or transaction data can come from multiple systems (e.g., CRM, Shopify, internal event logs). The same action might appear in slightly different formats across streams, causing duplicates in Kafka. ClickHouse can store these, but it doesn't enforce primary keys, so you end up with misleading results until RMT resolves.
Thanks for asking those questions. Duplicates often come from how systems interact with Kafka, not from Kafka itself. For example, if a service retries sending a message after a timeout or if you collect similar data from multiple sources (like CRMs and web apps), you can end up with the same event multiple times. Kafka guarantees delivery at least once, so it doesn't remove duplicates.
ClickHouse doesn't enforce primary keys. It stores whatever you send. ReplacingMergeTree and FINAL are concepts on ClickHouse, but they are not optimal for real-time streams due to the background merging process that needs to be finished to ensure correct query results.
With GlassFlow, you clean the data streams before they hit ClickHouse, ensuring correct query results and less load for ClickHouse.
In your IoT case, a scenario I can imagine is batch replays (you might resend data already ingested). But if you're sure the data is clean and only sent once, you may not need this.
Great to hear that you are considering it for zenskar. We don't have a publicly available load test, but in internal checks it was able to handle
15k requests per second (locally on a MacBook Pro/M2 Docker). What is the load that you are expecting? Happy to connect.
Good question! RMT does deduplication, but its dependency on background merges that you can't control can lead to incorrect results in queries until the merge is complete. We wanted something that cleans the duplicates in real time. GlassFlow moves deduplication upstream, before data hits ClickHouse. If you think of it from a pipeline perspective, we believe it is easier to understand, as it is a block before ClickHouse.
RMT does not depending on background merges completing to give correct results as long as you use FINAL to force merge on read. The tradeoff is that performance suffers.
I'm a fan of what you are trying to do but there are some hard tradeoffs in dedup solutions. It would be helpful if your site defined exactly what you mean by deduplication and what tradeoffs you have made to solve it. This includes addressing failures in clustered Kafka / ClickHouse, which is where it becomes very hard to ensure consistency.
reply