Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is not a battle that can be won in this way.

Scraping content for an LLM is not a hyper time sensitive thing. You don't need to scrape every page every day. Sam Altman does not need a synchronous replica of the internet to achieve his goals.



That is one view of the problem, but the one people are fixing with proof of work systems is the (unintentional) DDoS that LLM scrapers are operating against these sites. Just reducing the amount of traffic to manageable levels lets me get back to the work of doing whatever my site is supposed to be doing. I personally don't care if Sam Altman has a copy of my git server's rendition of the blame of every commit in my open source repo, because he could have just cloned my git repo and gotten the same result.


I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.


> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

It is very real and the reason why Anubis has been created in the first place. It is not plain hostility towards LLMs, it is *first and foremost* a DDoS protection against their scrapers.

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

https://social.kernel.org/notice/AsgziNL6zgmdbta3lY

https://xeiaso.net/notes/2025/amazon-crawler/


I've set up a few honeypot servers. Right now OpenAI alone accounts for 4 hours of compute for one of the honeypots in a span of 24 hours. It's not hypothetical.


25k+ hits/minute here. And that's just the scrapers that doesn't just identify themselves as a browsers.

Not sure why you believe massive repeated scraping isn't a problem. It's not like there is just one single actor out there, and ignoring robits.txt seems to be the norm nowadays.


> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

Yes, there are sites being DDoSed by scrapers for LLMs.

> If you can't handle the traffic of one request per page per week or month, I think there are bigger problems to solve.

This isn't about one request per week or per month. There were reports from many sites that they're being hit by scrapers that request from many different IP addresses, one request each.


> I'm a bit confused. Is anyone's website currently being DDOS'd by scrapers for LLMs or is this a hypothetical?

There are already a few dozens of thousands of scrapers right now trying to get even more training data.

It will only get worse. We all want more training data. I want more training data. You want more training data.

We all want the most up to date data there is. So, yeah, it will only get worse as time goes on.


For the model it’s not. But I think many of these bots are also from tool usage or ‚research‘ or whatever they call it these days. And for that it doesnatter




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: