Uncharitable. Robots.txt is *already* the understood mechanism for getting *robo...

simonw · 2025-07-17T22:52:57 1752792777

People often use specific user agents in there, which is hard if you don't know what the user agents are in advance!

lxgr · 2025-07-18T02:05:03 1752804303

That seems like a potentially very useful addition to the robots.txt "standard": Crawler categories.

Wanting to disallow LLM training (or optionally only that of closed-weight models), but encouraging search indexing or even LLM retrieval in response to user queries, seems popular enough.

wat10000 · 2025-07-17T22:56:43 1752793003

If you're using a specific user agent, then you're saying "I want this specific user agent to follow this rule, and not any others." Don't be surprised when a new bot does what you say! If you don't want any bots reading something, use a wildcard.

lxgr · 2025-07-18T02:06:46 1752804406

Yes, but given the lack of generic "robot types" (e.g. "allow algorithmic search crawlers, allow archival, deny LLM training crawlers"), neither opt-in nor opt-out seems like a particularly great option in an age where new crawlers are appearing rapidly (and often, such as here, are announced only after the fact).

simonw · 2025-07-17T23:46:28 1752795988

Sure, but I still think it's OK to look at Apple with a raised eyebrow when they say "and our previously secret training data crawler obeys robots.txt so you can always opt out!"

wat10000 · 2025-07-18T13:54:39 1752846879

I've been online since before the web existed, and this is the first time I've ever seen this idea of some implicit obligation to give people advance notice before you deploy a crawler. Looks to me like people are making up new rules on the fly because they don't like Apple and/or LLMs.

simonw · 2025-07-18T16:12:47 1752855167

I stand by what I said.

Apple are saying you can opt out of their training data collection using robots.txt.

But... they collected their training data before they told people how to opt out.

I don't understand why me pointing that out as "eyebrow raising" is controversial here.

hn_go_brrrrr · 2025-07-18T16:44:08 1752857048

It's not controversial, it's just not how the ecosystem works. There has never been an expectation that someone make a notification about impending crawling.

It might be nice if there were categories that well-behaved bots could follow, as noted above, but even then the problem exists for bots doing new things that don't fall into existing categories.

simonw · 2025-07-18T17:08:38 1752858518

My complaint here isn't what they did. It's that they explain it as "here's how to opt out" when the information was too late to allow people to opt out.

I think that's disingenuous of them.

wat10000 · 2025-07-18T18:33:36 1752863616

It's been common knowledge for anyone running a web server since 1994.

simonw · 2025-07-18T19:12:23 1752865943

I don't think you are reading my posts in full.

pjmlp · 2025-07-18T13:27:03 1752845223

Assuming well behaved robots.