Hi all, I made this site. I wasn't expecting this post, but happy to answer questions and take feedback.
Hopefully it's pretty self-explanatory but I made this website as a simple resource for people who want to stay up to date with the ever-changing cast of AI user agents.
Feel free to sign up with the Google Form to get notified when this list is updated. And if you know of any agents I'm missing, please submit them. Thanks!
> cohere-ai is an unconfirmed agent possibly dispatched by Cohere's AI chat products in response to user prompts when it needs to retrieve content on the internet.
This is almost certainly the coral web grounding found as an option on coral.cohere.ai
Great resource! I'm wondering why you put links to your site in the robots.txt example? It's like there's a clear utility you've made but then the links cheapen it a bit. Maybe I'm missing the reason?
The term "AI Agent" is currently being used a lot to describe the setup where you have an LLM doing multiple generation rounds using tools etc so it can interact with an environment or other LLMs or whatever. Feel free to pass your own judgement on whether that's going to go anywhere, but that's what I think of when I hear the term, rather than "web crawler for company making LLM".
I'm not sure I'd classify most of these as AI agents, it's mostly web crawlers. (Though the definition of agent is fuzzy enough that I suppose you could lump them in).
I'm also curious, will adding "Common Crawl" to the user-agent disallow list in your robots.txt actually do anything?
Added context: I run a tracker of vetted AI agents with verticalized use cases https://staf.ai
Maybe if you're worried that your hypothetical My Little Pony erotic fanfic site might accidentally poison one of these AIs. Maybe. Hypothetically speaking. Of course, neither you nor I have such a site, and if we did, the writing would be of high enough quality, and the characters depicted faithfully enough to be good for AI training!
In today's world of PR speak where after reading a company's main page leaves me with no concept of what they do, maybe an AI could better explain it. Or they've been planning for this all along and knew they could confuse the future AI overlords by using word vomit for their websites
There's a difference between a human reading something on the internet and an LLM absorbing something in to its model on a mass scale. Mainly that LLMs are packaged up in to nice commercial products that, in essence, sell other people's ideas for profit (or in aid of hugely inflated stock options) with no recompense to the original author of the idea.
I just stumbled on it while searching for AI crawler user agents.
I'm not sure how often it's updated. It seems fairly complete from other sources I've been looking at (mainly robots.txt files for well-known publishers).
If you want to avoid having your site contents used to train AI and don't want to make your website unavailable to the open web, then blocking Common Crawl (among others) is absolutely mandatory.
We have very few tools to protect ourselves here, and need to make the most of the ones we have.
This is kind of a question of etiquette and legality more than it is a question of what is technically possible. People can and do ignore robots.txt. That doesn't mean copyright infringement is now legal.
People steal cars and break into houses and both have their versions of locks and alarms and are equally as useless as a right-click-disable fix to a website to a skilled "attacker".
Some people want to make their stuff freely available to people, but don't want it to be used to train AI. That's the group my comment was talking about. No monopolistic megalomaniacs here.
Personally, I think that such efforts are pointless -- all it takes is one crawler to make it through your defenses and you may as well have not done a thing. The alternative, though, is to remove your sites from the open web entirely. Either way, Commmon Crawl needs to be excluded if you want to avoid your stuff being used to train AI.
I think most traffic on the internet will eventually come from "ChatGPT-User" or similar. Who wants to sift through things when ChatGPT can do this for you? What is needed is a way to monetize this information it seems.
I was thinking the same. Wouldn’t it be better to let agents buy access to your site? It wasn’t feasible with the Google monopoly. Now with dozens of agents competing for info it seems feasible to put a tollbooth in front of your site.
it would be great to have a raw json file or something of this list so we can just start blocking the requests instead of using robots.txt. do they really respect it? probably not.
Huh, didn't know about "ChatGPT-User"... I've always been a fan of a slightly blacker-hat approach to this problem: rather than block those user agents outright, wouldn't it be way more fun to serve them nonsense or irrelevant garbage-filled versions of your pages?
As a user of non-google search engine (which came long way) I'm pretty disappointed how many website owners explicitly granting monopoly powers to google.
Modern browsers already have LLMs built in so anti-scraping is a folly.
Hopefully it's pretty self-explanatory but I made this website as a simple resource for people who want to stay up to date with the ever-changing cast of AI user agents.
Feel free to sign up with the Google Form to get notified when this list is updated. And if you know of any agents I'm missing, please submit them. Thanks!