Dark Visitors – A list of known AI agents on the internet

gavinhking · on Dec 28, 2023

Hi all, I made this site. I wasn't expecting this post, but happy to answer questions and take feedback.

Hopefully it's pretty self-explanatory but I made this website as a simple resource for people who want to stay up to date with the ever-changing cast of AI user agents.

Feel free to sign up with the Google Form to get notified when this list is updated. And if you know of any agents I'm missing, please submit them. Thanks!

schleck8 · on Dec 28, 2023

> cohere-ai is an unconfirmed agent possibly dispatched by Cohere's AI chat products in response to user prompts when it needs to retrieve content on the internet.

This is almost certainly the coral web grounding found as an option on coral.cohere.ai

gavinhking · on Dec 28, 2023

Thank you, will try to confirm this.

lakomen · on Dec 29, 2023

Is there a "export all to robots.txt disallow" feature? If not, that would be great to have

gavinhking · on Dec 29, 2023

Yup, here is an example with all listed agents, although you probably want to modify it to not fully block all of them: https://darkvisitors.com/robots-txt-builder

_bkyr · on Dec 28, 2023

Great resource! I'm wondering why you put links to your site in the robots.txt example? It's like there's a clear utility you've made but then the links cheapen it a bit. Maybe I'm missing the reason?

gavinhking · on Dec 28, 2023

You're not missing anything, they're just for any humans reading the file who want to know more about the agent. This is also just an example file.

ErikAugust · on Dec 28, 2023

At least they are being “good net citizens” [0] and setting their User Agent honestly.

0: https://x.com/pm/status/1391462220861566977?s=46

thatguysaguy · on Dec 28, 2023

This is a bit confusing since "AI Agent" has come to mean a specific thing, whereas this is about values of the User-Agent field.

mnky9800n · on Dec 29, 2023

Can you explain what you mean?

thatguysaguy · on Dec 29, 2023

The term "AI Agent" is currently being used a lot to describe the setup where you have an LLM doing multiple generation rounds using tools etc so it can interact with an environment or other LLMs or whatever. Feel free to pass your own judgement on whether that's going to go anywhere, but that's what I think of when I hear the term, rather than "web crawler for company making LLM".

mnky9800n · on Jan 2, 2024

I think it's too early to reserve the word agent. Especially when we could bring back Daemon.

blacksmith_tb · on Dec 28, 2023

Kind of a funny choice of names, "dark visitor" is a common English translation of 黑客, the title of a book[1] about hacking in China, for example.

1: https://taosecurity.blogspot.com/2008/02/review-of-dark-visi...

Macaroni8152 · on Dec 28, 2023

Oh that is a good point! 黑客 could also be 駭客！

Areibman · on Dec 28, 2023

I'm not sure I'd classify most of these as AI agents, it's mostly web crawlers. (Though the definition of agent is fuzzy enough that I suppose you could lump them in).

I'm also curious, will adding "Common Crawl" to the user-agent disallow list in your robots.txt actually do anything?

Added context: I run a tracker of vetted AI agents with verticalized use cases https://staf.ai

gavinhking · on Dec 28, 2023

Yeah they respect robots.txt: https://commoncrawl.org/faq

thatguysaguy · on Dec 28, 2023

I'm kinda unclear on why I wouldn't want these to be able to access my site, what's the reasoning?

Tao3300 · on Dec 29, 2023

Maybe if you're worried that your hypothetical My Little Pony erotic fanfic site might accidentally poison one of these AIs. Maybe. Hypothetically speaking. Of course, neither you nor I have such a site, and if we did, the writing would be of high enough quality, and the characters depicted faithfully enough to be good for AI training!

weregiraffe · on Dec 29, 2023

And that's the story of CelestAI

urbandw311er · on Dec 28, 2023

Ask the New York Times

dirtyhippiefree · on Dec 28, 2023

Copyright…exfiltration of ideas…industrial espionage…

YMMV

schleck8 · on Dec 28, 2023

If I can industrially espionage your company using publicly accessible data, I think you should ask chatgpt for some advice on how to maintain secrecy

dylan604 · on Dec 28, 2023

In today's world of PR speak where after reading a company's main page leaves me with no concept of what they do, maybe an AI could better explain it. Or they've been planning for this all along and knew they could confuse the future AI overlords by using word vomit for their websites

cdot2 · on Dec 28, 2023

exfiltration of ideas

Isnt that generally the point of putting something on the internet?

johneth · on Dec 28, 2023

There's a difference between a human reading something on the internet and an LLM absorbing something in to its model on a mass scale. Mainly that LLMs are packaged up in to nice commercial products that, in essence, sell other people's ideas for profit (or in aid of hugely inflated stock options) with no recompense to the original author of the idea.

thatguysaguy · on Dec 28, 2023

I guess I can understand that POV even if I don't hold it myself.

moritzwarhier · on Dec 28, 2023

Great idea & execution!

There is a typo in one of the classification texts:

  not currently classified as artificailly intelligent

gavinhking · on Dec 28, 2023

Fixed, thanks!

aeturnum · on Dec 28, 2023

This seems like a useful index, but I just wanted to say that I love the name. A good name is so much fun.

johneth · on Dec 28, 2023

I just stumbled on it while searching for AI crawler user agents.

I'm not sure how often it's updated. It seems fairly complete from other sources I've been looking at (mainly robots.txt files for well-known publishers).

gavinhking · on Dec 28, 2023

It's updated daily. Thanks for posting by the way.

johneth · on Dec 28, 2023

Thanks for making it, a great resource!

gavinhking · on Dec 28, 2023

Ha thanks, it was a shower thought

ethbr1 · on Dec 28, 2023

Does Cloudflare have an equivalent?

I assume they'd have visibility into a huge amount of this

WarOnPrivacy · on Dec 28, 2023

This is helpful.

An updating list of IPs for these agents wouldn't go amiss. I might have access to some site logs I could mine for that.

jahewson · on Dec 28, 2023

Don’t block Common Crawl! They’re a charity and do great work producing an open dataset that everyone can use.

JohnFen · on Dec 28, 2023

If you want to avoid having your site contents used to train AI and don't want to make your website unavailable to the open web, then blocking Common Crawl (among others) is absolutely mandatory.

We have very few tools to protect ourselves here, and need to make the most of the ones we have.

flir · on Dec 28, 2023

I want to publish it on the internet, but I want to keep control of it.

Honestly, it seems like we've been here before. I hope y'all have got your right-click-disable scripts in place too.

__loam · on Dec 28, 2023

This is kind of a question of etiquette and legality more than it is a question of what is technically possible. People can and do ignore robots.txt. That doesn't mean copyright infringement is now legal.

flir · on Dec 28, 2023

Doesn't mean the chatbots are infringing copyright, either. Or even that copyright is respected worldwide.

It's going to be an interesting half-decade.

dylan604 · on Dec 28, 2023

People steal cars and break into houses and both have their versions of locks and alarms and are equally as useless as a right-click-disable fix to a website to a skilled "attacker".

dzhiurgis · on Dec 28, 2023

Why use a deliberately open standard then? Create an app like every other successful monopolistic megalomaniac?

JohnFen · on Dec 28, 2023

Some people want to make their stuff freely available to people, but don't want it to be used to train AI. That's the group my comment was talking about. No monopolistic megalomaniacs here.

Personally, I think that such efforts are pointless -- all it takes is one crawler to make it through your defenses and you may as well have not done a thing. The alternative, though, is to remove your sites from the open web entirely. Either way, Commmon Crawl needs to be excluded if you want to avoid your stuff being used to train AI.

mistrial9 · on Dec 28, 2023

yeah - more cases than those alluded to ..

https://apnews.com/article/dungeons-dragons-ai-artificial-in...

a variety of stakeholders involved - not just "individuals doing pointless things" .. new tech is making winners and losers right now.

yellowpencil · on Dec 28, 2023

Them being a charity isn't an automatic license to use anything on the internet?

alphabettsy · on Dec 28, 2023

Including all of the other crawlers. Interesting dilemma.

osigurdson · on Dec 28, 2023

I think most traffic on the internet will eventually come from "ChatGPT-User" or similar. Who wants to sift through things when ChatGPT can do this for you? What is needed is a way to monetize this information it seems.

jmarbach · on Dec 28, 2023

I was thinking the same. Wouldn’t it be better to let agents buy access to your site? It wasn’t feasible with the Google monopoly. Now with dozens of agents competing for info it seems feasible to put a tollbooth in front of your site.

mistrial9 · on Dec 28, 2023

doesn't that depend on the purpose of the visit? Is it not the choice of a publisher?

sgt101 · on Dec 28, 2023

Because the sift may be biased and unfai

ano-ther · on Dec 28, 2023

Great list.

Would be nice if they had a fully filled robots.txt for download (I could only find the example file).

gavinhking · on Dec 28, 2023

Right here: https://darkvisitors.com/robots-txt-builder

This example blocks everything, which is probably not what you want, but it's meant to be a starting point.

pimlottc · on Dec 30, 2023

Is this for crawling requests used to train/update models, or for real-time requests made in response to user queries?

vennom · on Dec 28, 2023

This idea is awesome! I'm definitely going to be referencing this for some of my upcoming projects.

Love the design and name, too.

vntok · on Dec 28, 2023

Why even bother adding those to a purely declarative robots.txt, when you can just block them outright?

johneth · on Dec 28, 2023

Defence-in-depth? Why not both?

olx_designer · on Dec 29, 2023

Looks like we might end up with the new web designed entirely for machine intelligence.

_a_a_a_ · on Dec 29, 2023

What RDF was supposed to be for, though an extra to the web not an alternative.

scopeh · on Dec 28, 2023

it would be great to have a raw json file or something of this list so we can just start blocking the requests instead of using robots.txt. do they really respect it? probably not.

SushiHippie · on Dec 28, 2023

Most of these bots already publish their IP ranges, so it would be just a simple firewall rule.

For chatgpt user:

https://platform.openai.com/docs/plugins/bot

And for gtpbot:

https://platform.openai.com/docs/gptbot https://openai.com/gptbot.json

__loam · on Dec 28, 2023

It would be cool if you could download this list as a robots.txt

gavinhking · on Dec 28, 2023

Here it is: https://darkvisitors.com/robots-txt-builder

__loam · on Dec 28, 2023

Oh sick

nyx · on Dec 28, 2023

Huh, didn't know about "ChatGPT-User"... I've always been a fan of a slightly blacker-hat approach to this problem: rather than block those user agents outright, wouldn't it be way more fun to serve them nonsense or irrelevant garbage-filled versions of your pages?

dzhiurgis · on Dec 28, 2023

As a user of non-google search engine (which came long way) I'm pretty disappointed how many website owners explicitly granting monopoly powers to google.

Modern browsers already have LLMs built in so anti-scraping is a folly.

chupapimunyenyo · on Dec 29, 2023

I don't understand anything from this comment. Is there something wrong with me?