Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dark Visitors – A list of known AI agents on the internet (darkvisitors.com)
134 points by johneth on Dec 28, 2023 | hide | past | favorite | 66 comments


Hi all, I made this site. I wasn't expecting this post, but happy to answer questions and take feedback.

Hopefully it's pretty self-explanatory but I made this website as a simple resource for people who want to stay up to date with the ever-changing cast of AI user agents.

Feel free to sign up with the Google Form to get notified when this list is updated. And if you know of any agents I'm missing, please submit them. Thanks!


> cohere-ai is an unconfirmed agent possibly dispatched by Cohere's AI chat products in response to user prompts when it needs to retrieve content on the internet.

This is almost certainly the coral web grounding found as an option on coral.cohere.ai


Thank you, will try to confirm this.


Is there a "export all to robots.txt disallow" feature? If not, that would be great to have


Yup, here is an example with all listed agents, although you probably want to modify it to not fully block all of them: https://darkvisitors.com/robots-txt-builder


Great resource! I'm wondering why you put links to your site in the robots.txt example? It's like there's a clear utility you've made but then the links cheapen it a bit. Maybe I'm missing the reason?


You're not missing anything, they're just for any humans reading the file who want to know more about the agent. This is also just an example file.


At least they are being “good net citizens” [0] and setting their User Agent honestly.

0: https://x.com/pm/status/1391462220861566977?s=46


This is a bit confusing since "AI Agent" has come to mean a specific thing, whereas this is about values of the User-Agent field.


Can you explain what you mean?


The term "AI Agent" is currently being used a lot to describe the setup where you have an LLM doing multiple generation rounds using tools etc so it can interact with an environment or other LLMs or whatever. Feel free to pass your own judgement on whether that's going to go anywhere, but that's what I think of when I hear the term, rather than "web crawler for company making LLM".


I think it's too early to reserve the word agent. Especially when we could bring back Daemon.


Kind of a funny choice of names, "dark visitor" is a common English translation of 黑客, the title of a book[1] about hacking in China, for example.

1: https://taosecurity.blogspot.com/2008/02/review-of-dark-visi...


Oh that is a good point! 黑客 could also be 駭客!


I'm not sure I'd classify most of these as AI agents, it's mostly web crawlers. (Though the definition of agent is fuzzy enough that I suppose you could lump them in).

I'm also curious, will adding "Common Crawl" to the user-agent disallow list in your robots.txt actually do anything?

Added context: I run a tracker of vetted AI agents with verticalized use cases https://staf.ai


Yeah they respect robots.txt: https://commoncrawl.org/faq


I'm kinda unclear on why I wouldn't want these to be able to access my site, what's the reasoning?


Maybe if you're worried that your hypothetical My Little Pony erotic fanfic site might accidentally poison one of these AIs. Maybe. Hypothetically speaking. Of course, neither you nor I have such a site, and if we did, the writing would be of high enough quality, and the characters depicted faithfully enough to be good for AI training!


And that's the story of CelestAI


Ask the New York Times


Copyright…exfiltration of ideas…industrial espionage…

YMMV


If I can industrially espionage your company using publicly accessible data, I think you should ask chatgpt for some advice on how to maintain secrecy


In today's world of PR speak where after reading a company's main page leaves me with no concept of what they do, maybe an AI could better explain it. Or they've been planning for this all along and knew they could confuse the future AI overlords by using word vomit for their websites


exfiltration of ideas

Isnt that generally the point of putting something on the internet?


There's a difference between a human reading something on the internet and an LLM absorbing something in to its model on a mass scale. Mainly that LLMs are packaged up in to nice commercial products that, in essence, sell other people's ideas for profit (or in aid of hugely inflated stock options) with no recompense to the original author of the idea.


I guess I can understand that POV even if I don't hold it myself.


Great idea & execution!

There is a typo in one of the classification texts:

  not currently classified as artificailly intelligent


Fixed, thanks!


This seems like a useful index, but I just wanted to say that I love the name. A good name is so much fun.


I just stumbled on it while searching for AI crawler user agents.

I'm not sure how often it's updated. It seems fairly complete from other sources I've been looking at (mainly robots.txt files for well-known publishers).


It's updated daily. Thanks for posting by the way.


Thanks for making it, a great resource!


Ha thanks, it was a shower thought


Does Cloudflare have an equivalent?

I assume they'd have visibility into a huge amount of this


This is helpful.

An updating list of IPs for these agents wouldn't go amiss. I might have access to some site logs I could mine for that.


Don’t block Common Crawl! They’re a charity and do great work producing an open dataset that everyone can use.


If you want to avoid having your site contents used to train AI and don't want to make your website unavailable to the open web, then blocking Common Crawl (among others) is absolutely mandatory.

We have very few tools to protect ourselves here, and need to make the most of the ones we have.


I want to publish it on the internet, but I want to keep control of it.

Honestly, it seems like we've been here before. I hope y'all have got your right-click-disable scripts in place too.


This is kind of a question of etiquette and legality more than it is a question of what is technically possible. People can and do ignore robots.txt. That doesn't mean copyright infringement is now legal.


Doesn't mean the chatbots are infringing copyright, either. Or even that copyright is respected worldwide.

It's going to be an interesting half-decade.


People steal cars and break into houses and both have their versions of locks and alarms and are equally as useless as a right-click-disable fix to a website to a skilled "attacker".


Why use a deliberately open standard then? Create an app like every other successful monopolistic megalomaniac?


Some people want to make their stuff freely available to people, but don't want it to be used to train AI. That's the group my comment was talking about. No monopolistic megalomaniacs here.

Personally, I think that such efforts are pointless -- all it takes is one crawler to make it through your defenses and you may as well have not done a thing. The alternative, though, is to remove your sites from the open web entirely. Either way, Commmon Crawl needs to be excluded if you want to avoid your stuff being used to train AI.


yeah - more cases than those alluded to ..

https://apnews.com/article/dungeons-dragons-ai-artificial-in...

a variety of stakeholders involved - not just "individuals doing pointless things" .. new tech is making winners and losers right now.


Them being a charity isn't an automatic license to use anything on the internet?


Including all of the other crawlers. Interesting dilemma.


I think most traffic on the internet will eventually come from "ChatGPT-User" or similar. Who wants to sift through things when ChatGPT can do this for you? What is needed is a way to monetize this information it seems.


I was thinking the same. Wouldn’t it be better to let agents buy access to your site? It wasn’t feasible with the Google monopoly. Now with dozens of agents competing for info it seems feasible to put a tollbooth in front of your site.


doesn't that depend on the purpose of the visit? Is it not the choice of a publisher?


Because the sift may be biased and unfai


Great list.

Would be nice if they had a fully filled robots.txt for download (I could only find the example file).


Right here: https://darkvisitors.com/robots-txt-builder

This example blocks everything, which is probably not what you want, but it's meant to be a starting point.


Is this for crawling requests used to train/update models, or for real-time requests made in response to user queries?


This idea is awesome! I'm definitely going to be referencing this for some of my upcoming projects.

Love the design and name, too.


Why even bother adding those to a purely declarative robots.txt, when you can just block them outright?


Defence-in-depth? Why not both?


Looks like we might end up with the new web designed entirely for machine intelligence.


What RDF was supposed to be for, though an extra to the web not an alternative.


it would be great to have a raw json file or something of this list so we can just start blocking the requests instead of using robots.txt. do they really respect it? probably not.


Most of these bots already publish their IP ranges, so it would be just a simple firewall rule.

For chatgpt user:

https://platform.openai.com/docs/plugins/bot

And for gtpbot:

https://platform.openai.com/docs/gptbot https://openai.com/gptbot.json


It would be cool if you could download this list as a robots.txt



Oh sick


Huh, didn't know about "ChatGPT-User"... I've always been a fan of a slightly blacker-hat approach to this problem: rather than block those user agents outright, wouldn't it be way more fun to serve them nonsense or irrelevant garbage-filled versions of your pages?


As a user of non-google search engine (which came long way) I'm pretty disappointed how many website owners explicitly granting monopoly powers to google.

Modern browsers already have LLMs built in so anti-scraping is a folly.


I don't understand anything from this comment. Is there something wrong with me?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: