Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Improved ways to operate a rude crawler (marginalia.nu)
75 points by doruk101 10 months ago | hide | past | favorite | 12 comments


I had to lock-down my private Gitea server when I noticed my commits were taking forever, because my meager 2-CPU instance was pegged.

Tail the nginx logs, sure enough some jerk is asking for every URL for every git commit ever done, with no delays/backoffs/anything. Just hammer the ever-loving crap out of me. Lovely, GTFO!

The simplest thing to do: add HTTP Basic auth, now my git server is no longer accessible to the public. Thanks AI startups! Maybe I'll re-enable after this craze is over.


Don't forget to set "Accept-Encoding" to "identity", you wouldn't want to waste valuable CPU cycles on decompression. You need those for training!


> This text is satirical in nature.

I usually reject this prefix/postfix as damaging to the spirit of the post. Ruining the art as it were.

Unfortunately, I think in this case, it's required. I run a novel git host and I've seen bots who lie about their UA crawl exclusively code. Ignoring the git history only following links with an extension. It's a git host, if you choose to crawl the web interface instead of cloning the repo, you're too stupid to also pick up this is satire, and would likely follow the other suggestions intentionally. Same goes for those bots that crawl wikipedia instead of downloading one of the prepackaged archives. Bot authors: "Why are you the way you are?"

It's refreshing to read some humor about the state of things. There's too much frustration, vitriol and anger. Justified as it may be, this is a nice change of pace. So my heartfelt thanks to the author, I laughed. :)


As someone who runs very simple crawler, I hope these actions will not affect me that much. I want to be able to collect data and be able to share it

Results of my crawling

https://github.com/rumca-js/Internet-Places-Database


Turn the tables by having your crawler send snippy emails to webmasters when their site slows down under your barrage. Try: “Your server failed to support our cutting-edge AI training. Please upgrade your pathetic infrastructure.” Blaming them for your bad behavior not only shifts responsibility but also proves your startup’s fearless attitude.


These types of satirical posts are great, and its great that they can not only be entertaining but also provide new information (I had never heard of TCP SACK).

P.S: all I have to say to this guy spamming HN at the moment is mentioned in this (great) article: GET over it.


(author) Yeah this originally started out as more of an educational post, but it got progressively more unhinged until I more or less gave up on the idea.


ps: he's mad because bad... but don't feed the trolls


Of course he is, but at some point I will still showdead=false.


meh, everyone is entitled to have a bad day eventually... I've been there... never enough to want to win clout as an edgelord, but I understand being so angry you don't know how to cope


Imagine going to jail because you got rejected by YC lmao


"You want that new TCP handshake smell" really got to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: