Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Nice, what are you using to crawl the web?


It's pretty much all bespoke.

I use external libraries for parsing HTML (JSoup) and robots.txt; but that's about it.


What was the starting site you fed to the crawler to follow the links from to build the index?


Just my (swedish) personal website. The first iteration of the search engine was probably mainly seeded by these links:

https://www.marginalia.nu/00-l%C3%A4nkar/

But I've since expanded my websites, so now I think these play a decent role in later iterations, although they are virtually all of them pages I've found eating my own dogfood:

https://memex.marginalia.nu/links/fragments-old-web.gmi

https://memex.marginalia.nu/links/bookmarks.gmi




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: