Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As always working on search.marginalia.nu.

Search itself is a fractal of interesting problems. Haven't had much time to write about it lately, but I've pretty much doubled the size of the index and re-written a lot of the query logic to make it much better, faster, and more accurate. Will do a write-up eventually, since it may be relatively explainable without getting too into the weeds that the audience dwindles entirely.

I keep having breakthroughs that make it in one way or another better, but as soon as I do I find something new that could be improved.

Kinda bonkers it's been possible to build this alone and run the entire thing on what amounts to a souped up PC :P



How big is your index and how many sites do you cover?


1 million websites, a bit above 60 million documents in the index; the crawl is a couple of hundred million but a lot of it gets filtered out for various reasons.

The crawler itself is aware of 470 million URLs.

I've actually had it up to 50 million before, but that was a lot noisier data with fewer keywords per document. The current 60 million is significantly "bigger" than the old 50 million. Index size is not actually a great metric for how comprehensive a search engine is. A small index with good signal-to-noise ratio is much more useful than a large one where 95% is chaff.

100 million is my current goal. I think that's about what's doable on my current hardware. It also gets increasingly unwieldy to deal with the data. I've already got processes that require several days non-stop computation.


For sure, a large index by itself doesn't mean anything. I was more curious about the size on disk and how you manage it on a single machine.

Also curious now, why you say half a 470m URLs? :)


Size of disk is like 3-400 Gb I think. Fairly manageable. I think it would require significantly more hardware with a multi-node approach. Locality is hella efficient.

I accidentally a word while editing the sentence.


I really appreciate your work.


I love this :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: