Search itself is a fractal of interesting problems. Haven't had much time to write about it lately, but I've pretty much doubled the size of the index and re-written a lot of the query logic to make it much better, faster, and more accurate. Will do a write-up eventually, since it may be relatively explainable without getting too into the weeds that the audience dwindles entirely.
I keep having breakthroughs that make it in one way or another better, but as soon as I do I find something new that could be improved.
Kinda bonkers it's been possible to build this alone and run the entire thing on what amounts to a souped up PC :P
1 million websites, a bit above 60 million documents in the index; the crawl is a couple of hundred million but a lot of it gets filtered out for various reasons.
The crawler itself is aware of 470 million URLs.
I've actually had it up to 50 million before, but that was a lot noisier data with fewer keywords per document. The current 60 million is significantly "bigger" than the old 50 million. Index size is not actually a great metric for how comprehensive a search engine is. A small index with good signal-to-noise ratio is much more useful than a large one where 95% is chaff.
100 million is my current goal. I think that's about what's doable on my current hardware. It also gets increasingly unwieldy to deal with the data. I've already got processes that require several days non-stop computation.
Size of disk is like 3-400 Gb I think. Fairly manageable. I think it would require significantly more hardware with a multi-node approach. Locality is hella efficient.
Search itself is a fractal of interesting problems. Haven't had much time to write about it lately, but I've pretty much doubled the size of the index and re-written a lot of the query logic to make it much better, faster, and more accurate. Will do a write-up eventually, since it may be relatively explainable without getting too into the weeds that the audience dwindles entirely.
I keep having breakthroughs that make it in one way or another better, but as soon as I do I find something new that could be improved.
Kinda bonkers it's been possible to build this alone and run the entire thing on what amounts to a souped up PC :P