Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I use Java with simple task queue and multiple worker threads (scrapy is only singlethreaded, although uses async I/O). Failed tasks are collected into second queue and restarted when needed. Used Jsoup[1] for parsing, proxychains and HAproxy + tor [2] for distributing across multiple IPs.

[1] https://jsoup.org/ [2] https://github.com/mattes/rotating-proxy



Hardest part was synchronization

- to end the main thread only if all tasks are done

- when every running task can produce multiple new tasks

- with limiting the maximum number of running threads

- always running the maximum nubmer of threads if possible

semaphores to the rescue


Doesn't ThreadPoolExecutor take care of all of that if you store the returned Future from the submit method? Then you just have the main thread wait for those.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: