Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There is some info about archiving here: https://www.gwern.net/Archiving-URLs

Lots of jobs and scripts, plus usage of archive.org as well. It's an interesting read.



That page is a bit outdated because I am still finetuning the on-site archive system before I do a writeup.

I still use archiver-bot etc, they're just not how I do the on-site archives. See https://github.com/gwern/gwern.net/blob/master/build/LinkArc... https://github.com/gwern/gwern.net/blob/master/build/linkArc... for that.

The quick summary is that PDFs are automatically downloaded, hosted locally, and links rewritten to the local PDF; other URLs, after a delay, call the CLI version of https://github.com/gildas-lormeau/SingleFile to run headless Chrome to dump a snapshot, which are manually reviewed by myself & improved as necessary, and then links get rewritten to the snapshot HTML. They get some no-crawl HTTP headers and robots.txt exclusions to try to reduce copyright trouble.


THANK YOU for scratching that itch.


Just what the doctor ordered. Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: