There is some info about archiving here: https://www.gwern.net/Archiving-URLs Lo...

gwern · on April 7, 2022

That page is a bit outdated because I am still finetuning the on-site archive system before I do a writeup.

I still use archiver-bot etc, they're just not how I do the on-site archives. See https://github.com/gwern/gwern.net/blob/master/build/LinkArc... https://github.com/gwern/gwern.net/blob/master/build/linkArc... for that.

The quick summary is that PDFs are automatically downloaded, hosted locally, and links rewritten to the local PDF; other URLs, after a delay, call the CLI version of https://github.com/gildas-lormeau/SingleFile to run headless Chrome to dump a snapshot, which are manually reviewed by myself & improved as necessary, and then links get rewritten to the snapshot HTML. They get some no-crawl HTTP headers and robots.txt exclusions to try to reduce copyright trouble.

tomcam · on April 7, 2022

THANK YOU for scratching that itch.

tomcam · on April 6, 2022

Just what the doctor ordered. Thank you.