Ladies and Gentlemen, the kike behind the Internet Archive project: (((Brewster Kahle)))

Ladies and Gentlemen, the kike behind the Internet Archive project: (((Brewster Kahle))) (catbox.moe)

submitted 5.4 years ago by alele-opathic

4 comments

JPG Open

Friendly reminder that, unless you have an archive of your own, the gems of the internet are as good as lost. For those of you that wondered why censored sites also disappeared from the (((internet archive))), this is why.

Learn to make local mirrors of websites.

link

Deceneu 5.2 years ago

Learn to make local mirrors of websites.

Do you recommend any tools for doing just that ?

link

alele-opathic 5.2 years ago

Hey! I'll assume you are on Windows since every Linux distro comes with wget (one such tool, quite powerful) built right into the command line.

BTW, the name of the tool we're looking for is a 'web spider' - it saves pages (much as your browser can if you right click), but then it reads them and follows links to other pages. You need to give them boundaries to keep them on the sites/pages of interest, but they generally entirely automate the process of mirroring an endangered site.

As to which tools:

* If you are more daring, look for a torrent for a program called Teleport Ultra - there is nothing better.

* If you prefer freeware and don't want to torrent, HTTrack is your best bet.

If you actually happened to be on Linux (and my guess as to your OS was wrong earlier), then simply open a terminal and type:

wget -mkEpnp www.nameofyoursitehere.com

wget will obey the robot exclusion standard, so if you need something that people are intentionally trying to keep from spiders (which usually means Google), then pass the '-e robots=off' option as well before the site.

link

Deceneu 5.2 years ago

Thanks for the refresher. Actually, I'm on both.

Yes, I was referring to Web Spiders. They used to also be called Web Grabbers when I was running them in the early 00's for archiving sites offline in order to offset telephone line connection costs (33-56k). This naming was paralleling 'CD grabbing' in the same time period. Broadband being what it is, I stopped using both ;)

I've tried HTTrack about a year ago with unsatisfactory results at the time (don't remember the details).

Thank you for the wget params. I'll try them, then carefully study the man for them, and maybe try transferring them (as possible) on HTTrack numerous settings (if I recall correctly). That way I can use any system I find myself on.

link