You are viewing a single comment's thread.

view the rest of the comments →

wecanhelp ago

Thank you so much for your work, this project is a huge relief for the community.

As for the assets: Voat seems to be using relative URLs. So when you archive a given page, could you parse (or grep) the HTML for <script /> and <link rel="stylesheet" /> tags pointing to the static assets requested by the page, and make an up-to-date copy of those assets every day, maintaining the folder structure as found in the src/href attributes? That way, when opening one of the .html files locally, the browser would look up the appropriate local copies of the scripts and stylesheets, and load them.

I'm sure this is overly simplistic, and problems will arise as you go, specifically with assets that are loaded on the fly. But do you see a problem with the initial logic?


Edit: Have you tried wget -r -nc https://voat.co/v/pizzagate/new?page={1..19}? Theoretically, this should download a page recursively (with all its assets), and prevent wget from overwriting a file if it already exists. Now, there seems to be a question as to whether wget will actually prevent the superfluous HTTP request from happening if a given file already exists, or it will carry out the request nonetheless and simply not do anything with the downloaded file if it is a duplicate. The latter behavior would, of course, result in a lot of unwanted traffic, but if the former is the case then this could be a good starting point.

gittttttttttttttttt ago

Thanks for the pointers. Will give this a go in a little and test it out.

Sonic_fan1 ago

If anyone wants to try this on their end, another interesting one to try is WebHTTrack... it can be set to recursively follow links, it'll handle CSS and all that (at least, it used to... haven't used it for a while), and can be set to follow links however many levels deep you want. I've used it to fully download a friends website, and it'll even change links from absolute (http://me.com/img/1.jpg)) to relative to the folder structure... folder name/img/1.jpg), and it'll give you a browsable site. But, what you have now is awesome! If you don't have to change it, don't. Everyone is right, having any sort of complete archive of all this is the biggest thing, even if someone who looks at it has to wade through a little HTML. Thumbs up

Also, don't know if it's possible... maybe have the bot just sit and monitor the site for anytime something changes, because if the site has anything after page 19 disabled, and we have a busy day around here, important stuff might get bumped off by the other newer stuff. Maybe have the bot compare the front page of 'New' to the last archived page of 'New' (dates, or maybe thread titles) and if there's any difference, have it just slurp down the newest posts or pages or something.

And, this would be a great use of that old DLT4 tape drive someone has sitting around (which reminds me, I should get that tower from ma's at some point)... 800gigs on a single tape, as long as I could get Windows to recognize it... nightly or weekly backups of the GitHub. And, I know it's possible to get a tapedrive working under Win10 (I have a Travan 10/20 that works for backup, uses ZDatDump Free... software limits to 12gigs backed up without paying, but it works).