BIG NEWS: Built a backup bot and automatic archiver for this subverse!

BIG NEWS: Built a backup bot and automatic archiver for this subverse! (pizzagate)

submitted 8.3 years ago by gittttttttttttttttt

Hello people. I am the person behind the github repos that originally were on reddit collab tools and now are on the sidebar that have been getting a lot of downloads/views. I was running into an issue with saving voat posts on this sub-verse since there is so many coming in daily. I noticed as a community we fell behind on archiving every post to archive.is. My guess is we were only archiving about 30% of them. The top posts were getting archived but a few diamond in the rough post that fell through were not. Another big issue I ran into when trying to retrieve older post was the voat admins disabled pagination recently past page 19. There is a lot of talk about it on /v/voatdev/ and it may get restored. The API is also not ready for production use so I was not able to get a key. I am also working with one of the people on /v/voatdev/ to get a full backup of the older post so that way we will for sure have 100% of the data backed up all over the world and to multiple sites.

The bot will go through page 1-19 on new every day on a cron job and make a folder of that day. it will then push to the git repos once done. Every HTML page will be downloaded with wget and saved as the post ID in the posts folder for that day. There is also a file called ids.txt in every day folder that will have the unique post ids. The post will also be automatically archived at archive.is through a POST request.

One thing I discovered last week about http://archive.is/https://voat.co/v/pizzagate/* is that they also have pagination issues. If someone could send an email about this issue to [email protected] I would really appreciate it. Make sure to post below you sent an email so the person does not receive multiple. We should request to be able to view all of them and that 950-1000 is not enough. The good thing though is they are archived even though they are not in the pagination ( I checked with a few older posts ). As long as we have all the post ids we can easily back track. I am going to try and create a master-post-ids.txt file in the main folder in the repo that will have every post ID ever on here. I brought this up just so you are all aware.

NOTE: PLEASE STILL USE ARCHIVE.IS THOUGH BECAUSE WE NEED TO BACK UP POSTS WITH MULTIPLE SCREENSHOTS BECAUSE PEOPLE ADD COMMENTS, DELETE COMMENTS ETC. THE BOT WON'T BE ABLE TO GET THE NEWEST ACTIVITY SO PLEASE KEEP ARCHIVING WHEN POST GET COMMENTS ETC. ALSO KEEP SAVING POSTS LOCALLY. DO NOT JUST RELY ON ME AND MY BOT.

Here is the repos: https://github.com/pizzascraper/pizzagate-voat-backup https://gitlab.com/pizzascraper/pizzagate-scraper

TO DO: Need to figure out CSS/JS/IMG assets. Viewing HTML post locally is currently not calling any stylesheets/scripts/images since the urls are not absolute in the html files so it looks pretty plain. This is not critical as it can always be fixed later. What is important is preserving the data. If you have an idea on how to fix this please file an issue or comment here. Also if you have any suggestions or any ideas on how to improve this please let me know. I really appreciate all the help I can get.

Can be cloned:

git clone https://github.com/pizzascraper/pizzagate-voat-backup.git

git clone https://gitlab.com/pizzascraper/pizzagate-scraper.git

Non tech users can download by going to https://github.com/pizzascraper/pizzagate-voat-backup/archive/master.zip .

61 comments

You are viewing a single comment's thread.

view the rest of the comments →

wecanhelp 8.3 years ago

Thank you so much for your work, this project is a huge relief for the community.

As for the assets: Voat seems to be using relative URLs. So when you archive a given page, could you parse (or grep) the HTML for <script /> and <link rel="stylesheet" /> tags pointing to the static assets requested by the page, and make an up-to-date copy of those assets every day, maintaining the folder structure as found in the src/href attributes? That way, when opening one of the .html files locally, the browser would look up the appropriate local copies of the scripts and stylesheets, and load them.

I'm sure this is overly simplistic, and problems will arise as you go, specifically with assets that are loaded on the fly. But do you see a problem with the initial logic?

Edit: Have you tried wget -r -nc https://voat.co/v/pizzagate/new?page={1..19}? Theoretically, this should download a page recursively (with all its assets), and prevent wget from overwriting a file if it already exists. Now, there seems to be a question as to whether wget will actually prevent the superfluous HTTP request from happening if a given file already exists, or it will carry out the request nonetheless and simply not do anything with the downloaded file if it is a duplicate. The latter behavior would, of course, result in a lot of unwanted traffic, but if the former is the case then this could be a good starting point.

link

gittttttttttttttttt 8.3 years ago

Thanks for the pointers. Will give this a go in a little and test it out.

link

Sonic_fan1 8.3 years ago

If anyone wants to try this on their end, another interesting one to try is WebHTTrack... it can be set to recursively follow links, it'll handle CSS and all that (at least, it used to... haven't used it for a while), and can be set to follow links however many levels deep you want. I've used it to fully download a friends website, and it'll even change links from absolute (http://me.com/img/1.jpg)) to relative to the folder structure... folder name/img/1.jpg), and it'll give you a browsable site. But, what you have now is awesome! If you don't have to change it, don't. Everyone is right, having any sort of complete archive of all this is the biggest thing, even if someone who looks at it has to wade through a little HTML. Thumbs up

Also, don't know if it's possible... maybe have the bot just sit and monitor the site for anytime something changes, because if the site has anything after page 19 disabled, and we have a busy day around here, important stuff might get bumped off by the other newer stuff. Maybe have the bot compare the front page of 'New' to the last archived page of 'New' (dates, or maybe thread titles) and if there's any difference, have it just slurp down the newest posts or pages or something.

And, this would be a great use of that old DLT4 tape drive someone has sitting around (which reminds me, I should get that tower from ma's at some point)... 800gigs on a single tape, as long as I could get Windows to recognize it... nightly or weekly backups of the GitHub. And, I know it's possible to get a tapedrive working under Win10 (I have a Travan 10/20 that works for backup, uses ZDatDump Free... software limits to 12gigs backed up without paying, but it works).

link