BIG NEWS: Built a backup bot and automatic archiver for this subverse!

BIG NEWS: Built a backup bot and automatic archiver for this subverse! (pizzagate)

submitted 8.3 years ago by gittttttttttttttttt

Hello people. I am the person behind the github repos that originally were on reddit collab tools and now are on the sidebar that have been getting a lot of downloads/views. I was running into an issue with saving voat posts on this sub-verse since there is so many coming in daily. I noticed as a community we fell behind on archiving every post to archive.is. My guess is we were only archiving about 30% of them. The top posts were getting archived but a few diamond in the rough post that fell through were not. Another big issue I ran into when trying to retrieve older post was the voat admins disabled pagination recently past page 19. There is a lot of talk about it on /v/voatdev/ and it may get restored. The API is also not ready for production use so I was not able to get a key. I am also working with one of the people on /v/voatdev/ to get a full backup of the older post so that way we will for sure have 100% of the data backed up all over the world and to multiple sites.

The bot will go through page 1-19 on new every day on a cron job and make a folder of that day. it will then push to the git repos once done. Every HTML page will be downloaded with wget and saved as the post ID in the posts folder for that day. There is also a file called ids.txt in every day folder that will have the unique post ids. The post will also be automatically archived at archive.is through a POST request.

One thing I discovered last week about http://archive.is/https://voat.co/v/pizzagate/* is that they also have pagination issues. If someone could send an email about this issue to [email protected] I would really appreciate it. Make sure to post below you sent an email so the person does not receive multiple. We should request to be able to view all of them and that 950-1000 is not enough. The good thing though is they are archived even though they are not in the pagination ( I checked with a few older posts ). As long as we have all the post ids we can easily back track. I am going to try and create a master-post-ids.txt file in the main folder in the repo that will have every post ID ever on here. I brought this up just so you are all aware.

NOTE: PLEASE STILL USE ARCHIVE.IS THOUGH BECAUSE WE NEED TO BACK UP POSTS WITH MULTIPLE SCREENSHOTS BECAUSE PEOPLE ADD COMMENTS, DELETE COMMENTS ETC. THE BOT WON'T BE ABLE TO GET THE NEWEST ACTIVITY SO PLEASE KEEP ARCHIVING WHEN POST GET COMMENTS ETC. ALSO KEEP SAVING POSTS LOCALLY. DO NOT JUST RELY ON ME AND MY BOT.

Here is the repos: https://github.com/pizzascraper/pizzagate-voat-backup https://gitlab.com/pizzascraper/pizzagate-scraper

TO DO: Need to figure out CSS/JS/IMG assets. Viewing HTML post locally is currently not calling any stylesheets/scripts/images since the urls are not absolute in the html files so it looks pretty plain. This is not critical as it can always be fixed later. What is important is preserving the data. If you have an idea on how to fix this please file an issue or comment here. Also if you have any suggestions or any ideas on how to improve this please let me know. I really appreciate all the help I can get.

Can be cloned:

git clone https://github.com/pizzascraper/pizzagate-voat-backup.git

git clone https://gitlab.com/pizzascraper/pizzagate-scraper.git

Non tech users can download by going to https://github.com/pizzascraper/pizzagate-voat-backup/archive/master.zip .

61 comments

You are viewing a single comment's thread.

view the rest of the comments →

IWishIWasFoxMulder 8.3 years ago

Everything you've just done is going to make it at least 100 times more difficult to take down this site and this subverse. You are weaponized autism at its finest and you inspire me to be a better autist so thank you. This is what web and Silicon Valley people are talking about when they mean redundancy in the truest sense. Is there anyway for you to backup videos? I'm trying to figure out a way we could backup James Alefantis' short film from Sundance that appears to be on Vimeo at the moment.

link

gittttttttttttttttt 8.3 years ago

Haha thanks :)

The best way to backup videos and I encourage you to do so is to use a command line tool called youtube-dl. There is ways to download many videos by specifying a keyword ( you can check the readme for all the options ). It can also easily download vimeo etc. If you have bandwidth and extra space on your machine then this would be a great idea. Then what I would recommend is finding some smaller video sites with less security/bot checking built in and figure out how to send POST request or create a bot/macro to upload all the videos as mirrors. If this sounds doable to you and you just need a little help to get it going then DM me.

link