Hello people. I am the person behind the github repos that originally were on reddit collab tools and now are on the sidebar that have been getting a lot of downloads/views. I was running into an issue with saving voat posts on this sub-verse since there is so many coming in daily. I noticed as a community we fell behind on archiving every post to archive.is. My guess is we were only archiving about 30% of them. The top posts were getting archived but a few diamond in the rough post that fell through were not. Another big issue I ran into when trying to retrieve older post was the voat admins disabled pagination recently past page 19. There is a lot of talk about it on /v/voatdev/ and it may get restored. The API is also not ready for production use so I was not able to get a key. I am also working with one of the people on /v/voatdev/ to get a full backup of the older post so that way we will for sure have 100% of the data backed up all over the world and to multiple sites.
The bot will go through page 1-19 on new every day on a cron job and make a folder of that day. it will then push to the git repos once done. Every HTML page will be downloaded with wget and saved as the post ID in the posts folder for that day. There is also a file called ids.txt in every day folder that will have the unique post ids. The post will also be automatically archived at archive.is through a POST request.
One thing I discovered last week about http://archive.is/https://voat.co/v/pizzagate/* is that they also have pagination issues. If someone could send an email about this issue to [email protected] I would really appreciate it. Make sure to post below you sent an email so the person does not receive multiple. We should request to be able to view all of them and that 950-1000 is not enough. The good thing though is they are archived even though they are not in the pagination ( I checked with a few older posts ). As long as we have all the post ids we can easily back track. I am going to try and create a master-post-ids.txt file in the main folder in the repo that will have every post ID ever on here. I brought this up just so you are all aware.
NOTE: PLEASE STILL USE ARCHIVE.IS THOUGH BECAUSE WE NEED TO BACK UP POSTS WITH MULTIPLE SCREENSHOTS BECAUSE PEOPLE ADD COMMENTS, DELETE COMMENTS ETC. THE BOT WON'T BE ABLE TO GET THE NEWEST ACTIVITY SO PLEASE KEEP ARCHIVING WHEN POST GET COMMENTS ETC. ALSO KEEP SAVING POSTS LOCALLY. DO NOT JUST RELY ON ME AND MY BOT.
Here is the repos: https://github.com/pizzascraper/pizzagate-voat-backup https://gitlab.com/pizzascraper/pizzagate-scraper
TO DO: Need to figure out CSS/JS/IMG assets. Viewing HTML post locally is currently not calling any stylesheets/scripts/images since the urls are not absolute in the html files so it looks pretty plain. This is not critical as it can always be fixed later. What is important is preserving the data. If you have an idea on how to fix this please file an issue or comment here. Also if you have any suggestions or any ideas on how to improve this please let me know. I really appreciate all the help I can get.
Can be cloned:
git clone https://github.com/pizzascraper/pizzagate-voat-backup.git
or
git clone https://gitlab.com/pizzascraper/pizzagate-scraper.git
Non tech users can download by going to https://github.com/pizzascraper/pizzagate-voat-backup/archive/master.zip .
view the rest of the comments →
soundbyyte ago
Would there be a way for you to send me the folders for each day and then the older posts when you're able to archive those just so I can skim through and see if anything external needs to be archived, because a lot of the daily posts as you've mentioned, do fall through the cracks. Plus, if more people have the data, then the more people we can share it with and the less likely it is for anyone to try to shut us down.
gittttttttttttttttt ago
Hey all external links will be archived automatically. Just did 1600 super fast from today.
soundbyyte ago
Awesome, that's fantastic to hear!