Keeping Voat's data when Voat shuts down

Keeping Voat's data when Voat shuts down (SearchVoat)

submitted 4.3 years ago by SearchVoat

SearchVoat.co is an independent website not associated with Voat, Inc.

SearchVoat.co is not going to shut down in the foreseeable future.
SearchVoat.co already has a complete* backup of Voat's submissions and comments. This data does not generally include up/downvotes.
You can retrieve a copy of a Voat submission, including comments, by replacing voat.co in the submission URL with searchvoat.co. For example, this post is at https://voat.co/v/SearchVoat/4170146 and you can retrieve a backup at https://searchvoat.co/v/SearchVoat/4170146.
You can choose to download results from SearchVoat.co in CSV or JSON format, by selecting those options when you do a search. These downloads include up to 240 posts, but you cannot specify a page number, so narrow your search if you want more than 240 posts at a time. (EDIT: Just noticed that this only works for submissions, not comments. Will add this facility for comments soon.)
I will add a shortcut to the SearchVoat.co front page to one or more Voat post mortem discussion forums TBA. Suggestions please.
If you want to delete your posts from SearchVoat, make sure you delete them from Voat before Voat shuts down then go to SearchVoat and hit edit/delete.
EDIT: I am sending unique individual passwords to Voat users by DM so they can prove who they used to be on Voat. See here: https://voat.co/v/SearchVoat/4170476

^{* 99.9% complete; missing a few comments before July 2017}

116 comments

You are viewing a single comment's thread.

view the rest of the comments →

Germ22 4.3 years ago

So how exactly could i download ALL of voat? And how much data would it be? And can the format then still be used like it was still a website but on my pc?

link

hang_em_high 4.3 years ago

Any ballpark idea on how much space that would take up? Not sure I have the HDD space even if I knew how to do it.

link

thor7 4.3 years ago

Probably a few gigabytes if you're scaping it in a few optimized/efficient way. As in, parsing the data and feeding stuff into your own SQL database. You'd essentially be rebuilding the backend database through scraping. It would be a good load on the server, but you'd only have to do it once, and you can then share it with whoever. Even in the event of complete noncompliance/non-help from Voat admins, it would absolutely work. If it wasn't so short notice this is something I would've happily built.

link

hang_em_high 4.3 years ago

Great info. Sounds like it would have been a fun project. I figure if I have to ask 3 days probably isn't going to be enough. Maybe the next place I go to I can build something like this. It would be really cool to be able to archive only the subs I care about. What sort of database pattern would you use for forums? Some sort of star schema with users, forums, submissions, etc tables?

link

thor7 4.3 years ago

I'd do something like this: https://files.catbox.moe/gc6wpn.png It may not be 100% correct, as I'm running on 5 hours and some change of sleep, but that should definitely point you in the right direction. There is only a handful of values you need to account for.

The structure would be pretty simple. You'd just have the scraper adding usernames/threads/subverses it sees into a queue (another table), then checking if the thread/user/subverse in question has already been parsed (which could perhaps be another boolean added to each of those things). If this wasn't just a one-time tool and you'd want to archive a site over time, you could add other attributes like datetime_checked to prioritize older threads to check, up until a certain age when they are archived and "locked in." Other things can be added as well, depending on the size of the site and how ambitious you want to be. Anyway, I'm not trying to ramble, but rather show the layers of sophistication you could bolt onto it if you want it to be increasingly "smart" in how it runs.

link

hang_em_high 4.3 years ago

Thanks man. That’s good info. I’m more front end so database stuff is out of my comfort zone. I did a scraping project a few years back and they would block me when I requested too fast. I think they did an IP ban after repeated blocks as well. Any idea how to get around something like that?

link

thor7 4.3 years ago

It really comes down to the level of protection a site has, ranging from nothing at all, to 10/10 (Google, for instance). The most simple way to detect bot traffic is by slamming out requests on the server. The easiest solution is to simply space out your requests, perhaps sleep a few seconds between requests. But the more sophisticated detectors can see through this, so you'll have to randomize your activity, add random "breaks" in, and not have it run 24H straight. Basically, imagine how a normal human would use a website, and try to fit within that. And all of these things could easily by done by a language such as Python.

They also can check your user agent, which shows browser info, and can quickly determine if you're a bot or a real user. But that too can be spoofed. Finally, (at the far end of sophistication), they can block IPs depending on if they are known datacenter IPs, or proxy IPs. This is why residential IP proxies are big money, because you can do whatever without the site suspecting you're a bot (assuming you're not doing stuff to make you stand out like I described above).

But yeah.... just start by spacing out your requests by a few seconds, and see if that works. Test this on some public wifi network like a library, university or Dunkin Donuts, no need to burn out your home IP when figuring this stuff out. I made an automated distributed craigslist scraper thing a while back that would refresh a few pages and instantly alert me if there were new posts that contained certain keywords, so for flipping stuff, I could be the first person to contact them at the same time without me being full time on the computer searching. Their anti-bot protection is among the best, so it was a lot of figuring stuff out and overengineering to make sure none of the nodes got blocked (and they never did). Look into what I talked about, theres some good resources and articles online.

link

hang_em_high 4.3 years ago

Thanks for all the information. Kind of fired up to work on this. Maybe trying it on Poal or Ruffus could be a decent idea.

link