You are viewing a single comment's thread.

view the rest of the comments →

Germ22 ago

So how exactly could i download ALL of voat? And how much data would it be? And can the format then still be used like it was still a website but on my pc?

hang_em_high ago

Any ballpark idea on how much space that would take up? Not sure I have the HDD space even if I knew how to do it.

thor7 ago

Probably a few gigabytes if you're scaping it in a few optimized/efficient way. As in, parsing the data and feeding stuff into your own SQL database. You'd essentially be rebuilding the backend database through scraping. It would be a good load on the server, but you'd only have to do it once, and you can then share it with whoever. Even in the event of complete noncompliance/non-help from Voat admins, it would absolutely work. If it wasn't so short notice this is something I would've happily built.

hang_em_high ago

Great info. Sounds like it would have been a fun project. I figure if I have to ask 3 days probably isn't going to be enough. Maybe the next place I go to I can build something like this. It would be really cool to be able to archive only the subs I care about. What sort of database pattern would you use for forums? Some sort of star schema with users, forums, submissions, etc tables?

thor7 ago

I'd do something like this: https://files.catbox.moe/gc6wpn.png It may not be 100% correct, as I'm running on 5 hours and some change of sleep, but that should definitely point you in the right direction. There is only a handful of values you need to account for.

The structure would be pretty simple. You'd just have the scraper adding usernames/threads/subverses it sees into a queue (another table), then checking if the thread/user/subverse in question has already been parsed (which could perhaps be another boolean added to each of those things). If this wasn't just a one-time tool and you'd want to archive a site over time, you could add other attributes like datetime_checked to prioritize older threads to check, up until a certain age when they are archived and "locked in." Other things can be added as well, depending on the size of the site and how ambitious you want to be. Anyway, I'm not trying to ramble, but rather show the layers of sophistication you could bolt onto it if you want it to be increasingly "smart" in how it runs.

hang_em_high ago

Thanks man. That’s good info. I’m more front end so database stuff is out of my comfort zone. I did a scraping project a few years back and they would block me when I requested too fast. I think they did an IP ban after repeated blocks as well. Any idea how to get around something like that?

thor7 ago

It really comes down to the level of protection a site has, ranging from nothing at all, to 10/10 (Google, for instance). The most simple way to detect bot traffic is by slamming out requests on the server. The easiest solution is to simply space out your requests, perhaps sleep a few seconds between requests. But the more sophisticated detectors can see through this, so you'll have to randomize your activity, add random "breaks" in, and not have it run 24H straight. Basically, imagine how a normal human would use a website, and try to fit within that. And all of these things could easily by done by a language such as Python.

They also can check your user agent, which shows browser info, and can quickly determine if you're a bot or a real user. But that too can be spoofed. Finally, (at the far end of sophistication), they can block IPs depending on if they are known datacenter IPs, or proxy IPs. This is why residential IP proxies are big money, because you can do whatever without the site suspecting you're a bot (assuming you're not doing stuff to make you stand out like I described above).

But yeah.... just start by spacing out your requests by a few seconds, and see if that works. Test this on some public wifi network like a library, university or Dunkin Donuts, no need to burn out your home IP when figuring this stuff out. I made an automated distributed craigslist scraper thing a while back that would refresh a few pages and instantly alert me if there were new posts that contained certain keywords, so for flipping stuff, I could be the first person to contact them at the same time without me being full time on the computer searching. Their anti-bot protection is among the best, so it was a lot of figuring stuff out and overengineering to make sure none of the nodes got blocked (and they never did). Look into what I talked about, theres some good resources and articles online.

hang_em_high ago

Thanks for all the information. Kind of fired up to work on this. Maybe trying it on Poal or Ruffus could be a decent idea.