Hey folks, just wanted to make a post to update a couple things you will see from PizzagateBot and get some feedbacks since its been active for 30 days now and I have some free time this week.
Improvements:
1.) Added function to verify that page found is actually valid by opening the page and finding the original link. Sometimes I would get results that were searched properly quoted but got ignored so you may of clicked on a post link and were not able to find the link or how it was related.
2.) Added function to get the post date and include in table column PostDate.
3.) Added function to determine if the link I found is from the OP or if it is from a comment, and, if from comment, post a link to it under table column LinkOrigin.
Here is an example for your reference - https://voat.co/v/pizzagate/1755477/8598437
Feedbacks:
Appreciate any feedbacks, but specifically looking to answer these questions...
1.) I can make pizzagatebot do anything you can do with a web browser. Is there anything PG research related that you find yourself doing repetitively that you'd like to see automated?
2.) Is there anything you would consider broken at the moment?
3.) Anything you don't like in particular about the results? (if you find a bad result, please reply to the comment instead of just downvoating, I will look into it)
4.) What search engine do you use, besides Google, to research PG? I'm going to do side-by-side comparison next.
Thanks and cheers!
view the rest of the comments →
anonOpenPress ago
1) Automated archiving & automated webpage screenshot
PizzagateBot ago
:)
I'm about 10k posts into a full archive of some 23k posts that I discovered by crawling ~400k possible ID numbers and checking if they are posts or not. I'll upload it somewhere once done, it saves css in each WARC so total itll be bloated like 8GB, and maybe seed a torrent of it.
I am using the WARC proxy that webrecorder.io provides because it is the first proxy that archives everything including javascript so that you can click on and expand comments. I tried others who claim the same thing but never worked. Also, I like that webrecorder.io allows you to download the WARC so I could include a WARC link in each post and if you thought that post was interesting or worth saving, you could just click and download the WARC.
I use webarchiveplayer - https://github.com/ikreymer/webarchiveplayer - to replay WARC files like this - https://files.catbox.moe/vgd5z5.png and I can combine multiple WARC into a single index so that it is searchable by link like this - https://files.catbox.moe/g8tiqd.png so to open a browser of the index of WARCs that I merged for testing, I just ran this command
webarchiveplayer WARCMerge20170329215008930044.warc