It's not so easy to explore the 33,000+ Clinton emails on https://file.wikileaks.org/file/clinton-emails/
The file dates are set to default values 1970 or 1984 so it's hard to find out if there's something new, see for example here.
These emails will probably come come back into the spotlight soon. There are probably no newly relased emails among the Wilileaks files but it surely is a good idea to check that. And maybe not everyone has read all the emails so far.
Sometimes their search site https://wikileaks.org/clinton-emails/ helps. But I prefer to have the files on my own system.
So I downloaded all files of that folder using the "Cyotek WebCopy" downloader. That took about 25 hours. Then I zipped the files and uploaded them as 5 parts. You can unpack all 5 Zip files into one "clinton-emails" folder. The unpacked size is about 800 MB.
Part 1: https://files.catbox.moe/l9enpi.zip - Content
Part 2: https://files.catbox.moe/mjxapq.zip - Content
Part 3: https://files.catbox.moe/u0sb3k.zip - Content
Part 4: https://files.catbox.moe/ezw5fi.zip - Content
Part 5: https://files.catbox.moe/zwt9st.zip - Content
It would be helpful if someone can mirror the files somewhere else (like Mega.nz).
The size and hash values for the 5 zip files:
.
Part 1: 156769570 Bytes (149 MiB)
SHA256: 2F043544CC2480DBAA585441ED77E111282FA130FECD8621DEFE6EC6068E6528
MD5: 9e1ee1d161e6e67ee56d2b0c9ceed1c6
.
Part 2: 159475607 Bytes (152 MiB)
SHA256: 798FA9E04B899DC315024340A6B95850B7088ED5599F4180D49B5DA99907AE25
MD5: ea174cedbf2e7cd44234b9ca2dcabdbd
.
Part 3: 183435301 Bytes (174 MiB)
SHA256: E7293531A7E53AA7C24C6D7971064A466C20EC70AE736FADA6CA1EE5C9C621D7
MD5: 2af7c385f7f276bd04ca7b62576ef443
.
Part 4: 169683520 Bytes (161 MiB)
SHA256: C4F465CC096CA30D64B600FC215120EDEA6BC78092847F70C1F92490EAD6FEDD
MD5: 6a415e7063dae65e919682038e5cd6f0
.
Part 5: 83217801 Bytes (79 MiB)
SHA256: 211E1970E210658F507AF8A9D4050E301840D061219C4F1F5D1C1CE4F692ADE1
MD5: dc1d04c49e4c25e8a18bae1102f29b22
view the rest of the comments →
murface ago
Has anyone heard of language analysis being completed on the archives?
Even simple works, like counting words and phrases? This may be useful in consolidating the information to allow one to quickly find related content and rare occurrences where more extraordinary information could be found.
satisfyinghump ago
This is a great idea!
One that I've played around with, such as using word clouds on various wikileak emails.
A different approach I've been attempting is comparing writing samples, based on a DefCon presentation I came across awhile back.
A separate project has been to automate a process/script that does the following:
Sample the current news as being published by a list of websites
a) Include twitter/facebook/etc sources for trends
b) Provide a method of providing a word/phrase as input
Take the sample/word phrase and run it through all news sites, providing a list of them that have used it (look for natural vs created trends, i.e. fake and planned news)
Take the same phrase/word and input it into a search/query that gophers/crawls through all of the wikileak emails, and other provided 'databases' of such leaked files/emails/documents, etc.
Create a list of emails/documents which mentioned the word/phrase that we sampled from (1) and include in that list, if applicable, the from/to/people involved with those emails/documents, including those mentioned within them.
Any criticism would be appreciated. Going to attempt it first with Python.
Any suggestions, I'd welcome them!
murface ago
That sounds awesome! It's a bit above my head to implement myself. But on the performance side, perl operates on strings notably faster than Python, and may increase performance for some tasks.
Decentralized work may be useful too. Use something like a redis/mongo instance to index findings and pointers. This can let you have multiple small workers filling in the data.
Doing something like this in the cloud can then be pretty easy/cost effective, allowing it to grow with more workers to ingest at an hourly rate, then shrink once the majority of work has been done to keep operational costs down.
Document linking:
Perhaps each email/doc/article gets a UUID, and a redis sorted set could then be matched to a word/phrase. The UUIDS can be scored by relevance and then re-linked to the actual source.
This should provide a rapid lookup which can be indexed in a multi-worker environment.