The Wikileaks Clinton emails (>33,000) as 5 zip files for easier download and offline research

The Wikileaks Clinton emails (>33,000) as 5 zip files for easier download and offline research (GreatAwakening)

submitted 4.8 years ago by suomy_the_nona

It's not so easy to explore the 33,000+ Clinton emails on https://file.wikileaks.org/file/clinton-emails/
The file dates are set to default values 1970 or 1984 so it's hard to find out if there's something new, see for example here.

These emails will probably come come back into the spotlight soon. There are probably no newly relased emails among the Wilileaks files but it surely is a good idea to check that. And maybe not everyone has read all the emails so far.

Sometimes their search site https://wikileaks.org/clinton-emails/ helps. But I prefer to have the files on my own system.
So I downloaded all files of that folder using the "Cyotek WebCopy" downloader. That took about 25 hours. Then I zipped the files and uploaded them as 5 parts. You can unpack all 5 Zip files into one "clinton-emails" folder. The unpacked size is about 800 MB.

Part 1: https://files.catbox.moe/l9enpi.zip - Content

Part 2: https://files.catbox.moe/mjxapq.zip - Content

Part 3: https://files.catbox.moe/u0sb3k.zip - Content

Part 4: https://files.catbox.moe/ezw5fi.zip - Content

Part 5: https://files.catbox.moe/zwt9st.zip - Content

It would be helpful if someone can mirror the files somewhere else (like Mega.nz).

The size and hash values for the 5 zip files:

Part 1: 156769570 Bytes (149 MiB)

SHA256: 2F043544CC2480DBAA585441ED77E111282FA130FECD8621DEFE6EC6068E6528

MD5: 9e1ee1d161e6e67ee56d2b0c9ceed1c6

Part 2: 159475607 Bytes (152 MiB)

SHA256: 798FA9E04B899DC315024340A6B95850B7088ED5599F4180D49B5DA99907AE25

MD5: ea174cedbf2e7cd44234b9ca2dcabdbd

Part 3: 183435301 Bytes (174 MiB)

SHA256: E7293531A7E53AA7C24C6D7971064A466C20EC70AE736FADA6CA1EE5C9C621D7

MD5: 2af7c385f7f276bd04ca7b62576ef443

Part 4: 169683520 Bytes (161 MiB)

SHA256: C4F465CC096CA30D64B600FC215120EDEA6BC78092847F70C1F92490EAD6FEDD

MD5: 6a415e7063dae65e919682038e5cd6f0

Part 5: 83217801 Bytes (79 MiB)

SHA256: 211E1970E210658F507AF8A9D4050E301840D061219C4F1F5D1C1CE4F692ADE1

MD5: dc1d04c49e4c25e8a18bae1102f29b22

12 comments

RecceRat 4.8 years ago

Just a thought but have you searched the keywords using some of the on purpose misspellings as was apparent in the IG report? Where COMEY was actually spelt CORNEY with the use of slightly different italics and spacing techniques? CLINTON was actually a D as in DINTON and the same with HUMA ABEDIN (I think they squashed the D and I together) maybe some of them might throw something up. I believe PODESTA was also written differently. When you quickly read it your brain would recognise the word despite it being a misspelling.

link

pizzaequalspedo 4.8 years ago

You guys (or gals) are doing very valuable work.

10x more important than shit posting like the rest of us.

link

Iheartcatfood 4.8 years ago

How is Julian doing, havent heard about him in a while...

link

satisfyinghump 4.8 years ago

I also enjoy having a copy of files, such as these, on my own hardware.

I have had some experience with grabbing files, mundane when manually, without breaking a sweat when with a tool that automates it, but ALWAYS it's tiresome and time consuming.

You did this for yourself, but you went the extra few steps to upload it and provide us with info and links. Thank you very much! Greatly appreciate it.

link

suomy_the_nona 4.8 years ago

You're welcome! I tried several tools I had used before but they didn't work with https or had other problems. With "Cyotek WebCopy" it worked, but in the end about 100 files were missing because the server didn't respond from time to time. That software has a useful error log which I used to complete the download.

Since we're all in this together, I think it's better if you all can save that time and use it for exploring the emails or doing some other research.

link

murface 4.8 years ago

Has anyone heard of language analysis being completed on the archives?

Even simple works, like counting words and phrases? This may be useful in consolidating the information to allow one to quickly find related content and rare occurrences where more extraordinary information could be found.

link

satisfyinghump 4.8 years ago

This is a great idea!

One that I've played around with, such as using word clouds on various wikileak emails.

A different approach I've been attempting is comparing writing samples, based on a DefCon presentation I came across awhile back.

A separate project has been to automate a process/script that does the following:

Sample the current news as being published by a list of websites

a) Include twitter/facebook/etc sources for trends

b) Provide a method of providing a word/phrase as input
Take the sample/word phrase and run it through all news sites, providing a list of them that have used it (look for natural vs created trends, i.e. fake and planned news)
Take the same phrase/word and input it into a search/query that gophers/crawls through all of the wikileak emails, and other provided 'databases' of such leaked files/emails/documents, etc.
Create a list of emails/documents which mentioned the word/phrase that we sampled from (1) and include in that list, if applicable, the from/to/people involved with those emails/documents, including those mentioned within them.

Any criticism would be appreciated. Going to attempt it first with Python.

Any suggestions, I'd welcome them!

link

murface 4.8 years ago

That sounds awesome! It's a bit above my head to implement myself. But on the performance side, perl operates on strings notably faster than Python, and may increase performance for some tasks.

Decentralized work may be useful too. Use something like a redis/mongo instance to index findings and pointers. This can let you have multiple small workers filling in the data.

Doing something like this in the cloud can then be pretty easy/cost effective, allowing it to grow with more workers to ingest at an hourly rate, then shrink once the majority of work has been done to keep operational costs down.

Document linking:

Perhaps each email/doc/article gets a UUID, and a redis sorted set could then be matched to a word/phrase. The UUIDS can be scored by relevance and then re-linked to the actual source.

This should provide a rapid lookup which can be indexed in a multi-worker environment.

link

omnimattymattymatt 4.8 years ago

Research her penance to Lady de Rothschild. That one always interested me.

link

HalmoniKim 4.8 years ago

wow....a lot of work....thank you!

link

showbobandvagene 4.8 years ago

What brings this up? is it a re- release by wikileaks or is there potentially something new in here?

link

suomy_the_nona 4.8 years ago

From time to time there are rumours that there are additional releases. But because of the inconvenient date settings it's difficult to find out if there's something new. With that complete offline copy it will be easy to detect when subdirectories will be added.

And at least I didn't read all emails yet.

link