RecceRat ago

Just a thought but have you searched the keywords using some of the on purpose misspellings as was apparent in the IG report? Where COMEY was actually spelt CORNEY with the use of slightly different italics and spacing techniques? CLINTON was actually a D as in DINTON and the same with HUMA ABEDIN (I think they squashed the D and I together) maybe some of them might throw something up. I believe PODESTA was also written differently. When you quickly read it your brain would recognise the word despite it being a misspelling.

pizzaequalspedo ago

You guys (or gals) are doing very valuable work.

10x more important than shit posting like the rest of us.

Iheartcatfood ago

How is Julian doing, havent heard about him in a while...

satisfyinghump ago

I also enjoy having a copy of files, such as these, on my own hardware.

I have had some experience with grabbing files, mundane when manually, without breaking a sweat when with a tool that automates it, but ALWAYS it's tiresome and time consuming.

You did this for yourself, but you went the extra few steps to upload it and provide us with info and links. Thank you very much! Greatly appreciate it.

suomy_the_nona ago

You're welcome! I tried several tools I had used before but they didn't work with https or had other problems. With "Cyotek WebCopy" it worked, but in the end about 100 files were missing because the server didn't respond from time to time. That software has a useful error log which I used to complete the download.

Since we're all in this together, I think it's better if you all can save that time and use it for exploring the emails or doing some other research.

murface ago

Has anyone heard of language analysis being completed on the archives?

Even simple works, like counting words and phrases? This may be useful in consolidating the information to allow one to quickly find related content and rare occurrences where more extraordinary information could be found.

satisfyinghump ago

This is a great idea!

One that I've played around with, such as using word clouds on various wikileak emails.

A different approach I've been attempting is comparing writing samples, based on a DefCon presentation I came across awhile back.

A separate project has been to automate a process/script that does the following:

  1. Sample the current news as being published by a list of websites

    a) Include twitter/facebook/etc sources for trends

    b) Provide a method of providing a word/phrase as input

  2. Take the sample/word phrase and run it through all news sites, providing a list of them that have used it (look for natural vs created trends, i.e. fake and planned news)

  3. Take the same phrase/word and input it into a search/query that gophers/crawls through all of the wikileak emails, and other provided 'databases' of such leaked files/emails/documents, etc.

  4. Create a list of emails/documents which mentioned the word/phrase that we sampled from (1) and include in that list, if applicable, the from/to/people involved with those emails/documents, including those mentioned within them.


Any criticism would be appreciated. Going to attempt it first with Python.

Any suggestions, I'd welcome them!

murface ago

That sounds awesome! It's a bit above my head to implement myself. But on the performance side, perl operates on strings notably faster than Python, and may increase performance for some tasks.

Decentralized work may be useful too. Use something like a redis/mongo instance to index findings and pointers. This can let you have multiple small workers filling in the data.

Doing something like this in the cloud can then be pretty easy/cost effective, allowing it to grow with more workers to ingest at an hourly rate, then shrink once the majority of work has been done to keep operational costs down.

Document linking:

Perhaps each email/doc/article gets a UUID, and a redis sorted set could then be matched to a word/phrase. The UUIDS can be scored by relevance and then re-linked to the actual source.

This should provide a rapid lookup which can be indexed in a multi-worker environment.

omnimattymattymatt ago

Research her penance to Lady de Rothschild. That one always interested me.

HalmoniKim ago

wow....a lot of work....thank you!

showbobandvagene ago

What brings this up? is it a re- release by wikileaks or is there potentially something new in here?

suomy_the_nona ago

From time to time there are rumours that there are additional releases. But because of the inconvenient date settings it's difficult to find out if there's something new. With that complete offline copy it will be easy to detect when subdirectories will be added.

And at least I didn't read all emails yet.