There are 7544 emails sent from [email protected]
I wanted to make these faster to peruse, so I scraped them from wikileaks and tabulated the information into two bodies of data:
- A reference table which contains a link to the email, subject, who its sent to, date it sent, links to any attachments included and total size of attachments, and keywords (compiled by occurance across all emails): http://pastebin.com/0V1cq5xw (click the raw option to get tab-delimited format)
- A quick list of keywords and frequency of occurance: http://pastebin.com/3WRMEETY
Considerations
- Given the amount of data for all emails to and from JP, I figured that it was most useful to focus on the ones he sent. Of course, this wont catch everything, because probably a lot of the most important ones he followed up with a call, but it made the body of data manageable for processing.
- Because of some performance issues I was having with my browser during the scrape, I missed about 30 or so, for a total of 7512/7544, I apologize for this, I wanted to correct it, but the scrape took a while and I decided to get what I had available, and then fix the gap if people found this work useful.
Keywords
- The keywords are again using the methodology of a capitalized sequence of words or numbers. I put the most amount of logic into this, including a list of text to ignore, this greatly improved the results, but its not perfect, and you will see some unimportant phrases show up as a result.
- I traversed all 7500+ emails to create the list of keywords, and I have some logic for correcting them as the data aggregates (pluralization, all caps, etc)
- The resulting list was enormous (30k+ items), so I cut I made a rule that the keyword had to occur 3 times to be included, which reduced the list size to a bit over 10k.
Final thoughts
There are some good insights in the current data, imo, the frequency of who he talks to is pretty clear, and the ranking of keywords is based on enough data to establish actual relevance.
But at the very least, this tabulated data may assist investigators who want to zip around the emails.
Next, I'm planning on scraping [email protected] next which has 495 emails sent. And I can do more of this type of work if the data proves fruitful to the community. Hope it helps!
God bless you for fighting for justice.
view the rest of the comments →
SIMONBARROW ago
During the course of preparing this, did you happen to notice if he received a lot of spam? I'm wondering if his spam was real (regular) spam or whether some material may have been sent to him disguised as spam.
Jeremy20_9 ago
He recieves so much email, tons of fundraising stuff, that I couldn't even approach his inbox, I had to look at it from the sent email perspective.
However, I did notice that there is very minimal use of this address until 2006. Then a lot of use, and then look at his usage after 2008, these are dates of sent emails:
11/21/2008
2/22/2009
2/22/2009
2/22/2009
5/4/2009
11/29/2009
11/29/2009
11/29/2009
12/6/2009
4/19/2010
11/1/2011
11/1/2011
11/1/2011
11/1/2011
1 email sent in 2010.
That's curious imo.