You are viewing a single comment's thread.

view the rest of the comments →

murface ago

Has anyone heard of language analysis being completed on the archives?

Even simple works, like counting words and phrases? This may be useful in consolidating the information to allow one to quickly find related content and rare occurrences where more extraordinary information could be found.

satisfyinghump ago

This is a great idea!

One that I've played around with, such as using word clouds on various wikileak emails.

A different approach I've been attempting is comparing writing samples, based on a DefCon presentation I came across awhile back.

A separate project has been to automate a process/script that does the following:

  1. Sample the current news as being published by a list of websites

    a) Include twitter/facebook/etc sources for trends

    b) Provide a method of providing a word/phrase as input

  2. Take the sample/word phrase and run it through all news sites, providing a list of them that have used it (look for natural vs created trends, i.e. fake and planned news)

  3. Take the same phrase/word and input it into a search/query that gophers/crawls through all of the wikileak emails, and other provided 'databases' of such leaked files/emails/documents, etc.

  4. Create a list of emails/documents which mentioned the word/phrase that we sampled from (1) and include in that list, if applicable, the from/to/people involved with those emails/documents, including those mentioned within them.


Any criticism would be appreciated. Going to attempt it first with Python.

Any suggestions, I'd welcome them!

murface ago

That sounds awesome! It's a bit above my head to implement myself. But on the performance side, perl operates on strings notably faster than Python, and may increase performance for some tasks.

Decentralized work may be useful too. Use something like a redis/mongo instance to index findings and pointers. This can let you have multiple small workers filling in the data.

Doing something like this in the cloud can then be pretty easy/cost effective, allowing it to grow with more workers to ingest at an hourly rate, then shrink once the majority of work has been done to keep operational costs down.

Document linking:

Perhaps each email/doc/article gets a UUID, and a redis sorted set could then be matched to a word/phrase. The UUIDS can be scored by relevance and then re-linked to the actual source.

This should provide a rapid lookup which can be indexed in a multi-worker environment.