Investigating thousands (or millions) of documents by visualizing clusters

February 28, 2011February 28, 2011Jonathan Straycomputational journalism, computational linguistics, document dumps, visualization

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

“A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
“Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.
Knight News Challenge application for “Overview,” the open-source system we’d like to build for doing this and other kinds of visual explorations of large document sets. If you like our work, why not leave a comment on our proposal?

12 thoughts on “Investigating thousands (or millions) of documents by visualizing clusters”

fabricant et fournisseur de moteur de la bicyclette says:

March 1, 2011 at 7:17 am

L’oggetto del tuo articolo e ben scritto e ho solo pensato che avrei dovuto lasciare un complimento poco qui. Bravi e continuate cosi! Ho pensato di iniziare un blog WordPress troppo. Sapete altri siti dove ti insegnano come?
Pingback: How can newsrooms become better listeners and enablers? » Article » OWNI.eu, Digital Journalism
Pingback: Jonathan Stray » Knight News Challenge 2011 Finalists
James says:

June 13, 2011 at 4:54 am

Concentrate legal discovery databases.
Make anyone able to do the same thing as a lawyer.
Lawyers were able to stop this fifteen years ago, as I recall. It’s time to take that information and make it available to everyone. It’s not secret, but it is guarded.
Pingback: ขุดข่าวจากวิกิลีกส์ ด้วยโปรแกรมคอมพิวเตอร�
Pingback: Investigating thousands (or millions) of documents by visualizing clusters « Another Word For It
Teresa (@PDXsays) Boze says:

June 25, 2011 at 9:20 am

Jonathan,

Ward Cunningham – best known for inventing the code that creates wikis – is working with PEG in a C++ wrap around to parse. He is opening up the code, if not done so already. He spoke to many of us about it at the Open Source Bridge conference 2011, ( #OSBridge11 or #OSB11). He is testing it on large scale by using Wikipedia, and will be speaking with archive.org for potential application. The testing process is fast and reliable. I was thinking this might be useful, even if just conceptually, for your objective with investigating journalism documents.

Best,

Teresa Boze

Concept | Connections, NW
Pingback: Investigating thousands (or millions) of documents by visualizing clusters — Nogiets
Vance says:

July 25, 2014 at 11:10 am

I see a lot of interesting articles on your page. You have to spend a lot of time
writing, i know how to save you a lot of work, there is a
tool that creates high quality, SEO friendly posts in couple of seconds, just type in google – k2 unlimited content
Bennie says:

September 5, 2014 at 9:20 am

I read a lot of interesting posts here. Probably you spend
a lot of time writing, i know how to save you a lot of time,
there is an online tool that creates high quality, SEO friendly posts in minutes, just
search in google – k2seotips unlimited content
Codecave says:

September 2, 2015 at 6:48 am

Really really awesome thank you!
Andreas from germany
James parker says:

December 27, 2015 at 9:07 am

Thanks admin for this great article and i promise to share it
with my friends. Rank you site high in search engines in Two days >>> https://goo.gl/zcALt3

Jonathan Stray

Information, culture, and belief

Investigating thousands (or millions) of documents by visualizing clusters

12 thoughts on “Investigating thousands (or millions) of documents by visualizing clusters”

Leave a Reply