Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.
  • Knight News Challenge application for “Overview,” the open-source system we’d like to build for doing this and other kinds of visual explorations of large document sets. If you like our work, why not leave a comment on our proposal?

11 thoughts on “Investigating thousands (or millions) of documents by visualizing clusters”

  1. Concentrate legal discovery databases.
    Make anyone able to do the same thing as a lawyer.
    Lawyers were able to stop this fifteen years ago, as I recall. It’s time to take that information and make it available to everyone. It’s not secret, but it is guarded.

  2. Jonathan,

    Ward Cunningham – best known for inventing the code that creates wikis – is working with PEG in a C++ wrap around to parse. He is opening up the code, if not done so already. He spoke to many of us about it at the Open Source Bridge conference 2011, ( #OSBridge11 or #OSB11). He is testing it on large scale by using Wikipedia, and will be speaking with for potential application. The testing process is fast and reliable. I was thinking this might be useful, even if just conceptually, for your objective with investigating journalism documents.


    Teresa Boze

    Concept | Connections, NW

  3. I see a lot of interesting articles on your page. You have to spend a lot of time
    writing, i know how to save you a lot of work, there is a
    tool that creates high quality, SEO friendly posts in couple of seconds, just type in google – k2 unlimited content

  4. I read a lot of interesting posts here. Probably you spend
    a lot of time writing, i know how to save you a lot of time,
    there is an online tool that creates high quality, SEO friendly posts in minutes, just
    search in google – k2seotips unlimited content

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>