«« A computational journalism reading list | UN asks Ushahidi to produce Crisis Map of Libya »»

Feb 28 2011

Investigating thousands (or millions) of documents by visualizing clusters

This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs.

References cited in the talk:

  • “A full-text visualization of the Iraq war logs”, a detailed writeup of the technique used to generate the first set of maps presented in the talk.
  • The Glimmer high-performance, parallel multi-dimensional scaling algorithm, which is the software I presented in the live demo portion. It will be the basis of our clustering work going forward. (We are also working on other large-scale visualizations which may be more appropriate for e.g. email dumps.)
  • “Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology.” Justin Grimmer, Gary King, 2009. A paper that everyone working in document clustering needs to read. It clearly makes the point that there is no “best” clustering, just different algorithms that correspond to different pre-conceived frames on the story — and gives a method to compare clusterings (though I don’t think it will scale well to millions of docs.)
  • Wikipedia pages for bag of words model, tf-idf, and cosine similarity, the basic text processing techniques we’re using.
  • Gephi, a free graph visualization system, which we used for the one-month Iraq map. It will work up to a few tens of thousands of nodes.
  • Knight News Challenge application for “Overview,” the open-source system we’d like to build for doing this and other kinds of visual explorations of large document sets. If you like our work, why not leave a comment on our proposal?
Share This:
  • Twitter
  • Facebook
  • Digg
  • del.icio.us
  • Google Bookmarks
  • Tumblr
  • email
  • StumbleUpon
  • Yahoo! Bookmarks
  • Ping.fm

9 responses so far

9 Responses to “Investigating thousands (or millions) of documents by visualizing clusters”

  1. L’oggetto del tuo articolo e ben scritto e ho solo pensato che avrei dovuto lasciare un complimento poco qui. Bravi e continuate cosi! Ho pensato di iniziare un blog WordPress troppo. Sapete altri siti dove ti insegnano come?

  2. [...] of the comments on their own stories — it’s a big job. People like Jonathan Stray are exploring how to find patterns in huge datasets, others like Adam Marcusare looking at how to use systems like Twitter to identify and verify [...]

  3. [...] Overview (that’s me and colleagues) – announcement, proposal, demo [...]

  4. Jameson 13 Jun 2011 at 4:54 am

    Concentrate legal discovery databases.
    Make anyone able to do the same thing as a lawyer.
    Lawyers were able to stop this fifteen years ago, as I recall. It’s time to take that information and make it available to everyone. It’s not secret, but it is guarded.

  5. [...] โดยอาศัยการวาดภาพจากกลุ่มข้อมูล [...]

  6. [...] Investigating thousands (or millions) of documents by visualizing clusters [...]

  7. Teresa (@PDXsays) Bozeon 25 Jun 2011 at 9:20 am

    Jonathan,

    Ward Cunningham – best known for inventing the code that creates wikis – is working with PEG in a C++ wrap around to parse. He is opening up the code, if not done so already. He spoke to many of us about it at the Open Source Bridge conference 2011, ( #OSBridge11 or #OSB11). He is testing it on large scale by using Wikipedia, and will be speaking with archive.org for potential application. The testing process is fast and reliable. I was thinking this might be useful, even if just conceptually, for your objective with investigating journalism documents.

    Best,

    Teresa Boze

    Concept | Connections, NW

  8. [...] Jonathan Stray » Investigating thousands (or millions) of documents by visualizing clusters Investigating thousands (or millions) of documents by visualizing clusters Published by Jonathan Stray at 8:59 am. Tags: computational journalism, computational linguistics, document dumps, visualization This is a recording of my talk at the NICAR (National Institute of Computer-Assisted Reporting) conference last week, where I discuss some of our recent work at the AP with the Iraq and Afghanistan war logs. [...]

  9. Vanceon 25 Jul 2014 at 11:10 am

    I see a lot of interesting articles on your page. You have to spend a lot of time
    writing, i know how to save you a lot of work, there is a
    tool that creates high quality, SEO friendly posts in couple of seconds, just type in google – k2 unlimited content

Trackback URI | Comments RSS

Leave a Reply

«« A computational journalism reading list | UN asks Ushahidi to produce Crisis Map of Libya »»