What do Journalists do with Documents?

Many people have realized that natural language processing (NLP) techniques could be extraordinarily helpful to journalists who need to deal with large volumes of documents or other text data. But although there have been many experiments and much speculation, almost no one has built NLP tools that journalists actually use. In part, this is because computer scientists haven’t had a good description of the problems journalists actually face. This talk and paper, presented at the Computation + Journalism Symposium, are one attempt to remedy that. (Talk slides here.)

This all comes out of my experience both building and using Overview, an open source document mining system built specifically for investigative journalists. The paper summarizes every story completed with Overview, and also discusses the five cases I know where journalists used custom NLP code to get the story done.

Stories done with Overview

The talk is more focussed on the lessons learned — all the things I wish I had known when I started writing NLP code for journalism six years ago. I recommend six research themes for computer scientists who want to help journalists:

Robust import. Preparing documents for analysis is a much bigger problem than is generally appreciated. Even structured data like email is often delivered on paper.

Robust analysis. Journalists routinely deal with unbelievably dirty documents. OCR error confounds classic algorithms. Shorthand and jargon break dictionaries and parsers.

Search, not exploration. Reporters are usually looking for something, but it may not be something that is easy to express in a keyword search. The ultimate example is “corruption,” which you can’t just type into a search box.

Quantitative summaries. Journalists have long produced stories by counting the number of documents of a certain type. How can we make this easy, flexible, and accurate?

Interactive methods. Even with NLP, document-based reporting requires extensive human reading. How do we best integrate machine and human intelligence in an interactive loop?

Clarity and Accuracy. Journalists are accountable to the public for their results. They must be able to explain how they got their answer, and how they know the answer is right.

I am currently compiling test sets of real-world documents that journalists have encountered, to help researchers who want to work on these problems. Contact me if you’re interested! I’d also like to take this opportunity to point out that Overview has an analysis plugin API, so if you’re doing work that you want journalists to use, this is one easy way to get a UI around it, and get it shipping with a widely-used tool.