Many people have realized that natural language processing (NLP) techniques could be extraordinarily helpful to journalists who need to deal with large volumes of documents or other text data. But although there have been many experiments and much speculation, almost no one has built NLP tools that journalists actually use. In part, this is because computer scientists haven’t had a good description of the problems journalists actually face. This talk and paper, presented at the Computation + Journalism Symposium, are one attempt to remedy that. (Talk slides here.)
This all comes out of my experience both building and using Overview, an open source document mining system built specifically for investigative journalists. The paper summarizes every story completed with Overview, and also discusses the five cases I know where journalists used custom NLP code to get the story done.