A computational journalism reading list

[Last updated: 18 April 2011 — added statistical NLP book link]

There is something extraordinarily rich in the intersection of computer science and journalism. It feels like there’s a nascent field in the making, tied to the rise of the internet. The last few years have seen calls for a new class of  “programmer journalist” and the birth of a community of hacks and hackers. Meanwhile, several schools are now offering joint degrees. But we’ll need more than competent programmers in newsrooms. What are the key problems of computational journalism? What other fields can we draw upon for ideas and theory? For that matter, what is it?

I’d like to propose a working definition of computational journalism as the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. This includes the journalistic mainstay of “reporting” — because information not published is information not known — but my definition is intentionally much broader than that. To succeed, this young discipline will need to draw heavily from social science, computer science, public communications, cognitive psychology and other fields, as well as the traditional values and practices of the journalism profession.

“Computational journalism” has no textbooks yet. In fact the term barely is barely recognized. The phrase seems to have emerged at Georgia Tech in 2006 or 2007. Nonetheless I feel like there are already important topics and key references.

Data journalism
Data journalism is obtaining, reporting on, curating and publishing data in the public interest. The practice is often more about spreadsheets than algorithms, so I’ll suggest that not all data journalism is “computational,” in the same way that a novel written on a word processor isn’t “computational.” But data journalism is interesting and important and dovetails with computational journalism in many ways.

Visualization
Big data requires powerful exploration and storytelling tools, and increasingly that means visualization. But there’s good visualization and bad visualization, and the field has advanced tremendously since Tufte wrote The Visual Display of Quantitative Information. There is lots of good science that is too little known, and many open problems here.

  • Tamara Munzner’s chapter on visualization is the essential primer. She puts visualization on rigorous perceptual footing, and discusses all the major categories of practice. Absolutely required reading for anyone who works with pictures of data.
  • Ben Fry invented the Processing language and wrote his PhD thesis on “computational information design,” which is his powerful conception of the iterative, interactive practice of designing useful visualizations.
  • How do we make visualization statistically rigorous? How do we know we’re not just fooling ourselves when we see patterns in the pixels? This amazing paper by Wickham et. al. has some answers.
  • Is a visualization a story? Segal and Heer explore this question in “Narrative Visualization: Telling Stories with Data.”

Computational linguistics
Data is more than numbers. Given that the web is designed to be read by humans, it makes heavy use of human language. And then there are all the world’s books, and the archival recordings of millions of speeches and interviews. Computers are slowly getting better at dealing with language.

Communications technology and free speech
Code is law. Because our communications systems use software, the underlying mathematics of communication lead to staggering political consequences — including whether or not it is possible for governments to verify online identity or remove things from the internet. The key topics here are networks, cryptography, and information theory.

  • The Handbook of Applied Cryptography is a classic, and free online. But despite the title it doesn’t really explain how crypto is used in the real world, like Wikipedia does.
  • It’s important to know how the internet routes information, using TCP/IP and BGP, or at a somewhat higher level, things like the BitTorrent protocol. The technical details determine how hard it is to do things like block websites, suppress the dissemination of a file, or remove entire countries from the internet.
  • Anonymity is deeply important to online free speech, and very hard. The Tor project is the outstanding leader in anonymity-related research.
  • Information theory is stunningly useful across almost every technical discipline. Pierce’s short textbook is the classic introduction, while Tom Schneider’s Information Theory Primer seems to be the best free online reference.

Tracking the spread of information (and misinformation)
What do we know about how information spreads through society? Very little. But one nice side effect of our increasingly digital public sphere is the ability to track such things, at least in principle.

  • Memetracker was (AFAIK) the first credible demonstration of whole-web information tracking, following quoted soundbites through blogs and mainstream news sites and everything in between. Zach Seward has cogent reflections on their findings.
  • The Truthy Project aims for automated detection of astro-turfing on Twitter. They specialize in covert political messaging, or as I like to call it, computational propaganda.
  • We badly need tools to help us determine the source of any given online “fact.” There are many existing techniques that could be applied to the problem, as I discussed in a previous post.
  • If we had information provenance tools that worked across a spectrum of media outlets and feed types (web, social media, etc.) it would be much cheaper to do the sort of information ecosystem studies that Pew and others occasionally undertake. This would lead to a much better understanding of who does original reporting.

Filtering and recommendation
With vastly more information than ever before available to us, attention becomes the scarcest resource. Algorithms are an essential tool in filtering the flood of information that reaches each person. (Social media networks also act as filters.)

  • The paper on preference networks by Turyen et. al. is probably as good an introduction as anything to the state of the art in recommendation engines, those algorithms that tell you what articles you might like to read or what movies you might like to watch.
  • Before Google News there was Columbia News Blaster, which incorporated a number of interesting algorithms such as multi-lingual article clustering, automatic summarization, and more as described in this paper by McKeown et. al.
  • Anyone playing with clustering algorithms needs to have a deep appreciation of the ugly duckling theorem, which says that there is no categorization without preconceptions. King and Grimmer explore this with their technique for visualizing the space of clusterings.
  • Any digital journalism product which involves the audience to any degree — that should be all digital journalism products — is a piece of social software, well defined by Clay Shirky in his classic essay, “A Group Is Its Own Worst Enemy.” It’s also a “collective knowledge system” as articulated by Chris Dixon.

Measuring public knowledge
If journalism is about “informing the public” then we must consider what happens to stories after publication — this is the “last mile” problem in journalism. There is almost none of this happening in professional journalism today, aside from basic traffic analytics. The key question here is, how does journalism change ideas and action? Can we apply computers to help answer this question empirically?

  • World Public Opinion’s recent survey of misinformation among American voters solves this problem in the classic way, by doing a randomly sampled opinion poll. I discuss their bleak results here.
  • Blogosphere maps and other kinds of visualizations can help us understand the public information ecosystem, such as this interactive visualization of Iranian blogs. I have previously suggested using such maps as a navigation tool that might broaden our information horizons.
  • UN Global Pulse is a serious attempt to create a real-time global monitoring system to detect humanitarian threats in crisis situations. They plan to do this by mining the “data exhaust” of entire societies — social media postings, online records, news reports, and whatever else they can get their hands on. Sounds like key technology for journalism.
  • Vox Civitas is an ambitious social media mining tool designed for journalists. Computational linguistics, visualization, and more.

Research agenda
I know of only one work which proposes a research agenda for computational journalism.

This paper presents a broad vision and is really a must-read. However, it deals almost exclusively with reporting, that is, finding new knowledge and making it public. I’d like to suggest that the following unsolved problems are also important:

  • Tracing the source of any particular “fact” found online, and generally tracking the spread and mutation of information.
  • Cheap metrics for the state of the public information ecosystem. How accurate is the web? How accurate is a particular source?
  • Techniques for mapping public knowledge. What is it that people actually know and believe? How polarized is a population? What is under-reported? What is well reported but poorly appreciated?
  • Information routing and timing: how can we route each story to the set of people who might be most concerned about it, or best in a position to act, at the moment when it will be most relevant to them?

This sort of attention to the health of the public information ecosystem as a whole, beyond just the traditional surfacing of new stories, seems essential to the project of making journalism work.

The state of The State of the Union coverage, online

The state of the union is a big pre-planned event, so it’s a great place to showcase new approaches and techniques. What do news digital news organizations do when they go all out? Here’s my roundup of online coverage Tuesday night.

Live coverage

The Huffington Post, the New York Times, the Wall Street JournalABCCNNMashable, and many others, including even Mother Jones had live web video. But you can get live video on television, so perhaps the digitally native form of the live blog is more interesting. This can include commentary from multiple reporters, reactions from social media, link round-ups, etc. The New York Times, the Boston Globe, The Wall Street JournalCNNMSNBC, and many others had a live blog. The Huffington Post’s effort was particularly comprehensive, continuing well into Wednesday afternoon.

Multi-format, socially-aware live coverage is now standard, and by my reckoning makes television look meagre. But the experience is not really available on tablet and mobile yet. For example, almost all of the live video feeds were in Flash and therefore unavailable on Apple devices, as CNET reports.

As far as tools, there was some use of Coveritlive, but most live blogs seemed to be using nondescript custom software.

Visualizations

Lots of visualization love this year. But visualizations take time to create, so most of them were rooted in previously available SOTU information. The Wall Street Journal did an interactive topic and keyword breakdown of Obama’s addresses to congress since 2009, which moved about an hour after Tuesday’s speech concluded.

The New York Times had a snazzy graphic comparing the topics of 75 years of SOTU addresses,  by looking at the rates of certain carefully chosen words. Rollovers for individual counts, but mostly a flat thing.

The Guardian Data Blog took a similar historical approach, with Wordles for SOTU speeches from Obama and seven other presidents back to Washington. Being the Data Blog, they also put the word frequencies for these speeches into a downloadable spreadsheet. It’s a huge image, definitely intended for big print pages.

A shout-out to my AP colleagues for all their hard work on our SOTU interactive, which included the video, a fact-checked transcript, and an animated visualization of Twitter responses before, during, and after the State of the Union.

But it’s not clear what, if anything, we can actually learn from such visualizations. In terms of solid journalism content, possibly the best visualization came not from a news organization but from Nick Diakopoulos and co. at Rutgers University. Their Vox Civitas tool does filtering, search, and visualization of over  100,000 tweets captured during the address.

I find this interface a little too complex for general audience consumption — definitely a power user’s tool. But the algorithms are second to none. For example, Vox Civitas compares tweets to the text of the speech within the previous two minutes to detect “relevance,” and the automated keyword extraction — you can see the keywords at the bottom of the interface above — is based on tf-idf and seems to choose really interesting and relevant words. The interactive graph of keyword frequency over time clearly shows the sort of information that I had hoped to reveal with the AP’s visualization.

Fact Checking

A number of organizations did real-time or near real-time fact checking, as Yahoo reports. The Sunlight Foundation used itsSunlight Live system fo real-time fact checks and commentary. This platform, incorporating live video, social media monitoring, and other components is expected to be available as an open-source web app, for the use of other news organizations, by mid-2011.

The Associated Press published a long fact check piece (also integrated into the AP interactive), ABC had their own story, and CNN took a stab at it.

But the heaviest hitter was Politifact, who had a number of fact check rulings within hours and several more by Wednesday evening. These are together in a nice summary article, but as is their custom the individual fact checks are extensively documented and linked to primary sources.

Audience engagement

Pretty much every news organization had some SUTO action on social media, though with varying degrees of aggressiveness and creativity. Some of the more interesting efforts involved solicitation of audience responses of a specific kind. NPR asked people to describe their reaction to the state of the union in three words. This was promoted aggressively on Twitter and Facebook. They also asked for political affiliation, and split out the 4000 responses into Democratic and Republican word clouds:

Apparently, Obama’s salmon joke went down well. The Wall Street Journal went live Tuesday morning with “The State of the Union is…” asking viewers to leave a one word answer. This was also promoted on Twitter. Their results were presented in the same interactive, as a popularity-sorted list.

Aside from this type of interactive, we saw lots of agressive social media engagement in general. The more social-media savvy organizations were all over this, promoting their upcoming coverage and responding to their audiences. As usual, the Huffington Post was pretty seriously tweeting the event, posting about updates to their live blog, etc. and going well into Wednesday morning. Perhaps inspired by NPR, they encouraged people to tweet their #3wordreaction to the speech. They also collected and highlighted reaction from teachers, Sarah Palin, etc.

But as an AP colleague of mine asked, engagement to what end? Getting people’s attention is great, but then how do we, as journalists, focus that attention in a way that makes people think or act?

The White House

No online media roundup of the SOTU would be complete without a discussion of the White House’s own efforts, including web and mobile app presences. Fortunately, Nieman Journalism Lab has done this for us. Here I’ll just add that the White House livestreamed a Q&A session in front of  an audience immediately after the speech, in which White House Office of Public Engagement’s Kal Penn (aka Kumar) read questions from social media. Then Obama himself did an intervew Thursday afternoon in which he answered questions submitted as videos on YouTube.

What is news when the audience is editor?

This is a paper I wrote in December 2009. I’ve decided to post it now, partially because it contains a previously unreported 30-day content comparison of Digg versus the New York times. Looking back on this work, I think that its greatest weakness is an under-appreciation of the importance of production processes in determining what gets reported and how. In other words, I believe now that the intense pressure of daily deadlines shapes the news far more than external influences such as political and commercial pressures — at least in countries where the press is relatively free. Also available as a pdf.

Abstract
There are now several websites which allow users to assemble news content from around the internet by means of voting systems. The result is a new kind of front page that directly reflects what the audience believes to be salient, as opposed to what the editorial staff of a newsroom believes the audience should know. Content analyses of such sites show that they have little overlap with mainstream media agendas (5% in a previous study). In fact, many of the items selected by users would not traditionally be considered “news” at all. This paper examines the shift from editor to audience agendas in the context of previous theories of news production, discusses existing content analysis work on the subject, and reports on a new 30 day study of Digg.com versus NYTimes.com.

Introduction
No news organization can cover everything. Traditionally, it is ultimately the editor of a news publication who decides what is newsworthy: what stories reporters will follow, and what stories will be published. It has been considered part of the value of a news organization to determine what its audiences need to know about.

It’s never been entirely clear how professional journalists decide which events are worth reporting, out of all the events taking place in the world. Neither has it been obvious how editorial choices relate to the audience’s personal judgments about what is important, but  such questions were largely theoretical before the advent of the web. “I own a newspaper, you do not” was always the implicit end to discussions about who got to decide what was news.

Today, publishing is near-free and the news package has been disaggregated. An online audience member can select single stories that interest them, without reading or even really being aware of the traditional news package. Alongside this disaggregation we find a new class of online applications that re-aggregate content from multiple sources. Readers vote on pages from across the web, and the top-rated items are displayed on the aggregator’s home page.

News consumers are literally tearing the world’s newspapers apart and re-assembling them to fit their own agendas, including lots of content not traditionally considered news at all.

This paper examines what we can learn about the online audience’s judgment not only of what is important but what is news at all, and how it differs from that of traditional newsrooms. I review previous work on “news values”  and “news agenda” in professional journalism, look at measurements of what audiences view online, and report on my own 30 day quantitative study of Digg as compared to the New York Times.

Features of the audience-generated agenda
Continue reading What is news when the audience is editor?