A computational journalism reading list

[Last updated: 18 April 2011 — added statistical NLP book link]

There is something extraordinarily rich in the intersection of computer science and journalism. It feels like there’s a nascent field in the making, tied to the rise of the internet. The last few years have seen calls for a new class of  “programmer journalist” and the birth of a community of hacks and hackers. Meanwhile, several schools are now offering joint degrees. But we’ll need more than competent programmers in newsrooms. What are the key problems of computational journalism? What other fields can we draw upon for ideas and theory? For that matter, what is it?

I’d like to propose a working definition of computational journalism as the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. This includes the journalistic mainstay of “reporting” — because information not published is information not known — but my definition is intentionally much broader than that. To succeed, this young discipline will need to draw heavily from social science, computer science, public communications, cognitive psychology and other fields, as well as the traditional values and practices of the journalism profession.

“Computational journalism” has no textbooks yet. In fact the term barely is barely recognized. The phrase seems to have emerged at Georgia Tech in 2006 or 2007. Nonetheless I feel like there are already important topics and key references.

Data journalism
Data journalism is obtaining, reporting on, curating and publishing data in the public interest. The practice is often more about spreadsheets than algorithms, so I’ll suggest that not all data journalism is “computational,” in the same way that a novel written on a word processor isn’t “computational.” But data journalism is interesting and important and dovetails with computational journalism in many ways.

Visualization
Big data requires powerful exploration and storytelling tools, and increasingly that means visualization. But there’s good visualization and bad visualization, and the field has advanced tremendously since Tufte wrote The Visual Display of Quantitative Information. There is lots of good science that is too little known, and many open problems here.

  • Tamara Munzner’s chapter on visualization is the essential primer. She puts visualization on rigorous perceptual footing, and discusses all the major categories of practice. Absolutely required reading for anyone who works with pictures of data.
  • Ben Fry invented the Processing language and wrote his PhD thesis on “computational information design,” which is his powerful conception of the iterative, interactive practice of designing useful visualizations.
  • How do we make visualization statistically rigorous? How do we know we’re not just fooling ourselves when we see patterns in the pixels? This amazing paper by Wickham et. al. has some answers.
  • Is a visualization a story? Segal and Heer explore this question in “Narrative Visualization: Telling Stories with Data.”

Computational linguistics
Data is more than numbers. Given that the web is designed to be read by humans, it makes heavy use of human language. And then there are all the world’s books, and the archival recordings of millions of speeches and interviews. Computers are slowly getting better at dealing with language.

Communications technology and free speech
Code is law. Because our communications systems use software, the underlying mathematics of communication lead to staggering political consequences — including whether or not it is possible for governments to verify online identity or remove things from the internet. The key topics here are networks, cryptography, and information theory.

  • The Handbook of Applied Cryptography is a classic, and free online. But despite the title it doesn’t really explain how crypto is used in the real world, like Wikipedia does.
  • It’s important to know how the internet routes information, using TCP/IP and BGP, or at a somewhat higher level, things like the BitTorrent protocol. The technical details determine how hard it is to do things like block websites, suppress the dissemination of a file, or remove entire countries from the internet.
  • Anonymity is deeply important to online free speech, and very hard. The Tor project is the outstanding leader in anonymity-related research.
  • Information theory is stunningly useful across almost every technical discipline. Pierce’s short textbook is the classic introduction, while Tom Schneider’s Information Theory Primer seems to be the best free online reference.

Tracking the spread of information (and misinformation)
What do we know about how information spreads through society? Very little. But one nice side effect of our increasingly digital public sphere is the ability to track such things, at least in principle.

  • Memetracker was (AFAIK) the first credible demonstration of whole-web information tracking, following quoted soundbites through blogs and mainstream news sites and everything in between. Zach Seward has cogent reflections on their findings.
  • The Truthy Project aims for automated detection of astro-turfing on Twitter. They specialize in covert political messaging, or as I like to call it, computational propaganda.
  • We badly need tools to help us determine the source of any given online “fact.” There are many existing techniques that could be applied to the problem, as I discussed in a previous post.
  • If we had information provenance tools that worked across a spectrum of media outlets and feed types (web, social media, etc.) it would be much cheaper to do the sort of information ecosystem studies that Pew and others occasionally undertake. This would lead to a much better understanding of who does original reporting.

Filtering and recommendation
With vastly more information than ever before available to us, attention becomes the scarcest resource. Algorithms are an essential tool in filtering the flood of information that reaches each person. (Social media networks also act as filters.)

  • The paper on preference networks by Turyen et. al. is probably as good an introduction as anything to the state of the art in recommendation engines, those algorithms that tell you what articles you might like to read or what movies you might like to watch.
  • Before Google News there was Columbia News Blaster, which incorporated a number of interesting algorithms such as multi-lingual article clustering, automatic summarization, and more as described in this paper by McKeown et. al.
  • Anyone playing with clustering algorithms needs to have a deep appreciation of the ugly duckling theorem, which says that there is no categorization without preconceptions. King and Grimmer explore this with their technique for visualizing the space of clusterings.
  • Any digital journalism product which involves the audience to any degree — that should be all digital journalism products — is a piece of social software, well defined by Clay Shirky in his classic essay, “A Group Is Its Own Worst Enemy.” It’s also a “collective knowledge system” as articulated by Chris Dixon.

Measuring public knowledge
If journalism is about “informing the public” then we must consider what happens to stories after publication — this is the “last mile” problem in journalism. There is almost none of this happening in professional journalism today, aside from basic traffic analytics. The key question here is, how does journalism change ideas and action? Can we apply computers to help answer this question empirically?

  • World Public Opinion’s recent survey of misinformation among American voters solves this problem in the classic way, by doing a randomly sampled opinion poll. I discuss their bleak results here.
  • Blogosphere maps and other kinds of visualizations can help us understand the public information ecosystem, such as this interactive visualization of Iranian blogs. I have previously suggested using such maps as a navigation tool that might broaden our information horizons.
  • UN Global Pulse is a serious attempt to create a real-time global monitoring system to detect humanitarian threats in crisis situations. They plan to do this by mining the “data exhaust” of entire societies — social media postings, online records, news reports, and whatever else they can get their hands on. Sounds like key technology for journalism.
  • Vox Civitas is an ambitious social media mining tool designed for journalists. Computational linguistics, visualization, and more.

Research agenda
I know of only one work which proposes a research agenda for computational journalism.

This paper presents a broad vision and is really a must-read. However, it deals almost exclusively with reporting, that is, finding new knowledge and making it public. I’d like to suggest that the following unsolved problems are also important:

  • Tracing the source of any particular “fact” found online, and generally tracking the spread and mutation of information.
  • Cheap metrics for the state of the public information ecosystem. How accurate is the web? How accurate is a particular source?
  • Techniques for mapping public knowledge. What is it that people actually know and believe? How polarized is a population? What is under-reported? What is well reported but poorly appreciated?
  • Information routing and timing: how can we route each story to the set of people who might be most concerned about it, or best in a position to act, at the moment when it will be most relevant to them?

This sort of attention to the health of the public information ecosystem as a whole, beyond just the traditional surfacing of new stories, seems essential to the project of making journalism work.

FMRI “Mind Reading” Doesn’t Yet Threaten Humanity

visual-image-reconstruction-from-fmri

It is now possible to see what a person is looking at by scanning their brain. The technique, published last November by a team of Japanese neuroscientists, uses FMRI to reconstruct a digital image of the picture entering the eye, albeit at very low resolution and only after hundreds of training runs. Still, it’s an awesome development, and many articles covering this research have called it “mind reading” (1, 2, 3, 4, 5). But it really isn’t, and it’s fun to explore what real “mind reading” would imply.

When I hear “mind reading” I want psychic abilities. I want to be able to know what number you’re thinking of, where you were on the night of March 4th, and what you actually think of my souffle. This is the sort of technology that could be badly misused, as the comments on one blog note:

Am I the only one finding this DEEPLY disturbing? It opens the doors to some of the scariest 1984-style total-control future predictions. Imagine you can’t hide your f#&%!ng MIND!

Fortunately, we’re not there yet. Morover, if we did have the technology to read minds, we’d have much bigger societal issues than privacy to deal with. The existence of “mind reading machines” would imply that we possessed good formal models of the human mind, and that is a can of worms.

Continue reading FMRI “Mind Reading” Doesn’t Yet Threaten Humanity

Minds Are Tricky Things — Part III

Everybody thinks they know how their mind works, but they don’t. You can ask someone why they like their boyfriend, or why they chose a job, or whether a book changed their opinion of global warming, and they’ll think about it for a moment and happily give you an answer. But they’re making it up.

The experiments were done ages ago, and the research is still going, continuing to tease apart actual cause and psychological effect. We know now that what people tell us about their own mental processes is quite thoroughly inaccurate. We all believe that we have this magic thing called “introspection” that lets us see what is going on in our own minds, but in reality we don’t. It’s a fictional superpower.

The research on this point is really quite good. It’s not even a new finding, having been understood for at least the last fifty years. And yet this simple but important fact has never quite managed to make it into popular culture.

Perhaps no one wants to believe it.

Continue reading Minds Are Tricky Things — Part III

Minds Are Tricky Things — Part II

In a fit of recursion, I am going to begin my discussion of the scientific understanding of the mind by bringing up a piece of psychology research into how people perceive neuro-imaging. This not only gives a taste of what different types of research can be like, but reveals something rather disturbing: merely adding a brain scan image or two makes people more likely to rate an article as scientifically sound. This gets us into questions of what is and isn’t a good reason to believe any particular research conclusion, which is ultimately what I want to talk about in this series of articles.

At the present time there are basically two technologies that can give us some idea of the activity of a working brain: positron-emission topography (PET) and functional magnetic-resonance imaging (fMRI). They both have important limitations in terms of resolution, what they actually measure, and many other things besides, but they’re also pretty amazing technologies. They produce detailed 3D maps of the “activity” of a whole brain, which are often represented like this:

A Functional Magnetic Resonance Imaging (fMRI) image

Continue reading Minds Are Tricky Things — Part II

Minds Are Tricky Things — Part I

I’ve been reading the literature on neuroscience, cognitive linguistics, psychology and such for a long time now, and the temptation to write about what’s new is overwhelming. There are so many exciting things being learned, and equally there are so many subtle problems of how we can know anything at all about the subjective world. But before I can bombard you with chewy words like “affect” and “epistemology,” I need to explain why any of this matters. It matters because people matter.

It is a difficult and ancient fact that we as conscious beings don’t live in the real world. There are boundaries to what we know and what we can know. I am right now sitting on a couch in my house in Oakland, California. Across the ocean, there is a woman sitting on the floor of her Tokyo apartment. I have never met her, but she is just as much a part of the world as I am. Not my world though. There seem to be boundaries to the things I perceive. Figuring out those boundaries and how things get into and out of them is the process of figuring out me, and everyone else too.

Continue reading Minds Are Tricky Things — Part I