A computational journalism reading list

January 31, 2011May 13, 2025Jonathan Straybelief, computational journalism, journalism, knowledge, media, minds, misinformation, politics, social media

[Last updated: 18 April 2011 — added statistical NLP book link]

There is something extraordinarily rich in the intersection of computer science and journalism. It feels like there’s a nascent field in the making, tied to the rise of the internet. The last few years have seen calls for a new class of “programmer journalist” and the birth of a community of hacks and hackers. Meanwhile, several schools are now offering joint degrees. But we’ll need more than competent programmers in newsrooms. What are the key problems of computational journalism? What other fields can we draw upon for ideas and theory? For that matter, what is it?

I’d like to propose a working definition of computational journalism as the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. This includes the journalistic mainstay of “reporting” — because information not published is information not known — but my definition is intentionally much broader than that. To succeed, this young discipline will need to draw heavily from social science, computer science, public communications, cognitive psychology and other fields, as well as the traditional values and practices of the journalism profession.

“Computational journalism” has no textbooks yet. In fact the term barely is barely recognized. The phrase seems to have emerged at Georgia Tech in 2006 or 2007. Nonetheless I feel like there are already important topics and key references.

Data journalism
Data journalism is obtaining, reporting on, curating and publishing data in the public interest. The practice is often more about spreadsheets than algorithms, so I’ll suggest that not all data journalism is “computational,” in the same way that a novel written on a word processor isn’t “computational.” But data journalism is interesting and important and dovetails with computational journalism in many ways.

The Nieman Journalism Lab’s interview with Guardian Data Blog editor Simon Rogers remains a solid introduction to (one kind of) contemporary practice.
The best practical guides I know are Rogers’ “How to: get to grips with data journalism” and Dan Nguyen’s series of data-scraping tutorials at ProPublica.
Stanford’s Journalism in the Age of Data is an hour-long documentary on data journalism and visualization.
The web is a linked system of human-readable documents. Now Tim Berners-Lee wants to create a web of machine-readable linked data. The full potential is unclear, but it’s a big idea that may come to be the backbone of semantic web visions. The New York Times, The Guardian, and others are experimenting with open data APIs.
Everyblock creator Adrian Holovaty seems to have been the first to suggest that reporters file structured data in his 2006 “A Fundamental Way Newspaper Websites Need to Change.” This idea is beautifully expanded in Stijn Debrouwere’s “Information Architecture for News Websites” series.

Visualization
Big data requires powerful exploration and storytelling tools, and increasingly that means visualization. But there’s good visualization and bad visualization, and the field has advanced tremendously since Tufte wrote The Visual Display of Quantitative Information. There is lots of good science that is too little known, and many open problems here.

Tamara Munzner’s chapter on visualization is the essential primer. She puts visualization on rigorous perceptual footing, and discusses all the major categories of practice. Absolutely required reading for anyone who works with pictures of data.
Ben Fry invented the Processing language and wrote his PhD thesis on “computational information design,” which is his powerful conception of the iterative, interactive practice of designing useful visualizations.
How do we make visualization statistically rigorous? How do we know we’re not just fooling ourselves when we see patterns in the pixels? This amazing paper by Wickham et. al. has some answers.
Is a visualization a story? Segal and Heer explore this question in “Narrative Visualization: Telling Stories with Data.”

Computational linguistics
Data is more than numbers. Given that the web is designed to be read by humans, it makes heavy use of human language. And then there are all the world’s books, and the archival recordings of millions of speeches and interviews. Computers are slowly getting better at dealing with language.

Word frequency techniques like tf-idf and the vector space document model are very simple and very useful. See also stemming. Lots more in the wonderful (and free!) Introduction to Information Retrieval. This book explains how search engines are built, and discusses tf-idf etc. in great technical detail.
Statistical language models are increasingly important for all kinds of applications. Michael Nielsen has a great introduction to statistical machine translation. Google’s Peter Norvig discusses how he implemented statistical spelling correction on his laptop during a long plane flight. For the full deal, see the book Foundations of Statistical Natural Language Processing.
On a related note, Google N-gram viewer lets you look at the frequency of short phrases within 4% of all books published, ever. The excellent paper gives examples of how to use this for cultural research. Dan Cohen has important criticisms.
Speech-to-text algorithms enable automated transcription, and Matt Thompson explores the huge implications for journalism.
Reuters maintains the OpenCalais entity extraction service, which parses text to contextually determine who and what is referenced.
IBM’s Watson project built a question-answering system that reads reference books and wins at Jeopardy. Imagine how useful to journalists and curious readers this could be! This paper on the DeepQA system describes how they did it.

Communications technology and free speech
Code is law. Because our communications systems use software, the underlying mathematics of communication lead to staggering political consequences — including whether or not it is possible for governments to verify online identity or remove things from the internet. The key topics here are networks, cryptography, and information theory.

The Handbook of Applied Cryptography is a classic, and free online. But despite the title it doesn’t really explain how crypto is used in the real world, like Wikipedia does.
It’s important to know how the internet routes information, using TCP/IP and BGP, or at a somewhat higher level, things like the BitTorrent protocol. The technical details determine how hard it is to do things like block websites, suppress the dissemination of a file, or remove entire countries from the internet.
Anonymity is deeply important to online free speech, and very hard. The Tor project is the outstanding leader in anonymity-related research.
Information theory is stunningly useful across almost every technical discipline. Pierce’s short textbook is the classic introduction, while Tom Schneider’s Information Theory Primer seems to be the best free online reference.

Tracking the spread of information (and misinformation)
What do we know about how information spreads through society? Very little. But one nice side effect of our increasingly digital public sphere is the ability to track such things, at least in principle.

Memetracker was (AFAIK) the first credible demonstration of whole-web information tracking, following quoted soundbites through blogs and mainstream news sites and everything in between. Zach Seward has cogent reflections on their findings.
The Truthy Project aims for automated detection of astro-turfing on Twitter. They specialize in covert political messaging, or as I like to call it, computational propaganda.
We badly need tools to help us determine the source of any given online “fact.” There are many existing techniques that could be applied to the problem, as I discussed in a previous post.
If we had information provenance tools that worked across a spectrum of media outlets and feed types (web, social media, etc.) it would be much cheaper to do the sort of information ecosystem studies that Pew and others occasionally undertake. This would lead to a much better understanding of who does original reporting.

Filtering and recommendation
With vastly more information than ever before available to us, attention becomes the scarcest resource. Algorithms are an essential tool in filtering the flood of information that reaches each person. (Social media networks also act as filters.)

The paper on preference networks by Turyen et. al. is probably as good an introduction as anything to the state of the art in recommendation engines, those algorithms that tell you what articles you might like to read or what movies you might like to watch.
Before Google News there was Columbia News Blaster, which incorporated a number of interesting algorithms such as multi-lingual article clustering, automatic summarization, and more as described in this paper by McKeown et. al.
Anyone playing with clustering algorithms needs to have a deep appreciation of the ugly duckling theorem, which says that there is no categorization without preconceptions. King and Grimmer explore this with their technique for visualizing the space of clusterings.
Any digital journalism product which involves the audience to any degree — that should be all digital journalism products — is a piece of social software, well defined by Clay Shirky in his classic essay, “A Group Is Its Own Worst Enemy.” It’s also a “collective knowledge system” as articulated by Chris Dixon.

Measuring public knowledge
If journalism is about “informing the public” then we must consider what happens to stories after publication — this is the “last mile” problem in journalism. There is almost none of this happening in professional journalism today, aside from basic traffic analytics. The key question here is, how does journalism change ideas and action? Can we apply computers to help answer this question empirically?

World Public Opinion’s recent survey of misinformation among American voters solves this problem in the classic way, by doing a randomly sampled opinion poll. I discuss their bleak results here.
Blogosphere maps and other kinds of visualizations can help us understand the public information ecosystem, such as this interactive visualization of Iranian blogs. I have previously suggested using such maps as a navigation tool that might broaden our information horizons.
UN Global Pulse is a serious attempt to create a real-time global monitoring system to detect humanitarian threats in crisis situations. They plan to do this by mining the “data exhaust” of entire societies — social media postings, online records, news reports, and whatever else they can get their hands on. Sounds like key technology for journalism.
Vox Civitas is an ambitious social media mining tool designed for journalists. Computational linguistics, visualization, and more.

Research agenda
I know of only one work which proposes a research agenda for computational journalism.

“Computational Journalism: A Call to Arms for Database Researchers” by Sarah Cohen et. al. raises the very intriguing possibility of building systems that automatically or semi-automatically scan databases for stories, document the rationale for believing certain facts, etc.

This paper presents a broad vision and is really a must-read. However, it deals almost exclusively with reporting, that is, finding new knowledge and making it public. I’d like to suggest that the following unsolved problems are also important:

Tracing the source of any particular “fact” found online, and generally tracking the spread and mutation of information.
Cheap metrics for the state of the public information ecosystem. How accurate is the web? How accurate is a particular source?
Techniques for mapping public knowledge. What is it that people actually know and believe? How polarized is a population? What is under-reported? What is well reported but poorly appreciated?
Information routing and timing: how can we route each story to the set of people who might be most concerned about it, or best in a position to act, at the moment when it will be most relevant to them?

This sort of attention to the health of the public information ecosystem as a whole, beyond just the traditional surfacing of new stories, seems essential to the project of making journalism work.

87 thoughts on “A computational journalism reading list”

binx says:

January 31, 2011 at 8:09 pm

Bravo, Jonathan, for compiling all of these resources into one categorized post. For anyone interested in the field of computation journalism, it is important to understand the wide breadth of disciplines from which it draws on.

Best wishes to you in further developing the field!
Pingback: Tweets that mention Jonathan Stray » A computational journalism reading list -- Topsy.com
Bente Kalsnes says:

February 1, 2011 at 9:06 am

I loved this reading list! I’m looking forward to check out your links. You inspired me to write a blog post about the topic (in Norwegian, unfortunately, but try Google translate), you’ll find it here: http://blogg.origo.no/-/bulletin/show/626679_leseliste-aapne-data-og-datastoettet-journalistikk
Jonathan Stray says:

February 1, 2011 at 11:29 am

Thanks for that Bente. But your blog post seems to say that “data journalism” and “computational journalism” are the same thing. I believe that it’s useful to distinguish between the two terms, as I describe in the introduction to the “data journalism” section above.

– Jonathan
Pingback: Popular on Twitter: The AOL Way leaked, sleuthing The Daily’s staffers and Rob Neyer joins SB Nation » Nieman Journalism Lab » Pushing to the Future of Journalism
Bente Kalsnes says:

February 1, 2011 at 2:18 pm

I see what you mean regarding the definition of computational journalism vs. data journalism. I’ve fixed that now. Thanks!

In the Norwegian blog post, I used the term “datastøttet journalistikk”, which translates more or less to computer assisted journalism (CAR). This term is used quite frequently in Norwegian to describe this field (more Norwegian articles about the topic here: http://bit.ly/idOokf

Do you regard CAR and computational journalism as two terms with similar meaning?
Devin Walker says:

February 1, 2011 at 2:35 pm

Great article and resource compilation. I graduated with a Journalism degree 4 years ago and since that time I have been doing nothing but web development. I knew a bit in college, but just enough to get by knowing the basic fundamentals of HTML, CSS, PHP and JavaScript. Since graduating I started working in the IT department at a major corporation. I’ve often thought my Journalism degree was essentially worthless, but I see it applied now the more I look everyday. Thanks for this excellent post and keep up the great work on your website.
Jonathan Stray says:

February 1, 2011 at 3:03 pm

Bente– no, “computer-assisted reporting” is about reporting, that is, finding new stories. I’m talking about also using computers for many other problems related to public information and knowledge. Not all of these areas have historically been part of what journalism does, but I am arguing that they should be.

– Jonathan
Ari Lacenski says:

February 1, 2011 at 10:45 pm

Your last point, on information routing and timing, calls out an interesting predictive relationship between story and audience — that news itself will precipitate new events. Makes me think of the DDOSes against Mastercard and Visa a few weeks ago, which wouldn’t have happened if not for all the reporting on WikiLeaks’ difficulties in staying online. I’m sure there are infinite examples. Do you think reporters or agencies have a specific responsibility to anticipate the effects of their work, to whatever extent is possible?
Jonathan Stray says:

February 1, 2011 at 11:52 pm

Yes, I do think reporters and publishers have a responsibility to consider the effects of their work. First of all, without that consideration there’s no way that journalism can do much of anything at all. How do you steer the profession without longstanding goals of some sort? Somewhere, somehow, you and your public have to decide what stories are “important” and should be worked on.

Journalism has also long dealt with possible negative effects, giving rise to conventions such as the anonymous source (when the person really is kept anonymous for the protection of important rights).

Another entanglement is the question of secrecy. The fundamental social responsibility of journalism gives rise to situations where information must be kept secret for some period of time– but exactly which secrets are legitimate is a question of enormous social importance. No one really knows where the right social contracts around the transparency of various entities lie, but many people feel that our institutions, including journalism, ought to be a great deal more transparent than they are now.

This has mostly been sold under the rubric of “accountability,” and that’s important but there’s also this: open institutions allow anyone with an interest to try to care for their workings. That is what the maker ethic brings to this question — a renewed, geeky sense of participation in society.
Jane Briggs-Bunting says:

February 2, 2011 at 8:42 am

Thanks for the thorough list and the amazing links. This will be a great resource for professionals and students.
Jeremy Antley says:

February 2, 2011 at 9:56 am

Great list here- I’m sending this to my other history grad buddies as many of the approaches outlined above could have definite impact on the future of the ‘digital humanities’. I feel like the tools and methods developed by journalists out of need to interpret new digital source materials will eventually be used for historical subjects. Perhaps its time to form a collaborative venture between the professions? Journalists could provide analysis of current situation while historians provide context rooted in the past. Take Egypt for example- no one has mentioned how Egypt and the US have had a tenuous relationship dating back to Nasser and his decision to arm Egypt with Soviet Bloc weapons and aid, or even touched upon the reign of Sadat. But now I ramble- excellent list and thanks for taking the time to gather these papers/reactions in one place.
Nick Diakopoulos says:

February 2, 2011 at 12:14 pm

Thanks for this up-to-date list Jonathan. Just wanted to point out that Computational Journalism has been taught at Georgia Tech since 2007 and that the syllabus for the most recent offering of the course is here: http://compjournalism.wordpress.com/schedulesyllabus/ It’s a bit outdated but also has some more classic reading relating to many of the themes you bring up.

As for research, there’s some great work being done at U. Michigan (Balance project: http://www.smunson.com/portfolio/projects.php), Berkeley (http://opinion.berkeley.edu/landing/), Northwestern (http://infolab.northwestern.edu/) and of course at Rutgers where I do research in this field (http://www.nickdiakopoulos.com/research-and-projects/). The J-Schools just haven’t caught up yet, partly because they do not have an engineering culture and so have difficulty investing in creating new technologies.
Pingback: links for 2011-02-02 | Aram on Mason
Reg Chua says:

February 2, 2011 at 5:44 pm

Jonathan, hey, great list. And more importantly, categorizations. I’ll post on this in a bit as well.

I do like the distinction you draw between computational and data journalism, which is an important one. Ditto the need for much more study into visualization as a communications/narrative form.

And while this is outside the scope of your definition – much the way that understanding the traditional business models of print journalism was outside the scope of traditional journalism studies – I suspect it’s important to bring in some notions of business/revenues etc at least as an adjunct to this discussion. Traditional journalism shied away from understanding its broader position as an industry as well as a public service – to its cost, I believe.

Reg
David Locke says:

February 2, 2011 at 6:15 pm

In Argentina a human rights organization built a data warehouse from military promotions, orders, and disappearances. They uncovered who did what. Those that did had immunity, but were still forced out of the military.

In terms of computational journalism, you can build data warehouses to see what we cannot see. When is someone going to aim one at campaign contributions?
Reg Chua says:

February 2, 2011 at 7:22 pm

David,

There’s a fair amount on campaign contributions out there – from opensecrets.org to maplight.org to poligraft.org. Although it’s true that we could use more meaning put into some of this.

Reg
Pingback: A Reading List « (Re)Structuring Journalism
Pingback: Lesetipps für den 4. Februar | Netzpiloten.de - das Beste aus Blogs, Videos, Musik und Web 2.0
Pingback: Jonathan Stray » A computational journalism reading list | WORDPRESS!
Pingback: Medial Digital» Linktipps Neu » Linktipps zum Wochenstart: Die Zukunft sieht alt aus
Pingback: Linksammlung Lesenswerter Beiträge zum Thema “Hacker Journalisten” | bloggingMAG
Pingback: Data journalism – shorthand for coping with information abundance | Martin Moore
Hertzel Karbasi says:

February 13, 2011 at 10:38 am

Hello

This ” free textbook on information retrieval” URL is not correct!

Hertzel
Jonathan Stray says:

February 13, 2011 at 5:00 pm

Hertzel — you are right! Fixed. Thanks.
Pingback: Links to Check Out 02/15/2011 « Innovation in College Media
Mark T says:

February 15, 2011 at 12:33 pm

This is a great list. I just graduated with a Journalism degree with a minor in Computer Science and was wondering what I could possibly do with it. Now I know. Thank You.
Pingback: The rise of “Computational Journalism?” | The Center for Campus Investigations
Liliana says:

February 21, 2011 at 2:01 pm

Great list! Here is another paper on data journalism which documents best practices in the field and provides a list of data tools and DDJ innovators: http://mediapusher.eu/datadrivenjournalism/pdf/ddj_paper_final.pdf
Mark says:

March 10, 2011 at 12:15 am

Thanks so much, as a student interested in journalism I am fairly convinced that understanding this field will be extremely important in the future. I came upon your post while looking for information on the topic (I plan on reading each of these documents).
Pingback: The Future of the Field: Computational Journalism « Grey Matters
Pingback: Week Six Recap, Week Seven Preview « New Media Entrepreneurship
Jaine Stockler says:

April 2, 2011 at 5:35 pm

Loved your reading list – will continue to recommend it to interested people.
One aspect that interests me paricularly is the current discussion around the role of journalists as ‘curators’ – I’m tempted to suggest that the word they are looking for is actually ‘librarians’.
The field of library and information science has massive intersections with the functions and resources and philosophies and ideas/theories of both computer science and journalism, as you’ve outlined them in the piece , especially under ‘filtering and recommendation’ – so that might be a fertile field for further investigation and another ‘collaborative venture’.
Pingback: links for 2011-04-04 | Joanna Geary
Pingback: From the Listening Post… 04/06/2011 (a.m.) « Sean Lawson, Ph.D.
Pingback: Vis redaktørene hvordan det kan gjøres « Vox Publica
Pingback: datababe » homework
Pingback: วารสารศาสตร์เชิงคำนวณ: จะผลักข่าวไปข้างหน�
Pingback: A computational journalism reading list | Interactive Journalism II
Pingback: Jonathan Stray » Learn to program, then and now
Pingback: Internet de Nueva Generación (UPM) » Blog Archive » El periodismo ya no es lo que era; y la informática, tampoco
Pingback: From the feeds: Writing, more writing, journalism, and automation | sacha chua :: living an awesome life
Yolanda Ma says:

December 20, 2011 at 11:05 am

finally i got to read this post carefully after almost a year after you wrote it!
it was a bit too much for me at that time, but now i feel like it’s a very pleasant journey following the links and readings mentioned in the post – maybe that tells sth about my learning in the past year.
anyway, one year is enough to update the post, isn’t it? 🙂
catch up soon, merry xmas!
yol/hk
Pingback: Länkar från vecka 2 // Anna Norberg
Pingback: A Crash Course in Data Journalism | Journalist in Residence
sesirrant says:

September 4, 2012 at 8:28 am

Hello! Just recently found on the internet site kamkam.rf . This is a site for singles, where video chat. It seems like the site : the design of a simple , not annoying. In addition to this check-in took just a couple of minutes. On this site has a huge number of registered users – more than a million . A video chat feature is handy – roulette . That is, you come into a video chat, and you people are offered in perfect order.

But , to be honest , once I ‘m afraid to get acquainted while on this site. Suddenly, there are not real people .

Guys, can you tell me maybe someone has already registered on this site? Tell us what and how? Share experiences so to speak . Thanks for the info , I will appreciate any information – positive or not.
Pingback: Recommended readings for the infographics and visualization course by Alberto Cairo | New Media and Non Profits
Pingback: Is Data A Public Good? Is Data Good For The Public? | Data For Radicals
pet insurance comparison chart says:

September 20, 2013 at 9:56 pm

Furnishing them with proper dental defense is the finest matter to do.
Utilised for a the vast majority of these phrases, you want to
have to be completely ready monetarily. This will present them that they would not have to request you to consider the pet dog
outdoors to go to the toilet.
Pingback: Une définition du journalisme computationnel | OhMyBox

Comments are closed.