March | 2011 | Jonathan Stray

It’s impossible to build a computer system that helps people find or filter information without at some point making editorial judgements. That’s because search and collaborative filtering algorithms embody human judgement about what is important to know. I’ve been pointing this out for years, and it seems particularly relevant to the journalism profession today as it grapples with the digital medium. It’s this observation which is the bridge between the front page and the search results page, and it suggests a new generation of digital news products that are far more useful than just online translations of a newspaper.

It’s easy to understand where human judgement enters into information filtering algorithms, if you think about how such things are built. At some point a programmer writes some code for, say, a search engine, and tests it by looking at the output on a variety of different queries. Are the results good? In what way do they fall short of the social goals of the software? How should the code be changed? It’s not possible to write a search engine without a strong concept of what “good” results are, and that is an editorial judgement.

I bring this up now for two reasons. One is an ongoing, active debate over “news applications” — small programs designed with journalistic intent — and their role in journalism. Meanwhile, for several years Google’s public language has been slowly shifting from “our search results are objective” to “our search results represent our opinion.” The transition seems to have been completed a few weeks ago, when Matt Cutts spoke to Wired about Google’s new page ranking algorithm:

In some sense when people come to Google, that’s exactly what they’re asking for — our editorial judgment. They’re expressed via algorithms. When someone comes to Google, the only way to be neutral is either to randomize the links or to do it alphabetically.

There it is, from the mouth of the bot. “Our editorial judgment” is “expressed via algorithms.” Google is saying that they have and employ editorial judgement, and that they write algorithms to embody it. They use algorithms instead of hand-curated lists of links, which was Yahoo’s failed web navigation strategy of the late 1990s, because manual curation doesn’t scale to whole-web sizes and can’t be personalized. Yet hand selection of articles is what human editors do every day in assembling the front page. It is valuable, but can’t fulfill every need.

Informing people takes more than reporting
Like a web search engine, journalism is about getting people the accurate information they need or want. But professional journalism is built upon pre-digital institutions and economic models, and newsrooms are geared around content creation, not getting people information. The distinction is important, and journalism’s lack of attention to information filtering and organization seems like a big omission, an omission that explains why technology companies have become powerful players in news.

I don’t mean to suggest that going out and getting the story — aka “reporting” — isn’t important. Obviously, someone has to provide the original report that then ricochets through the web via social media, links, and endless reblogging. Further, there is evidence that very few people do original reporting. Last year I counted the percentage of news outlets did their own reporting on one big story, and found that only 13 of 121 stories listed on Google News did not simply copy information found elsewhere. A contemporaneous Pew study of the news ecosystem of Baltimore found that most reporting was still done by print newspapers, with very little contributed by “new media,” though this study has been criticized for a number of potentially serious category problems. I’ve also repeatedly experienced the power that a single original report can have, as when I made a few phone calls to discover that Jurgen Habermas is not on Twitter, or worked with AP colleagues to get the first confirmation from network operators that Egypt had dropped off the internet. Working in a newsroom, obsessively watching the news propagate through the web, I see this every day: it’s amazing how few people actually pump original reports into the ecosystem.

But reporting isn’t everything. It’s not nearly enough. Reporting is just one part of ensuring that important public information is available, findable, and known. This is where journalism can learn something from search engines, because I suspect what we really want is a hybrid of human and algorithmic judgement.

As conceived in the pre-digital era, news is a non-personalized, non-interactive stream of updates about a small number of local or global stories. The first and most obvious departure from this model would be the ability to search within a news product for particular stories of interest. But the search function on most news websites is terrible, and mostly fails at the core task of helping people find the best stories about a topic of interest. If you doubt this, try going to your favorite news site and searching for that good story that you read there last month. Partially this is technical neglect. But at root this problem is about newsroom culture: the primary product is seen to be getting the news out, not helping people find what is there. (Also, professional journalism is really bad at linking between stories, and most news orgs don’t do fine-grained tracking of social sharing of their content, which are two of primary signals that search engines use to determine which articles are the most relevant.)

Story-specific news applications
We are seeing signs of a new kind of hybrid journalism that is as much about software as it is about about reporting. It’s still difficult to put names to what is happening, but terms like “news application” are emerging. There has been much recent discussion of the news app, including a session at the National Institute of Computer-Assisted Reporting conference in February, and landmark posts on the topic at Poynter and NiemanLab. Good examples of the genre include ProPublica’s dialysis facility locator, which combines investigative reporting with a search engine built on top of government data, and the Los Angeles Time’s real-time crime map, which plots LAPD data across multiple precincts and automatically detects statistically significant spikes. Both can be thought of as story-specific search engines, optimized for particular editorial purposes.

Yet the news apps of today are just toes in the water. It is no disrespect to all of the talented people currently working in the field say this, because we are at the beginning of something very big. One common thread in recent discussion of news apps has been a certain disappointment at the slow rate of adoption of the journalist-programmer paradigm throughout the industry. Indeed, with Matt Waite’s layoff from Politifact, despite a Pulitzer Prize for his work, some people are wondering if there’s any future at all in the form. My response is that we haven’t even begun to see the full potential of software combined with journalism. We are under-selling the news app because we are under-imagining it.

I want to apply search engine technology to tell stories. “Story” might not even be the right metaphor, because the experience I envision is interactive and non-linear, adapting to the user’s level of knowledge and interest, worth return visits and handy in varied circumstances. I don’t want a topic page, I want a topic app. Suppose I’m interested in — or I have been directed via headline to — the subject of refugees and internal migration. A text story about refugees due to war and other catastrophes is an obvious introduction, especially if it includes maps and other multimedia. And that would typically be the end of the story by today’s conventions. But we can do deeper. The International Organization for Migration maintains detailed statistics on the topic. We could plot that data, make it searchable and linkable. Now we’re at about the level of a good news app today. Let’s go further by making it live, not a visualization of a data set but a visualization of a data feed, an automatically updating information resource that is by definition evergreen. And then let’s pull in all of the good stories concerning migration, whether or not our own newsroom wrote them. (As a consumer, the reporting supply chain is not my problem, and I’ve argued before that news organizations need to do much more content syndication and sharing.) Let’s build a search engine on top of every last scrap of refugee-related content we can find. We could start with classic keyword search techniques, augment them by link analysis weighted toward sources we trust, and ingest and analyze the social streams of whichever communities deal with the issue. Then we can tune the whole system using our editorial-judgment-expressed-as-algorithms to serve up the most accurate and relevant content not only today, but every day in the future. Licensed content we can show within our product, and all else we can simply link to, but the search engine needs to be a complete index.

Rather than (always, only) writing stories, we should be trying to solve the problem of comprehensively informing the user on a particular topic. Web search is great, and we certainly need top-level “index everything” systems, but I’m thinking of more narrowly focussed projects. Choose a topic and start with traditional reporting, content creation, in-house explainers and multimedia stories. Then integrate a story-specific search engine that gathers together absolutely everything else that can be gathered on that topic, and applies whatever niche filtering, social curation, visualization, interaction and communication techniques are most appropriate. We can shape the algorithms to suit the subject. To really pull this off, such editorially-driven search engines need to be both live in the sense of automatically incorporating new material from external feeds, and comprehensive in the sense of being an interface to as much information on the topic as possible. Comprehensiveness will keep users coming back to your product and not someone else’s, and the idea of covering 100% of a story is itself powerful.

Other people’s content is content too
The brutal economics of online publishing dictate that we meet the needs of our users with as little paid staff time as possible. That drives the production process toward algorithms and outsourced content. This might mean indexing and linking to other people’s work, syndication deals that let a news site run content created by other people, or a blog network that bright people like to contribute to. It’s very hard for the culture of professional journalism to accept this idea, the idea that they should leverage other people’s work as far as they possibly can for as cheap as they can possibly get it, because many journalists and publishers feel burned by aggregation. But aggregation is incredibly useful, while the feelings and job descriptions of newsroom personnel are irrelevant to the consumer. As Sun Microsystems founder Bill Joy put it, “no matter who you are, most of the smartest people work for someone else,” and the idea that a single newsroom can produce the world’s best content on every topic is a damaging myth. That’s the fundamental value proposition of aggregation — all of the best stuff in one place. The word “best” represents editorial judgement in the classic sense, still a key part of a news organization’s brand, and that judgement can be embodied in whatever algorithms and social software are designed to do the aggregation. I realize that there are economic issues around getting paid for producing content, but that’s the sort of thing that needs to be solved by better content marketplaces, not lawsuits and walled gardens.

None of this means that reporters shouldn’t produce regular stories on their beats, or that there aren’t plenty of topics which require lots of original reporting and original content. But asking who did the reporting or made the content misses the point. A really good news application/interactive story/editorial search engine should be able to teach us as much as we care to learn about the topic, regardless of the state of our previous knowledge, and no matter who originally created the most relevant material.

What I am suggesting comes down to this: maybe a digital news product isn’t a collection of stories, but a system for learning about the world. For that to happen, news applications are going to need to do a lot of algorithmically-enhanced organization of content originally created by other people. This idea is antithetical to current newsroom culture and the traditional structure of the journalism industry. But it also points the way to more useful digital news products: more integration of outside sources, better search and personalization, and story-specific news applications that embody whatever combination of original content, human curation, and editorial algorithms will best help the user to learn.

[Updated 27 March with more material on social signals in search, Bill Joy’s maxim, and other good bits.]
[Updated 1 April with section titles.]

Jonathan Stray

Information, culture, and belief

Monthly Archives: March 2011

The editorial search engine