Oct 12 2009

The Search Problem vs. The News Problem

I think I’ve found a useful distinction between the “search” and “news” problems. News organizations like to complain that search engines are taking their business, but that’s only because no one has yet built a passable news engine.

Search is when the user asks the computer for a particular type of information, and the computer finds it.

News is when the computer has to figure out, by itself, what information a user wants in each moment.

This definition has useful consequences. For example, it says that accurately modeling the user and their needs is going to be absolutely essential for news, because the news problem doesn’t have a query to go on. All a news selection algorithm can know is what the user has done in the past. For this reason, I don’t believe that online news systems can truly be useful until they take into account everything of ourselves that we’ve put online, including Facebook profiles and emails, and viewing histories.

And yes, I do want my news engine to keep track of cool YouTube uploads and recommend videos to me. This in addition to telling me that Iran has a secret uranium enrichment facility. In the online era, “news” probably just means recently published useful information, of which journalistic reporting is clearly a very small segment.

It’s worth remembering that keyword web search wasn’t all that useful until Google debuted in 1998 with an early version of the now-classic PageRank algorithm.  I suspect that we have not yet seen the equivalent for news. In other words, the first killer news app has yet to be deployed. Because such an app will need to know a great deal about you, it will probably pull in data from Facebook and Gmail, at a minimum. But no one really knows yet how to turn a pile of emails into a filter that selects from the best of the web, blogosphere, Twitter, and mainstream media.

Classic journalism organizations are at a disadvantage in designing modern news apps, because broadcast media taught them bad habits. News organizations still think in terms of editors who select content for the audience. This one-size fits all attitude seems ridiculous in the internet era, a relic of the age when it would have been inconceivably expensive to print a different paper for each customer.

Of course, there are some serious potential problems with the logical end-goal of total customization. The loss of a socially shared narrative is one; the Daily Me effect where an individual is never challenged by anything outside of what they already believe is another. But shared narratives seem to emerge in social networks regardless of how we organize them — this is the core meaning of something “going viral.” And I believe the narcissism problem can be addressed through information maps. In fact, maps are so important that we should add another required feature to our hypothetical killer news app: it must in some way present a useful menu of the vast scope of available information. This is another function that existing search products have hardly begun to address.

Not that we have algorithms today that are as good as human editors as putting together a front page. But we will. Netflix’s recent million dollar award for a 10% improvement in their film recommendation system is a useful reminder of how seriously certain companies are taking the problem of predicting user preferences.

The explosion of blog, Twitter, and Wikipedia consumption demonstrates that classic news editors may not have been so good at giving us what we want, anyway.

No responses yet

Sep 25 2009

Rating Items by Number of Votes: Ur Doin It Rong

Digg, YouTube, Slashdot, and many other sites employ user voting to generate collaborative rankings for their content. This is a great idea, but simply counting votes is a horrible way to do it. Fortunately, the fix is simple.

A basic ranking system allows each user to add a vote to the items they like, then builds a “top rated” list by counting votes. The problem with this scheme is that users can only vote on items they’ve seen, and they are far more likely to see items near the top of the list. In fact, anything off the front page may get essentially no views at all — and therefore has virtually no chance of rising to top.

digg

This is rather serious if the content being rated is serious. It’s fine for Digg to have weird positive-feedback popularity effects, but it’s not fine if we are trying to decide what goes on the front page of a news site. Potentially important stories might never make it to the top simply because they started a little lower in the rankings for whatever reason.

Slightly more sophisticated systems allow users to rate items on a scale, typically 1-5 stars.  This seems better, but still introduces weird biases. Adding up the stars assigned by all users to a single item doesn’t work, because users still have to see an item to vote on it. Averaging all the ratings assigned to a single item doesn’t work either, because it can push something permanently to the bottom of the list, if the first user to view it rates it only one star.

There are lots of subtle hacks that one can make to try to fix the system, but it turns out there might actually be a right way to do things.

If every item was rated by every user, there would be no problem with popularity feedback effects.

That’s completely impractical with thousands or even millions of items. But we can actually get close to the same result with much less work, if we take random samples. Like a telephone poll, the opinion of a small group of randomly selected people will be an accurate indicator, to within a few percent, of the result that we would get if we asked everyone.

In practice, this would mean adding a few select “sampling” stories to each front page served, different every time. Items can then by ranked simply their average rating, with no skewing due to who got to the front page first. (In fact, basic sampling math will tell us which items have the most uncertain ratings and need to be seen with the highest priority.) In effect, we are distributing the work of rating a huge body of items across a huge body of users — true collaborative filtering, using sampling methods to remove the “can’t see it can’t vote on it” bias.

This is not an end-all solution to the problem of distributed agenda-setting. User ratings are not necessarily the ideal criterion for measuring “relevance.” One problem is that not every user is going to take the trouble to assign a rating, so you will only be sampling from particularly motivated individuals. Other metrics such as length of time on page might be better — did this person read the whole thing?

Even more fundamentally, it’s not clear that popularity, however defined, is really the right way to set a news agenda in the public interest.

However, any attempt to use user polling for collaborative agenda setting needs to be aware of basic statistical bias issues. Sampling is a simple and very well-developed way to think about such problems.

4 responses so far

Sep 23 2009

Mapping the Daily Me

If we deliver to each person only what they say they want to hear, maybe we end up with a society of narrow-minded individualists. It’s exciting to contemplate news sources that (successfully) predict the sorts of headlines that each user will want to read, but in the extreme case we are reduced to a journalism of the Daily Me: each person isolated inside their own little reflective bubble.

The good news is, specialized maps can show us what we are missing. That’s why I think they need to be standard on all information delivery systems.

For the first time in history, it is possible to map with some accuracy the information that free-range consumers choose for themselves. A famous example is the graph of political booksales produced by orgnet.com:

Social network graph of Amazon sales of political books, 2008

Here, two books are connected by a line if consumers tended to buy both. What we see is what we always suspected: a stark polarization. For the most part, each person reads either liberal or conservative books. Each of us lives in one information world but not the other. Despite the Enlightenment ideal of free debate, real-world data shows that we do not seek out contradictory viewpoints.

Which was fine, maybe, when the front page brought them to us. When information distribution was monopolized by a small number of newspapers and broadcasters, we had no choice but to be exposed to stories that we might not have picked for ourselves. Whatever charges one can press against biased editors of the past, most of them felt that they had a duty to diversity.

In the age of disaggregation, maybe the money is in giving people what they want. Unfortunately, there is a real possibility that we want is to have our existing opinions confirmed. You and I and everyone else are going to be far more likely to click through from a headline that confirms what we already believe than from one which challenges us. “I don’t need to read that,” we’ll say, “it’s clearly just biased crap.” The computers will see this, and any sort of recommendation algorithm will quickly end up as a mirror to our preconceptions.

It’s a positive feedback loop that will first split us along existing ideological cleavages, then finer and finer. In the extreme, each of us will be alone in a world that never presents information to the contrary.

We could try to design our systems to recommend a more diverse range of articles (an idea I explored previously) but the problem is, how? Any sort of agenda-setting system that relies on what our friends like will only amplify polarities, while anything based on global criteria is necessarily normative — it makes judgements on what everyone should be seeing. This gets us right back into all the classic problems of ideology and bias — how do we measure diversity of viewpoint? And even if we could agree on a definition of what a “healthy” range sources is, no one likes to be told what to read.

I think that maps are the way out. Instead of trying to decide what someone “should” see, just make clear to them what they could see.

An information consumption system — an RSS reader, online newspapers, Facebook — could include a map of the infosphere as a standard feature. There are many ways to draw such a map, but the visual metaphor is well-established: each node is an information item (an article, video, etc.) while the links between items indicate their “similarity” in terms of worldview.

iran_blogosphere_map

This is less abstract than it seems, and with good visual design these sorts of pictures can be immediately obvious. Popular nodes could be drawn larger; closely related nodes could be clustered. The links themselves could be generated from co-consumption data: when one user views two different items, the link between those items gets slightly stronger. There are other ways of classifying items as related — as belonging to similar worldviews — but co-consumption is probably as good a metric as any, and in fact co-purchasing data is at the core of Amazon’s successful recommendation system.

The concepts involved are hardly new, and many maps have been made at the site level where each node is an entire blog, such as the map of the Iranian blogosphere above. However, we have never had a map of individual news items, and never in real-time for everyone to see.

Each map also needs a “you are here” indicator.

This would be nothing more than some way of marking items that the user has personally viewed. Highlight them, center them on the map, and zoom in. But don’t zoom in too much. The whole purpose of the map is to show each of us how small, how narrow and unchallenging our information consumption patterns actually are. We will each discover that we live in a particular city-cluster of information sources, on a particular continent of language, ideology, or culture. A map literally lets you see this at a glance — and you can click on far-away nodes for instant travel to distant worldviews.

Giving people only what they like risks turning journalism into entertainment or narcissism. Forcing people to see things that they are not interested in is a losing strategy, and we there isn’t any obvious way to decide what we should see. Showing people a map of the broader world they live in is universally acceptable, and can only encourage curiosity.

6 responses so far

Sep 22 2009

Requiem For The Front Page

Oh Front Page, your days are clearly numbered. For generations all eyes were upon you; you set the public agenda, and advertisers loved you best. In the tumult of the world, your voice carried above all others, and we needed you. You told us when the war ended, and when The Beatles came to town.

But you are in your autumn now.

TheFrontPageisDead

We know that your children killed you, though they did not mean it. In the age of the scribe, it seemed that anyone could own a printing press. But now, Front Page, we talk online about the monopoly you once claimed. Some will pine for newsprint, but paper is just too expensive, too heavy and static.

But this is not about paper. This is about the way you lived your life, your insistence on a space that you and you alone controlled. You tried to move online, Front Page, but your model would not yield and your children ate your lunch. Google News chooses from the best, while Digg lets us choose for ourselves. There will always be reporters — those who assemble the narratives —  but there may not always be editors. Your stubborn insistence on one for all made us question your purpose.

We loved you and you ignored us! Advertisers deserted you first; they were very quick to understand that reader information could be leveraged into relevance. Google itself was built on this model. Meanwhile Amazon and iTunes grasped that efficiencies of delivery had moved the money to the infinite niche. But you admitted none of this, Front Page, and also you did not see that people live in networks, that our friends know what is important to us.

Why would you not give us what we wanted? No one questions your integrity, the standards of journalism you uphold. No one questions that we, the public, need to be told at least as much as we need to be listened to. But suddenly we could talk back, and you weren’t listening. You insisted that we go to you instead of just coming to us. Why did you not use our input to customize the agenda? You could have spawned Facebook applications and iPhone applications and even innovative social RSS readers that determined our interests and automatically delivered ten million personalized headlines! (And their ads.)

You had everything you needed, and this was your unforgivable sin. A hundred years ago you built the Associated Press to feed you, the prototype of distributed journalism. This could have been the beginning, if you had embraced more than the cream of international stories, if you had realized how cheap local reporting could be. Those long tail stories could be vastly cheaper, Front Page, if you embraced more sources, if you fought for transparency instead of access, if you taught citizens to be journalists instead of insisting that they can’t. You could have set the standards and franchised the platform. But instead of finding innovative ways to gather the news and innovative ways to deliver it to us, even now you fight hard to be seen less!

Instead of owning the aggregators and bringing to them the wisdom of an old hand, you scoffed at Digg, at Google, at Memeorandum. Why are there still so many news sites without a panel of  “Share This” links beneath each story? Why are we not allowed to speak to the New York Times with user ratings buttons? Your mannerisms are quaint as hoop-skirts, Front Page.

We know also that your less reputable cousin is only slightly younger, and the world will never listen to Television as their parents did. The internet will devour Broadcast too; in only a few more years bandwidth will be cheap enough for anyone to run their own station. We know that upcoming content analysis algorithms will soon make video search a reality, and we know that the RSS future will soon disaggregate Television News just as it only recently disaggregated you.

Front Page, your children are brash, but they are filled with the energy of youth. They have inherited a world you never foresaw, and they are hopeful in a way you are not. It is their world now. You must guide them, but you must let them have it.

Much as we loved you, your time has passed.

One response so far

Sep 16 2009

American Press Covers Debate, Not Health Care

Representative Joe Wilson yelled “you lie!” at the president, and the papers loved it. Unfortunately, by a count of more than three to one, the major media articles covering the event did not bother to comment on the substance of issue of that provoked Wilson’s outburst: whether or not illegal immigrants would be provided health care under proposed reforms. There is no health care debate in the mainstream American press. There is only political drama.

The president did not lie. All of the proposed health care reform bills contain language excluding those residing illegally in the US from government-subsidized coverage. This single-sentence fact check was entirely absent from 50 of the 70 articles mentioning “wilson” and “lie” on the New York Times and Washington Post websites as of Monday night. Of the 20 which discussed actual policy, only nine articles mentioned it in the first two paragraphs. (Spreadsheet here.)

Wilson’s outburst will be forgotten long after millions of Americans are insured — or not — under Obama’s plan. It’s just noise and heat. Yet some of the most reputable newspapers in the world have lead with it for the last five days. In fact, the press has in some cases actively dodged the underlying issue. Consider this exchange from an online Q&A session with Dana Milbank of the Washington Post:

Cincinnati: Are you saying the President wasn’t lying when he said illegal immigrants won’t be covered? Why not look at the House bill and tell us whether or not it allows illegals to be covered? The Congressional Research service issued a report last week saying there was NOTHING in the House bill that excludes illegals from receiving government-run health care. In other words, be a REPORTER instead of a hack for Barack.

Dana Milbank:  Actually I wasn’t addressing the factual nature of Obama’s speech. The issue wasn’t that Wilson thought the president wasn’t telling the truth; part of the presidential job description calls for expertise in truth shading. The issue was shouting “you lie!” at the president on the House floor during an address to a joint session of Congress.

(For the record, the CRS report in question notes that HR 3200 says “Nothing in this subtitle shall allow Federal payments for affordability credits on behalf of individuals who are not lawfully present in the United States.” Which has, oddly, been spun as meaning that illegals would be subsidized!)

It should be no surprise that there is actually substance to the question of coverage for illegal immigrants. Only nine of the 70 pieces get into it: yes, a few undocumented workers could end up getting subsidized health care. No, it’s not worth taxpayer money to add an enforcement mechanism.

But even this is one level removed, and only one article grappled with the fundamental question: would it really be so bad if the poorest workers in America got a break? In fact we might even owe it to them. On average, migrant labor is thought to be a small net gain to the American economy.

I get that Wilson’s little moment is a great story, right up there with the guy who threw a shoe at Bush (who was imprisoned for his prank, with far less coverage.) And I do understand the logic of a populist press as the paper ship sinks. What cannot be excused is the omission of any mention of the substantive content of the debate from the majority of coverage — 50 out of 70 articles said nothing at all about anything that will last.

We are reporting on court theatrics while the citizens starve.

No responses yet