knowledge – Jonathan Stray

What should the digital public sphere do?

Jonathan Stray — Wed, 30 Nov 2011 01:12:46 +0000

Earlier this year, I discovered there wasn’t really a name for the thing I wanted to talk about. I wanted a word or phrase that includes journalism, social media, search engines, libraries, Wikipedia, and parts of academia, the idea of all these things as a system for knowledge and communication. But there is no such word. Nonetheless, this is an essay asking what all this stuff should do together.

What I see here is an ecosystem. There are narrow real-time feeds such as expertly curated Twitter accounts, and big general reference works like Wikipedia. There are armies of reporters working in their niches, but also colonies of computer scientists. There are curators both human and algorithmic. And I have no problem imagining that this ecosystem includes certain kinds of artists and artworks. Let’s say it includes all public acts and systems which come down to one person trying to tell another, “I didn’t just make this up. There’s something here of the world we share.”

I asked people what to call it. Some said “media.” That captures a lot of it, but I’m not really talking about the art or entertainment aspects of media. Also I wanted to include something of where ideas come from, something about discussions, collaborative investigation, and the generation of new knowledge. Other people said “information” but there is much more here than being informed. Information alone doesn’t make us care or act. It is part of, but only part of, what it means to connect to another human being at a distance. Someone else said “the fourth estate” and this is much closer, because it pulls in all the ideas around civic participation and public discourse and speaking truth to power, loads of stuff we generally file under “democracy.” But the fourth estate today means “the press” and what I want to talk about is broader than journalism.

I’m just going to call this the “digital public sphere”, building on Jürgen Habermas’ idea of a place for the discussion of shared concerns, public yet apart from the state. Maybe that’s not a great name — it’s a bit dry for my taste — but perhaps it’s the best that can be done in three words, and it’s already in use as a phrase to refer to many of the sorts of things I want to talk about. “Public sphere” captures something important, something about the societal goals of the system, and “digital” is a modifier that means we have to account for interactivity, networks, and computation. Taking inspiration from Michael Schudson’s essay “Six or seven things that news can do for democracy,” I want to ask what the digital public sphere can do for us. I think I see three broad categories, which are also three goals to keep in mind as we build our institutions and systems.

1. Information. It should be possible for people to find things out, whatever they want to know. Our institutions should help people organize to produce valuable new knowledge. And important information should automatically reach each person at just the right moment.

2. Empathy. The vast majority of people in the world, we will only know through media. We must strive to represent the “other” to each-other with compassion and reality. We can’t forget that there are people on the other end of the wire.

3. Collective action. What good is public deliberation if we can’t eventually come to a decision and act? But truly enabling the formation of broad agreement also requires that our information systems support conflict resolution. In this age of complex overlapping communities, this role spans everything from the local to the global.

Each of these is its own rich area, and each of these roles already cuts across many different forms and institutions of media.

Information
I’d like to live in a world where it’s cheap and easy for anyone to satisfy the following desires:

“I want to learn about X.”
“How do we know that about X?”
“What are the most interesting things we don’t know about X?”
“Please keep me informed about X.”
“I think we should know more about X.”
“I know something about X and want to tell others.”

These desires span everything from mundane queries (“what time does the store close?”) to complex questions of fact (“what will be the effects of global climate change?”) And they apply at all scales; I might have a burning desire to know how the city government is going to deal with bike lanes, or I might be curious about the sum total of humanity’s knowledge of breast cancer — everything we know today, plus all the good questions we can’t yet answer. Different institutions exist to address each of these needs in various ways. Libraries have historically served the need to answer specific questions, desires number #1 and #2, but search engines also do this. Journalism strives to keep people abreast of current events, the essence of #4. Academia has focused on how we know and what we don’t yet know, which is #2 and #3.

This list includes two functions related to the production of new knowledge, because it seems to me that the public information ecosystem should support people working together to become collectively smarter. That’s why I’ve included #5, which is something like casting a vote for an unanswered question, and #6, the peer-to-peer ability to provide an answer. These seem like key elements in the democratic production of knowledge, because the resources which can be devoted to investigating answers are limited. There will always be a finite number of people well placed to answer any particular question, whether those people are researchers, reporters, subject matter experts, or simply well-informed. I like to imagine that their collective output is dwarfed by human curiosity. So efficiency matters, and we need to find ways to aggregate the questions of a community, and route each question to the person or people best positioned to find out the answer.

In the context of professional journalism, this amounts to asking what unanswered questions are most pressing to the community served by a newsroom. One could devise systems of asking the audience (like Quora and StackExchange) or analyze search logs (ala Demand Media.) That newsrooms don’t frequently do these things is, I think, an artifact of industrial history — and an unfilled niche in the current ecosystem. Search engines know where the gaps between supply and demand lie, but they’re not in the business of researching new answers. Newsrooms can produce the supply, but they don’t have an understanding of the demand. Today, these two sides of the industry do not work together to close this loop. Some symbiotic hybrid of Google and The Associated Press might be an uncannily good system for answering civic questions.

When new information does become available, there’s the issue of timing and routing. This is #4 again, “please keep me informed.” Traditionally, journalism has answered the question “who should know when?” with “everyone everything as fast as possible” but this is ridiculous today. I really don’t want my phone to vibrate for every news article ever written, which is why only “important” stories generate alerts. But taste and specialization dictate different definitions of “important” for each person, and old answers delivered when I need them might be just as valuable as new information delivered hot and fresh. Google is far down this track with its thinking on knowing what I want before I search for it.

Empathy
There is no better way to show one person to another, across a distance, than the human story. These stories about other people may be informative, sure, but maybe their real purpose is to help us feel what it is like to be someone else. This is an old art; one journalist friend credits Homer with the last major innovation in the form.

But we also have to show whole groups to each other, a very “mass media” goal. If I’ve never met a Cambodian or hung out with a union organizer, I only know what I see in the media. How can and should entire communities, groups, cultures, races, interests or nations be represented?

A good journalist, anthropologist, or writer can live with a community for a while, observing and learning, then articulate generalizations. This is important and useful. It’s also wildly subjective. But then, so is empathy. Curation and amplification can also be empathetic processes: someone can direct attention to the genuine voices of a community. This “don’t speak, point” role has been articulated by Ethan Zuckerman and practiced by Andy Carvin.

But these are still at the level of individual stories. Who is representative? If I can only talk to five people, which five people should I know? Maybe a human story, no matter how effective, is just a single sample in the sense of a tiny part standing for the whole. Turning this notion around, making it personal, I come to an ideal: If I am to be seen as part of some group, then I want representations of that group to include me in some way. This is an argument that mass media coverage of a community should try to account for every person in that community. This is absurd in practical terms, but it can serve as a signpost, a core idea, something to aim for.

Fortunately, more inclusive representations are getting easier. Most profoundly, the widespread availability of peer-to-peer communication networks makes it easier than ever for a single member of a community to speak and be heard widely.

We also have data. We can compile the demographics of social movements, or conduct polls to find “public opinion.” We can learn a lot from the numbers that describe a particular population, which is why surveys and censuses persist. But data are terrible at producing the emotional response at the core of empathy. For most people, learning that 23% of the children in some state live in poverty lacks the gut-punch of a story about a child who goes hungry at the end of every month. In fact there is evidence that making someone think analytically about an issue actually makes them less compassionate.

The best reporting might combine human stories with broader data. I am impressed by CNN’s interactive exploration of American casualties in Iraq, which links mass visualization with photographs and stories about each individual. But that piece covers a comparatively small population, only a few thousand people. There are emerging techniques to understand much larger groups, such as by visualizing the data trails of online life, all of the personal information that we leave behind. We can visualize communities, using aggregate information to see the patterns of human association at all scales. I suspect that mass data visualization represents a fundamentally new way of understanding large groups, a way that is perhaps more inclusive than anecdotes yet richer than demographics. Also, visualization forces us into conversations about who exactly is a member of the community in question, because each person is either included in a particular visualization or not. Drawing such a hard boundary is often difficult, but it’s good to talk about the meanings of our labels.

And yet, for all this new technology, empathy remains a deeply human pursuit. Do we really want statistically unbiased samples of a community? My friend Quinn Norton says that journalism should “strive to show us our better selves.” Sometimes, what we need is brutal honesty. At other times, what we need is kindness and inspiration.

Collective action

What a difficult challenge advances in communication have become in recent decades. On the one hand they are definitely bringing us closer to each other, but are they really bringing us together?

– Ryszard Kapuściński, The Other

I am sensitive to the idea of filter bubbles and concerns about the fragmentation of media, the worry that the personalization of information will create a series of insular and homogenous communities, but I cannot abide the implied nostalgia for the broadcast era. I do not see how one-size-fits-all media can ever serve a diverse and specialized society, and so: let a million micro-cultures bloom! But I do see a need for powerful unifying forces within the public sphere, because everything from keeping a park clean to tackling global climate change requires the agreement and cooperation of a community.

We have long had decision making systems at all scales — from the neighborhood to the United Nations — and these mechanisms span a range from very lightweight and informal to global and ritualized. In many cases decision-making is built upon voting, with some majority required to pass, such as 51% or 66%. But is a vicious, hard-fought 51% in a polarized society really the best we can do? And what about all the issues that we will not be voting on — that is to say, most of them?

Unfortunately, getting agreement among even very moderate numbers of people seems phenomenally difficult. People disagree about methods, but in a pluralistic society they often disagree even more strongly about goals. Sometimes presenting all sides with credible information is enough, but strongly held disagreements usually cannot be resolved by shared facts; experimental work shows that, in many circumstances, polarization deepens with more information. This is the painful truth that blows a hole in ideas like “informed public” and “deliberative democracy.”

Something else is needed here. I want to bring the field of conflict resolution into the digital public sphere. As a named pursuit with its own literature and community, this is a young subject, really only begun after World War II. I love the field, but it’s in its infancy; I think it’s safe to say that we really don’t know very much about how to help groups with incompatible values find acceptable common solutions. We know even less about how to do this in an online setting.

But we can say for sure that “moderator” is an important role in the digital public sphere. This is old-school internet culture, dating back to the pre-web Usenet days, and we have evolved very many tools for keeping online discussions well-ordered, from classic comment moderation to collaborative filtering, reputation systems, online polls, and various other tricks. At the edges, moderation turns into conflict resolution, and there are tools for this too. I’m particularly intrigued by visualizations that show where a community agrees or disagrees along multiple axes, because the conceptually similar process of “peace polls” has had some success in real-world conflict situations such as Northern Ireland. I bet we could also learn from the arduously evolved dispute resolution processes of Wikipedia.

It seems to me that the ideal of legitimate community decision making is consensus, 100% agreement. This is very difficult, another unreachable goal, but we could define a scale from 51% agreement to 100%, and say that the goal is “as consensus as possible” decision making, which would also be “as legitimate as possible.” With this sort of metric — and always remembering that the goal is to reach a decision on a collective action, not to make people agree for the sake of it — we could undertake a systematic study of online consensus formation. For any given community, for any given issue, how fragmented is the discourse? Do people with different opinions hang out in different places online? Can we document examples of successful and unsuccessful online consensus formation, as has been done in the offline case? What role do human moderators play, and how can well-designed social software contribute? How do the processes of online agreement and disagreement play out at different scales and under different circumstances? How we do know when the process has converged to a “good” answer, and when it has degraded into hegemony or groupthink? These are mostly unexplored questions. Fortunately, there’s a huge amount of related work to draw on: voting systems and public choice theory, social network analysis, cognitive psychology, information flow and media ecosystems, social software design, issues of identity and culture, language and semiotics, epistemology…

I would like conflict resolution to be an explicit goal of our media platforms and processes, because we cannot afford to be polarized and grid-locked while there are important collective problems to be solved. We may have lost the unifying narrative of the front page, but that narrative was neither comprehensive nor inclusive: it didn’t always address the problems of concern to me, nor did it ask me what I thought. Effective collective action, at all relevant scales, seems a better and more concrete goal than “shared narrative.” It is also an exceptionally hard problem — in some ways it is the problem of democracy itself — but there’s lots to try, and our public sphere must be designed to support this.

Why now?
I began writing this essay because I wanted to say something very simple: all of these things — journalism, search engines, Wikipedia, social media and the lot — have to work together to common ends. There is today no one profession which encompasses the entirety of the public sphere. Journalism used to be the primary bearer of these responsibilities — or perhaps that was a well-meaning illusion sprung from near monopolies on mass information distribution channels. Either way, that era is now approaching two decades gone. Now what we have is an ecosystem, and in true networked fashion there may not ever again be a central authority. From algorithm designers to dedicated curators to, yes, traditional on-the-scene pro journalists, a great many people in different fields now have a part in shaping the digital public sphere. I wanted try to understand what all of us are working toward. I hope that I have at least articulated goals that we can agree are important.

Visualizing communities

Jonathan Stray — Mon, 01 Aug 2011 21:54:02 +0000

There are in fact no masses; there are only ways of seeing people as masses.
–Raymond Williams

Who are the masses that the “mass media” speaks to? What can it mean to ask what “teachers” or “blacks” or “the people” of a country think? These words are all fiction, a shorthand which covers over our inability to understand large groups of unique individuals. Real people don’t move in homogeneous herds, nor can any one person be neatly assigned to a single category. Someone might view themselves simultaneously as the inhabitant of a town, a new parent, and an active amateur astronomer. Now multiply this by a million, and imagine trying to describe the overlapping patchwork of beliefs and allegiances.

But patterns of association leave digital traces. Blogs link to each other, we have “friends” and “followers” and “circles,” we share interesting tidbits on social networks, we write emails, and we read or buy things. We can visualize this data, and each type of visualization gives us a different answer to the question “what is a community?” This is different from the other ways we know how to describe groups. Anecdotes are tiny slices of life that may or may not be representative of the whole, while statistics are often so general as to obscure important distinctions. Visualizations are unique in being both universal and granular: they have detail at all levels, from the broadest patterns right down to individuals. Large scale visualizations of the commonalities between people are, potentially, a new way to represent and understand the public — that is, ourselves.

I’m going to go through the major types of community visualizations that I’ve seen, and then talk about what I’d like to do with them. Like most powerful technologies, large scale visualization is a capability that can also be used to oppress and to sell. But I imagine social ends, worthwhile ways of using visualization to understand the “public” not as we imagine it, but as something closer to how we really exist.

Social networks
Social networking services seem like an obvious place to go looking for communities, and I’m sure everyone has seen a social network visualization by now; they’re great eye candy. There are a lot of problems with social network visualizations — for example, what does it really mean to say that two people are “connected”? But let’s dive right in and see what we can see.

Here’s a visualization of the connections between my Facebook friends, which I created with the “social graph” Facebook application. Every person I am friends with is included in this visualization. The layout algorithm tries to put people with lots of mutual friends close together; otherwise, the positions are random. Nothing can be learned from the fact that “Amy” is to the left or the right of “Ramone,” but clusters of people are reliable structures.

On this diagram I can see the following clusters: San Francisco personal friends, Hong Kong personal friends, Toronto personal friends, University of Hong Kong classmates, SF circus people, HK circus people, former Adobe colleagues, and a few others. The independent nodes floating around are mostly people I met traveling but never got to know too well, while clusters form when lots of people know their friends’ friends. Clusters are so fundamental to this type of analysis that this Facebook app has tried to identify them by overlaying colored circles. I can see a lot more here than the algorithm can, which is a warning about the limitations of blind, acontextual analysis. Nonetheless, several major aspects of my life and personal history are immediately apparent. When you think about it, this is pretty amazing.

But this is a tiny little world. Rather than centering the visualization on a single person, you can make up some other sort of rule that determines which nodes are included. Here’s part of a lovely visualization of the visualization community on Twitter

Creator Moriz Stefaner chose who appears on this graph with a simple algorithm: starting with a small list of names who he considered central to the visualization community, he included every person who followed or was followed by at least five of those people to produce a larger set. Within this limited network, the size of each node represents the number of followers. Which shows, again, the importance of context. Hans Rosling may not be a big fish in the larger Twitter universe — he’s no Ashton Kutcher — but he’s a superstar in the visualization community.

But is there really one “visualization community”? I’m involved in visualization and know many of the folks on this map, and it looks like a pretty good map to me, but it seems to skew heavily toward the design, art, and infographics world. That’s probably because of the seed accounts chosen, and this chart misses a number of folks coming at visualization from the open government, journalism, scientific, and academic points of view. It also certainly excludes many prominent visualizers who don’t use Twitter. This is a universal problem: a visualization must either include or exclude each node; it’s a binary, black-and-white sort of decision process about a fixed set of nodes drawn from available data, but reality isn’t like that. Real communities are porous and overlapping and span multiple communication networks.

Co-consumption
We can also map “communities” by what they read or view or buy. This was first done by large online merchants, such as Amazon. Their famous “customers who bought this also bought that” feature, and indeed all automated recommendation engines, can be viewed as cluster detection algorithms. In this case, people are clustered by the books they bought or the movies they watched. Your personal recommendations are nothing more than the patterns of the cluster you fall into. Google News’ personalization system represents these clusters explicitly in its core algorithm.

To make this a little more concrete, here’s an analysis of US political booksales on Amazon during the 2008 presidential election, as plotted by Orgnet. Rather than representing people, each node is now a book, and the arrows represent Amazon’s “customers who bought A also bought B” recommendations. The striking finding is that people bought red books or blue books, but not both.

This amazing visualization is political polarization made manifest. There is little overlap in the networks of political books, so the “left” and the “right” emerge as features of reality in this context, which is fascinating. But a word of caution: this chart actually shows three clusters, two of which are assigned to the “left.” What do we make of that? Are there actually three “sides” here? Also, the visualization includes only books deemed “political” from the outset. This looks at the world through a very narrow lens because it ignores all other books — and therefore the rest of the network structure around the books shown here, which is presumably dense and interesting. But how do we decide what is “political?” And what about every other way we could examine the relationships between books and people? Is this kind of polarization apparent and important in broader contexts? We need to be very wary of projecting our preconceptions onto the interpretation of a visualization.

Also note that this map doesn’t depend at all on the “content” of books or blogs or articles — there’s no text processing or semantic analysis here. Amazon infers similarity in an entirely social fashion, based on how groups of people show similar buying behavior. iTunes’ Genius playlists and Netflix’s movie recommendations work the same way — but we can’t see the structures of any of these data sets, because they aren’t visualized.

Communication networks
There’s often a difference between what people say and what they do. Looking at social network connections is a little like asking someone who their friends are — relevant, but subject to little white lies, perceptual biases, the limitations of memory, and complicated personal judgements. Better, perhaps, to look at the data streams generated by online activity. For example, email.

Email network analysis seems to have come to popularity with the Enron emails released in 2003. The simplest way to visualize a huge pile of emails is to plot each email address as a node and draw edges when one person emailed another. Here’s such an image from Jeffery Heer’s Exploring Enron project:

There’s more going on this picture, such as some color coding via the modularity algorithm, which claims to be about “detecting community structure” but is actually about detecting clusters. But no matter how you visualize it, there’s something interesting here. Analysis of email networks within organization can reveal organizational structure that varies significantly from formal hierarchies, and there’s at least one book which claims that this informal structure is how things actually get done.

The main email analysis techniques are all based around counting the number of emails exchanged by each pair of people. This is a powerful idea, even if it’s not necessarily a very clear one. We don’t really know how to interpret facts such as “Joanna emails Hugo more than anyone else.” Are they colleagues, or lovers, or does Hugo work in tech support? But again, in almost every visualization of this type we get clusters, more or less tight groups of people who talk or act more with each other than they do with others. There is at least one research technique which attempts to detect conspiracies based partially on this type of network structure. There has also been some interesting analysis of the dynamic structure of the network — how people’s communication patterns changed as the crisis deepened. I like that, because time is so often overlooked in network analysis. Ideally, every network visualization would include a time slider that allows the user to scrub back and forth to see how things evolved.

Web structure
What if we take “website” instead of “person” as the atomic component of a community? The first maps of the web were made in the late 1990s by spidering the links between pages. My favorite modern example is the 2008 map of the Persian-language blogosphere by John Kelly and Bruce Etling of the Berkman Center. Every node is a blog. The size represents the number of other blogs that link to it. The color shows the subject of the blog, as categorized by a Persian-speaking researcher. Again, the visualization algorithms places blogs that frequently link to each closer together. And like people, blogs tend to form clusters.

In this map, humans chose the color for each dot — each blog — by manually reading the blog and coding the topic. The researchers didn’t know that the blogs on similar topics would be in the same cluster when they were reading them, and the computer didn’t know the assigned topics when clustering them. In other words, there is an amazing discovery here: an algorithm that can tell that two blogs have a different perspective — say, secular vs. religious politics, or perhaps poetry instead — just by looking at where these two blogs sit in the web of links. Link structure is here a proxy for worldview. It may also be a proxy for information flow, which must be closely related.

It would also be possible to visualize the web in terms of language. I imagine that this would reveal a geography of continent-clusters separated not by oceans but by language, so that Spain and Mexico would be neighbors, somewhat apart from the United States. As far as I know, no one has done this yet. It might tell us something about how information flows between cultures, or reveal useful bridge-bloggers.

Location-based community
By this I don’t mean where you live, though that’s part of it. Rather, I mean what can be inferred by analyzing people’s real-time location history. There are many sources for this information: check-in apps like FourSquare, tracking services like Google Latitude, geo-Tweets, or just the location recorded by mobile phone companies and individual phones. Suppose you had millions of these person-at-location-at-time data points. Could you segment users into different groups based on, say, the bars they hang out in? There’s money betting that the answer is yes, because Sense Networks is aiming to do this commercially. For more, see this patent. In 2008 they released the CitySense app showing, collectively, where everyone is within a city:

But this little phone app is just a demonstration, a toy. The point of this work isn’t to say where people are, but how their patterns of movement relate over time. This is another type of clustering, of understanding who people are and how they are the same or different. I bet you could locate the members of, say, an underground party community by looking for a cluster of people who frequently gathered together in supposedly abandoned warehouses in industrial areas. Sense Networks CTO Tony Jebara has written about visualizing these path clusters directly, but I haven’t been able to find any examples.

What’s a community?
I believe that I am part of many communities, and that each of these communities has enriched my life tremendously. But there’s no simple definition. Very often when someone says “community” what they mean is “geographic community,” the set of people who live in the same town or polity. I’d like to include this definition, because face-to-face contact is very important, and many of the most pressing collective action problems are local. But community must now mean something more. Consider Clay Shirky’s anecdote about the Boston Globe’s coverage of sexual abuse within the Catholic Church, in 1992 versus 2002:

In April of 2002 … this Spotlight story was most largest, most global thing that ever came out of Boston.com, the Boston Globe website, and the circulation for that one story was larger than the nominal circulation of The Boston Globe. Because [when] the stories in the ’90s had come out, the audience of the story was Bostonians, whether Catholic or no. But in 2002, the audience was Catholics whether Bostonian or no.

The only definition of “community” that makes any sense to me is “a group of people who think or act collectively.” This is the central theme of these visualizations. People don’t act truly independently, randomly spreading themselves out across geography and belief and behavior. Our lives are clustered along many disparate dimensions, which is just another way of saying that humans are social creatures. There must be as many different ways to visualize communities as there are types of human action. Each is an answer to “what is a community?” How these different answers relate, and how they relate to our intuitive, experiential understanding of face-to-face communities, I don’t think anyone really knows. Many people are trying to understand this right now, from industry to academia, and no doubt intelligence and law enforcement.

Note that many of these types of visualizations can group people who are not in contact with one another. This is particularly true of co-consumption and co-location visualizations. Maybe everyone has read the same books, or they hang out at the same coffee shop but have never met. There is some similarity between people that the algorithm reveals, a pattern we can see, but these people don’t necessarily know that they’re similar. We might call this a “latent community,” a group of like-minded folks who might act together if they were to come into communication — and the internet is great at allowing people to self-organize and define a common identity.

What do we do with this?
The list of people working on identifying communities through data is long. Finance, intelligence, law enforcement, politics, and especially marketing are already hard at work in these areas. Marketing is starting to turn away from classic measures like age, location, and gender because they are not terribly good predictors of purchasing, and there are already social-network based influence predictors. Advertising based on personal data is powerful, but imagine advertising based on an analysis of where you fit into the broader fabric of society. I imagine scarily good predictions of what your friends will find cool. Or your colleagues, or your family.

But I’m more concerned with the public-interest applications, which I see all over the social sciences: in journalism, sociology, conflict resolution, representative governance, urban planning, epidemiology. This is especially true when the clusters in these visualizations are good proxies for belief or worldview, as they seemed to be in the Iran blogosphere map. Knowing who believes what seems a critical building block for collective action of all types.

Being a journalist I’ve thought about journalism most, and I’d like to use community visualization to target journalism to actual people. If I had a live map of the web broken down by interests in some way, there are all sorts of ways I could focus my reporting. I could look at the map to see where the people affected by the story congregate online, and find sources there. When I had something to publish, I would know where to post the link. And I could discover who I’m missing, who I’m not thinking of and not serving in my reporting, and challenge the categories by which I group people. I tend to think of journalism in terms of empowerment; it’s a service we perform for members of a community. Community used to mean “town” but the definition is and must be more complex now. I want to get closer the audience and farther from “mass media.” Media became mass during the broadcast era because of technology and economics, not because it was the right way to do journalism.

In the most general sense, I am concerned with community visualization because I am concerned with representation. That is why I want these maps of the masses to be available to all. It is vital to represent the public to itself, and mapping how people are already acting together, out there in the world, seems like a critical feature for anyone who wants to participate broadly in society. It is especially critical because we can expect that various interests will expend huge sums pursuing this mapping for their own ends; in these maps there is the ability to influence, and to divide or unite, and I don’t think we want that entirely in a few powerful hands. But there is also the ability to understand who we, collectively, are. It’s easy to toss around labels like “left” and “right” or “Hispanic” or “drug abuser” but who are these people, actually, and what other identities do they have? And who are we not thinking of at all?

A computational journalism reading list

Jonathan Stray — Tue, 01 Feb 2011 02:29:28 +0000

[Last updated: 18 April 2011 — added statistical NLP book link]

There is something extraordinarily rich in the intersection of computer science and journalism. It feels like there’s a nascent field in the making, tied to the rise of the internet. The last few years have seen calls for a new class of “programmer journalist” and the birth of a community of hacks and hackers. Meanwhile, several schools are now offering joint degrees. But we’ll need more than competent programmers in newsrooms. What are the key problems of computational journalism? What other fields can we draw upon for ideas and theory? For that matter, what is it?

I’d like to propose a working definition of computational journalism as the application of computer science to the problems of public information, knowledge, and belief, by practitioners who see their mission as outside of both commerce and government. This includes the journalistic mainstay of “reporting” — because information not published is information not known — but my definition is intentionally much broader than that. To succeed, this young discipline will need to draw heavily from social science, computer science, public communications, cognitive psychology and other fields, as well as the traditional values and practices of the journalism profession.

“Computational journalism” has no textbooks yet. In fact the term barely is barely recognized. The phrase seems to have emerged at Georgia Tech in 2006 or 2007. Nonetheless I feel like there are already important topics and key references.

Data journalism
Data journalism is obtaining, reporting on, curating and publishing data in the public interest. The practice is often more about spreadsheets than algorithms, so I’ll suggest that not all data journalism is “computational,” in the same way that a novel written on a word processor isn’t “computational.” But data journalism is interesting and important and dovetails with computational journalism in many ways.

The Nieman Journalism Lab’s interview with Guardian Data Blog editor Simon Rogers remains a solid introduction to (one kind of) contemporary practice.
The best practical guides I know are Rogers’ “How to: get to grips with data journalism” and Dan Nguyen’s series of data-scraping tutorials at ProPublica.
Stanford’s Journalism in the Age of Data is an hour-long documentary on data journalism and visualization.
The web is a linked system of human-readable documents. Now Tim Berners-Lee wants to create a web of machine-readable linked data. The full potential is unclear, but it’s a big idea that may come to be the backbone of semantic web visions. The New York Times, The Guardian, and others are experimenting with open data APIs.
Everyblock creator Adrian Holovaty seems to have been the first to suggest that reporters file structured data in his 2006 “A Fundamental Way Newspaper Websites Need to Change.” This idea is beautifully expanded in Stijn Debrouwere’s “Information Architecture for News Websites” series.

Visualization
Big data requires powerful exploration and storytelling tools, and increasingly that means visualization. But there’s good visualization and bad visualization, and the field has advanced tremendously since Tufte wrote The Visual Display of Quantitative Information. There is lots of good science that is too little known, and many open problems here.

Tamara Munzner’s chapter on visualization is the essential primer. She puts visualization on rigorous perceptual footing, and discusses all the major categories of practice. Absolutely required reading for anyone who works with pictures of data.
Ben Fry invented the Processing language and wrote his PhD thesis on “computational information design,” which is his powerful conception of the iterative, interactive practice of designing useful visualizations.
How do we make visualization statistically rigorous? How do we know we’re not just fooling ourselves when we see patterns in the pixels? This amazing paper by Wickham et. al. has some answers.
Is a visualization a story? Segal and Heer explore this question in “Narrative Visualization: Telling Stories with Data.”

Computational linguistics
Data is more than numbers. Given that the web is designed to be read by humans, it makes heavy use of human language. And then there are all the world’s books, and the archival recordings of millions of speeches and interviews. Computers are slowly getting better at dealing with language.

Word frequency techniques like tf-idf and the vector space document model are very simple and very useful. See also stemming. Lots more in the wonderful (and free!) Introduction to Information Retrieval. This book explains how search engines are built, and discusses tf-idf etc. in great technical detail.
Statistical language models are increasingly important for all kinds of applications. Michael Nielsen has a great introduction to statistical machine translation. Google’s Peter Norvig discusses how he implemented statistical spelling correction on his laptop during a long plane flight. For the full deal, see the book Foundations of Statistical Natural Language Processing.
On a related note, Google N-gram viewer lets you look at the frequency of short phrases within 4% of all books published, ever. The excellent paper gives examples of how to use this for cultural research. Dan Cohen has important criticisms.
Speech-to-text algorithms enable automated transcription, and Matt Thompson explores the huge implications for journalism.
Reuters maintains the OpenCalais entity extraction service, which parses text to contextually determine who and what is referenced.
IBM’s Watson project built a question-answering system that reads reference books and wins at Jeopardy. Imagine how useful to journalists and curious readers this could be! This paper on the DeepQA system describes how they did it.

Communications technology and free speech
Code is law. Because our communications systems use software, the underlying mathematics of communication lead to staggering political consequences — including whether or not it is possible for governments to verify online identity or remove things from the internet. The key topics here are networks, cryptography, and information theory.

The Handbook of Applied Cryptography is a classic, and free online. But despite the title it doesn’t really explain how crypto is used in the real world, like Wikipedia does.
It’s important to know how the internet routes information, using TCP/IP and BGP, or at a somewhat higher level, things like the BitTorrent protocol. The technical details determine how hard it is to do things like block websites, suppress the dissemination of a file, or remove entire countries from the internet.
Anonymity is deeply important to online free speech, and very hard. The Tor project is the outstanding leader in anonymity-related research.
Information theory is stunningly useful across almost every technical discipline. Pierce’s short textbook is the classic introduction, while Tom Schneider’s Information Theory Primer seems to be the best free online reference.

Tracking the spread of information (and misinformation)
What do we know about how information spreads through society? Very little. But one nice side effect of our increasingly digital public sphere is the ability to track such things, at least in principle.

Memetracker was (AFAIK) the first credible demonstration of whole-web information tracking, following quoted soundbites through blogs and mainstream news sites and everything in between. Zach Seward has cogent reflections on their findings.
The Truthy Project aims for automated detection of astro-turfing on Twitter. They specialize in covert political messaging, or as I like to call it, computational propaganda.
We badly need tools to help us determine the source of any given online “fact.” There are many existing techniques that could be applied to the problem, as I discussed in a previous post.
If we had information provenance tools that worked across a spectrum of media outlets and feed types (web, social media, etc.) it would be much cheaper to do the sort of information ecosystem studies that Pew and others occasionally undertake. This would lead to a much better understanding of who does original reporting.

Filtering and recommendation
With vastly more information than ever before available to us, attention becomes the scarcest resource. Algorithms are an essential tool in filtering the flood of information that reaches each person. (Social media networks also act as filters.)

The paper on preference networks by Turyen et. al. is probably as good an introduction as anything to the state of the art in recommendation engines, those algorithms that tell you what articles you might like to read or what movies you might like to watch.
Before Google News there was Columbia News Blaster, which incorporated a number of interesting algorithms such as multi-lingual article clustering, automatic summarization, and more as described in this paper by McKeown et. al.
Anyone playing with clustering algorithms needs to have a deep appreciation of the ugly duckling theorem, which says that there is no categorization without preconceptions. King and Grimmer explore this with their technique for visualizing the space of clusterings.
Any digital journalism product which involves the audience to any degree — that should be all digital journalism products — is a piece of social software, well defined by Clay Shirky in his classic essay, “A Group Is Its Own Worst Enemy.” It’s also a “collective knowledge system” as articulated by Chris Dixon.

Measuring public knowledge
If journalism is about “informing the public” then we must consider what happens to stories after publication — this is the “last mile” problem in journalism. There is almost none of this happening in professional journalism today, aside from basic traffic analytics. The key question here is, how does journalism change ideas and action? Can we apply computers to help answer this question empirically?

World Public Opinion’s recent survey of misinformation among American voters solves this problem in the classic way, by doing a randomly sampled opinion poll. I discuss their bleak results here.
Blogosphere maps and other kinds of visualizations can help us understand the public information ecosystem, such as this interactive visualization of Iranian blogs. I have previously suggested using such maps as a navigation tool that might broaden our information horizons.
UN Global Pulse is a serious attempt to create a real-time global monitoring system to detect humanitarian threats in crisis situations. They plan to do this by mining the “data exhaust” of entire societies — social media postings, online records, news reports, and whatever else they can get their hands on. Sounds like key technology for journalism.
Vox Civitas is an ambitious social media mining tool designed for journalists. Computational linguistics, visualization, and more.

Research agenda
I know of only one work which proposes a research agenda for computational journalism.

“Computational Journalism: A Call to Arms for Database Researchers” by Sarah Cohen et. al. raises the very intriguing possibility of building systems that automatically or semi-automatically scan databases for stories, document the rationale for believing certain facts, etc.

This paper presents a broad vision and is really a must-read. However, it deals almost exclusively with reporting, that is, finding new knowledge and making it public. I’d like to suggest that the following unsolved problems are also important:

Tracing the source of any particular “fact” found online, and generally tracking the spread and mutation of information.
Cheap metrics for the state of the public information ecosystem. How accurate is the web? How accurate is a particular source?
Techniques for mapping public knowledge. What is it that people actually know and believe? How polarized is a population? What is under-reported? What is well reported but poorly appreciated?
Information routing and timing: how can we route each story to the set of people who might be most concerned about it, or best in a position to act, at the moment when it will be most relevant to them?

This sort of attention to the health of the public information ecosystem as a whole, beyond just the traditional surfacing of new stories, seems essential to the project of making journalism work.

By the numbers, American journalism failed to inform voters

Jonathan Stray — Thu, 30 Dec 2010 00:40:01 +0000

A recent study by World Public Opinion.org shows that the majority of the American population believed false things about basic national issues, right before the 2010 mid-term elections. I don’t know how to interpret this as anything other than a catastrophic failure of American journalism, in its most fundamental, clichéd, “inform the public” role.

The most damning section of the report (PDF) is titled “Evidence of Misinformation Among Voters.”

The poll found strong evidence that voters were substantially misinformed on many of the issues prominent in the election campaign, including the stimulus legislation, the healthcare reform law, TARP, the state of the economy, climate change, campaign contributions by the US Chamber of Commerce and President Obama’s birthplace. In particular, voters had perceptions about the expert opinion of economists and other scientists that were quite different from actual expert opinion.

This study also found that Fox viewers were significantly more misinformed than average on many issues, which is mostly how this survey was covered in the blogosphere and mainstream news outlets. I think this Fox thing is a terrible diversion from the core problem: the American press did not succeed in informing the public. Not even right before an election, not even on the narrow set of issues that, by survey, voters cared to base their votes on.

The travesty here is that the relevant facts were instantly available from primary sources, such as the Congressional Budget Office and the Intergovernmental Panel on Climate Change. I interpret this failure in the following way: for many kinds of issues, the web makes it easy to find true information. But it doesn’t solve the problem of making people go look. That, perhaps, is a key role for modern journalism. Unfortunately, modern American journalism seems to be very bad at it. I imagine the same problem exists in the journalism of many other countries.

What the study actually says
The study compares what voters think experts believe with what those experts actually believe. This is a bit tricky, and the study isn’t saying that the experts are necessarily right, but we’ll get to that. First, some example findings:

68% of voters thought that “most economists” believe that the stimulus package “saved or created a few jobs” and 20% thought most economists believe that the stimulus caused job losses, whereas only 8% correctly said that most economists think it “saved or created several million jobs.” (The Congressional Budget Office estimates that the stimulus saved several millions jobs, as do 75% of economists interviewed by the Wall Street Journal.)
53% of voters thought that economists believe that Obama’s health care reform plan will increase the deficit, while 29% said that economists were evenly divided on this issue. Only 13% said correctly that a majority of economists think that health care reform will not increase the deficit. (The Congressional Budget Office estimates a net reduction in deficits of $143 billion over 2010-2019, and Boards of Trustees of the Medicare Fund also believe that the Affordable Care act will “postpone the exhaustion of … trust fund assets.”)
12% of voters thought that “most scientists believe” that climate change is not occurring, while 33% thought scientists were evenly divided on the issue. That’s 45% with an incorrect perception, as opposed to the 54% who said, correctly, that most scientists think climate change is occurring. (Aside from the IPCC reports and virtually every governmental study of the issue worldwide, an April 2010 survey of climate scientists showed that 97% believe that human-caused climate change is occurring.)

A fussy but necessary digression: all of this rests on the reliability of the WorldPublicOpinion.org survey results. The survey was conducted by Knowledge Networks, Inc. using an online response panel randomly selected from the US population. Those without internet access were apparently provided it for free. I have been unable to find any serious independent evaluation of Knowledge Networks’ methodology, but their many research papers on sample design certainly talk the talk. All of the basic sampling errors, such as self-selection and language bias (what about Hispanics?) are at least addressed on paper. The margin of error is reported as 3.9%.

So let’s take these survey results as accurate, for the moment. This means that the majority of the American public had an incorrect conception of expert opinion on the issues that they voted on. That’s a mouthful. It’s not the same as “believed false things,” and in fact asking “what do you think experts believe” deliberately dodges the tricky question of what is true. If there is some misperception of expert belief, then in the strictest terms the public is misinformed. The study addresses this point as follows:

In most cases we inquired about respondents’ views of expert opinion, as well as the respondents’ own views. While one may argue that a respondent who had a belief that is at odds with expert opinion is misinformed, in designing this study we took the position that some respondents may have had correct information about prevailing expert opinion but nonetheless came to a contrary conclusion, and thus should not be regarded as ‘misinformed.’

So this study does not say “the American public are wrong about the economy and climate change.” It says that they haven’t really looked into it. I’m all for questioning authority’s claim to truth — anyone who follows my work knows that I’m generally a fan of Wikipedia, for example — but I believe we must take lifelong study and rigorous methodology seriously. To put it another way: voting contrary to the opinions of economists may be a fine thing, but voting without any awareness of their work is just silly. Yet that seems to be exactly what happened in the last election.

The role of the press, then and now
Of course, voting is hard and stuff is complex, which is why we rely on the media to break it all down for us. The sad part is that economics and climate change are familiar ground for journalists. It’s not like the facts of these issues were not published in mainstream news outlets. For that matter, journalists were not even necessary here. Any citizen with a web browser could have found out exactly what the Affordable Care Act was predicted to do to the deficit. The Congressional Budget Office published their report and then blogged about it in plain language.

Maybe publishing the truth was never enough. Maybe journalism never actually “informed the public,” but merely created conditions where the curious could get themselves informed by diligently reading the news. But on big issues like whether a piece of national legislation will affect the deficit, we no longer need professionals to enable this kind of self-motivated discovery. The sources go direct in such cases, as the Congressional Budget Office did. And do we really expect that the social media sphere — that’s all of us — will remain silent about the next big global warming study? We’re all going to use Facebook etc. to share links to the next IPCC report when it comes out.

If the problem of having access to true information about these sorts of “votable issues” is solved by the web, what isn’t solved by the web is getting every voter to go look at least once. That might be a job for informed professionals at the helm of big media channels. This is a big responsibility for a news organization to try to take, but I don’t see how it’s anything but the corollary to the responsibility to only publish true information. Presumably some of that information is important enough to know, so consumers would probably appreciate the idea that your mission is to ensure they are informed.

I suspect that paper-based habits are holding journalism back here. There is a deeply ingrained newsroom emphasis on reporting only what’s “new.” A budget report only gets to be news once, even if what it says is relevant for years. But there are no “editions” online; the same headline can float on the hot topics list for as long as it’s relevant. There is even more reason to keep directing attention to an issue if people are actively discussing it, if it is greatly polarized, or if there’s a lot of spin around it (see: the rise of fact-check journalism). In any case, journalists have long been good at keeping an issue in the news, by advancing the story daily in one way or another. But first they have to know what the public doesn’t know.

So the burning question that the World Public Opinion study leaves me with is just this: why wasn’t it a news organization that commissioned this survey?

See also: Does journalism work?

The world cannot be represented in machine-readable form

Jonathan Stray — Thu, 15 Apr 2010 08:54:41 +0000

UPDATE: Debrouwere continues the conversation with a response to the key points here, in the comments to his original post.

Dutch journalist/coder Stijn Debrouwere has written a very thorough post describing the ways in which standard tags, like the ones on this blog or on Flickr, fall short when applied to news articles. There are lots of things we might like to know about a story, such as where and when it happened and who was involved. This additional information, sort of like the index to a book, is known as “metadata”, and there is within the online journalism community a great call for its use, including by Debrouwere:

Each story could function as part of a web of knowledge around a certain topic, but it doesn’t.

So here’s a well-intentioned idea you’ve heard before: journalists should start tagging. Jay Rosen insists that “Getting disciplined and strategic about tagging” may be one way professional journalism separates itself from the flood of cheap content online.” Tags can show how a news article relates to broader themes and topics. Just the ticket.

News metadata is a major topic, and many people have speculated deeply about the value of creating news metadata at the time of reporting, such as the ever-sarcastic Xark and the thoughtful Martin Belam who writes about why “linked data” is good for journalism. But I’m going to respond to Debrouwere because I read him today, because he has lovely diagrams that explain his good ideas, and because, in criticizing “tags” as a form of metadata, I think he misses some very important points.

And he’s not alone. My sense is that many of the coder-journalists of today have not learned from the mistakes of generations of technically-minded people who wished to talk about the world in more precise ways.

Moving forward from simple tagging, Debrouwere imagines more sophisticated annotation schemes that start to pick up on what the tags actually mean. For starters, the tags could be drawn from separate “vocabularies.” Does a tag refer to a person, or a place, or perhaps an event? Debrouwere uses the following picture, which I’m going to borrow here because it explains the idea so nicely:

But, he says, we can get even more sophisticated. What did the story actually say? If it mentioned a person, what did it say about them? Was it an interview? A profile? Did it criticize them? Here’s the diagram he draws:

He imagines using this information to perform chains of inferences, like so:

Barack Obama belongs to the Democratic Party and he’s from Chicago. If we tag an article with Barack Obama, it’s likely that the article also has something to do with the Democratic Party. If we’ve specified that the article is about Obama, and we’ve specified that Obama is part of the DP, the system now has all the necessary information to suggest our article about Obama as a possibly interesting related read on the topical page for the democratic party, even if we didn’t explicitly indicate that link.

First of all, note that this sort of thing is already possible, quite often, using tags as they exist today. Simple analysis of co-tagging information will tell us that Obama is related to the Democratic party, because many articles will be tagged with both. Which is not to say that encoding such relationships explicitly isn’t a good idea. We can do this sort of thing using “triples,” which are fundamental to the nascent evolution of the internet into a web of “linked data”:

```
 belongs-to-party 
```

Here, “Barack Obama” is an object from, say, the “people” vocabulary, and “Democratic Party” is from, perhaps, the “political party” vocabulary, or maybe just from “groups.” Essentially, these are tags that have been pre-categorized. The relationship between the two is expressed by the “belongs-to-party” predicate.

But I argue that this is a rigged example. The world is normally much more messy.

“Are you now or have you ever been a member of the communist party?” was a killer question in its day, with complex answers like “I only attended one meeting.” And if parsing politician’s statements was easy, then Politifact wouldn’t devote entire articles to the question of whether a single sentence was true or false. Further, they distinguish between different “grades” of truth, like “mostly true.” Mathematical logic — which is what the sort of news inferences that Debrouwere and others discuss is based on — doesn’t deal with “mostly true.”

The problem is that the world is not neatly categorizable.

Don’t get me wrong — vocabularies and relationships (ala linked data triples) are surely a good idea. But they have some serious drawbacks that relate to very deep issues in knowledge representation.

Debrouwere says, “Events happen at a certain place and at a certain time.” Sometimes. For a house fire or a shooting, maybe, but how “long” were the post-election protests in Iran last summer? They continued at varying intensity for several days, then flared up weeks later. Was that one protest or two? And what about a Facebook protest that gathered supporters over the course of a week? “When” and “where” did that happen?

Or, take the example of describing what an article says about someone. How do we decide when a story “criticizes” someone? There will always be boundary cases — lots of them in professional reporting. How do we ensure inter-rater reliability? Can we extract any real data from analyses of this tag if we have no other reference points with which to interpret it?

Something is always lost in categorization. That is the point! To say that two things are like one another is to ignore their differences, for the purposes of the present discussion. Unfortunately, what can safely be ignored depends on the discussion. Simple date and place notations work for some purposes, and fail miserably for others. They are not very rich, and even worse, we don’t know exactly how much has been lost in each case. Knowledge of that error is sometimes critical, especially when trying to make chains of inferences, where errors multiply.

The reason we use text for reporting is that it’s good at representing these sorts of ambiguities. Strict adherence to the religion of finite relationship vocabularies leads one to believe that the world can be modeled in first-order logic (predicate logic), and this just isn’t true. Chains of automatic inference fail very quickly when applied even to very simple “real world” situations. The Artificial Intelligence research community went down that road for decades and found it really problematic, which is why we’re now seeing the rise of “statistical” AI techniques, such as statistical machine translation. This approach tries to find patterns in vast amounts of data rather than working out hard underlying rules; the categorization comes after you look at all available data, not before.

And therein lies the great virtue of tags: they are just about the simplest possible way of saying something, and don’t imply or require any particular inferential framework. They’re much harder to get wrong than more complex associations, and they make sense only in aggregate, and this makes them much more robust than predicate sentences. A tag says only, “there’s some association.” Full stop. I find this ambiguity a virtue. The meaning comes out of the relationships between the tags, articles, and users. Meaning is always relative, and tags force us to understand this, because there’s nothing else to go on.

Tags allow (or force) what we might call the “Google solution”: let humans describe it in a way that makes sense to them, then sort it all out later algorithmically. There are limits to this, of course, which is why metadata has value. But ultimately, computers serve humans, so the Google solution will always be a win when it is possible.

Linked data will be valuable because of the links. I predict that its main use will be as a sort of “super tagging” system: we still have “tags” in the linked data world, it’s just that they’re now all “uniform resource identifiers” that are visible to everyone on the web. This means that tags can be shared between systems and maintained by communities, which only makes them more powerful. In fact, this is exactly what we’re already seeing, with the Wikipedia-derived DBPepdia at the center of all those linked data “bubble diagrams.”

Linked data also supports predicates that say what the relationship between the tags is, like the “Barack Obama is a member of the Democratic Party” example. But I predict that these will be much less useful, offering almost none of the “machine understanding” that’s supposed to come with the semantic web. I don’t know what “understanding” means if not the ability to draw inferences of some sort, and predicates are just too fragile, too subject to mis-categorization, too limited to capture the rich relationships of the real world. I do believe that we’ll see amazing new “artificial intelligence”-like applications built on top of linked data, but they’ll be built statistically: they’ll ignore the predicates or use them only in special cases, or only in aggregate.

Having said all this, I am fully in support of adding better metadata to news stories. I believe the “entity recognition” performed by OpenCalais is valuable, and that carefully managed tag vocabularies are essential. Often “location” will be a genuinely useful tag, and I can see the possibility for some wonderful news mashups.

But please, let’s not imagine that we can capture even the “essential” details of real journalism with any fixed vocabulary. And let’s not oversell the potential of machine reasoning or data-mining based on carefully-annotated news metadata.

We’re a very long way from understanding how to represent reality in machine-readable form.

For more on this topic, I recommend:

“Ontology is overrated” by Clay Shirky
“Metacrap” by Cory Doctorow
“What is a knowledge representation?” by Davis et al. at MIT

Brain-Sharing, Illustrated

Jonathan Stray — Sat, 13 Feb 2010 07:05:42 +0000

I found this awesome little video exploring the idea of plugging in someone else’s brain for a while to see how they see the world.

I sort of feel like this is what I’m doing when I hang out with certain people, or when I read certain authors or watch certain films. It’s always exhilarating to step inside someone else’s exquisitely constructed universe. Communication excites me.

This is from TV Ontario’s YouTube channel — that would be in Canada, folks, and the purveyor of my childhood television. My mom used to direct shows for them. Glad to see they’re still doing the occasional interesting thing.

We Were Wrong About Giraffes

Jonathan Stray — Wed, 19 Aug 2009 04:28:54 +0000

I was told in grade school that the giraffe’s neck evolved to be long because taller giraffes could reach more tasty tree leaves in times of drought. It’s a lovely example of natural selection, and also completely wrong, as I discovered when researching an edit to the Wikipedia article. Eventually, someone just went and checked: it turns out that during times of drought or food scarcity, giraffes eat from low bushes.

There is an important lesson here about what it means to “explain something.”

Rudyard Kipling wrote a children’s book of myths about the origins of animals titled Just-So Stories. In it he explains the origin of the elephant’s trunk, how the camel got his hump, and where the leopard’s spots came from (they were drawn by an Ethiopian from the leftover black of his own dark skin, so that the leopard would better blend into the background when they hunted zebra together.) Clearly, making sense is not the criterion for truth. It’s very easy to forget this, when someone gives you a complex explanation and you get that “aha! I understand” feeling. Human beings constantly confuse congruence with truth.

Sensible and false explanations are such a problem in science that the term “just-so story” has come to refer to any sort of explanation that fits the facts, but cannot be verified. Scientific theories are supposed to differ from literary criticism and other forms of creative writing by demanding explanations that are true. This means testing them against reality.

A crucial point here: you can’t test a theory against the same facts that you used to come up with the theory to begin with. Of course a theory is going to fit the facts that inspired it! Instead, a theory — an explanation of something — needs to predict things that haven’t been observed yet. Prediction is the essence of science; it is the ability to say what will happen before it happens that makes it possible to “design” a bicycle rather than just gluing random objects together until they roll. If our aim is to come up with a true theory about evolution, we need to use the length of the giraffe’s neck to make predictions about something else, something we can go check (repeatedly, if we are serious about testing the theory.)

This seemingly philosophical notion is incredibly useful for spotting subtle bullshit that sounds like science.

Consider, for example, the trial of a vitamin for preventing the common cold. Let’s say it’s even a controlled trial. One hundred volunteers are given Vitamin Z daily, while another hundred are (unknowingly) given a placebo. At the end of the study, the Vitamin Z group had the same number of colds. But, the researchers discover as they analyze the data, they had fewer headaches. Does this mean Vitamin Z prevents headaches? Not necessarily, because the theory “Vitamin Z prevents headaches” was formulated by noticing a pattern, any pattern, then making up a story about how that pattern came to be. That doesn’t make the story true. And there will always be patterns. If the volunteers can suffer from hundreds of different ailments, then by sheer dumb chance the Vitamin Z group will be found to suffer from less of at least one of them. (Applied to controlled experiments, this notion can be made mathematically precise, by the way. See post-hoc analysis.)

Put another way, if you keep turning over rocks you will eventually find something. The whole point of a theory — an explanation, a model, a statement of the causal relationships of reality — is to say what you will find before the rock is turned over. Otherwise you only have a story that fits the facts, a just-so story.

I have found just-so stories to be most common in alternative medicine, economics, and evolutionary explanations of human behavior. If nothing testable has been predicted, then nothing has been “explained.”

We Have No Maps of The Web

Jonathan Stray — Mon, 04 May 2009 01:17:44 +0000

We dream the internet to be a great public meeting place where all the world’s cultures interact and learn from one another, but it is far less than that. We are separated from ourselves by language, culture and the normal tendency to seek out only what we already know. In reality the net is cliquish and insular. We each live in our own little corner, only dimly aware of the world of information just outside. In this the internet is no different from normal human life, where most people still die within a few kilometers of their birthplace. Nonetheless, we all know that there is something else out there: we have maps of the world. We do not have maps of the web.

I have met people who have never seen a world map. I once had a conversation with herders in the south Sahara who asked me if Canada was in Europe. As we talked I realized that the patriarch of the settlement couldn’t name more than half a dozen countries, and had no idea how long it might take to get to any of the ones he did know. He simply had no notion of how big the planet was. And to him, the world really is small: he lives in the desert, occasionally catches a ride to town for supplies, and will never leave the country in which he was born.

Online, we are all that man. Even the most global and sophisticated among us does not know the true scope of our informational world. Statistics on the “size” of the web are surprisingly hard to come by and even harder to grasp; learning that there are a trillion unique URLs is like being told that the land area of the Earth is 148 million square kilometers. We really have no idea what we’re missing, no visceral experience that teaches our ignorance.

We can remedy this.

First, language. When asked about the Chinese internet, the best most Westerners can manage is “here there be dragons.” Although machine translation is coming along and Google now includes it standard, we do not yet appreciate that the web in other languages could be important. In fact, unless you have twiddled your preferences, the multi-lingual web will not normally appear in your search results. There must have been a point in history when European maps did not show China, and Chinese maps did not show Europe; this is where we live today. The result is a strange sort of online invisibility between the major cultures of the world.

Another kind of invisibility results from gaps in media coverage. Even without the effects of censorship (of both press and internet varieties) there is the question of what counts as news; a famous example is the paucity of world events coverage in the American media. Although blogs can fill the reporting gap, a terrific story means nothing if no one knows where to read it.

Within the limitations of what we can view there are the limits of what we do view. A map of the Iranian blogosphere shows one cluster of visited of sites frequented by reformists and expats, and another for by conservatives and religious youth. In the United States, Amazon book sales data shows that liberals and conservatives don’t read each other’s books. Ideology aside, each person has particular interests; not everyone can be concerned with colony collapse disorder, Polish cinema, or the oil pipelines of Turkmenistan.

It’s not that everyone should care about everything; that’s ridiculous and impossible. I am also not concerned about finding things specifically sought; we have search engines for that. Rather, the point of a map is to know that something is there at all. I want school-children to see the web from space. I want maps of the web and its various resources, online, up to date, for everyone.

We understand, in a general sense, how to make such maps. There have already been a number of large-scale maps of online information, such as the blogosphere visualizations of Matthew Hurst. In his images, each dot is a blog and each arc represents a hyperlink. Automatic layout minimizes the distance between clusters of interlinked blogs, translating nearness on the web into nearness on the map. Looking at these incredibly detailed images, where each tiny dot is a blog, I am overwhelmed by how big just this one corner of the internet can be, and how little of it I can ever perceive. I am also deeply impressed by the Places and Spaces charts of science and other fields, and the phenomenal Scientific Method: Relationships Among Scientific Paradigms. Browsing these maps, I am struck everywhere by the existence large-scale patterns, the continents of a geography I didn’t know existed.

But these views are partial, specialized, and require enormous one-time resources to produce. They are curiosities, not navigation instruments. Until such maps exist in real-time in every browser they are just the toys of academics.

Imagine, then, a online newsreader (RSS reader, feed reader) with a map. I imagine all the world’s feeds drawn out in multiple colors, perhaps mapped out on a sphere. If each of your subscribed feeds was marked with a colored dot on the surface of this abstract Earth — which would include news and blogs from other cultures, ideologies, and languages — then it would be possible to see at a glance just where you stand in information space, and how wide or narrow your perspective. We would finally be able to put a finger down and say “you are here” in the world of what could be learned from the web.

The point is to engage curiosity, to encourage ourselves to leave the house online. In “Intelligent News Agents, With Real New” I envisioned a system that monitors what you read and automatically suggests topics that are as “different” as possible from your usual fare. This is a well-intended attempt to help you escape from the informational ghetto you grew up in, but I now think that such a system would be an utter failure. No one likes to be told what to read. Anyway, how is a programer to to decide what we “should” be viewing? Instead of trying to direct attention, let’s just make people aware of the geography.

There are many things that could be mapped. RSS feeds now include all the major news media, plus blogs, so they are an obvious place to start. A larger whole-web map seems essential for its sheer scope, and another “you are here” moment might arise from plotting personal browser history against such a map. All sorts of global patterns might also become apparent if we visually coded sites by language or topic, as I suggested in “How Many World Wide Webs are There?” Maps of academic publications or books, such as the maps of science discussed above, would reveal more slowly changing patterns in the world’s knowledge. Maps of corporate or political connections – something like a whole-world social network, or akin to the remarkable corporation browser of theyrule.net – would be difficult to generate, requiring considerable data-mining of public information, but could provide an up-to-date snapshot of global economic and power structures.

In all cases, our maps must be drawn very carefully, especially with regard to what counts as a link, because a map of something which is not fundamentally spatial can only be a metaphor. When well chosen, metaphors are powerful because they allow reasoning about one domain through the more familiar concepts of another; when poorly chosen, metaphors are unclear or deceptive. A map also engages our spatial reasoning faculties, the ability to grasp shape and structure at a glance. When we draw maps of information, we are seeking a visual representation of abstract properties such the number of connecting links between blogs, co-authorship of books, or similarity of word vectors. This can be done well or poorly, as Edward Tufte has spent his life demonstrating.

Along this line, I feel that our web maps should be spheres and not planes. Not only does a sphere suggest the Earth, but there is no center on a sphere, no privileged continent. A sphere also provides the concept of an antipode, the point farthest away from wherever you stand. It is good to wonder what is on the other side of the world.

The maps I want are also live. They are not snapshots, nothing like the “blogosphere as recorded by web crawl in August 2007” that we see in captions today. Instead, they must be continually updated, just as our search engines continually re-crawl the web. Our internet also needs history, as The Internet Archive and Google Trends know. I want a time slider on every map, a little widget that lets one scroll back and forth through history and actually watch new blogs rise to prominence, or see the polarization that occurred after 9/11. I want to see the continental drift.

Technologically, none of this is especially difficult, at least not in concept. A whole-web map of all accessible pages does require work with very large datasets, perhaps hundreds of terrabytes, but there are many corporations that know how to do this, often under the label of cloud computing. It also requires whole-web indices, and this is a trickier problem because only the search engine companies currently have the required infrastructure (and are willing to pay for it). The sorts of maps I propose are fundamentally expensive to maintain, which is probably part of why they don’t already exist. This implies centralization, and Google could certainly do the job — if they wanted to, or if they were willing to let others access their data. (Update: more on the economics of web indices.) But details follow need; like Stewart Brand, maybe we first need to want to see the whole world from space.

I live with very idealistic hopes. I believe that being aware of our world truly enables us live better at all scales, from where to brunch to national policy options for desertification. I also believe that communication can reduce bigotry, intolerance, and ultimately conflict, at least if the next generation is exposed young enough. But information that we do not even know exists cannot help us, and the ability to communicate with someone anywhere in the world means nothing if we are never tempted to do it. It is not our fault that we all live in informational ghettoes, but we need to make it obvious that we do.

We Can’t Learn About Economics

Jonathan Stray — Thu, 26 Mar 2009 00:23:19 +0000

Despite spending the last several days reading up on Treasury Secretary Geithner’s plan to buy bad bank assests, I now feel only marginally better prepared to judge whether this is a good idea or not. Of course, no one is asking me, but I still think it’s a big problem that I can’t evaluate this plan, because the fact that we live in a democracy means that citizens need to be able to understand what their government is doing.

Now, I am no economist and I have no idea how to run a bank — much less all the banks. However, I am smart, interested, and I’ve done my homework, including previously reading a first year economics textbook (covering both micro- and macro-economics) and several other interesting books (1,2,3) on how markets work or don’t. In short I have been the model of a concerned citizen, and I still have no idea what is going on. This is partially because the situation is very complex, but it is also because there is no way a private citizen can get access to the data that would clarify matters — large banks will barely share their balance sheets with the government, much less me.

This is a problem. It means that the government, financial, and academic communities have not paid nearly enough attention both to basic economics education, and to transparency in real-world business. It is therefore impossible for anyone else to check their assumptions and restrain their huge power. Lest this sounds like unhelpful complaining, I promise to make a concrete suggestion for improvement by the end of this post.

I find it useful to compare the question of bank bailouts to climate change. Like the worldwide recession, climate change presents a policy problem of global scope that needs to be handled right the first time. And like the global economy, climate change is a ridiculously complex problem, depending on the long-term interplay of hundreds of variables interacting through physical laws that most people have never head of. Because I was once a physicist, I can appreciate the basic science involved, and I can even make credible sense of specialist papers in the field. However, I don’t have to: the most cursory investigation of the topic quickly leads to the IPCC reports, a series of massively collaborative international summaries of current knowledge. There is no such authoritative primer for the economic crisis.

The best I have been able to find is Baseline Scenario’s Financial Crisis for Beginners, a collection of articles and interviews on various aspects of the problem. On the Geithner plan in particular, I have also found Brad DeLong’s FAQ to be quite helpful. Unfortunately, his (good) opinion of the plan is not universal: noted (and Nobel-winning) economist Paul Krugman disagrees. There’s a lively debate in the New York Times editorial pages today, and so it goes…

The core of the plan is this: the US Treasury will provide loans to allow investors to buy up bad bank assets — mortgages that might not be repaid, that sort of thing — as a long term speculative investment. This will, in theory, make the troubled banks more solvent, which should encourage them to start lending again. The potential problem is this: the loans will be “non-recourse“, which means that investors don’t have to repay the loans if these assets really are worth little or nothing. It amounts to the government insuring financiers to take more risk — for free.

This galls me, but it actually makes sense if you believe that the root of the current problem is that everyone is scared to take risk, and thus businesses can’t get loans, etc., which is the well-reasoned position of yet another prominent economist. It doesn’t make sense if you believe that some of these bad assets are just that: worthless debt that will never be repaid, such as mortgages to people who couldn’t afford them in the first place. In this view, the pre-crash boom was just a pyramid scheme that finally collapsed.

Which view is right? I have no idea. Neither, it seems, does anyone else. The major papers seem more interested in covering the politics of who said what (see, e.g. this from the NYT) than getting deep into the complexities of what is actually going on. This isn’t helpful; it’s little more than celebrity gossip, yapping about what the stars are doing. It’s akin to talk-show pundits arguing about climate change — which, sadly, also happens frequently.

So where is the IPCC report for the financial crisis? Where is the graph that shows that lending has collapsed? Where is the excruciatingly careful analysis of the relationship between credit and unemployment? What percentage of which assets at which banks are now considered to be bad? I know these things must exist — I surely hope they do — but I do not have the time, expertise, or connections to track everything down. Thus I want a careful overview, but I also want the citations, because requiring that claims be documented is a really good way of keeping people intellectually honest. Lacking this, what reason do I have to believe anything that anyone is saying?

One of the clearer, more careful resources I’ve found is the Congressional Research Service report “Causes of the Financial Crisis.” Ironically, this document, like all CRS reports, was not meant for members of the general public, and it had to be leaked to the fabulous OpenCRS.com for us to get at it. And in this summary I find the most refreshingly honest sentence:

While some may insist that there is a single cause, and thus a simple remedy, the sheer number of causal factors that have been identified tends to suggest that the current financial situation is not yet understood in its full complexity.

If we are at all serious about democracy, then public education about complex topics must be a specific goal. This is a goal that goes hand-in-hand with greater transparency, because it is not possible to make convincing arguments about data that is secret. We need to be treated like adults, not children: the level of economic discourse in the current press is scarcely high-school level, complete with whispered secrets and gossip about who is popular. At best, the major players in this game — the US government and financial institutions — have been negligent in educating the public on the functioning and dys-functioning of the economy; at worst they are playing politics to protect their power and money. Nor have academia or well-informed bloggers stepped in to fill the vacuum.

Again, what is needed is a meticulously documented argument as to 1) what happened and 2) what can be expected to fix it. The truth may be that we simply don’t know what happened or what will happen next — but even if this is so, I want to know why those in charge believe what they believe. Lacking such a detailed, evidence-based narrative, it is simply not possible to be an informed citizen. The worst part is, I’m not sure that our professional legislators currently understand any more than I do. Transparency and good education benefit us all.

The Censored Story of Wikileaks

Jonathan Stray — Thu, 01 Jan 2009 23:13:26 +0000

Wikileaks is often in the news, but for the wrong reasons. The web site provides a highly public outlet for “classified, censored, or otherwise restricted material of political, diplomatic, or ethical significance.” It is designed to be a journalistic tool for whistle-blowers and citizens of oppressive government and corporate regimes, a place of first and last resort for sensitive information from sources who need protection. It is a great irony, then, that an organization which specializes in censored information only makes the news when somebody violently objects.

I first stumbled upon Wikileaks about a year ago and have been watching it closely ever since. Despite its mission of openness, the site has a certain mystery about it: nowhere on the site are the principals publicly named. I was delighted, then, to attend a talk by two of the Wikileaks founders at the 25th Annual Chaos Communication Congress in Berlin. The 50-minute presentation was titled Wikileaks vs. The World, or “a talk about some conclusions observing Wikileaks.”

You may have heard about some of the things we’ve done in the media, but what you hear about tends to be what is frequently of greatest salacious interest to the Western media and to people in general. That doesn’t tend to be our everyday work.

A look at the front page of Wikileaks today shows all sorts of topics: The un-redacted report of Abu Grhiab whistleblower Samuel Provence. The German Foreign Secret Service report on Kosovo, 2005. Alperin vs. Vatican Bank, 2008 concerning Nazi assets allegedly laundered in 1946. A Scientology Department of Special Affairs lecture. Documentation showing that Swiss Bank Julius Baer put USD $300 million through the Cayman Islands in 1999. “The secret internet censorship list of Thailand’s Ministry of Information and Communication Technology (MICT).”

Wikileaks posts anything submitted to it complete and unaltered; that is the point. In this policy they represent the purest possible interpretation of the ideals of transparency and freedom of speech. Usually, the documents they post are applauded or at least ignored, but sometimes they draw the ire of those who feel that there is a case for certain secrets. A few weeks ago Wikileaks posted a list of Danish web-sites ostensibly censored for child-pornography; this summer they released a document describing the technical details of the Warlock signal jammers used by American forces in Iraq. They defend both choices, and indeed all of their leaks, with the same argument:

Who’s to judge the relevance, the political relevance? if it’s us who is to judge the relevance, then are we robust enough to judge this for all of society? … This is something for the public to do, and the political groups in the public, and not us.

Fighting censorship is what they’re all about. They believe deeply in the “fourth estate,” the role of the press and public cognizance as a check against tyranny. Like Wikipedia, they place great trust in the intelligence and enthusiasm of the public at large, who are asked to vet, analyze, and publicize the anonymously submitted documents. This ultimately represents a different model of society, an almost ridiculously open and transparent society. I did not hear the Wikileaks speakers ever concede that secrecy sometimes has its purposes, that there are legitimate reasons for knowledge to be hidden; instead, they repeatedly articulated the dangers of censorship.

The question is not what we need to be told. The question is what we need not to be told and who decides. Secret censorship systems are unaccountable and dangerous.

But again we are distracted. The possible mistakes and harm of Wikileaks cannot be judged in a vacuum, but only against the overall activities of the project. And sadly, sometimes it is the successes that draw the least attention.

There are a lot of things we do routinely that are very serious, but still get little attention. For example we have exposed many, many political assassinations. We released only three months ago a very important report on Kenya documenting 500 extra-judicial assassinations that had occurred in the past 18 months. There was some pickup in the Kenyan press, but the rest of the world, nothing. So getting leaked documents out is extremely important, but it’s not the only thing. Sometimes there is no interest group to care to spread the information.

The speakers urged the audience to get involved: to read, to analyze, to disclose. Our collective reality is only information, they said. “Everyone here is what he knows.” Every decision we make about what to say to someone else or what to write on our blogs defines the future world we live in, and defines what actually happened. It is not an absolute world; it is malleable. And, they claim, it is being changed in all sorts of ways with or without our knowledge or consent. Contrary to popular belief, “no medium is easier to censor than the internet.”

There is a complete eradication of certain parts of history going on. This is much easier than anyone in this crowd here most likely will think. We can see that censorship is being implemented systematically and globally. … George Orwell said that ‘he who controls the present controls the past, and he who controls the past controls the future,’ and this is never more true than with electronic archives. We have seen many, many examples of major newspapers pull material from the archives permanently … For example, this year there were seven stories removed from The Guardian, The Telegraph, and the New Statesman, in response to fear over legal costs. If you go to the URLs for those stories, you won’t see that this story has been removed by legal action, you will see ‘not found’, and if you search the index you will see ‘not found’. Those stories not only have ceased to exist, they have ceased to have ever existed. So the centralization that is occurring in archive repositories means that censorship is very easy.

Speaking to an audience of hundreds of hackers, researchers, anarchists and artists at the CCC in Berlin, they reminded everyone that Wikileaks is real. At the CCC I learned about the flaws in proposed cryptographic technologies for electronic voting; I even learned that SSL itself has been compromised. But technology is not people. And this, perhaps, is the key point of the entire lecture, and the entire project:

All these documents are real. It is hard fact that is documented. And all these documents reflect some facets of something that is happening at some point somewhere in the world. This is reality. … These documents pertain to violence that is caused by truth being told, by documents surfacing to the society. So It is important to understand that is not a hypothetical construct, some project that is dealing with something very obscure. We are actually dealing with information that reflects a very important facet of lives all over the world, and that has an influence on the quality, the freedom, and all other aspects of lives, living beings that we all need to have compassion for, and care for. This is very important in the mission that we try to bring across.

The streaming video of the complete talk has been archived in WMV format (859MB) here and here, and in OGG video format (445MB) here and here.