Visualizing communities

There are in fact no masses; there are only ways of seeing people as masses.
–Raymond Williams

Who are the masses that the “mass media” speaks to? What can it mean to ask what “teachers” or “blacks” or “the people” of a country think? These words are all fiction, a shorthand which covers over our inability to understand large groups of unique individuals. Real people don’t move in homogeneous herds, nor can any one person be neatly assigned to a single category. Someone might view themselves simultaneously as the inhabitant of a town, a new parent, and an active amateur astronomer. Now multiply this by a million, and imagine trying to describe the overlapping patchwork of beliefs and allegiances.

But patterns of association leave digital traces. Blogs link to each other, we have “friends” and “followers” and “circles,” we share interesting tidbits on social networks, we write emails, and we read or buy things. We can visualize this data, and each type of visualization gives us a different answer to the question “what is a community?” This is different from the other ways we know how to describe groups. Anecdotes are tiny slices of life that may or may not be representative of the whole, while statistics are often so general as to obscure important distinctions. Visualizations are unique in being both universal and granular: they have detail at all levels, from the broadest patterns right down to individuals. Large scale visualizations of the commonalities between people are, potentially, a new way to represent and understand the public — that is, ourselves.

I’m going to go through the major types of community visualizations that I’ve seen, and then talk about what I’d like to do with them. Like most powerful technologies, large scale visualization is a capability that can also be used to oppress and to sell. But I imagine social ends, worthwhile ways of using visualization to understand the “public” not as we imagine it, but as something closer to how we really exist.

Social networks
Social networking services seem like an obvious place to go looking for communities, and I’m sure everyone has seen a social network visualization by now; they’re great eye candy. There are a lot of problems with social network visualizations — for example, what does it really mean to say that two people are “connected”? But let’s dive right in and see what we can see.

Here’s a visualization of the connections between my Facebook friends, which I created with the “social graph” Facebook application. Every person I am friends with is included in this visualization. The layout algorithm tries to put people with lots of mutual friends close together; otherwise, the positions are random. Nothing can be learned from the fact that “Amy” is to the left or the right of “Ramone,” but clusters of people are reliable structures.

On this diagram I can see the following clusters: San Francisco personal friends, Hong Kong personal friends, Toronto personal friends, University of Hong Kong classmates, SF circus people, HK circus people, former Adobe colleagues, and a few others. The independent nodes floating around are mostly people I met traveling but never got to know too well, while clusters form when lots of people know their friends’ friends. Clusters are so fundamental to this type of analysis that this Facebook app has tried to identify them by overlaying colored circles. I can see a lot more here than the algorithm can, which is a warning about the limitations of blind, acontextual analysis. Nonetheless, several major aspects of my life and personal history are immediately apparent. When you think about it, this is pretty amazing.

But this is a tiny little world. Rather than centering the visualization on a single person, you can make up some other sort of rule that determines which nodes are included. Here’s part of a lovely visualization of the visualization community on Twitter

Creator Moriz Stefaner chose who appears on this graph with a simple algorithm: starting with a small list of names who he considered central to the visualization community, he included every person who followed or was followed by at least five of those people to produce a larger set. Within this limited network, the size of each node represents the number of followers. Which shows, again, the importance of context. Hans Rosling may not be a big fish in the larger Twitter universe — he’s no Ashton Kutcher — but he’s a superstar in the visualization community.

But is there really one “visualization community”? I’m involved in visualization and know many of the folks on this map, and it looks like a pretty good map to me, but it seems to skew heavily toward the design, art, and infographics world. That’s probably because of the seed accounts chosen, and this chart misses a number of folks coming at visualization from the open government, journalism, scientific, and academic points of view. It also certainly excludes many prominent visualizers who don’t use Twitter. This is a universal problem: a visualization must either include or exclude each node; it’s a binary, black-and-white sort of decision process about a fixed set of nodes drawn from available data, but reality isn’t like that. Real communities are porous and overlapping and span multiple communication networks.

Co-consumption
We can also map “communities” by what they read or view or buy. This was first done by large online merchants, such as Amazon. Their famous “customers who bought this also bought that” feature, and indeed all automated recommendation engines, can be viewed as cluster detection algorithms. In this case, people are clustered by the books they bought or the movies they watched. Your personal recommendations are nothing more than the patterns of the cluster you fall into. Google News’ personalization system represents these clusters explicitly in its core algorithm.

To make this a little more concrete, here’s an analysis of US political booksales on Amazon during the 2008 presidential election, as plotted by Orgnet. Rather than representing people, each node is now a book, and the arrows represent Amazon’s “customers who bought A also bought B” recommendations. The striking finding is that people bought red books or blue books, but not both.

This amazing visualization is political polarization made manifest. There is little overlap in the networks of political books, so the “left” and the “right” emerge as features of reality in this context, which is fascinating. But a word of caution: this chart actually shows three clusters, two of which are assigned to the “left.” What do we make of that? Are there actually three “sides” here? Also, the visualization includes only books deemed “political” from the outset. This looks at the world through a very narrow lens because it ignores all other books — and therefore the rest of the network structure around the books shown here, which is presumably dense and interesting. But how do we decide what is “political?” And what about every other way we could examine the relationships between books and people? Is this kind of polarization apparent and important in broader contexts? We need to be very wary of projecting our preconceptions onto the interpretation of a visualization.

Also note that this map doesn’t depend at all on the “content” of books or blogs or articles — there’s no text processing or semantic analysis here. Amazon infers similarity in an entirely social fashion, based on how groups of people show similar buying behavior. iTunes’ Genius playlists and Netflix’s movie recommendations work the same way — but we can’t see the structures of any of these data sets, because they aren’t visualized.

Communication networks
There’s often a difference between what people say and what they do. Looking at social network connections is a little like asking someone who their friends are — relevant, but subject to little white lies, perceptual biases, the limitations of memory, and complicated personal judgements. Better, perhaps, to look at the data streams generated by online activity. For example, email.

Email network analysis seems to have come to popularity with the Enron emails released in 2003. The simplest way to visualize a huge pile of emails is to plot each email address as a node and draw edges when one person emailed another. Here’s such an image from Jeffery Heer’s Exploring Enron project:

There’s more going on this picture, such as some color coding via the modularity algorithm, which claims to be about “detecting community structure” but is actually about detecting clusters. But no matter how you visualize it, there’s something interesting here. Analysis of email networks within organization can reveal organizational structure that varies significantly from formal hierarchies, and there’s at least one book which claims that this informal structure is how things actually get done.

The main email analysis techniques are all based around counting the number of emails exchanged by each pair of people. This is a powerful idea, even if it’s not necessarily a very clear one. We don’t really know how to interpret facts such as “Joanna emails Hugo more than anyone else.” Are they colleagues, or lovers, or does Hugo work in tech support? But again, in almost every visualization of this type we get clusters, more or less tight groups of people who talk or act more with each other than they do with others. There is at least one research technique which attempts to detect conspiracies based partially on this type of network structure. There has also been some interesting analysis of the dynamic structure of the network — how people’s communication patterns changed as the crisis deepened. I like that, because time is so often overlooked in network analysis. Ideally, every network visualization would include a time slider that allows the user to scrub back and forth to see how things evolved.

Web structure
What if we take “website” instead of “person” as the atomic component of a community? The first maps of the web were made in the late 1990s by spidering the links between pages. My favorite modern example is the 2008 map of the Persian-language blogosphere by John Kelly and Bruce Etling of the Berkman Center. Every node is a blog. The size represents the number of other blogs that link to it. The color shows the subject of the blog, as categorized by a Persian-speaking researcher. Again, the visualization algorithms places blogs that frequently link to each closer together. And like people, blogs tend to form clusters.

In this map, humans chose the color for each dot — each blog — by manually reading the blog and coding the topic. The researchers didn’t know that the blogs on similar topics would be in the same cluster when they were reading them, and the computer didn’t know the assigned topics when clustering them. In other words, there is an amazing discovery here: an algorithm that can tell that two blogs have a different perspective — say, secular vs. religious politics, or perhaps poetry instead — just by looking at where these two blogs sit in the web of links. Link structure is here a proxy for worldview. It may also be a proxy for information flow, which must be closely related.

It would also be possible to visualize the web in terms of language. I imagine that this would reveal a geography of continent-clusters separated not by oceans but by language, so that Spain and Mexico would be neighbors, somewhat apart from the United States. As far as I know, no one has done this yet. It might tell us something about how information flows between cultures, or reveal useful bridge-bloggers.

Location-based community
By this I don’t mean where you live, though that’s part of it. Rather, I mean what can be inferred by analyzing people’s real-time location history. There are many sources for this information: check-in apps like FourSquare, tracking services like Google Latitude, geo-Tweets, or just the location recorded by mobile phone companies and individual phones. Suppose you had millions of these person-at-location-at-time data points. Could you segment users into different groups based on, say, the bars they hang out in? There’s money betting that the answer is yes, because Sense Networks is aiming to do this commercially. For more, see this patent. In 2008 they released the CitySense app showing, collectively, where everyone is within a city:

But this little phone app is just a demonstration, a toy. The point of this work isn’t to say where people are, but how their patterns of movement relate over time. This is another type of clustering, of understanding who people are and how they are the same or different. I bet you could locate the members of, say, an underground party community by looking for a cluster of people who frequently gathered together in supposedly abandoned warehouses in industrial areas. Sense Networks CTO Tony Jebara has written about visualizing these path clusters directly, but I haven’t been able to find any examples.

What’s a community?
I believe that I am part of many communities, and that each of these communities has enriched my life tremendously. But there’s no simple definition. Very often when someone says “community” what they mean is “geographic community,” the set of people who live in the same town or polity. I’d like to include this definition, because face-to-face contact is very important, and many of the most pressing collective action problems are local. But community must now mean something more. Consider Clay Shirky’s anecdote about the Boston Globe’s coverage of sexual abuse within the Catholic Church, in 1992 versus 2002:

In April of 2002 … this Spotlight story was most largest, most global thing that ever came out of Boston.com, the Boston Globe website, and the circulation for that one story was larger than the nominal circulation of The Boston Globe. Because [when] the stories in the ’90s had come out, the audience of the story was Bostonians, whether Catholic or no. But in 2002, the audience was Catholics whether Bostonian or no.

The only definition of “community” that makes any sense to me is “a group of people who think or act collectively.” This is the central theme of these visualizations. People don’t act truly independently, randomly spreading themselves out across geography and belief and behavior. Our lives are clustered along many disparate dimensions, which is just another way of saying that humans are social creatures. There must be as many different ways to visualize communities as there are types of human action. Each is an answer to “what is a community?” How these different answers relate, and how they relate to our intuitive, experiential understanding of face-to-face communities, I don’t think anyone really knows. Many people are trying to understand this right now, from industry to academia, and no doubt intelligence and law enforcement.

Note that many of these types of visualizations can group people who are not in contact with one another. This is particularly true of co-consumption and co-location visualizations. Maybe everyone has read the same books, or they hang out at the same coffee shop but have never met. There is some similarity between people that the algorithm reveals, a pattern we can see, but these people don’t necessarily know that they’re similar. We might call this a “latent community,” a group of like-minded folks who might act together if they were to come into communication — and the internet is great at allowing people to self-organize and define a common identity.

What do we do with this?
The list of people working on identifying communities through data is long. Finance, intelligence, law enforcement, politics, and especially marketing are already hard at work in these areas. Marketing is starting to turn away from classic measures like age, location, and gender because they are not terribly good predictors of purchasing, and there are already social-network based influence predictors. Advertising based on personal data is powerful, but imagine advertising based on an analysis of where you fit into the broader fabric of society. I imagine scarily good predictions of what your friends will find cool. Or your colleagues, or your family.

But I’m more concerned with the public-interest applications, which I see all over the social sciences: in journalism, sociology, conflict resolution, representative governance, urban planning, epidemiology. This is especially true when the clusters in these visualizations are good proxies for belief or worldview, as they seemed to be in the Iran blogosphere map. Knowing who believes what seems a critical building block for collective action of all types.

Being a journalist I’ve thought about journalism most, and I’d like to use community visualization to target journalism to actual people. If I had a live map of the web broken down by interests in some way, there are all sorts of ways I could focus my reporting. I could look at the map to see where the people affected by the story congregate online, and find sources there. When I had something to publish, I would know where to post the link. And I could discover who I’m missing, who I’m not thinking of and not serving in my reporting, and challenge the categories by which I group people. I tend to think of journalism in terms of empowerment; it’s a service we perform for members of a community. Community used to mean “town” but the definition is and must be more complex now. I want to get closer the audience and farther from “mass media.” Media became mass during the broadcast era because of technology and economics, not because it was the right way to do journalism.

In the most general sense, I am concerned with community visualization because I am concerned with representation. That is why I want these maps of the masses to be available to all. It is vital to represent the public to itself, and mapping how people are already acting together, out there in the world, seems like a critical feature for anyone who wants to participate broadly in society. It is especially critical because we can expect that various interests will expend huge sums pursuing this mapping for their own ends; in these maps there is the ability to influence, and to divide or unite, and I don’t think we want that entirely in a few powerful hands. But there is also the ability to understand who we, collectively, are. It’s easy to toss around labels like “left” and “right” or “Hispanic” or “drug abuser” but who are these people, actually, and what other identities do they have? And who are we not thinking of at all?

Jonathan Stray

Information, culture, and belief

30 thoughts on “Visualizing communities”