information visualization – Jonathan Stray

A full-text visualization of the Iraq War Logs

Jonathan Stray — Fri, 10 Dec 2010 21:38:51 +0000

Update (Apr 2012): the exploratory work described in this post has since blossomed into the Overview Project, an open-source large document set visualization tool for investigative journalists and other curious people, and we’ve now completed several stories with this technique. If you’d like to apply this type of visualization to your own documents, give Overview a try!

Last month, my colleague Julian Burgess and I took a shot a peering into the Iraq War Logs by visualizing them in bulk, as opposed to using keyword searches in an attempt to figure out which of the 391,832 SIGACT reports we should be reading. Other people have created visualizations of this unique document set, such as plots of the incident locations on a map of Iraq, and graphs of monthly casualties. We wanted to go a step further, by designing a visualization based on the the richest part of each report: the free text summary, where a real human describes what happened, in jargon-inflected English.

Also, we wanted to investigate more general visualization techniques. At the Associated Press we get huge document dumps on a weekly or sometimes daily basis. It’s not unusual to get 10,000 pages from a FOIA request — emails, court records, meeting minutes, and many other types of documents, most of which don’t have latitude and longitude that can be plotted on a map. And all of us are increasingly flooded by large document sets released under government transparency initiatives. Such huge files are far too large to read, so they’re only as useful as our tools to access them. But how do you visualize a random bunch of documents?

We’ve found at least one technique that yields interesting results, a graph visualization where each document is node, and edges between them are weighted using cosine-similarity on TF-IDF vectors. I’ll explain exactly what that is and how to interpret it in a moment. But first, the journalism. We learned some things about the Iraq war. That’s one sense in which our experiment was a success; the other valuable lesson is that there are a boatload of research-grade visual analytics techniques just waiting to be applied to journalism.

click for super hi-res version

Interpreting the Iraq War, December 2006
This is a picture of the 11,616 SIGACT (“significant action”) reports from December 2006, the bloodiest month of the war. Each report is a dot. Each dot is labelled by the three most “characteristic” words in that report. Documents that are “similar” have edges drawn between them. The location of the dot is abstract, and has nothing to do with geography. Instead, dots with edges between them are pulled closer together. This produces a series of clusters, which are labelled by the words that are most “characteristic” of the reports in that cluster. I’ll explain precisely what “similar” and “characteristic” mean later, but that’s the intuition.

We colored each report/dot by the “incident type”, which is an existing field in the SIGACT, entered by military personnel. It’s important to note that the incident type field was not used to place the reports in the diagram — the placement depends only on the text of the document. This plots one one variable (color, which is incident type) against another (position, which depends on the summary text).

And it works. The central cluster is blue, the color for the “criminal event” type, and the documents within it all include the word “corpse.” There are a heartbreaking number of them, because this was the height of the Iraqi civil war. Sub-clusters include various modifiers such as “shot.” (Click any image for hi-res version.)

Above this, the blue murders merge into the green “enemy action” reports. At the interface we have “civ, killed, shot,” which are apparently reports of civilians wounded in battle. Enemy actions also have their own clusters labelled with “mortar,” “female,” “officer,” and “injured.” We haven’t looked into the “female”/”enemy action” cluster yet, and I wonder if there’s a story there.

There is a red cluster off to the side. Red signifies that the military coded these reports as “explosive hazard,” and the documents here all include the words “tanker truck.” Sure enough, there are contemporaneous press reports of tankers being used as explosive weapons, and this cluster shows that there were at least several dozen such incidents throughout Iraq in Dec 2006 — though it doesn’t immediately distinguish between explosions and attempted or threatened explostions.

There’s another cluster of blue criminal action reports, labelled “blindfolded, feet, hands.” Bound feet and hands were common in sectarian violence at the time, and some reports include the word “torture.” There’s a nearby cluster of abductions.

It goes on. December 2006 was a vicious and disturbing and complicated time in Iraq, and the visualization has patterns at all scales, especially if you look at the hi-res image and read the tiny single-report labels. There are some dark green “friendly action” reports labelled “convoy,” and other “friendly actions” which mention the troublesome town of Hadithah (near bottom left). And there is the oil connection, a group of reports which include the word “pipeline.”

How we did it, and what we can and can’t learn from this picture
Visualization is metaphor. Certain details are thrown away, other are emphasized. The algorithms used to produce the visualization have their own sensitivities and blind spots. Without understanding these, a viewer will make false inferences. I’m going to explain in some detail about how this picture was produced, both so that others can replicate this research, and so that those looking at such visualizations can interpret them honestly.

We used standard text-analytics techniques, borrowed from information retrieval: the bag-of-words model, TF-IDF term weighting, and cosine similarity to compare documents. This is the stuff from which search engines are built, among other things. The geeky among us can learn as much as they could ever want to know from this wonderful free information retrieval textbook.

We start by turning each document into a fixed-length vector of numbers. There are as many numbers in this vector as their are words in the vocabulary of all the documents, over 17,000 distinct terms in the case of the Iraq War Logs. If “pipeline” appears three times in a report, we put a three in the count for “pipeline.” Of course the reports are much shorter than 17,000 words, usually just a couple hundred words, so most of the numbers in each document vector are zero.

We also don’t quite store the count of each word. Instead we store the frequency, that is, we divide the counts by the number of words in the document. If the document is 100 words long then “‘pipeline’ appeared three times” becomes “3% of the words in this document are ‘pipeline.'” This is “term frequency,” the TF part of TF-IDF.

Then we normalize again by how commonly the word appears across documents. It’s not enough to know that “pipeline” is common in a document.” We need to know that “pipeline” is unusually common in this document. So we count the fraction of documents where “pipeline” appears, and divide the term frequency by this document frequency. (Technically, by the logarithm). This has the effect of de-emphasizing terms which appear in almost every document, and it’s the “inverse document frequency” or IDF part of TF-IDF.

This is the sense in which that the labels on the documents and the clusters are “characteristic” words: they are words that occur frequently in those specific documents, but don’t appear at all in most other documents.

But by turning each document into a list of numbers, the order of the words is lost. Once we crunch the text in this way, “the insurgents fired on the civilians” and “the civilians fired on the insurgents” are indistinguishable. Both will appear in the same cluster. This is why a vector of TF-IDF numbers is called a “bag of words” model; it’s as if we cut out all the individual words and put them in a bag, losing their relationships before further processing. And so we get to:

Important caveat #1: any visualization based on a bag-of-words model cannot show distinctions that depend on word order.

Once we have all the documents encoded as TF-IDF vectors, we compare every pair of documents to determine how similar they are. We call two documents similar if their characteristic words overlap, and we determine this by taking the dot product of the two document vectors. Why? The dot product multiplies the corresponding numbers at each position in the two vectors. If two documents both have a big number for “pipeline”, the dot product will be large. If one document has a big number for “pipeline” but zero for “abducted”, while the other has a large number for “abducted” but zero for “pipeline”, then the dot product will be zero. This is called the cosine similarity method of comparing documents, because of geometrical relationships between the cosine function and the dot product. Cosine similarity assigns a number to every pair of documents, from zero for “they are completely different” to one for “they are the same.” (At least, the same as far as the bag of words model is concerned.)

Each document is a dot in the visualization. To this we add edges, and the “weight” or strength of the edge — which shows up as line width in this visualization — is the cosine similarity. But we don’t put edges between every pair of documents, only those that are above some threshold of similarity. For this visualization, that threshold was 0.6.

And then we lay out the graph. We used Gephi, a free graph visualization tool. Generally, graph layout algorithms try to bring nodes with strong edges closer together. We found the Fruchterman-Reingold algorithm gave the clearest layout in this case, but the general idea is that points with strong ties gradually move closer as the algorithm runs. But there are conflicting demands; a node marked “corpse” and “abducted” may be pulled towards both clusters. Where a node ends up also depends a lot on where it started, and the nodes start in essentially random positions.

Cosine similarity-weighted graph layout is not the only way to view the relationships between thousands of documents in a 17,000-dimensional space. There are other techniques such as multi-dimensional scaling. But however the documents are visualized, we are trying to understand the structure of a something very complicated in only two dimensions, like trying to guess an object from its shadow. Depending on which angle you take, the shadow is going to be more or less revealing, and perhaps more or less misleading. This is:

Important caveat #2: the positions of the dots are sort of arbitrary, though we hope that nearby dots actually represent similar documents.

In other words, quantitative measurements of distances on this visualization won’t mean much. Arguing that “these events are unrelated because they are on opposite sides of the image” is similarly fallacious.

What can we learn from this visualization technique? Clusters are fairly reliable structures. Using color to plot one type of information against another can reveal patterns. And we believe that this visualization captures some important macro-scale aspects of the War Logs. This picture isn’t a story in the usual sense, but we find it insightful nonetheless, and maybe it tells us where to look further. A search tool only can only answer the questions we ask, but a visualization tool lets us make maps.

Much more is possible
To begin with, we’d like to try coloring each dot according to the number of casualties, another field already available in the SIGACTs. We know that over 4000 U.S. forces and 100,000 civilians died in Iraq, but what were the circumstances of their deaths? Perhaps we can start to answer that question. We also want to find a way to animate this diagram through time, so we can see how the war changed as it progressed.

But there are plenty of other visualization techniques waiting to be applied to journalism, and plenty of other document sets to apply them to. It seems likely that TF-IDF and cosine similarity will be generally useful for full-text visualizations of a variety of document types, but it won’t always work. Threaded displays might be much more revealing for things like emails, where it’s important to identify and isolate conversations. In other contexts, entity-relationship diagrams can be insightful; theyrule.net is the granddaddy of this type of analysis, today being seriously pursued by Muckety.

Visualization is also only one part of the problem. This is a static image, but what we really need is an interactive system where a computer draws the pictures and a human directs the exploration. Visualization has to be combined with filtering and selection tools to allow an investigator to “zoom in” on only those documents of interest. Such complete systems exist in other fields, such as the Jigsaw visual analytics software, but there’s currently nothing that really works well for journalism. Performance is a huge issue when dealing with very large document sets, and data import and clean-up are often the real-world bottlenecks. Clean-up is often the most time consuming part of document set analysis, and new tools such as Google Refine give us hope that it can be streamlined.

The potential applications of an industrial-strength journalistic visual analytics system are far broader than document dumps. We got interested in visual analytics because we faced document sets that were so large that they were completely opaque without special tools. But a newsroom also has its archives, and the data and stories it generates every day. We’ve heard interest from historians, and at the other end of the immediacy scale are potential real-time monitoring applications, technologies that are being seriously pursued by organizations such as UN Global Pulse.

We see so much potential that we — the Associated Press in conjunction with several top-notch researchers — are embarking on a serious attempt to build an open-source system for journalistic visualization of very large document sets, be they document dumps, news archives, or the streams of data that now surround civilization. We have preliminary designs for a system called Overview, and we have applied for a Knight News Challenge grant to hire full-time developers to create it. I’ll soon post a more detailed description of the system we’d like to build. We’re going to need help from the journalist-programmer community.

Best Visualizations of 2009

Jonathan Stray — Wed, 16 Dec 2009 17:41:47 +0000

FlowingData just did a roundup the top 5 prettiest, awesomest, interestingest data visualizations of the year. I think it’s wonderful, because I think visualizations are important. The amount of data in the world is exploding, but human sense abilities are not.

It was a huge year for data. There’s no denying it. Data is about to explode.

Applications sprung up left and right that help you understand your data – your Web traffic, your finances, and your life. There are now online marketplaces that sell data as files or via API. Data.govlaunched to provide the public with usable, machine-readable data on a national scale. State and local governments followed, and data availability expands every day.

At the same time, there are now tons of tools that you can use to visualize your data. It’s not just Excel anymore, and a lot of it is browser-based. Some of the tools even have aesthetics to boot.

It’s exciting times for data, indeed.

Data has been declared sexy, and the rise of the data scientist is here.

With all the new projects this year, it was hard to filter down to the best, but here they are: two honorable mentions and the five best data visualization projects of 2009. Visualizations were chosen based on analysis, aesthetics, and most importantly, how well they told their story (or how well they let you tell yours).

Go here for the rest.

Since all I ever seem to write about these days is journalism (what with the journalism school, and the currently interning at a newspaper), here’s the tie-in:
Data is news now.

Deep, huh?

More pretty pictures at visualcomplexity.com, my very favorite infoviz site.

Why We Need Open Search, and How to Make Money Doing It

Jonathan Stray — Sun, 27 Sep 2009 09:51:01 +0000

Anything that’s hard to put into words is hard to put into Google. What are the right keywords if I want to learn about 18th century British aristocratic slang? What if I have a picture of someone and I want to know who it is? How to I tell Google to count the number of web pages that are written in Chinese?

We’ve all lived with Google for so long that most of us can’t even conceive of other methods of information retrieval. But as computer scientists and librarians will tell you, boolean keyword search is not the end-all. There are other classic search techniques, such as latent semantic analysis which tries to return results which are “conceptually similar” to the user’s query, even if the relevant documents don’t contain any of the search terms. I also believe that full-scale maps of the online world are important, I would like to know which web sites act as bridges between languages, and I want tools to track the source of statements made online. These sorts of applications might be a huge advance over keyword search, but large-scale search experiments are, at the moment, prohibitively expensive.

The problem is that the web is really big, and only a few companies have invested in the hardware and software required to index all of it. A full crawl of the web is expensive and valuable, and all of the companies who have one (Google, Yahoo, Bing, Ask, SEOmoz) have so far chosen to keep their databases private. Essentially, there is a natural monopoly here. We would like a thousand garage-scale search ventures to bloom in the best Silicon Valley tradition, but it’s just too expensive to get into the business.

DotBot is the only open web index project I am aware of. They are crawling the entire web and making the results available for download via BitTorrent, because

We believe the internet should be open to everyone. Currently, only a select few corporations have access to an index of the world wide web. Our intention is to change that.

Bravo! However, a web crawl is a truly enormous file. The first part of the DotBot index, with just 600,000 pages, clocks in at 3.2 gigabytes. Extrapolating to the more than 44 billion pages so far crawled, I estimate that they currently have 234 terabytes of data. At today’s storage technology prices of about $100 per terabyte, it would cost $24,000 just to store the file. Real-world use also requires backups, redundancy, and maintenance, all of which push data center costs to something closer to $1000 per terabyte. And this says nothing of trying to download a web crawl over the network — it turns out that sending hard drives in the mail is still the fastest and cheapest way to move big data.

Full web indices are just too big to play with casually; there will always be a very small number of them.

I think the solution to this is to turn web indices and other large quasi-public datasets into infrastructure: a few large companies collect the data and run the servers, other companies buy fine-grained access at market rates. We’ve had this model for years in the telecommunications industry, where big companies own the lines and lease access to anyone who is willing to pay.

The key to the whole proposition is a precise definition of access. Google’s keyword “access” is very narrow. Something like SQL queries would expand the space of expressible questions, but you still couldn’t run image comparison algorithms or do the computational linguistics processing necessary for true semantic search. The right way to extract the full potential of a database is to run arbitrary programs on it, and that means the data has to be local.

The only model for open search that works both technologically and financially is to store the web index on a cloud, let your users run their own software against it, and sell the compute cycles.

It is my hope that this is what DotBot is up to. The pieces are all in place already: Amazon and others sell cheap cloud-computing services, and the basic computer science of large-scale parallel data processing is now well understood. To be precise, I want an open search company that sells map-reduce access to their index. Map-reduce is a standard framework for breaking down large computational tasks into small pieces that can be distributed across hundreds or thousands of processors, and Google already uses it internally for all their own applications — but they don’t currently let anyone else run it on their data.

I really think there’s money to be made in providing open search infrastructure, because I really think there’s money to be made in better search. In fact I see an entire category of applications that hasn’t yet been explored outside of a few very well-funded labs (Google, Bellcore, the NSA): “information engineering,” the question of what you can do with all of the world’s data available for processing at high speed. Got an idea for better search? Want to ask new questions of the entire internet? Working on an investigative journalism story that requires specialized data-mining? Code the algorithm in map-reduce, and buy the compute time in tenth-of-a-second chunks on the web index cloud. Suddenly, experimentation is cheap — and anyone who can figure out something valuable to do with a web index can build a business out of it without massive prior investment.

The business landscape will change if web indices do become infrastructure. Most significantly, Google will lose its search monopoly. Competition will probably force them to open up access their web indices, and this is good. As Google knows, the world’s data is exceedingly valuable — too valuable to leave in the hands of a few large companies. There is an issue of public interest here. Fortunately, there is money to be made in selling open access. Just as energy drives change in physical systems, money drives changes in economic systems. I don’t know who is going to do it or when, but open search infrastructure is probably inevitable. If Google has any sense, they’ll enter the search infrastructure market long before they’re forced (say, before Yahoo and Bing do it first.)

Let me know when it happens. There are some things I want to do with the internet.

Mapping the Daily Me

Jonathan Stray — Thu, 24 Sep 2009 06:57:52 +0000

If we deliver to each person only what they say they want to hear, maybe we end up with a society of narrow-minded individualists. It’s exciting to contemplate news sources that (successfully) predict the sorts of headlines that each user will want to read, but in the extreme case we are reduced to a journalism of the Daily Me: each person isolated inside their own little reflective bubble.

The good news is, specialized maps can show us what we are missing. That’s why I think they need to be standard on all information delivery systems.

For the first time in history, it is possible to map with some accuracy the information that free-range consumers choose for themselves. A famous example is the graph of political booksales produced by orgnet.com:

Here, two books are connected by a line if consumers tended to buy both. What we see is what we always suspected: a stark polarization. For the most part, each person reads either liberal or conservative books. Each of us lives in one information world but not the other. Despite the Enlightenment ideal of free debate, real-world data shows that we do not seek out contradictory viewpoints.

Which was fine, maybe, when the front page brought them to us. When information distribution was monopolized by a small number of newspapers and broadcasters, we had no choice but to be exposed to stories that we might not have picked for ourselves. Whatever charges one can press against biased editors of the past, most of them felt that they had a duty to diversity.

In the age of disaggregation, maybe the money is in giving people what they want. Unfortunately, there is a real possibility that we want is to have our existing opinions confirmed. You and I and everyone else are going to be far more likely to click through from a headline that confirms what we already believe than from one which challenges us. “I don’t need to read that,” we’ll say, “it’s clearly just biased crap.” The computers will see this, and any sort of recommendation algorithm will quickly end up as a mirror to our preconceptions.

It’s a positive feedback loop that will first split us along existing ideological cleavages, then finer and finer. In the extreme, each of us will be alone in a world that never presents information to the contrary.

We could try to design our systems to recommend a more diverse range of articles (an idea I explored previously) but the problem is, how? Any sort of agenda-setting system that relies on what our friends like will only amplify polarities, while anything based on global criteria is necessarily normative — it makes judgements on what everyone should be seeing. This gets us right back into all the classic problems of ideology and bias — how do we measure diversity of viewpoint? And even if we could agree on a definition of what a “healthy” range sources is, no one likes to be told what to read.

I think that maps are the way out. Instead of trying to decide what someone “should” see, just make clear to them what they could see.

An information consumption system — an RSS reader, online newspapers, Facebook — could include a map of the infosphere as a standard feature. There are many ways to draw such a map, but the visual metaphor is well-established: each node is an information item (an article, video, etc.) while the links between items indicate their “similarity” in terms of worldview.

This is less abstract than it seems, and with good visual design these sorts of pictures can be immediately obvious. Popular nodes could be drawn larger; closely related nodes could be clustered. The links themselves could be generated from co-consumption data: when one user views two different items, the link between those items gets slightly stronger. There are other ways of classifying items as related — as belonging to similar worldviews — but co-consumption is probably as good a metric as any, and in fact co-purchasing data is at the core of Amazon’s successful recommendation system.

The concepts involved are hardly new, and many maps have been made at the site level where each node is an entire blog, such as the map of the Iranian blogosphere above. However, we have never had a map of individual news items, and never in real-time for everyone to see.

Each map also needs a “you are here” indicator.

This would be nothing more than some way of marking items that the user has personally viewed. Highlight them, center them on the map, and zoom in. But don’t zoom in too much. The whole purpose of the map is to show each of us how small, how narrow and unchallenging our information consumption patterns actually are. We will each discover that we live in a particular city-cluster of information sources, on a particular continent of language, ideology, or culture. A map literally lets you see this at a glance — and you can click on far-away nodes for instant travel to distant worldviews.

Giving people only what they like risks turning journalism into entertainment or narcissism. Forcing people to see things that they are not interested in is a losing strategy, and we there isn’t any obvious way to decide what we should see. Showing people a map of the broader world they live in is universally acceptable, and can only encourage curiosity.

How Your Friends Affect You, Now With Math

Jonathan Stray — Thu, 17 Sep 2009 05:20:32 +0000

The New York Times Magazine and Wired both have major articles this week on recent empirical work in social networks, including significant research on how things like obesity, smoking, and even happiness spread between among groups of people. The Wired piece has better pictures

while the NYT piece is more thorough and thoughtful, and covers both the potential and the pitfalls of this kind of analysis.

For decades, sociologists and philosophers have suspected that behaviors can be “contagious.” … Yet the truth is, scientists have never successfully demonstrated that this is really how the world works. None of the case studies directly observed the contagion process in action. They were reverse-engineered later, with sociologists or marketers conducting interviews to try to reconstruct who told whom about what — which meant that people were potentially misrecalling how they were influenced or whom they influenced. And these studies focused on small groups of people, a few dozen or a few hundred at most, which meant they didn’t necessarily indicate much about how a contagious notion spread — if indeed it did — among the broad public. Were superconnectors truly important? How many times did someone need to be exposed to a trend or behavior before they “caught” it? Certainly, scientists knew that a person could influence an immediate peer — but could that influence spread further? Despite our pop-cultural faith in social contagion, no one really knew how it worked.

We Have No Maps of The Web

Jonathan Stray — Mon, 04 May 2009 01:17:44 +0000

We dream the internet to be a great public meeting place where all the world’s cultures interact and learn from one another, but it is far less than that. We are separated from ourselves by language, culture and the normal tendency to seek out only what we already know. In reality the net is cliquish and insular. We each live in our own little corner, only dimly aware of the world of information just outside. In this the internet is no different from normal human life, where most people still die within a few kilometers of their birthplace. Nonetheless, we all know that there is something else out there: we have maps of the world. We do not have maps of the web.

I have met people who have never seen a world map. I once had a conversation with herders in the south Sahara who asked me if Canada was in Europe. As we talked I realized that the patriarch of the settlement couldn’t name more than half a dozen countries, and had no idea how long it might take to get to any of the ones he did know. He simply had no notion of how big the planet was. And to him, the world really is small: he lives in the desert, occasionally catches a ride to town for supplies, and will never leave the country in which he was born.

Online, we are all that man. Even the most global and sophisticated among us does not know the true scope of our informational world. Statistics on the “size” of the web are surprisingly hard to come by and even harder to grasp; learning that there are a trillion unique URLs is like being told that the land area of the Earth is 148 million square kilometers. We really have no idea what we’re missing, no visceral experience that teaches our ignorance.

We can remedy this.

First, language. When asked about the Chinese internet, the best most Westerners can manage is “here there be dragons.” Although machine translation is coming along and Google now includes it standard, we do not yet appreciate that the web in other languages could be important. In fact, unless you have twiddled your preferences, the multi-lingual web will not normally appear in your search results. There must have been a point in history when European maps did not show China, and Chinese maps did not show Europe; this is where we live today. The result is a strange sort of online invisibility between the major cultures of the world.

Another kind of invisibility results from gaps in media coverage. Even without the effects of censorship (of both press and internet varieties) there is the question of what counts as news; a famous example is the paucity of world events coverage in the American media. Although blogs can fill the reporting gap, a terrific story means nothing if no one knows where to read it.

Within the limitations of what we can view there are the limits of what we do view. A map of the Iranian blogosphere shows one cluster of visited of sites frequented by reformists and expats, and another for by conservatives and religious youth. In the United States, Amazon book sales data shows that liberals and conservatives don’t read each other’s books. Ideology aside, each person has particular interests; not everyone can be concerned with colony collapse disorder, Polish cinema, or the oil pipelines of Turkmenistan.

It’s not that everyone should care about everything; that’s ridiculous and impossible. I am also not concerned about finding things specifically sought; we have search engines for that. Rather, the point of a map is to know that something is there at all. I want school-children to see the web from space. I want maps of the web and its various resources, online, up to date, for everyone.

We understand, in a general sense, how to make such maps. There have already been a number of large-scale maps of online information, such as the blogosphere visualizations of Matthew Hurst. In his images, each dot is a blog and each arc represents a hyperlink. Automatic layout minimizes the distance between clusters of interlinked blogs, translating nearness on the web into nearness on the map. Looking at these incredibly detailed images, where each tiny dot is a blog, I am overwhelmed by how big just this one corner of the internet can be, and how little of it I can ever perceive. I am also deeply impressed by the Places and Spaces charts of science and other fields, and the phenomenal Scientific Method: Relationships Among Scientific Paradigms. Browsing these maps, I am struck everywhere by the existence large-scale patterns, the continents of a geography I didn’t know existed.

But these views are partial, specialized, and require enormous one-time resources to produce. They are curiosities, not navigation instruments. Until such maps exist in real-time in every browser they are just the toys of academics.

Imagine, then, a online newsreader (RSS reader, feed reader) with a map. I imagine all the world’s feeds drawn out in multiple colors, perhaps mapped out on a sphere. If each of your subscribed feeds was marked with a colored dot on the surface of this abstract Earth — which would include news and blogs from other cultures, ideologies, and languages — then it would be possible to see at a glance just where you stand in information space, and how wide or narrow your perspective. We would finally be able to put a finger down and say “you are here” in the world of what could be learned from the web.

The point is to engage curiosity, to encourage ourselves to leave the house online. In “Intelligent News Agents, With Real New” I envisioned a system that monitors what you read and automatically suggests topics that are as “different” as possible from your usual fare. This is a well-intended attempt to help you escape from the informational ghetto you grew up in, but I now think that such a system would be an utter failure. No one likes to be told what to read. Anyway, how is a programer to to decide what we “should” be viewing? Instead of trying to direct attention, let’s just make people aware of the geography.

There are many things that could be mapped. RSS feeds now include all the major news media, plus blogs, so they are an obvious place to start. A larger whole-web map seems essential for its sheer scope, and another “you are here” moment might arise from plotting personal browser history against such a map. All sorts of global patterns might also become apparent if we visually coded sites by language or topic, as I suggested in “How Many World Wide Webs are There?” Maps of academic publications or books, such as the maps of science discussed above, would reveal more slowly changing patterns in the world’s knowledge. Maps of corporate or political connections – something like a whole-world social network, or akin to the remarkable corporation browser of theyrule.net – would be difficult to generate, requiring considerable data-mining of public information, but could provide an up-to-date snapshot of global economic and power structures.

In all cases, our maps must be drawn very carefully, especially with regard to what counts as a link, because a map of something which is not fundamentally spatial can only be a metaphor. When well chosen, metaphors are powerful because they allow reasoning about one domain through the more familiar concepts of another; when poorly chosen, metaphors are unclear or deceptive. A map also engages our spatial reasoning faculties, the ability to grasp shape and structure at a glance. When we draw maps of information, we are seeking a visual representation of abstract properties such the number of connecting links between blogs, co-authorship of books, or similarity of word vectors. This can be done well or poorly, as Edward Tufte has spent his life demonstrating.

Along this line, I feel that our web maps should be spheres and not planes. Not only does a sphere suggest the Earth, but there is no center on a sphere, no privileged continent. A sphere also provides the concept of an antipode, the point farthest away from wherever you stand. It is good to wonder what is on the other side of the world.

The maps I want are also live. They are not snapshots, nothing like the “blogosphere as recorded by web crawl in August 2007” that we see in captions today. Instead, they must be continually updated, just as our search engines continually re-crawl the web. Our internet also needs history, as The Internet Archive and Google Trends know. I want a time slider on every map, a little widget that lets one scroll back and forth through history and actually watch new blogs rise to prominence, or see the polarization that occurred after 9/11. I want to see the continental drift.

Technologically, none of this is especially difficult, at least not in concept. A whole-web map of all accessible pages does require work with very large datasets, perhaps hundreds of terrabytes, but there are many corporations that know how to do this, often under the label of cloud computing. It also requires whole-web indices, and this is a trickier problem because only the search engine companies currently have the required infrastructure (and are willing to pay for it). The sorts of maps I propose are fundamentally expensive to maintain, which is probably part of why they don’t already exist. This implies centralization, and Google could certainly do the job — if they wanted to, or if they were willing to let others access their data. (Update: more on the economics of web indices.) But details follow need; like Stewart Brand, maybe we first need to want to see the whole world from space.

I live with very idealistic hopes. I believe that being aware of our world truly enables us live better at all scales, from where to brunch to national policy options for desertification. I also believe that communication can reduce bigotry, intolerance, and ultimately conflict, at least if the next generation is exposed young enough. But information that we do not even know exists cannot help us, and the ability to communicate with someone anywhere in the world means nothing if we are never tempted to do it. It is not our fault that we all live in informational ghettoes, but we need to make it obvious that we do.

Escaping the News Hall of Mirrors

Jonathan Stray — Thu, 05 Mar 2009 07:40:35 +0000

We live in a cacaphony of news, but most of it is just echoes. Generating news is expensive; collecting it is not. This is the central insight of the news aggregator business model, be it a local paper that runs AP Wire and Reuters stories between ads, or web sites like Topix, Newser, and Memeorandum, or for that matter Google News. None of these sites actually pay reporters to research and write stories, and professional journalism is in financial crisis. Meanwhile there are more bloggers, but even more re-blogging. Is there more or less original information entering the web this year than last year? No one knows.

A computer could answer this question. A computer could trace the first, original source of any particular article or statement. The effect would be like donning special glasses in the hall of mirrors that is current news coverage, being able to spot the true sources without distraction from reflections. The required technology is nearly here.

This is more than geekery if you’re in a position of needing to know the truth of something. Last week I was researching a man named Michael D. Steele, after reading a newly leaked document containing his name. Steele gained fame as one of the stranded commanders in Black Hawk Down, but several of his soldiers later killed three unarmed Iraqi men. I rapidly discovered many news stories (1, 2, 3, 4, 5, 6, 7, etc.) claiming that Steele had ordered his men to “kill all military-age males.” This is a serious accusation, and widely reprinted — but no number of news articles, blog posts, and reblogs can make a false statement more true. I needed to know who first reported this statement, and its original source.

First I had to deal with straight-up duplication of stories. The first reference above is an Assoicated Press (AP) story which included the quote, saying it was from “sworn statements obtained by the Associated Press.” The subsequent MSNBC article is in fact just a reprint of the AP story. There were other reprints, each on a different outlet but credited to AP in the standard practice of newswire syndication. I can’t argue with the ethics or legality of the practice, but this type of mirroring does amplify a story’s apparent significance on the web.

The second level of indirection is the hyperlink. One of the references above is an ABC News Blog story which refers to the AP article, linking to it and one other related story. Although the text is new, this article is nothing more than a rehashing of facts presented elsewhere. For research or authentication purposes, it’s basically worthless.

Finally, there is the uncredited reblog, exemplified by a post on the blog Caffienated Politics where the key phrase is repeated without links or attribution. Even the article on CounterPunch — headlined “Kill All Military Age Men” — does not provide any sources at all.

In my manual analysis, only the AP article and a piece in the New York Times were original research. Out of the dozens (or hundreds?) of articles, blog posts, and screaming headlines, only two people/organizations had actually bothered to obtain original information. This doesn’t mean that Michael D. Steele did not, in fact, order his troops to “kill all military-age males.” In fact, the NYT article names four soldiers under his command who testified, on August 2nd in a military Article 32 hearing, that he did. This is what makes the statement reliable, not the ten thousand reblogs.

I shouldn’t have to do this sort of analysis by hand.

We’re getting there. In 2007, Google News introduced a feature that elimates duplicated stories from its default results display. This is simple elimination of textual duplicates, a reaction to newswire syndication. Slightly more advanced algorithms can be used to detect and cull near duplicates, such as the techniques Google has long used for web pages (near duplicates shouldn’t count as more than one item in the results list.)

I want more. For any particular paragraph, phrase or statement, I want to know exactly who said it first and where they got it from. I want automatic culling of cut-and-paste “reporting” and unattributed quotations (and plagiarism.) I want my computer to automatically track back through hyperlinks when they’re present, and do deep textual analysis to determine who references whom even when the content is unattributed. The software should also analyze publication dates, where available, to see who said what first.

What I want is the phylgenetic tree of any particular story or post, a graph which shows which articles “evolved” from which ancestors, and therefore which article or articles constitute the originals, the raw input of real-world information into the ‘net. In fact, phylogenetic trees have already been applied to documents. In an article published in Scientific American in 2003, the authors analyzed 33 different versions of a chain letter with algorithms originally designed to track evolutionary changes in genetic sequences, and were able to deduce which was the original version.

This type of analysis only works with identical snippets of text — copied articles that are modified, paragraphs cut and pasted, quotations. More sophisticated text analysis algorithms will be able to handle paraphrased reports, where an article is rewritten without adding substantial new information. General semantic analysis of news stories is coming, even for audio and video, at which point it will be possible to track a single statement through all its rewritings and rewordings as it passes from article to article to blog. Combined with information from hyperlinks and posting times, we will be able to construct a “source tree” like the one above for any given story.

We’ll finally be able to tell how much content we actually have, and where it came from. (We could even track the evolution of memes.)

It’s not that repeated coverage and discussion of the same story adds nothing. A major story should be covered by multiple outlets, and quotation, paraphrasing, and reblogging is how interesting or important stories spread; telling others about what we know is fundamentally how societal awareness comes to be. However, yelling something louder doesn’t make it more significant, or more true. In the balance between awareness and vacuous repetition, I refer to Ethan Zuckerman’s web 2.0 maxim: don’t speak, point.

But I don’t want to set guidelines for authors. I want software that is smart enough to parse the anarchy of the web and tell me what is a reflection and what is not, and I want everyone else to have this software too. I want to be able to see the source tree for every article or fact of interest to me, and I want filtered views on my news aggregators that show only the primary reports. It’s not important (or remotely realistic) that every reader scrutinize the sources for every article, but it is important that it is possible to do so easily. The interested ameteur should be able to trace statements in a few clicks; this should be a deterrent to the spreading of un-sourced lies as truth, and a stumbling block for would-be propaganda campaigns. In traditional journalism, the tracking and validation of sources was the responsibility of the media monopolies. If we are witnessing the dawning of the era where we all get to have our say — if the infosphere is going to be radically democratized and expanded a million fold — then it is suddenly the responsibility of all of us in general to monitor the quality of our information. For this we need tools.

UPDATE (October 2010): Since I wrote this, the Memetracker project demonstrated a whole-web news tracking service that has much of the capability I wished for. It even works by building text mutation trees. More on Memetracker and what it means for news at the Nieman Journalism Lab. My original post also missed the significance of social networking tools for the spread of news. There is now a fascinating project that aims to detect and track the source of political smear campaigns on Twitter, the Truthy project. We’re getting there technologically speaking. Now we just need to get the technology into our everyday news reading apps.

How Many World Wide Webs Are There?

Jonathan Stray — Wed, 04 Feb 2009 23:53:51 +0000

How much overlap is there between the web in different languages, and what sites act as gateways for information between them? Many people have constructed partial maps of the web (such as the blogosphere map by Matthew Hurst, above) but as far as I know, the entire web has never been systematically mapped in terms of language.

Of course, what I actually want to know is, how connected are the different cultures of the world, really? We live in an age where the world seems small, and in a strictly technological sense it is. I have at my command this very instant not one but several enormous international communications networks; I could email, IM, text message, or call someone in any country in the world. And yet I very rarely do.

Similarly, it’s easy to feel like we’re surrounded by all the international information we could possibly want, including direct access to foreign news services, but I can only read articles and watch reports in English. As a result, information is firewalled between cultures; there are questions that could very easily be answered by any one of tens or hundreds of millions of native speakers, yet are very difficult for me to answer personally. For example, what is the journalistic slant of al-Jazeera, the original one in Arabic, not the English version which is produced by a completely different staff? Or, suppose I wanted to know what the average citizen of Indonesia thinks of the sweatshops there, or what is on the front page of the Shanghai Times today– and does such a newspaper even exist? What is written on the 70% of web pages that are not in English?

We all live on the same physical planet, but the information worlds we inhabit must be vastly different. This are many reasons for this other than language, but language alone is enough to isolate humanity from itself.

And so, my question: how many islands are there in our multi-cultural information space, and how are they connected? I am willing to bet that a full-scale web map would show several large networks in the main languages of the web — English, Chinese, Spanish, Japanese, German, etc. — but few connections between them, web sites frequented by bilingual or bi-cultural individuals, who after all are the true gateways between cultures. The structure of the interconnections might tell us something about the relationships between cultures, and the actual number of links might provide some measure of how close or how far apart we actually are. The individual URLs themselves would also be extremely valuable information, representing high-bandwidth links between cultures, the trans-occeanic fiber between continents in the infosphere.

There is a second geography to the world that we’ve never seen. I don’t even know what I’m missing.

Creating such a map would be a trick, but by no means out of the reach of an academic project or a small company. Google says there are currently over one trillion (10^12) unique web pages (for their particular definition of “unique”, which is more complex than it might seem.) Unlike a search engine, a language-based web map does not require the full contents of every page, merely the outgoing URLs and a discrete categorization of the language (which can be automatically determined even without any document meta-data.) Assuming that each URL is assigned a unique 32 bit ID, another 32 bits for language and other info, and then links to an average of 20 other pages (estimates vary), this is about 100 terrabytes of data — or perhaps $15000 worth of storage at current prices. This index could be created from a fresh crawl, or by parsing an existing one, such as from the folks at the brand new and very awesome DotBot open index of the web.

The next step would be to generate the visualization of such a massive data set. The complete graph could be laid out in two or three dimensions using existing clustering methods. The resulting map could be traversed using GPU-accelerated rendering techniques for very large data sets, probably after some sort of hierarchical pre-processing that produces proxies for zoomed-out views of the network. A usuable UI would be crucial; the entire map needs to be navigable at multiple scales and composed of live, hyperlinked objects. The right visualization also depends on what you are trying to discover; ultimately, there can be no single map because the choice of visualization is dependent upon usability and aesthetics, as the huge variety of beautiful maps at Visual Complexity demonstrate.

The analysis could go much deeper with more computing power. Machine translation is currently poor, but it is probably good enough to detect whether one document is a translation of another. With this capability, we would actually be able to quantify the percentage of (public) textual information that makes it from one language into another and identify the key organizations that act as conduits. Further study might reveal fascinating things, such as selection biases in the types of news or information that get translated. The implications for differences in belief between cultures are obvious.

Yet even a “links only” data set could still answer some highly revealing questions, such as “what percentage of web sites are visited by people from multiple cultures?” or even “what is the best gateway between Polish and English film reviews?” This could be done without visualization, but it would be a mistake not to draw the actual maps. Not only do pictures engage our spatial reasoning in a way that raw bits never can, but such a map would re-make an obvious point that is too often lost: in terms of communication between cultures, the world is not nearly as small or interconnected as we’d like to think it is.

Social Network of US Counterinsurgency Policy Authors

Jonathan Stray — Tue, 27 Jan 2009 04:22:08 +0000

Who is writing the major policies of the wars in Iraq and Afghanistan, and what is the Obama administration likely to do? There have been many analyses and news reports of individual policies and events, but it’s hard to wade into this flood of information, and besides, how would I know who to listen to? In an effort to get some perspective on at least one major aspect of American military strategy, I decided to plot out all the authors of (public) counterinsurgency policy over the last decade, and the relationships between them, as evidenced by co-authorship of articles and papers.

The resulting network shows that the Obama administration is relying heavily on the talents of a group called the Center for A New American Security (CNAS), which has close ties to the authors of the most recent US Army counterinsurgency manual. This means that Obama is unlikely to break with the current military strategies in Iraq and Afghanistan — but even if he wanted to, could he? Counterinsurgency is difficult, and many, many people die when you do it wrong; you can’t simply make this stuff up, so the choices are necessarily among existing clusters of people and policy.

The graph also suggests that the only quasi-independent body of COIN policy is centered around the RAND Corporation, who may not hold a terribly different opinion. If this analysis is correct, then Obama cannot rapidly change the military’s course in fighting these wars, because there simply do not exist credible alternative policies at this time. His only options for change in America’s handling of Iraq and Afghanistan lie outside of the scope of military strategy — perhaps through high level political or economic interventions.

Counterinsurgency Policy

American troops are shooting at someone in Iraq and Afghanistan, but the designated enemy is not another army. After the Taliban was decimated in Afghanistan and Saddam’s main forces were defeated in Iraq, dozens of armed groups stepped up to fill the power vacuum in both countries, ranging from militias to (of course) terrorists. Beset on all sides, the US military lashed out, conducting increasingly intrusive operations in the civilian population, such as house-by-house searches. The bad guys were no longer wearing uniforms, and worse, there was often popular sympathy for them. The US military ended up shooting at the people it had claimed to be liberating.

At the start of these wars, the US military was poorly prepared in counterinsurgency (COIN) tactics, a product of the Cold War strategies and the painful memory of Vietnam, which was also a counterinsurgency war. In fact, the standard COIN manuals of that time (which you can now read courtesy of Wikileaks) were stagnant for 25 years until 2006, when a major review was undertaken by Lieutenant General David Petraeus and others. The resulting revision of the FM 3-24 Counterinsurgency Manual was widely publicized, in contrast to previous secret revisions, with co-author Lieutenant Colonel John Nagl even appearing on The Daily Show to discuss it. Clearly, this revision was as much about building public support and confidence at home as it was about an actual change in strategy. The whole manual has been extensively discussed elsewhere, but the core of the new doctrine is the notion that an insurgency is as much a political as it is a military problem:

The integration of civilian and military efforts is crucial to successful COIN operations. All efforts focus on supporting the local populace and HN [host nation] government. Political, social, and economic programs are usually more valuable than conventional military operations in addressing the root causes of conflict and undermining an insurgency.

This is hardly a new idea, as FM 3-24 freely admits, but — as one interpretation of the new manual and the publicity surrounding it goes — it represents a fundamentally new role for the military, who are now faced with a problem that cannot be solved by force. Depending on who you believe, the decrease in violence in Iraq over the last two years may be due to these radically new policies, the surge, or other factors entirely.

Three years later the US is still in Iraq, and Afghanistan — if anything, an even messier place — is finally starting to return to public consciousness. Obama has to make some decisions about these military strategy in these wars. What will his answers be?

Social network analysis can help us answer this question because ideas always live among a community of minds; those who develop ideas together tend to share them. The clusters in a social network are therefore proxies for distinct worldviews, or possible answers to a question.

COIN Policy Social Network

Without further ado, here is the social graph of those writing public counterinsurgency policy over the last decade (click for a larger image, or pdf).

Each node is a person, and an edge is drawn between any two individuals who collaborated on a document, or worked in the same group — hopefully a reasonable proxy for policy similarity. (Although this is a “social” graph, merely having been in the same place at the same time does not count as an edge.) I have assigned colors to larger clusters: the Center for a New American Security (CNAS) think-tank is red, the authors of the revised FM 3-24 are green, and those from the RAND corporation are Cyan. CNAS and the FM 3-24 authors overlap in the person of John Nagl, in yellow.

Crucially, Obama (in blue) has selected Michèle Flournoy as his top Pentagon appointment, and Flournoy founded CNAS. There are other CNAS links: Colin Kahl was a military advistor to Obama during his campaign, and Lt. Nathaniel Fick spoke at the DNC in August in support of Obama. Fick and Nagl also recently co-authored a major policy paper on counterinsurgency in Afghanistan, which includes an interview with Petraeus, another FM 3-24 author.

In other words: CNAS has adopted the military’s FM 3-24 strategy, and Obama has adopted CNAS. Therefore we should expect little change in the way that the military component of the American wars are currently prosecuted.

One of the advantages of CNAS is that it reprsents a unified, proflific, and highly visible body of strategic and policy thought. For Obama to choose another course, there has to be another course to choose. There are individual critiques of the Nagl/Fick position such as this by Afganistan social scientist Christian Bleuer, individual detractors such as the experienced but often wonky Edward Luttwak, and even the very sharp and compassionate Samantha Power, who was part of the Obama campaigin until (apparently) she referred to Hillary as “a monster.” But what Obama needs is a credible body of alternative policy.

Counterinsurgency as “Good Governance”

There is only one other major cluster on the graph. The RAND Corporation (in cyan) is of course old-school defence establishment, but, unlike the FM 3-24 authors, they are not actually military. They have written a number of counterinsurgency documents since the start of the Iraq war, such as a long report by O’Connel and Pirnie which is summarized here. The RAND reports do not differ all that sharply from the Petraeus/Nagl policies, except that they see counterinsurgency as fundamentally more than a military operation:

Strategy should be developed at the highest level of government, by the President, his closest advisors, and his Cabinet offcials, with advice from the Director of National Intelligence and regional experts, the Chairman of the Joint Chiefs of Stfff, and uniﬁed commanders. … Counterinsurgency is a political-military effrt that requires both good governance and military action. It follows that the entire U.S. government should conduct that effort.

This is exactly the sort of “nation building” that Bush had hoped to dispense with. If implemented thoroughly, it might also amount to little more than a classic colonial government. To this I can only say: what did you expect when invading a foreign country?

About Building the Graph

This graph represents four days of very manual web-surfing, and it is very much a work in progress. Starting with Fick and Nagl, each name was googled on the web, in the news, and in scholarly publications, and the top 20 or so results in each category were read to determine co-authorship of policy papers and organizational affiliations. Two people were connected if they had ever co-authored an article together, or worked in the same group at the same time. Doubtless, there are connections that I have missed, such as some of the other authors on today’s New York Times blogs piece which I will have to add. Also, because this process was so time-consuming, I had to make many choices not to include individuals or follow links. It is therefore entirely possible that the graph I have drawn is actually embedded into a larger network in such a way that it invalidates my conclusions; or that there is a credible cluster of people working on alternative policy that I simply never found (but then again, Obama hasn’t found them either.)

The graph proper was built by collecting a text file of web references, and manually entering people and connections in a .dot file for use with Graphviz. Again, this took days — I cannot stress how manual the process was. These difficulties highlight the dire necessity of better information visualization tools for journalism.

[Update: this work has come to the attention of the COIN community, in particular the folks at Abu Muqawama who were kind enough to discuss what it might mean. Aside from the fact that the original version contained not one but three misspelled names (calling Mr. Flick!) they have pointed out that many important people and links are missing. No doubt — as I discuss above and in my response on AbuM. I’d like to stress that this is a work in progres, but I would also like to ask those in the know if they feel my conclusions on this restricted graph are substantially correct even so. — js]