Jonathan Stray’s blog

Who Wants to Hack Twitter With Me?

I want to modify the open source, multiplatform, iPhone capable Spaz client so that it has a mode to automatically translate all tweets into the user’s chosen language.

twitter-logo

I had intended to do this myself. But I’ve discovered that I’m back in graduate school full time, so I’m looking for a collaborating programmer who wants to do the majority of the coding. If you have some programming skill and you want to get into web apps, drop me a line!

But mostly, you’ll do this because you think that the world needs better multi-lingual communication. In particular, you want people to be able to keep track of news from places with oppressive internet censorship regimes (Iran, China, some Middle East), and you want the people who live there to be able to have public, real-time conversations with the rest of the world.

(Getting an uncensored internet connection in these places, one that can actually reach Twitter, is a different problem. But believe me, that problem has an active community around it.)

Spaz is written in Adobe Air and will need to call the Google Translate APIs.

“Low-Profit” Corporations Enable Social Venture Capital

A “low-profit limited liability corporation” is allowed to make money, but can also accept tax-deductible loans. Michigan,Vermont, Wyoming, Utah and Illinois passed laws this year defining a new type of corporation called an L3C, creating interesting new investment incentives and legal protections for socially-conscious market entities.

Chicago tax attorney Mark Lane, who helped get low-profit legislation passed in Illinois, suggests that we think of the L3C as a structure for socially-conscious venture capital.

In this video, Lane painstakingly presents the details of this new type of corporation. On first reading, it does seem like a plan that only a tax attorney could love. The L3C is, in some sense, little more than a carefully designed IRS category — but it could channel a lot of new money to social entrepreneurs. The trick is that private foundations would be allowed to deduct certain kinds of loans to L3Cs just like donations to non-profits, but the L3C would hope to make a profit and eventually return the money with interest. Given that foundations have to give 5% of their net worth to charity every year anyway, it’s a huge win for them to give it as a speculative loan rather than a grant.

The “low-profit” structure could also ease a classic dilemma between public and private operations. Public entities and non-profit corporations are legally obligated to operate in the public interest, but can be horribly inefficient due to the lack of competitive pressures and, often, shielding from financial accountability. Meanwhile, private corporations live or die on their efficiency but can be sued by investors if they fail to maximize raw profit. The L3C is something in between: a business that is required to operate in the public interest and can accept of tax-deductible investments, but can also pay a profit to partners and investors.

This hybrid funding model may prove especially useful in the production of public goods, the things that benefit everyone but are difficult to get anyone to pay for. The L3C has been proposed for groups working in education, small industry, biotech, arts, and journalism.

For more, see overviews from Social Earth and the Non Profit Law Blog.

“Risky” Interactive Art Returns to Tate Modern After 38 Years

“Bodyspacemotionthings” is a playground-as-art, and it got completely trashed in 1971 when it premiered at the Tate Modern in London. Now it’s back, rebuilt slightly stronger and safer. And I think it’s awesome, and I want to swing on the rope and push that huge ball around.

Art you can fall off of will be familiar to anyone in the San Francisco independent arts scene (yes, I’m trying not to say “Burning Man” here), but it fascinates me to see how a very public institution in notoriously uptight country handles safety for an installation in a gallery which draws 100,000 people in a weekend.

The BBC report above focuses on splinters. Have we really become that lame?

Then again, I wonder if this piece could be shown at all in the US, a country with strong tort law and poor health insurance.

The Search Problem vs. The News Problem

I think I’ve found a useful distinction between the “search” and “news” problems. News organizations like to complain that search engines are taking their business, but that’s only because no one has yet built a passable news engine.

Search is when the user asks the computer for a particular type of information, and the computer finds it.

News is when the computer has to figure out, by itself, what information a user wants in each moment.

This definition has useful consequences. For example, it says that accurately modeling the user and their needs is going to be absolutely essential for news, because the news problem doesn’t have a query to go on. All a news selection algorithm can know is what the user has done in the past. For this reason, I don’t believe that online news systems can truly be useful until they take into account everything of ourselves that we’ve put online, including Facebook profiles and emails, and viewing histories.

And yes, I do want my news engine to keep track of cool YouTube uploads and recommend videos to me. This in addition to telling me that Iran has a secret uranium enrichment facility. In the online era, “news” probably just means recently published useful information, of which journalistic reporting is clearly a very small segment.

It’s worth remembering that keyword web search wasn’t all that useful until Google debuted in 1998 with an early version of the now-classic PageRank algorithm.  I suspect that we have not yet seen the equivalent for news. In other words, the first killer news app has yet to be deployed. Because such an app will need to know a great deal about you, it will probably pull in data from Facebook and Gmail, at a minimum. But no one really knows yet how to turn a pile of emails into a filter that selects from the best of the web, blogosphere, Twitter, and mainstream media.

Classic journalism organizations are at a disadvantage in designing modern news apps, because broadcast media taught them bad habits. News organizations still think in terms of editors who select content for the audience. This one-size fits all attitude seems ridiculous in the internet era, a relic of the age when it would have been inconceivably expensive to print a different paper for each customer.

Of course, there are some serious potential problems with the logical end-goal of total customization. The loss of a socially shared narrative is one; the Daily Me effect where an individual is never challenged by anything outside of what they already believe is another. But shared narratives seem to emerge in social networks regardless of how we organize them — this is the core meaning of something “going viral.” And I believe the narcissism problem can be addressed through information maps. In fact, maps are so important that we should add another required feature to our hypothetical killer news app: it must in some way present a useful menu of the vast scope of available information. This is another function that existing search products have hardly begun to address.

Not that we have algorithms today that are as good as human editors as putting together a front page. But we will. Netflix’s recent million dollar award for a 10% improvement in their film recommendation system is a useful reminder of how seriously certain companies are taking the problem of predicting user preferences.

The explosion of blog, Twitter, and Wikipedia consumption demonstrates that classic news editors may not have been so good at giving us what we want, anyway.

iPhone Augmented Reality Arrives — But When Will We Make Art With It?

Last year I imagined an iPhone app that superimposed virtual objects over video from the phone’s camera. With the advent of the iPhone 3GS and its built-in compass, it’s now happening.

This video shows NearestWiki, which tags nearby landmarks/objects and guides you to them. I am aware of a few other AR apps, as this post on Mashable and this AP story discuss. Many of these apps do building/object recognition, and one even recognizes faces and displays a sort of business card. We’re already seeing annotation with data from Wikipedia, Twitter and Yelp, and I suspect that we’re going to see these tools get very deep in the very near future, with Wikipedia-style tagging of  the entire history and context of any object.

Just a moment while I get over the fact that the future is already here.

Ok, I’m properly jaded again. Yeah, it’s an app platform, and that’s cool — but imagine the possibilities for art. Bets on who’s going to make the first “alternate reality spyglass” piece? Bets on how much Matthew Barney will sell it for in the app store?

Advertisers Smoking Crack, and the Future of Journalism According to Leo Laporte

Leo Laporte of This Week in Tech gave a truly marvelous talk on Friday about how his online journalism model works. The first half of the talk is all about how TWIT moved from TV to podcasting and became profitable, and includes such gems as

Advertisers have been smoking the Google and Facebook crack. And they no longer want that shakeweed that the [TV] networks are offering.

The second half is in many ways even better, when Leo takes questions from the audience and discusses topics such as the future of printing news on dead trees

Maybe there will always be [paper] news, but it will be brought to you by your butler who has ironed it out carefully for you. It will be the realm of the rich person.

and the “holy calling” of being a journalist:

You reporters are really the monks of the information world. You labour in obscurity. You have to be driven by passion because  you’re paid nothing. And you sleep on rocks.

He goes on to discuss the necessity of bidirectional communication, Twitter as the “emerging nervous system” of the net, etc. — all the standard new media stuff, but put very succinctly by someone who has deep experience in both old and new media. Very information-dense and enlightening!

Know Your Enemy

In America, the enemy is Terrorism. It used to be the Russians, or more generically Communists. We discussed the history of this concept in class today. And then I asked: In the state-controlled Chinese media, who is the enemy today?

I got three immediate answers:

“The West.”

“Japan.”

“Separatists.” (E.g. Tibetans, Uighurs.)

There was instant consensus on this list, among the PRC students. Good to know.

Why We Need Open Search, and How to Make Money Doing It

Anything that’s hard to put into words is hard to put into Google. What are the right keywords if I want to learn about 18th century British aristocratic slang? What if I have a picture of someone and I want to know who it is?  How to I tell Google to count the number of web pages that are written in Chinese?

We’ve all lived with Google for so long that most of us can’t even conceive of other methods of information retrieval. But as computer scientists and librarians will tell you, boolean keyword search is not the end-all. There are other classic search techniques, such as latent semantic analysis which tries to return results which are “conceptually similar” to the user’s query, even if the relevant documents don’t contain any of the search terms. I also believe that full-scale maps of the online world are important, I would like to know which web sites act as bridges between languages, and I want tools to track the source of statements made online. These sorts of applications might be a huge advance over keyword search, but large-scale search experiments are, at the moment, prohibitively expensive.

datacenter

The problem is that the web is really big, and only a few companies have invested in the hardware and software required to index all of it. A full crawl of the web is expensive and valuable, and all of the companies who have one (Google, Yahoo, Bing, Ask, SEOmoz) have so far chosen to keep their databases private. Essentially, there is a natural monopoly here. We would like a thousand garage-scale search ventures to bloom in the best Silicon Valley tradition, but it’s just too expensive to get into the business.

DotBot is the only open web index project I am aware of. They are crawling the entire web and making the results available for download via BitTorrent, because

We believe the internet should be open to everyone. Currently, only a select few corporations have access to an index of the world wide web. Our intention is to change that.

Bravo! However, a web crawl is a truly enormous file. The first part of the DotBot index, with just 600,000 pages, clocks in at 3.2 gigabytes. Extrapolating to the more than 44 billion pages so far crawled, I estimate that they currently have 234 terabytes of data. At today’s storage technology prices of about $100 per terabyte, it would cost $24,000 just to store the file. Real-world use also requires backups, redundancy, and maintenance, all of which push data center costs to something closer to $1000 per terabyte. And this says nothing of trying to download a web crawl over the network — it turns out that sending hard drives in the mail is still the fastest and cheapest way to move big data.

Full web indices are just too big to play with casually; there will always be a very small number of them.

I think the solution to this is to turn web indices and other large quasi-public datasets into infrastructure: a few large companies collect the data and run the servers, other companies buy fine-grained access at market rates. We’ve had this model for years in the telecommunications industry, where big companies own the lines and lease access to anyone who is willing to pay.

The key to the whole proposition is a precise definition of access. Google’s keyword “access” is very narrow. Something like SQL queries would expand the space of expressible questions, but you still couldn’t run image comparison algorithms or do the computational linguistics processing necessary for true semantic search. The right way to extract the full potential of a database is to run arbitrary programs on it, and that means the data has to be local.

The only model for open search that works both technologically and financially is to store the web index on a cloud, let your users run their own software against it, and sell the compute cycles.

It is my hope that this is what DotBot is up to. The pieces are all in place already: Amazon and others sell cheap cloud-computing services, and the basic computer science of large-scale parallel data processing is now well understood. To be precise, I want an open search company that sells map-reduce access to their index. Map-reduce is a standard framework for breaking down large computational tasks into small pieces that can be distributed across hundreds or thousands of processors, and Google already uses it internally for all their own applications — but they don’t currently let anyone else run it on their data.

I really think there’s money to be made in providing open search infrastructure, because I really think there’s money to be made in better search. In fact I see an entire category of applications that hasn’t yet been explored outside of a few very well-funded labs (Google, Bellcore, the NSA): “information engineering,” the question of what you can do with all of the world’s data available for processing at high speed. Got an idea for better search? Want to ask new questions of the entire internet? Working on an investigative journalism story that requires specialized data-mining? Code the algorithm in map-reduce, and buy the compute time in tenth-of-a-second chunks on the web index cloud. Suddenly, experimentation is cheap — and anyone who can figure out something valuable to do with a web index can build a business out of it without massive prior investment.

The business landscape will change if web indices do become infrastructure. Most significantly, Google will lose its search monopoly. Competition will probably force them to open up access their web indices, and this is good. As Google knows, the world’s data is exceedingly valuable — too valuable to leave in the hands of a few large companies. There is an issue of public interest here. Fortunately, there is money to be made in selling open access. Just as energy drives change in physical systems, money drives changes in economic systems. I don’t know who is going to do it or when, but open search infrastructure is probably inevitable. If Google has any sense, they’ll enter the search infrastructure market long before they’re forced (say,  before Yahoo and Bing do it first.)

Let me know when it happens. There are some things I want to do with the internet.

Rating Items by Number of Votes: Ur Doin It Rong

Digg, YouTube, Slashdot, and many other sites employ user voting to generate collaborative rankings for their content. This is a great idea, but simply counting votes is a horrible way to do it. Fortunately, the fix is simple.

A basic ranking system allows each user to add a vote to the items they like, then builds a “top rated” list by counting votes. The problem with this scheme is that users can only vote on items they’ve seen, and they are far more likely to see items near the top of the list. In fact, anything off the front page may get essentially no views at all — and therefore has virtually no chance of rising to top.

digg

This is rather serious if the content being rated is serious. It’s fine for Digg to have weird positive-feedback popularity effects, but it’s not fine if we are trying to decide what goes on the front page of a news site. Potentially important stories might never make it to the top simply because they started a little lower in the rankings for whatever reason.

Slightly more sophisticated systems allow users to rate items on a scale, typically 1-5 stars.  This seems better, but still introduces weird biases. Adding up the stars assigned by all users to a single item doesn’t work, because users still have to see an item to vote on it. Averaging all the ratings assigned to a single item doesn’t work either, because it can push something permanently to the bottom of the list, if the first user to view it rates it only one star.

There are lots of subtle hacks that one can make to try to fix the system, but it turns out there might actually be a right way to do things.

If every item was rated by every user, there would be no problem with popularity feedback effects.

That’s completely impractical with thousands or even millions of items. But we can actually get close to the same result with much less work, if we take random samples. Like a telephone poll, the opinion of a small group of randomly selected people will be an accurate indicator, to within a few percent, of the result that we would get if we asked everyone.

In practice, this would mean adding a few select “sampling” stories to each front page served, different every time. Items can then by ranked simply their average rating, with no skewing due to who got to the front page first. (In fact, basic sampling math will tell us which items have the most uncertain ratings and need to be seen with the highest priority.) In effect, we are distributing the work of rating a huge body of items across a huge body of users — true collaborative filtering, using sampling methods to remove the “can’t see it can’t vote on it” bias.

This is not an end-all solution to the problem of distributed agenda-setting. User ratings are not necessarily the ideal criterion for measuring “relevance.” One problem is that not every user is going to take the trouble to assign a rating, so you will only be sampling from particularly motivated individuals. Other metrics such as length of time on page might be better — did this person read the whole thing?

Even more fundamentally, it’s not clear that popularity, however defined, is really the right way to set a news agenda in the public interest.

However, any attempt to use user polling for collaborative agenda setting needs to be aware of basic statistical bias issues. Sampling is a simple and very well-developed way to think about such problems.

Mapping the Daily Me

If we deliver to each person only what they say they want to hear, maybe we end up with a society of narrow-minded individualists. It’s exciting to contemplate news sources that (successfully) predict the sorts of headlines that each user will want to read, but in the extreme case we are reduced to a journalism of the Daily Me: each person isolated inside their own little reflective bubble.

The good news is, specialized maps can show us what we are missing. That’s why I think they need to be standard on all information delivery systems.

For the first time in history, it is possible to map with some accuracy the information that free-range consumers choose for themselves. A famous example is the graph of political booksales produced by orgnet.com:

Social network graph of Amazon sales of political books, 2008

Here, two books are connected by a line if consumers tended to buy both. What we see is what we always suspected: a stark polarization. For the most part, each person reads either liberal or conservative books. Each of us lives in one information world but not the other. Despite the Enlightenment ideal of free debate, real-world data shows that we do not seek out contradictory viewpoints.

Which was fine, maybe, when the front page brought them to us. When information distribution was monopolized by a small number of newspapers and broadcasters, we had no choice but to be exposed to stories that we might not have picked for ourselves. Whatever charges one can press against biased editors of the past, most of them felt that they had a duty to diversity.

In the age of disaggregation, maybe the money is in giving people what they want. Unfortunately, there is a real possibility that we want is to have our existing opinions confirmed. You and I and everyone else are going to be far more likely to click through from a headline that confirms what we already believe than from one which challenges us. “I don’t need to read that,” we’ll say, “it’s clearly just biased crap.” The computers will see this, and any sort of recommendation algorithm will quickly end up as a mirror to our preconceptions.

It’s a positive feedback loop that will first split us along existing ideological cleavages, then finer and finer. In the extreme, each of us will be alone in a world that never presents information to the contrary.

We could try to design our systems to recommend a more diverse range of articles (an idea I explored previously) but the problem is, how? Any sort of agenda-setting system that relies on what our friends like will only amplify polarities, while anything based on global criteria is necessarily normative — it makes judgements on what everyone should be seeing. This gets us right back into all the classic problems of ideology and bias — how do we measure diversity of viewpoint? And even if we could agree on a definition of what a “healthy” range sources is, no one likes to be told what to read.

I think that maps are the way out. Instead of trying to decide what someone “should” see, just make clear to them what they could see.

An information consumption system — an RSS reader, online newspapers, Facebook — could include a map of the infosphere as a standard feature. There are many ways to draw such a map, but the visual metaphor is well-established: each node is an information item (an article, video, etc.) while the links between items indicate their “similarity” in terms of worldview.

iran_blogosphere_map

This is less abstract than it seems, and with good visual design these sorts of pictures can be immediately obvious. Popular nodes could be drawn larger; closely related nodes could be clustered. The links themselves could be generated from co-consumption data: when one user views two different items, the link between those items gets slightly stronger. There are other ways of classifying items as related — as belonging to similar worldviews — but co-consumption is probably as good a metric as any, and in fact co-purchasing data is at the core of Amazon’s successful recommendation system.

The concepts involved are hardly new, and many maps have been made at the site level where each node is an entire blog, such as the map of the Iranian blogosphere above. However, we have never had a map of individual news items, and never in real-time for everyone to see.

Each map also needs a “you are here” indicator.

This would be nothing more than some way of marking items that the user has personally viewed. Highlight them, center them on the map, and zoom in. But don’t zoom in too much. The whole purpose of the map is to show each of us how small, how narrow and unchallenging our information consumption patterns actually are. We will each discover that we live in a particular city-cluster of information sources, on a particular continent of language, ideology, or culture. A map literally lets you see this at a glance — and you can click on far-away nodes for instant travel to distant worldviews.

Giving people only what they like risks turning journalism into entertainment or narcissism. Forcing people to see things that they are not interested in is a losing strategy, and we there isn’t any obvious way to decide what we should see. Showing people a map of the broader world they live in is universally acceptable, and can only encourage curiosity.