Why We Need Open Search, and How to Make Money Doing It

Anything that’s hard to put into words is hard to put into Google. What are the right keywords if I want to learn about 18th century British aristocratic slang? What if I have a picture of someone and I want to know who it is?  How to I tell Google to count the number of web pages that are written in Chinese?

We’ve all lived with Google for so long that most of us can’t even conceive of other methods of information retrieval. But as computer scientists and librarians will tell you, boolean keyword search is not the end-all. There are other classic search techniques, such as latent semantic analysis which tries to return results which are “conceptually similar” to the user’s query, even if the relevant documents don’t contain any of the search terms. I also believe that full-scale maps of the online world are important, I would like to know which web sites act as bridges between languages, and I want tools to track the source of statements made online. These sorts of applications might be a huge advance over keyword search, but large-scale search experiments are, at the moment, prohibitively expensive.

datacenter

The problem is that the web is really big, and only a few companies have invested in the hardware and software required to index all of it. A full crawl of the web is expensive and valuable, and all of the companies who have one (Google, Yahoo, Bing, Ask, SEOmoz) have so far chosen to keep their databases private. Essentially, there is a natural monopoly here. We would like a thousand garage-scale search ventures to bloom in the best Silicon Valley tradition, but it’s just too expensive to get into the business.

DotBot is the only open web index project I am aware of. They are crawling the entire web and making the results available for download via BitTorrent, because

We believe the internet should be open to everyone. Currently, only a select few corporations have access to an index of the world wide web. Our intention is to change that.

Bravo! However, a web crawl is a truly enormous file. The first part of the DotBot index, with just 600,000 pages, clocks in at 3.2 gigabytes. Extrapolating to the more than 44 billion pages so far crawled, I estimate that they currently have 234 terabytes of data. At today’s storage technology prices of about $100 per terabyte, it would cost $24,000 just to store the file. Real-world use also requires backups, redundancy, and maintenance, all of which push data center costs to something closer to $1000 per terabyte. And this says nothing of trying to download a web crawl over the network — it turns out that sending hard drives in the mail is still the fastest and cheapest way to move big data.

Full web indices are just too big to play with casually; there will always be a very small number of them.

I think the solution to this is to turn web indices and other large quasi-public datasets into infrastructure: a few large companies collect the data and run the servers, other companies buy fine-grained access at market rates. We’ve had this model for years in the telecommunications industry, where big companies own the lines and lease access to anyone who is willing to pay.

The key to the whole proposition is a precise definition of access. Google’s keyword “access” is very narrow. Something like SQL queries would expand the space of expressible questions, but you still couldn’t run image comparison algorithms or do the computational linguistics processing necessary for true semantic search. The right way to extract the full potential of a database is to run arbitrary programs on it, and that means the data has to be local.

The only model for open search that works both technologically and financially is to store the web index on a cloud, let your users run their own software against it, and sell the compute cycles.

It is my hope that this is what DotBot is up to. The pieces are all in place already: Amazon and others sell cheap cloud-computing services, and the basic computer science of large-scale parallel data processing is now well understood. To be precise, I want an open search company that sells map-reduce access to their index. Map-reduce is a standard framework for breaking down large computational tasks into small pieces that can be distributed across hundreds or thousands of processors, and Google already uses it internally for all their own applications — but they don’t currently let anyone else run it on their data.

I really think there’s money to be made in providing open search infrastructure, because I really think there’s money to be made in better search. In fact I see an entire category of applications that hasn’t yet been explored outside of a few very well-funded labs (Google, Bellcore, the NSA): “information engineering,” the question of what you can do with all of the world’s data available for processing at high speed. Got an idea for better search? Want to ask new questions of the entire internet? Working on an investigative journalism story that requires specialized data-mining? Code the algorithm in map-reduce, and buy the compute time in tenth-of-a-second chunks on the web index cloud. Suddenly, experimentation is cheap — and anyone who can figure out something valuable to do with a web index can build a business out of it without massive prior investment.

The business landscape will change if web indices do become infrastructure. Most significantly, Google will lose its search monopoly. Competition will probably force them to open up access their web indices, and this is good. As Google knows, the world’s data is exceedingly valuable — too valuable to leave in the hands of a few large companies. There is an issue of public interest here. Fortunately, there is money to be made in selling open access. Just as energy drives change in physical systems, money drives changes in economic systems. I don’t know who is going to do it or when, but open search infrastructure is probably inevitable. If Google has any sense, they’ll enter the search infrastructure market long before they’re forced (say,  before Yahoo and Bing do it first.)

Let me know when it happens. There are some things I want to do with the internet.

The New York Times Doesn’t Understand Twitter and Iran

In the editorial “New Tweets, Old Needs” experienced journalist Roger Cohen says that Twitter isn’t journalism, and that Iran “has gone opaque” without its mainstream media correspondents. He may be right about the recent paucity of good journalism out of Iran, but he misses some really crucial points about how information flows in the absence of a distribution monopoly (like a printing press.) In particular, he seems to assume that only professional journalists can be capable of producing professional journalism.

It is absolutely true that journalism is much more than random tweeting or blogging. I have been particularly inspired by the notion that “journalism is a discipline of verification,” and a tweet or a blog post neither requires nor endures the fact-checking and truthfulness standards that we expect of our more traditional news media. I also agree that search engines are simply not a substitute for being there. Someone must be a witness. Someone has to feed their experience into the maw of the internet at some point.

However, when Cohen says “the mainstream media — expelled, imprisoned, vilified — is missed” he is implicitly arguing that only the mainstream media can produce good journalism.  Traditionally, “journalist” was  a distinct, easily defined class: a journalist was someone who worked for a news organization. There weren’t many such organizations, because a distribution monopoly is an expensive thing. All this has changed with the advent of nearly free and truly democratic information distribution, and we are seeing a rapid erosion of the the distinction between professional and amateur or “citizen” journalists. The result is confusion, uncertainty and fear — especially on the part of those who have staked their careers or their fortunes on the clarity of this distinction.

But I see a big difference between journalists and journalism, and this is where Cohen and I part ways.

In my view the failure of journalism in Iran was not the failure of the mainstream media to hold their ground (or their funding, or their audiences) but rather the failure of the journalism profession to educate the public about what exactly it does, and how to do it. When Cohen asks questions such as

But who is there to investigate these deaths — or allegations of wholesale rape of hundreds of arrested men and women — and so shed light?

my answer is, the Iranians, of course!

Naturally, a young activist-turned-reporter does not have the experience or connections of an old-school foreign correspondent. But such a person is there, and they care enormously. What they lack is guidance. What is and is not journalism, exactly? What are the expected standards and daily, on-the-ground procedures of verification? Where can someone turn to for advice on covering the struggles they are immersed in? And what, actually, differentiates the New York Times from a blogger? We need clear answers, because the newspapers are no longer the only ones declaiming the news.

Perhaps the mainstream media couldn’t be in Iran, but they could have been mentoring and collaborating from afar, and yes, publishing the journalism of non-career journalists. And such a project needs to begin long before times of crisis, in every region, so that those who are there are ready.

If “citizen journalism” has so far been somewhat underwhelming, it is because we have not taught our citizens to be journalists.

We Can’t Learn About Economics

fundamentals_economics

Despite spending the last several days reading up on Treasury Secretary Geithner’s plan to buy bad bank assests, I now feel only marginally better prepared to judge whether this is a good idea or not. Of course, no one is asking me, but I still think it’s a big problem that I can’t evaluate this plan, because the fact that we live in a democracy means that citizens need to be able to understand what their government is doing. 

Now, I am no economist and I have no idea how to run a bank — much less all the banks. However, I am smart, interested, and I’ve done my homework, including previously reading a first year economics textbook (covering both micro- and macro-economics) and several other interesting books (1,2,3) on how markets work or don’t. In short I have been the model of a concerned citizen, and I still have no idea what is going on. This is partially because the situation is very complex, but it is also because there is no way a private citizen can get access to the data that would clarify matters — large banks will barely share their balance sheets with the government, much less me.

This is a problem. It means that the government, financial, and academic communities have not paid nearly enough attention both to basic economics education, and to transparency in real-world business. It is therefore impossible for anyone else to check their assumptions and restrain their huge power. Lest this sounds like unhelpful complaining, I promise to make a concrete suggestion for improvement by the end of this post.

Continue reading We Can’t Learn About Economics

What Foxmarks Knows about Everyone

I recently installed Foxmarks, a Firefox extension that automatically synchronizes your web bookmarks across all the computers you might use. Refreshingly, the developers got it right: the plug-in is idiot-simple and works flawlessly.

This is accomplished through a central server, which means a lot of bandwidth, hardware, reliability costs, etc. In short, it’s not a completely cheap service to provide. As there is no advertising either in the plug-in or on the site (yet?) I began to wonder how they planned to pay for all this. I found my answer on their About Us page:

We are hard at work analyzing over 300 million bookmarks managed by our systems to help users discover sites that are useful to them. By combining algorithmic search with community knowledge-sharing and the wisdom of crowds, our goal is to connect users with relevant content.

Of course.

There is a lesson here: knowledge of something about about someone is fundamentally different than knowledge of something about everyone. As with Google, Amazon, or really any very large database of information over millions of users, there are extremely valuable patterns that only occur between people. The idea is as old as filing, but the web takes this to a whole new level, especially if you can convince huge numbers of people to voluntarily give up their information.

So far, I haven’t said anything new. What I am suggesting is a shift in thinking. Rather than being concerned primarily about our individual privacy rights when we fill out a form full of personal details, perhaps we should be pondering what powers we are handing over by letting a private entity see these large-scale intra-individual patterns — patterns that they can choose to hide from everyone else’s view, naturally.

I am beginning to wonder very seriously about the growing disparity between public and private data-mining capability. Is this an acceptable concentration of power? What effects does this have on a society?