The publishing industry needs a lightweight, open, paid syndication technology

I see a key open standard missing, crucial to the development of both user reading experience and publisher business models. Users want the “best” content from wherever it may have been published, presented beautifully and seamlessly within a single interface. Publishers want users to go to their site/app so that they can get paid, whether that’s through a subscription or by showing the user ads.

This tension is hugely visible in the content economy of today. It’s why Flipboard, Zite, and Google News are so loved by consumers yet so hated by publishers. It’s a manifestation of the “producers vs. aggregators” spat, a senseless culture war which reflects a legacy publishing industry structure that is ill-equipped to serve a digital public. This has spawned many lawsuits. These battles make no sense to the consumer, and indeed, the content supply chain is not the consumer’s problem. Nonetheless there is a real problem here, and lawyers alone cannot solve it.

What I start from is user experience. My user experience. Your user experience. I want whatever I want, all in one convenient cross-platform place. The product itself might be an expertly curated selection of articles, an algorithmic aggregator (Google News), a socially-filtered stream of content (Flipboard), or a system that tries to learn my content preferences over time (Zite.) The best method of content selection is far from settled, but it’s clear that it’s going to be very hard for a general-interest publishing brand to reliably attract my attention if all they can offer me is what they can create in-house. To adapt Bill Joy’s maxim, “most of the best content is on someone else’s site.”

The practice of pointing to content created by someone else within your product has come to be known, for better or for worse, as “aggregation”, though “curation” has also been used to describe the more manual version. (Personally I suspect distinction is meaningless, because algorithms embody editorial judgement, and there are strong hybrid approaches.) Because of the way the internet developed, many people have conflated aggregated content with free content. But this is not necessarily so. Aggregation has mostly been done by using links, and it’s not the aggregator who decides if the page on the other end of the link is free to view.

In an era of massive information abundance, filtering creates enormous value, and that’s what aggregation is. Aggregation in all its guises is really, really useful to all of us, and it’s here to stay. But linking as an aggregation method is starting to fall apart in important ways. It’s doesn’t provide a great user experience, and it doesn’t get producers paid. I strongly believe that we don’t want to discourage the linked, open nature of the internet, because widespread linking is an important societal good. Linking is both embodied knowledge and a web navigation system, and linking is incredibly valuable to journalism. Nonetheless, I see an alternative to linking that aligns the interests of publishers and consumers.

When Google News sends you to read an article, that article has a different user interface on each publisher’s site. When the Twitter or Flipboard apps show you an article they display only a stub, then require you to open a Safari window for the rest. This is a frustrating user experience, which Zite tried to remedy by using the readability engine to show you the clean article text right within the application. But of course this strips the ads from the original page, so the publisher doesn’t get paid, hence this cease and decist letter. For many kinds of content, somebody needs to get paid somewhere. (I’m not going to step today into the minefield of amateur-free vs. professional-paid content, except to say that both are valuable and will always be with us.) Payment means taking either some cash or some attention from the consumer. Lots of companies are working on payment systems to collect money from the consumer, and there have long been ad networks that distribute advertising to wherever it might be most valuable. What is missing is a syndication technology that moves content to where the user is, and money to the producer. The user gets an intergrated, smooth user experience that puts content from anywhere within one UI, and the publisher gets paid wherever their content goes.

This would be a B2B technology; payment would be between “aggregators” and “content creators,” though of course both roles are incredibly fluid and lots of companies do both at different times. To succeed, it needs to be a simple, open standard. Both complexity and legal encumbrances can be hurdles to wide adoption, and without wide adoption you can’t give consumers a wide choice of integrated sources. I’m imagining something like RSS, but with purchased authentication tokens. For the sake of simplicity and flexibility, both payment and rights enforcement need to be external to the standard. A publisher can use whatever e-business platform they like to sell authentication tokens at whatever pricing structure suits them, while merely expressing online rights — let alone enforcing them — is an incredibly complicated unsolved problem. Those problems will have to be worked on, but meanwhile, there’s no reason we can’t leverage our existing payment and rights infrastructures and solve just the technical problem of a simple, open, authenticated B2B content syndication system.

What I am trying to create is a fluid content marketplace which allows innovation in content packaging, filtering, and presentation. There is no guarantee that such a market will pay publishers anything like what they used to get for their content, and in fact I expect it won’t. But nothing can change the fact that there is way, way more content than there used to be, and much of it is both high-quality and legally free. If publishers want to extract the same level of revenue from you and I, they’re going to have to offer us something better than what we had before — such as, for example, an app that learns what I like to read and assembles that material into one clean interface. But it’s clear by now that no one content creator can ever satisfy the complete spectrum of consumer demand, so we need a mechanism to separate creation and packaging, while allowing revenue to flow from the companies that build the consumer-facing interfaces to the companies that supply the content. That means a paid syndication marketplace, which requires a paid syndication standard.

This idea has close links with the notion of content APIs, and what I am proposing amounts to an industry-standard paid-content API. Let’s make it possible for those who know what consumers want to give it to them, while getting producers paid.

Further reading:

Why We Need Open Search, and How to Make Money Doing It

Anything that’s hard to put into words is hard to put into Google. What are the right keywords if I want to learn about 18th century British aristocratic slang? What if I have a picture of someone and I want to know who it is?  How to I tell Google to count the number of web pages that are written in Chinese?

We’ve all lived with Google for so long that most of us can’t even conceive of other methods of information retrieval. But as computer scientists and librarians will tell you, boolean keyword search is not the end-all. There are other classic search techniques, such as latent semantic analysis which tries to return results which are “conceptually similar” to the user’s query, even if the relevant documents don’t contain any of the search terms. I also believe that full-scale maps of the online world are important, I would like to know which web sites act as bridges between languages, and I want tools to track the source of statements made online. These sorts of applications might be a huge advance over keyword search, but large-scale search experiments are, at the moment, prohibitively expensive.

datacenter

The problem is that the web is really big, and only a few companies have invested in the hardware and software required to index all of it. A full crawl of the web is expensive and valuable, and all of the companies who have one (Google, Yahoo, Bing, Ask, SEOmoz) have so far chosen to keep their databases private. Essentially, there is a natural monopoly here. We would like a thousand garage-scale search ventures to bloom in the best Silicon Valley tradition, but it’s just too expensive to get into the business.

DotBot is the only open web index project I am aware of. They are crawling the entire web and making the results available for download via BitTorrent, because

We believe the internet should be open to everyone. Currently, only a select few corporations have access to an index of the world wide web. Our intention is to change that.

Bravo! However, a web crawl is a truly enormous file. The first part of the DotBot index, with just 600,000 pages, clocks in at 3.2 gigabytes. Extrapolating to the more than 44 billion pages so far crawled, I estimate that they currently have 234 terabytes of data. At today’s storage technology prices of about $100 per terabyte, it would cost $24,000 just to store the file. Real-world use also requires backups, redundancy, and maintenance, all of which push data center costs to something closer to $1000 per terabyte. And this says nothing of trying to download a web crawl over the network — it turns out that sending hard drives in the mail is still the fastest and cheapest way to move big data.

Full web indices are just too big to play with casually; there will always be a very small number of them.

I think the solution to this is to turn web indices and other large quasi-public datasets into infrastructure: a few large companies collect the data and run the servers, other companies buy fine-grained access at market rates. We’ve had this model for years in the telecommunications industry, where big companies own the lines and lease access to anyone who is willing to pay.

The key to the whole proposition is a precise definition of access. Google’s keyword “access” is very narrow. Something like SQL queries would expand the space of expressible questions, but you still couldn’t run image comparison algorithms or do the computational linguistics processing necessary for true semantic search. The right way to extract the full potential of a database is to run arbitrary programs on it, and that means the data has to be local.

The only model for open search that works both technologically and financially is to store the web index on a cloud, let your users run their own software against it, and sell the compute cycles.

It is my hope that this is what DotBot is up to. The pieces are all in place already: Amazon and others sell cheap cloud-computing services, and the basic computer science of large-scale parallel data processing is now well understood. To be precise, I want an open search company that sells map-reduce access to their index. Map-reduce is a standard framework for breaking down large computational tasks into small pieces that can be distributed across hundreds or thousands of processors, and Google already uses it internally for all their own applications — but they don’t currently let anyone else run it on their data.

I really think there’s money to be made in providing open search infrastructure, because I really think there’s money to be made in better search. In fact I see an entire category of applications that hasn’t yet been explored outside of a few very well-funded labs (Google, Bellcore, the NSA): “information engineering,” the question of what you can do with all of the world’s data available for processing at high speed. Got an idea for better search? Want to ask new questions of the entire internet? Working on an investigative journalism story that requires specialized data-mining? Code the algorithm in map-reduce, and buy the compute time in tenth-of-a-second chunks on the web index cloud. Suddenly, experimentation is cheap — and anyone who can figure out something valuable to do with a web index can build a business out of it without massive prior investment.

The business landscape will change if web indices do become infrastructure. Most significantly, Google will lose its search monopoly. Competition will probably force them to open up access their web indices, and this is good. As Google knows, the world’s data is exceedingly valuable — too valuable to leave in the hands of a few large companies. There is an issue of public interest here. Fortunately, there is money to be made in selling open access. Just as energy drives change in physical systems, money drives changes in economic systems. I don’t know who is going to do it or when, but open search infrastructure is probably inevitable. If Google has any sense, they’ll enter the search infrastructure market long before they’re forced (say,  before Yahoo and Bing do it first.)

Let me know when it happens. There are some things I want to do with the internet.