I, Robot.txt
Tom Forenski gives voice to a growing sentiment in the blogosphere: search engine crawling for many sites is a net negative. The bandwidth and computing costs of servicing crawlers are often hard to justify in light of the identifiable traffic they bring.
The problem is much more pronounced in the blogosphere than in the wider web. If a user navigates to http://www.siliconvalleywatcher.com/ once, initial from a Google search page, and returns via bookmark 15 times over the next month, the traffic from Google would show as 6.25% of this user’s traffic. If the user in question returns another 15 times over the next month, the Google referral traffic for that month drops to zero, and the two month share is reduced to just over 3%. The specific figures aren’t important here, but what is the fact that search engine referrals, while just a tiny part of the overall traffic, can claim responsibility for an important share of a blog’s traffic beyond the inbound referral. Thinking about a number of blogs I have subscribed to (including Tom Foremski’s), many of them I visit daily, but found through a search engine. One inbound referral, hundreds of bookmarked visits. Without the search engine, though, no visits at all, perhaps.
That said, there are problems with crawling. Over time, the number of crawlers grows. There used to be just a handful that got broad coverage, now there’s a dozen or more. In the future, there may be a hundred or more. At some point, Tom Forenski’s argument will become undisputable, and crawlers will have to be managed much more carefully than they are now through the use of Robots.txt or some other means.
Another problem is latency. For much of the web, and the blogosphere in particular, freshness counts. The useful “half-life” for a lot of blog content is measured in days and hours, not weeks and months. Traditional crawling techniques often do find the right information that people want, but too late to matter. In the blogosphere, pings can provide a crucial bit of timely information – fresh content available here! While this provides a marked improvement over the old discovery methods, search engines, who have their own constraints in terms of bandwidth and compute cycles, have to prioritize which sites get immediate crawling. Only the content sources that will produce significant traffic from the crawl will get the ASAP treatment.
So, are the days of the search engine content distribution model numbered, as Tom Forenski suggests? Rather than clamping down on crawlers, and therefore distribution (however meager it may seem in the stats logs), let me suggest an alternative: full content pings.
Full Content Pings
The idea behind full content pings is to bundle the content itself into a message to a ping server indicating that new content has been published on your site. Now setting aside for the moment the problem that once you’ve committed to sending the full article to a ping server, it hardly seems right to call it a “ping server” any more, why would publishers want to do this? Because it provides wider, faster distribution, and it solves the crawling problem, on both ends – yours and the search engine’s. For example, if this post were submitted in full as part of the ping, Googlebot and the gang wouldn’t need to come fetch this post to analyze it for inclusion in their databases and indices. It would be available from the ping server directly. Search engines could maintain a high-bandwidth, always-on connection to the ping server, and have the full content of newly published articles in hand, without having to do any fetching at all from the origin server.
If you talk to the people who run search engines and their crawlers, they’ll tell you that the network access segment of the process of indexing content is the major bottleneck. In our own work here at VeriSign, the URL dereferencing step for some of our splog analysis is by far the slowest, most resource-intensive part of the process. If I pre-fetch 10,000 blog posts for analysis, the fetching will typically take an order of magnitude more time than the analysis of the content itself.
Full content pings represent a much more efficient process for publishers. Publish it once to one or more ping servers – maybe now renamed “content distribution servers” – and you’re done. Search engines and others that want to know about what you have published can get it from this central nexus. In theory, no bots should have to bother your web server at all, excepting possible a validating crawler that may need to fetch your content just to make sure what’s been submitted actually exists on your origin server.
The downside? The resistance to this concept is anchored around anxieties about control over the content. If I publish my full content to a “content distribution server”, then why would I expect anyone to come to my server to view my content? That’s a solid point if we assume there are no policy controls for the content being submitted. Matt Mullenweg and David Galbraith have suggested in version 2 of their rssping proposal that full content be stripped of “stop words” – commonly used words that are discarded by search engines like “and”, “but”, “the”, “or”, etc. Under this proposal, search engines would have “full content”, or at least something close in terms of what they will use. And the content stripped of the stop words is of marginal to no use to those who would be tempted to re-publish this content on their own site. Words like “the”, “but”, and “not” are much more important for human consumption than they are for search engines.
There are some subtle and fundamental problems with this strategy, too, however. More on that in a later post. Ultimately, what publishers need is trustable platform that enables them to publish content to the cloud with the confidence that they can achieve the efficient and broad distribution they desire, while retain control over where and how that content is presented and consumed. That’s not in place currently, but it is a solvable problem.

Comments
Michael, this is an interesting proposal for a new type of ping server. But I think the content ownership issue will not allow such a setup. Search engines provide distribution, but at some point, once a web site is established and has a large enougth audience, the bots could be turned away. Or they pay a fee to syndicate the content.
Most blogs are written by hobbyists--as more are written by professional journalists seeking to make a living from their work, then letting anyone republish their full content wont work.
Posted by: Tom Foremski | October 30, 2005 12:23 PM
Stripping stop words a la rssping seems attractive at an initial glance, but it's a fallacy to believe these stop words will never be important to the indexing engine. Indeed, they may already be, for all we know.
Posted by: Hans Granqvist | October 31, 2005 12:12 PM
Good idea. Really, recently began too much search engine crawlers. The variant of entering of additions the standard describing Robots.txt is possible still, allowing to specify what search engine crawler it is forbidden to index in general a site and to what it is authorized.
Posted by: Bruce | January 1, 2006 07:46 AM