« September 2005 | Main | November 2005 »

October 31, 2005

Content Sourcing

Ever since this post, the idea of content sourcing has been bugging me. The “silver bullet” for splogs is content theft. For example, in Doc Searls’ original post on that topic, he was reading quickly on some unknown blog, and failed to realize immediately that the words he read there were in fact Dave Winer’s. Now, there’s two problems with this phenomenon – what Doc charitably calls blogs “of unclear provenance.

 

First, it’s a simple violation of copyright. Now complaints about copyright tend to cause snickers in some quarters, but this is exactly the kind of abuse that needs to be protected. Dave Winer’s commentary isn’t just being copied without his permission – it’s being monetized, and without even acknowledging that Dave Winer was the author.

 

Second, and more important, I think, is the realization that content can be lifted arbitrarily from anywhere, and used to “cloak” a splog (“Spam Blog”). As any SEO black-hat worth their salt will tell you, doorway pages full of links are increasingly hard to get noticed and indexed by the search engines. Splogs need to look like real blogs, or at least like real content, in order to have any hope of attracting visitors through searching.

 

The problem is exacerbated by the casual conventions of the blogosphere. I know for several of my past posts, the entire post was quoted in someone else’s blog. That doesn’t bother me, and if fact it’s probably a good thing as the posts in question were points I wanted distributed as broadly as possible. I’m not blogging here for the ad revenue though, so I’m different than others who do, and can see that this practice may be a problem for some. As it is, though, part of the culture of blogs is linking, and attribution, but also “fair use” republishing of important or salient parts of articles from elsewhere. That being the case, then as long as content can be trivially copied from legitimate sites, it hardly matters how sophisticated the content analyzer crawling a splog is, as the content is a perfect copy of legitimate content from somewhere else. Keyword mapping, Bayesian filtering -- all that is futile if you’re just scanning Dave Winer’s commentary taken from scriptingnews.com

 

That’s a feature of the system that can be and will be exploited heavily. Say I’m a splogger. I’ve got a tool that takes keywords I enter and finds blog posts (via one or more search engines) that have content related to the keywords I enter.  The tool picks several dozen recent posts from the thousands available, combines them into a blog, and adorns the aggregated content with my ads, and links to my offer pages. It costs me virtually nothing – the domain name is less than $10, the blog hosting is free – and it is so automated that it’s nearly as simple as entering the keywords to start with and pressing the “Submit” button. Sure, it’s wholesale copyright violation, but when’s the last time you heard of someone catching trouble for that on a blog? The original authors can complain, and I’ll drop their content from my splog – there’s a million others to copy from just as easily.

 

As Doc Searls’ experience points out, it can be hard for a human to detect when this happens. For a crawler that’s analyzing the content to identify splogs, this makes the situation nearly hopeless, at least in terms of content analysis. The identifying characteristics of a splog in this case no longer can be found in the content – it’s full of rich, stolen text – but rather in the sites and URLs it points to. Content theft, then,  is the Vulcan Nerve Pinch the black-hats can use whenever needed against the system. Just scrounge up some fresh content from somebody else, and the whole splog detection framework collapses, like the hapless victim in Mr. Spock’s grip.

 

Should we, then, throw our hands up, and declare defeat? No, but there are two things I think need to be pursued in addition to (or maybe instead of) enhanced content analysis for blogs. First, we should look at semantics that can be embedded in (X)HTML that identify (tag) parts of the content that are “original” – copyright by the author of the post, and also those parts that are quoted or cited from elsewhere. I’ve done a quick tour with the search engine about this kind of markup, and besides some interesting ideas that use META tags – which are inadequate as they apply to the whole page – I haven’t found a framework for doing this. It’s not a complex concept, however.  (If anyone reading has experience with an existing framework that does this, please email me to let me know, thanks.)

 

The second idea that should be pursued in tandem is the enhancement of our publishing tools that make the “ownership/quotation” markup simple and quick for users to work into the process of creating content (writing a post). Something like a style tag that doesn’t just visually identify quoted material, but semantically identifies it as external content, along with proper source attribution. Properly marked up posts, then, can be quickly sorted through to determine what parts of the content are original and which are imported. That’s not a panacea, but it would be a good step forward. Clear assertions about what content is being produced vs. what is not will be a powerful asset in filtering splogs from the content stream.

 

As an aside, those who monitor the tech side of the blogosphere will remember the minor uproar started by Mark Cuban’s protest against splogs a couple weeks ago. There was a distinct surge in the splog traffic, but looking back, it was nothing that unusual, given what I can see looking back aways at the logs for weblogs.com. What was different this time, though, was that the bad guys tried something beyond the usual keywords (“loans”, “hair loss”, etc.), and tried using some new terms, like “Jeff Jarvis” and “Scoble”.  Some splogs simply used the names of bloggers as so much flypaper to catch users who were looking for the bloggers, in hopes that the odd one might click through an ad. Other’s appropriated who posts, even large collections of posts by popular bloggers, creating a sort of  counterfeit blog.  These tactics have been used before, but it was interesting to note that the blogosphere really got upset this time because they started seeing splogs regularly in their normal searches.

 

I can’t think that this tactic was particularly effective from a commercial standpoint – it seems unlikely that people searching for “Scoble” are good clickthrough candidates. It seems more like the black-hats having some fun at the expense of popular bloggers. If so, it may have backfired, as discussions and efforts around the splog issue have definitely picked up momentum in the last couple weeks.

October 27, 2005

I, Robot.txt

Tom Forenski gives voice to a growing sentiment in the blogosphere: search engine crawling for many sites is a net negative. The bandwidth and computing costs of servicing crawlers are often hard to justify in light of the identifiable traffic they bring.

 

The problem is much more pronounced in the blogosphere than in the wider web. If a user navigates to http://www.siliconvalleywatcher.com/ once, initial from a Google search page, and returns via bookmark 15 times over the next month, the traffic from Google would show as 6.25% of this user’s traffic. If the user in question returns another 15 times over the next month, the Google referral traffic for that month drops to zero, and the two month share is reduced to just over 3%.  The specific figures aren’t important here, but what is the fact that search engine referrals, while just a tiny part of the overall traffic, can claim responsibility for an important share of a blog’s traffic beyond the inbound referral. Thinking about a number of blogs I have subscribed to (including Tom Foremski’s), many of them I visit daily, but found through a search engine. One inbound referral, hundreds of bookmarked visits. Without the search engine, though, no visits at all, perhaps.

 

That said, there are problems with crawling. Over time, the number of crawlers grows. There used to be just a handful that got broad coverage, now there’s a dozen or more. In the future, there may be a hundred or more. At some point, Tom Forenski’s argument will become undisputable, and crawlers will have to be managed much more carefully than they are now through the use of Robots.txt or some other means.

 

Another problem is latency. For much of the web, and the blogosphere in particular, freshness counts. The useful “half-life” for a lot of blog content is measured in days and hours, not weeks and months. Traditional crawling techniques often do find the right information that people want, but too late to matter. In the blogosphere, pings can provide a crucial bit of timely information – fresh content available here! While this provides a marked improvement over the old discovery methods, search engines, who have their own constraints in terms of bandwidth and compute cycles, have to prioritize which sites get immediate crawling. Only the content sources that will produce significant traffic from the crawl will get the ASAP treatment.

 

So, are the days of the search engine content distribution model numbered, as Tom Forenski  suggests? Rather than clamping down on crawlers, and therefore distribution (however meager it may seem in the stats logs), let me suggest an alternative: full content pings.

 

Full Content Pings

The idea behind full content pings is to bundle the content itself into a message to a ping server indicating that new content has been published on your site. Now setting aside for the moment the problem that once you’ve committed to sending the full article to a ping server, it hardly seems right to call it a “ping server” any more, why would publishers want to do this? Because it provides wider, faster distribution, and it solves the crawling problem, on both ends – yours and the search engine’s. For example, if this post were submitted in full as part of the ping, Googlebot and the gang wouldn’t need to come fetch this post to analyze it for inclusion in their databases and indices. It would be available from the ping server directly. Search engines could maintain a high-bandwidth, always-on connection to the ping server, and have the full content of newly published articles in hand, without having to do any fetching at all from the origin server.

 

If you talk to the people who run search engines and their crawlers, they’ll tell you that the network access segment of the process of indexing content is the major bottleneck. In our own work here at VeriSign, the URL dereferencing step for some of our splog analysis is by far the slowest, most resource-intensive part of the process. If I pre-fetch 10,000 blog posts for analysis, the fetching will typically take an order of magnitude more time than the analysis of the content itself.

 

Full content pings represent a much more efficient process for publishers. Publish it once to one or more ping servers – maybe now renamed “content distribution servers” – and you’re done. Search engines and others that want to know about what you have published can get it from this central nexus. In theory, no bots should have to bother your web server at all, excepting possible a validating crawler that may need to fetch your content just to make sure what’s been submitted actually exists on your origin server.

 

The downside? The resistance to this concept is anchored around anxieties about control over the content. If I publish my full content to a “content distribution server”, then why would I expect anyone to come to my server to view my content? That’s a solid point if we assume there are no policy controls for the content being submitted. Matt Mullenweg and David Galbraith have suggested in version 2 of their rssping proposal that full content be stripped of “stop words” – commonly used words that are discarded by search engines like “and”, “but”, “the”, “or”, etc. Under this proposal, search engines would have “full content”, or at least something close in terms of what they will use. And the content stripped of the stop words is of marginal to no use to those who would be tempted to re-publish this content on their own site. Words like “the”, “but”, and “not” are much more important for human consumption than they are for search engines.

 

There are some subtle and fundamental problems with this strategy, too, however. More on that in a later post. Ultimately, what publishers need is trustable platform that enables them to publish content to the cloud with the confidence that they can achieve the efficient and broad distribution they desire, while retain control over where and how that content is presented and consumed. That’s not in place currently, but it is a solvable problem.

October 25, 2005

Weblogs.com problem this morning...

Maybe you noticed, maybe you didn’t, but early this morning, around midnight eastern standard time, the home page of recent pings for Weblogs.com stopped displaying new pings. Inbound pings continued to be accepted and recorded throughout; the problems encountered affected the service’s ability to publish the received pings out to subscribers.


The problem was diagnosed and remedied early this morning. If you click on the “hourly” update links for early this morning (at the bottom of the weblogs.com home page), you will see a large number of pings published at the 7am hourly update (WARNING: this is a huge file).  These are the accumulated pings that were received while the “output” side of the service was having problems.  Systems and subscribers that have been consuming changes.xml and shortChanges.xml during this period should be up to date.

October 17, 2005

VeriSign+Morever

Today, VeriSign announced the acquisition of Moreover.com. For almost a year, we’ve been thinking, watching and discussing (internally and externally) what’s been happening in the blogosphere. By early spring, several trends emerged that were important to us:

  • rapid, sustained growth of blogs
  • convergence of mainstream news and corporate data with feed-based publishing
  • increasing levels of spam in the blogosphere

The blogosphere was growing fast, and would soon outgrow its own infrastructure, and at the same time it was beginning to transcend the term “blogosphere” and establish itself as the new framework for Internet publishing for all kinds of information and content. In short, the blogosphere was going supernova.

It isn’t that we anticipated that the Internet would undergo “blogification” – although the ramifications of blogging are deeper and farther reaching than is generally acknowledged. Rather, the underlying technologies and processes which had proven themselves with bloggers were beginning to demonstrate their usefulness in areas that went far beyond blogging. While there’s something wonderfully humble about a simple ping, it represents a fundamental change, a re-organizing force in the way publishing occurs on the Internet.

If pings represented a new process for publishing network, then there would definitely be a need for a commercial, carrier-grade provider of ping services. Ping services are not a profitable business, in and of themselves. Pings are free by tradition and by necessity. Attempts to introduce cost or latency into the ping layer would be self-defeating; the network simply routes around such problems. A free, open, scalable service fabric for pings is a powerful base for us to build value-added services on, however.  It also happens to help with a growing problem for the blogosphere, which is a good thing, too. Double whammy.

Why Moreover?
Moreover represents a way to catalyze, accelerate and lead with this vision of RSS and feed-based publishing services. A great number of publishers we talked to had similar comments about the blogosphere: the model was interesting but the plumbing looked shaky, and some of the basic mechanisms for managing content distribution were immature, if present at all. Moreover has been an innovator in aggregating and syndicating news content, and has a lot of credibility with publishers in providing solutions to the problem. We believe that as part of VeriSign they will be able to do the things they’ve already been doing, but even better, and on a larger scale. In addition, we believe that the combination of the two companies will help us pursue excellence in these areas:
 

  • completeness – widest, most diverse set of developed publishing sources anywhere
  • richness – the deepest contextual and descriptive knowledge of content available
  • freshness – the fastest, cleanest signaling and distribution of new content anywhere
  • reliability – always on, always there, just works

The Moreover team has been successfully pursuing these goals for quite a while, of course. They have a proven business model, and hundreds of happy customers, to whom they currently provide billions of headlines every month. And it’s important to point out that Moreover doesn’t just provide search indexing; it’s harvesting technology combined with a refined editorial process make it an unequaled authority in relevance. Moreover doesn’t just know about keywords and terms in the content it processes. It develops a layer of rich metadata that is state-of-the-art for the industry. We believe the trends mentioned above offer an opportunity not only to leverage their existing efforts, but also to meet a much broader and emerging need – migration to a more efficient, more flexible and more consumer-friendly publishing framework.

Changing the Game
VeriSign and Moreover will combine to provide even better reach and better value through its existing array of services. But when combined with weblogs.com, and the shifts that are taking place in the marketplace, the combination will be uniquely suited to provide intelligent infrastructure for a new era of publishing. The status quo – publishers write stories, push it onto their web server and wait for customers to find it through direct access or through the indexing of search engines eventually crawling their pages – will remain a convention for some time to come. But increasingly, publishing in mass media as well as the corporate world will start to look a lot more like the blogosphere, at least in terms of the architecture. And it won’t be because of the hype of “Web 2.0” or the trendiness of blogging, but because the underlying model – publish > ping  > analyze > aggregate – simply works better than the old one.

 

Also, as has been made apparent recently, the issue of spam blogs -- splogs -- is reaching a crisis point. Splog filtering was one of the key services that got us involved in the acquisition of Weblogs.com's ping server. Moreover also runs a significant ping server, which means that VeriSign will be a gateway for an even wider share of the ping stream.  By combining VeriSign's own efforts toward splog filtering and Moreover's demonstrated skills in the area of contextual analysis and content validation, we believe we have the necessary elements to provide some real therapy here.

 

As part of VeriSign, Moreover will have the resources and operational platform to do what they do now even better, and on a bigger scale. More than that, though, the combined teams will have the assets and skills needed to lead in a more general way  -- by providing intelligent infrastructure for the new world of Internet publishing, from the once-a-month blogger to the manufacturing business that syndicates product information all the way up to the world's largest media organizations. In content signalling, metadata, feed management and distribution, the VeriSign+Moreover combination will provide customers with quality, manageable infrastructure that scales cleanly far into the future.

 

So, to the whole Moreover team let me say welcome aboard!

 

Now let's get to it.



 

Weblogs.com Cutover

This Thursday (10/20/05)  VeriSign will be switching over the weblogs.com ping service to an upgraded system. The new platform will offer much better performance and reliability for the service. The service will function the same way it always has, with the following exceptions:

  • IP Addresses for the servers have changed. You should not be using hardwired IP addresses to access the weblogs.com ping service – the correct way is to use the DNS server names – but if you are connecting by IP address, these will change on Thursday morning.
  • Duplicate Pings will be ignored, rather than rejected. Previously, there were time limitations on how frequently pings could be sent from a particular source. These limitations no longer exist.

You can test against the new system now. Again, for your production pings, they should be sent to the existing ping service at weblogs.com until Thursday Oct 20 at 10am. If you want to exercise the new system with your publishing tools before then, please do, using the following information:

Test Servers Until 10/20/05
test-weblogs.com 
test-www.weblogs.com 
test-rpc.weblogs.com 
test-audio.weblogs.com 
test-audiorpc.weblogs.com 

 

On Thursday, the test servers will become production servers, and the DNS names will be switch from their current IP addresses to the new ones:

Production Servers on 10/20/05IP Address
weblogs.com209.112.113.105
www.weblogs.com209.112.113.105
rpc.weblogs.com209.112.113.106
audio.weblogs.com209.112.113.107
audiorpc.weblogs.com209.112.113.108

 

As always, please use the DNS server names for accessing the weblogs.com ping service if at all possible. We understand that in some cases, IP addresses may need to be used, so we’re providing the information above to allow those parties to accommodate the upcoming change.


The website gets a new look, and now will only show the most recent 100 pings on the home page. At some times of the day currently, the current site’s home page will display more than 6,000 pings from the previous 5 minutes of traffic.


Publishers that are pinging weblogs.com right now should not have to do anything different to accommodate this change, provided they aren’t using hard-coded IP addresses to access the weblogs.com ping server. Everything should just get smoother, faster and more reliable. See the weblogs.com website’s information for more on this if you are interested.

October 13, 2005

BART RSS: What Is and What Should Be

Dave Winer points out that BART now offers an RSS Feed. Here's a screenshot of a recent Yahoo! render:

bart_rss_is.gif


That's progress, I guess. But here's the type of info I wish BART would stream in an RSS feed:


bart_rss_should_be.gif

BART has a stream of events and information that is of interest to BART riders that they should surface through RSS. I'm not interested in subscribing to a BART email list, but I would like to have a BART feed with the latest "traffic tips" to check before I head home. If I'm up in the morning and I see entries of significant delays on the Pleasanton-Daly City line, I might decide to drive into the city -- parking hassle and all -- rather than ride BART.


It's good that organizations like BART are embracing RSS. But they're still mostly stuck in the "newsletter" mentality. Many of these organizations have useful information their customers would like to have broadcast in a timely manner, in a way that doesn't require registration or sign-ups. Instead of (or in addition to) marketing messages being broadcast on their feeds, they should supply the runtime info that wil help make their products and services better -- avoiding a long delay on the Blue Line, for example.

October 12, 2005

Keyword-based Ads, Contextual Analysis, and RSS ad-splicing

Qumana has a series of posts that report on the result of a survey they’ve run across their users about advertising on blogs. Two Interesting questions they are asking:

 

1. Which is more effective: contextual ads or keyword-based ads?

2. Do add in RSS feeds work, or are ads only effective on the blog pages?

 

Keyword-based ads are ads that are matched against author-supplied tags for the content. Contextual ads are the same, only software analyzers are used to automatically deduce the contextual metadata. All things being equal, a human will provide more authoritative contextual information than a crawler every time, right? Probably, but there are a couple things to keep in mind:

 

a)      In some cases, machines using Bayesian filtering techniques (and other heuristics) may be better at matching your content with the most effective ads than you are, even if you are the author. If you write a blog on say, fly fishing, you might naturally offer the following tags for keyword-based ad matching for your latest post: “fishing”, “Blackfoot River”, “Montana”, and “lodge”. But a sophisticated contextual analyzer might catch something you missed. Using a rich ad matching knowledge base, it may determine that because your post mentions Orvis and Sage Fly Rods, that statistically, an ad for Lexus or Land’s End represents your most effective advertising opportunity. For you, it may have made sense to add the tags “orvis” and “sage”, or even to request ads from the product you mentioned directly. That may improve things, but often the best matches aren’t intuitive or direct at all. In this case, the mention of  a sage fly rod might flag content that is statistically compelling for (prospective) Lexus customers.

 

b)      Human added keywords aren’t hard to provide, but only if they are fairly vague. It’s quite useful for an ad-matching engine to know that your post revolves around, say, “Business: Accounting: Tax Negotiation and Representation: VAT Related” to use a DMOZ node in their taxonomy. But that’s a lot to expect from your average blogger.  Supplying the keywords “taxes” and “VAT” will help, but the granularity of author tags is likely to be quite coarse for some time, barring some big advances in blogging/authoring tools in the near future.

 

As for the question of whether ads in RSS feeds themselves are effective, Qumana suggest they’re not currently effective, and I don’t have reliable facts to the contrary on hand. However, I think bloggers, and publishers in general would clearly prefer to use RSS feeds simply as pointers that bring the user back to their content pages, where they control the ads, and everything else about the content presentation. If I recall Dave Winer’s comments on this correctly, RSS is the advertisement; it refers users to the content on your page, which then may be adorned with whatever ads you wish to present there. There’s definitely something to this idea. The other way to look at this is that RSS is sort of transitional in that sense: it was designed to serve simply as interesting pointers back to the content on your site. What’s been happening though is that RSS is becoming somewhat of a victim of its burgeoning success. It’s popularity has gotten the community to start thing of it as the basis for distriubuting content, not just pointing back to it. 

 

If the trend is toward feeds as a distribution channel for your content then, the question of whether ads in feeds work is somewhat academic: they simply must. If they don’t, it’s self-defeating to put your full content into a feed. Both questions raised above aren’t really either/or dichotomies.  Over time, a blend will develop between author-supplied keywords and machine-based contextual analysis as the basis for establishing the best match between available advertisements and your content. Similarly, successful bloggers and online publishers will devise a hybrid approach to advertising on their origin sites and in/through feeds. This will be interesting to watch evolve over the next year or so.

Cadenhead: King of Ping

Via Dave Winer, here’s an interesting post by Rogers Cadenhead, who worked on weblogs.com with Dave over the past several months.  He closes with this:


Rogers Cadenhead

 

One thing I'd like to see is a real-time search engine built only on the last several hours of pings, which could be a terrific current news service if compiled intelligently. While I was running W eblogs.Com, I wanted to use my brief moment as the king of pings to extend the API, which VeriSign appears to be considering, but Dave didn't want to mess with things while companies were loading a truck with money and asking for directions to his house.
I want to pursue these ideas, either independently or in concert with VeriSign and
Yahoo Blo.gs. No knock intended, but big companies tend to sit on purchases like this rather than implementing new features. Blogger still lacks category support two years after being purchased by Google, an omission so basic you have to wonder whether it's serious about fending off competition from Six Apart, UserLand, and WordPress.

 

 

We’re not just considering extending the API, we’re hard at work on it now. There’s a definite sense of urgency in developing a rich set of extensions to the existing ping semantics. The goal is to support and deliver highly reliable infrastructure for basic ping for free to the community. That’s a good idea, but that effort isn’t cheap, and will be increasingly expensive as the traffic loads continue to grow. Free basic pings in part depend on VeriSign's ability to deploy premium services on top of them that customers need and will pay for. That’s a strong motivator to get the API extended. 

 

In any event, I appreciate the great resource Rogers has been throughout the acquisition and transition processes, and it’s good to hear that Rogers wants to go further in this area. We’d be happy to work with Rogers on helping him realize his vision for ping. I think our goals for the future of ping are highly congruent.

Word of the Day: Pingwidth

A tip o’ the hat to Stowe Boyd who has coined the (now obvious) term pingwidth.  In his comments replying to my earlier post:

I agree. It is very bad mojo. But we still are going to wind up with a 'pro' version -- for extra cash -- with all the fancy bells and whistles (geolocation, etc.) and more 'pingwidth' than the basic stuff. (Yes, I did say 'pingwidth'. You heard it here first.)

I agree with this. We’ve been talking about ‘fat pings’ versus ‘thin pings’ for some time. The narrowest ping in terms of pingwidth would be just a URL indicating the feed that changed. Current pings submitted through weblogs.com are only marginally thicker. The basic ping through weblogs.com has:

  • Weblog Name
  • Weblog URL
  • Permalink URL [optional]
  • Category Name [Optional]

An extended ping on weblogs.com adds a RSS URL, which points the ping server at the RSS feed related to the post.

In terms of pingwidth, that’s minimal. As Stowe Boyd suggests in his reply, there is demand and usefulness for fat pings. Pings that come not just with the URL-based information contained in the basic ping above, but also metadata like:

  • geo-location (where the blogger is posting from)
  • geo-referencing (places mentioned in content)
  • people names
  • author’s tags
  • trackbacks/pingbacks
  • comment notification
  • digital signatures & trust assertions
  • media/attachment metadata

That’s not an exhaustive list, but you get the idea – much more pingwidth.  All of this information is useful in some consumption context, and much more efficient if submitted with a ping rather than having to be discovered by URL dereferencing and crawling.

‘Pingwidth’. Wish I’d thought of that…

October 11, 2005

Yahoo! News+Blog Search: Blurring the Line

Reviews of Yahoo!’s launch of its blog search have been mixed. Most of the discussion has been about the quality of the blog search itself, which obviously is an important consideration. But what’s interesting to me is Yahoo!’s decision to integrate blog search right into their regular news search. This has two important effects. First, it will introduce blog search to millions who are unaware of Technorati, Ice Rocket, Bloglines or a host of other blog search tools. Yahoo! has brought feeds to the masses on their My Yahoo! pages. Now, the search tools will address the content they are reading from their subscribed RSS feeds. Second, it blurs the line between news and blogs. TheYahoo! team sees what I see: that over time the news/blog distinction will become increasingly arbitrary.

eBay-VRSN deal

VeriSign and eBay announced a deal with eBay yesterday. VeriSign's payment business, which is how I came to VeriSign, has been acquired by eBay. Good luck to all in the payments team!

October 07, 2005

Comments on Weblogs.com Discussion

A couple of comments reaction to the evolving discussion of VeriSign’s acquisition of the weblogs.com ping service…

 

DISCLAIMER: My comments on this blog are not official VeriSign service announcements. Repeat, not official service announcements. They reflect my understanding and attitude on the matter, but just that: I’m just one person on a large team.  Please keep that in mind generally on this blog, and specifically for the comments below.

 

·        Niall Kennedy writes:

I expect VeriSign will introduce an authentication certificate for ping submissions to its servers. One possible upsell on the listening side is the ability to be alerted to a blog update before anyone else, similar to how stock market systems delay stock quotes to non-premium customers. VeriSign could also sell more personal authentication keys to bloggers using stand-alone services such as Movable Type or WordPress to allow for the rebroadcast of a ping submission.

 

He’s right, these are possible ideas for premium services. But let me say that I really have never liked the 20-minute delay for NASDAQ quotes, as much as I understand the rationale behind it. I believe the consensus of the team working on the “real-time web” at VeriSign is focused on the “real-time” part. We are looking to build “intelligent infrastructure” that offers powerful building blocks from which innovative network apps can be made. I think forcing additional latency and delays into the information stream would be extremely counterproductive – it would degrade the quality of the services that could be built on top of the infrastructure.  Everyone, including VeriSign, is better off if the system is geared around clean, low-latency signaling – for all pings, all the time.

 

·        Stowe Boyd from CORANTE writes:

I guess I am a bit slow on these developments, but the notion that pings could be separately from the weblogs.com service as a whole and discussed like an additional element of the service -- kind of like call forwarding for your cell phone -- seemed strange. I mean, aren't pings just an essential? But then I noticed the subtly important word "basic" that precedes ping in th efirst sentence. Basic pings will remain free, so I am intuiting that non-basic pings are going to cost. So if you need a quicker ping cycle, or if your blog receives more than some basic number of pings, you are going to pay. Perhaps you will purchase a basic plan with so many pings in it, and you will pay for extra pings. Especially during peak hours.

Given Niall’s comments above and Mr. Boyd’s here, I fear I’ve given the wrong impression.  Slowing down “ping cycles” or otherwise degrading the performance of the service isn’t appealing to me at all. The goal is for all pings to circulate through the system quickly and accurately. Rather than thinking about changes in latency or timing, I’m thinking the “value-add” here will pivot around the depth or richness of the ping itself.

For example: if a blog submits a “full content ping” – a ping that is much more than just the URL notification of new post, but the full content of the post itself, the infrastructure layer, either as an extension of the ping server itself, or perhaps in conjunction with a partner, can skip the URL dereferencing and crawling process, provided it establishes nominal trust with the submitter. So, if the whole post is attached with the ping – including really useful metadata like that addressed in the Atom 1.0 spec – the post can be processed and indexed, and therefore surfaced to the user much more quickly, and cost effectively than the “basic ping”.  

 

On the outbound side, if a service is offered that not only provides ping signals, but attaches a rich  set of metadata along with it – tags, keywords, place names, geo-references, etc. – that would be a highly useful upgrade from the information provided right now, which is basically a title and a URL for the source content. That may be an area where service and application builders will find a fee for developing and delivering the needed metadata on pinged content is easily worth the fee charged by the service.

 

So, think about pings becoming more deep and rich as a way to add value that can be charged for, rather than “dumbing down”, or “slowing down” the existing basic pings so that what is now considered a basic ping can be monetized as a “premium”. That’s not what I'm talking about at all. That’s bad mojo, IMHO.

Weblogs.com Press Release

Here's a link to the official VeriSign announcement about the purchase of weblogs.com assets...

October 06, 2005

Weblogs 2.0

Word is out, and it’s true: VeriSign has acquired the assets of Dave Winer’s weblogs.com. I’m sure Dave will have plenty to say on the subject, but weblogs.com this past year has reached a point where Dave needed to either a) invest significant capital into the development of Weblogs 2.0 – a ping server to handle the next several years of traffic growth, b) sell it to someone else who would do the same, or c) watch as the current system slowly (or maybe quickly) succumbed to the ever-growing stream of pings. Last Thursday weblogs.com processed just under 2 million (1.96M) pings for the day. When we started talking with Dave, a couple months back, the ping totals were barely half of that, and the load even then on the servers made pinging weblogs a chancy proposition during peak posting times (late morning and mid-evening in the US).  For a long time, ping servers could be stood up as a single box running on a fast business DSL connection. Those days have passed at least for the popular ping servers; pings are well on their way to requiring serious infrastructure.

 

Why VeriSign?

That’s where VeriSign comes in. Not only are we running the DNS Registry and the largest TLDs (.com/.net), we handle hundreds of millions of transactions every month in the areas of mobile telephony, ecommerce payments, and instant messaging among other things. As we look ahead a few years, we see a future in which pings are generated not just by the millions per day, but by the tens and hundreds of millions. The blogosphere will continue to grow – rapidly – but we already note signs that RSS and the mechanics of feed-based publishing will extend well beyond the blogging perimeter, and be adopted as an enabling technology in areas like mainstream media publishing and corporate data distribution. In short, we believe that it won’t be long before terms like ping, feed, and trackback become part of the conventional lexicon for Internet publishing as a whole, not just the realm of blogs.

 

That’s an exciting view of the future, with a host of new opportunities for delivering network services in a user-friendly (and often user-powered) way. In order for that to happen though, there’s a lot of work to be done underneath the application layer. The blogosphere has benefited from a burst of innovations and advances in blogging tools, aggregation services, and social networking applications. The plumbing underneath all this activity hasn’t kept up, however. In the area of pings and ping servers, we have what it takes to keep up with the vigorous growth “up the stack”.

 

Pings, as their number grows and grows, start to look a lot like the other kinds of messaging operations we run. It’s what we excel at.

 

Our Vision for Weblogs.com

First, we want to see weblogs.com remain what it is, and maintain how it works for the long term. There’s enormous value for the ecosystem in realizing Dave’s original vision for his ping server: a free, standards-based service that is easy to use, and effective in signaling to the world at large that you’ve submitted new content into the system. Here are some attributes that we intend to preserve and extend for weblogs.com:

 

1.      Free

Basic pings, the messages processed by weblogs.com, will remain free to submit, and free to retrieve from the service. Over time, we plan to offer value-added services to publishers and consumers that we can charge a fee for, in much the same way companies like Yahoo! provide basic email services for free, and offer premium “upgrades” for a fee (e.g. extra storage, domain hosting, integrated website, etc.) But pings will remain free; our goal is to make weblogs.com the best, most widely used ping server available.

 

2.      Open

We are strong believers in standards and open computing. We’ll keep the XML-RPC format Dave Winer built weblogs.com around, and add to it, with additional services that leverage and extend the usefulness of pings. In all cases, we endorse open formats, freely available, freely implementable by the rest of the community.  Competing services are a good thing – ultimately they will provide a much stronger basis for innovation and growth in the ecosystem. We want to excel in our execution and implementation of our services, rather than building a walled garden around a proprietary platform.

 

3.      Solid

We have the skills, resources and experience in highly-scaled, high-performance infrastructure to deploy ping server services that will serve the blogosphere (and beyond) for the next stages of growth. As latency and accuracy become increasingly important issues for the blogosphere, weblogs.com will provide a reliable “dial-tone” for sending and receiving publishing signals on the Internet. Like other high quality infrastructure, we expect that over time pings and related services from VeriSign will become transparent – it just works, so often and so well that you won’t give it much thought in the future.

 

4.      Informative

I know from talking to Dave Winer that this was part of his vision, if not part of his current implementation, but we would like to make weblogs.com – the website – a useful destination for checking in on the infrastructure side of the blogosphere. We anticipate it being a handy place to check in for aggregated metrics: how many pings were processed today? How many feeds are active in the last week? How many different languages are being used for ping submission? There’s a great number of stats and measurements we can deliver that we’d find useful as members of the blogosphere. We think you will too.

What Happens Now?

Weblogs.com version 2.0 will be a significant improvement in performance and features, but will remain fully backwards compatible. If your publishing tools are configured to ping weblogs.com, you should not have to change anything. Everything will just continue to work, only faster, and all of the time. As we develop additional services, we’ll do our best to make sure they are easy and reliable to use in the tools of your choice.

 

As for additional services, there’s a wide variety of services that we’re looking at and working on right now, but will focus on one that we’re committed to in the near term and believe is a compelling problem for the blogosphere in general: blog spam. If you read back through my previous posts a ways its not hard to deduce that we spend a lot of time thinking about this problem.  I noted this morning searching for something on Technorati that they are telling us that  we can search more than 18 million blogs now. I believe that’s true, but only if we’re fairly charitable in what we’d call a blog.

 

We’ve just begun doing some analysis on just how many blogs out there are real­ – the work of real humans crafting posts – rather than simply splogs – web pages that are generated automatically by scripts and programs to look just like (or much like) real blogs, but serve only as a place to park keywords that will hopefully be found in a search, and advertisements that hopefully will be clicked on by humans who happen to somehow land on that page. In talking to Google, they can confirm what our initial scan tells us: there are an enormous number of splogs out there, and the number is growing faster than the number of real blogs. By a good margin.

 

This problem is fraught with many of the same problems that plague the email world in its struggle against spam: Who is the source? What is the content about? Is it a copy? What does its distribution look like? Is this purely a solicitation? These are not easy questions, and a robust solution is not readily available. However, at the infrastructure level, very little is currently being done, and there are remedies that can be deployed that will provide significant, if not thorough relief. As a first “killer app” to deploy on top of weblogs.com ping services, we’d like to make progress in improving the “signal-to-noise ratio” in the blogosphere.  Does that mean censorship? No. As above, we’re committed to maintaining the integrity of the free and open ping stream, in all its wild and chaotic glory. But we believe that many will want to take advantage of filtering services – screen out the splogs based on a threshold value in the analysis – in much the same way that mail users see value in spam filters for their email inbox.

 

That’s a tough task, and one we won’t be able to make much progress on alone. We’re already working with a number of parties in the ecosystem on this subject, and believe that as part of a community effort, VeriSign can help lead the way to much better “signal” at the infrastructure layer of the blogosphere. Which will improve the user experience for everyone. Which is why we got involved in the first place.

Split

This will have long term effects on network infrastructure, and the Internet as a whole...

Got Stickers?

I didn’t attend Web 2.0 today, but was over at the Argent Hotel for a meeting at the hotel bar. While it wasn’t clear from talking to several who’d been in the conference all day what was emerging as salient new ideas from the presentations, it was clear from surveying the crowd at Jesters that the “hacker laptop” thing has become de rigueur with the Web 2.0 crowd. Either that or the hax0r set has started wearing Armani and Rolex. It was bad enough that I was hanging out there without a conference badge, but carrying around a naked ThinkPad…

All Your Result Are Belong To Us

Threadwatch has this today about Matt Cutts’ blog being googlewashed – virtually removed from results pages by the presence of a large number of duplicates of the same article on other sites. This is problematic for Google, and all the search engines, as it is predicated on the difficulty in not only identifying duplicate items, but in determining what’s “original” and what’s a ripped off (or even legitimate) duplicate.

 

SEO black hats are having a little fun at Matt Cutts’ expense, but to prove a point: without a base mechanism for asserting one’s ownership over content published on the web, there’s currently no way to keep that content from being used against you to diminish your page rank. Previously, there was concern about copyright – how does one protect and enforce the author’s rights over web content? This concern was typically driven by fear of lost ad revenue on one’s origin server, and lost syndication revenue for content that was distributed through paid content networks.

 

That’s still a valid concern, but with the advent of googlewashing, the emerging problem is that your content can be used to make you invisible to the search engines, while simultaneously boosting the results ranking for bad guys who are ripping off your work. In the case of Matt Cutts’ blog, Matt’s not in the top results for his own content, but the folks at DarkSEO are.

October 05, 2005

Empowered Users, Powered by Users

Acquisitions like this signal the increasing inertia towards consolidation in the "Web 2.0 space" -- for lack of a better term. Upcoming.org is an example of the trend toward participatory software. Back in the day, when software was king, we talked a lot about "empowering users".  Now, successful companies don't just empower users, they are powered by users.

Memeorandum and the Demise of (Real) Trackbacks

Currently, the blogosphere is unable to defend itself against the onslaught of trackback spam. One of the most useful and under-utilized aspects of blogging is increasingly being disabled across the blogging landscape as popular blogs succumb to an avalanche of automated trackback links meant to boost someone else’s page/blog rank.

 

Several interesting things are happening behind the scenes that should provide some remedial therapy for the blogosphere in this area, but for now, several sites are stepping in to provide informal trackback functions to blogreaders. Memeorandum has become a nexus for blog readers of late not just because it keeps a “hot list” of current memes floating around the blogosphere, but because it is as close to a trackback function as we’re likely to get for now. For example, Tim O’Reilly has had to turn off trackbacks to his blog. Clearly, this week, his What is Web 2.0  post is generating a lot of discussion in the community. Before the ascendancy of trackback spam, a popular post like this would also become a “wiki” of sorts – a self assembling index of references to related posts and links.  O’Reilly’s post has a good long set of interesting comments by now, but how does one quickly find out who else has been blogging about this post? Memeorandum does the job for now.

 

An obvious shortcoming here is that while Memeorandum does provide a quick index to who’s talking about what in the blogosphere, it can only practically address the hottest subjects and the biggest names in a limited slice of the ecosystem. Memeorandum, for example has two “sweet spots”: technology and news. Those are fairly broad categories, but you’re out of luck if your interests lie outside the scope of Memeorandum and similar services (like, say, string theory, for example), or even if the topic in question never quite makes it “supernova” on the meme charts. 

 

Real trackbacks are self-asserted links, and thus susceptible to being abused.  Memeorandum is (among other things) a form of mediated trackback system: They provide a cloud of “trackbacks” to related content for their hot topics. If real trackbacks are destined to be disabled indefinitely – if trust and authentication frameworks do not emerge to insulate the trackback feature from abuse – then eventually, I expect that mediated trackbacks – a compiled list of related links provided by a trusted third party – will become a significant opportunity for serving the blogosphere. To be really useful, a mediated trackback service would provide broad analysis of content links in the ecosystem, not just the hottest several dozen from the tech and news categories.

Categories

Blog Tools | Blogosphere | Feeds | Identity | Miscellaneous | Ping | RailsConf | RailsConf2006 | RubyonRails | Tags | VeriSign |

Blogroll

Jeff Richards' Demand Insights

Web Security Blog

The Accountable Web

SSL Blog

Demystifying the Web's Secure Backbone

Powered by
Movable Type 3.2
Disclaimer: Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of VeriSign.

VeriSign Legal Notices

Read our Privacy Policy