Main | October 2005 »

September 30, 2005

URL Anchors in Audio/Video?

This seems like a problem that must have been solved already: How does one create a URL that links to a specific frame/time in an audio or video stream? HTML pages are fine grained; you can link right to a specific word with an anchor. But I can't find a way to make something like this work:

http://www.randomnews.com/2005/09/25/broadcast/audio/seg_04.mp3#00-02-21-03

This supposes something like a "virtual anchor" in the mp3 audio that would let the user navigate to a point 2 minutes 21.03 seconds into the stream. It's not really useful for indexes, search engines and other navigation tools to simply load up a 1 hour podcast or other media blob containing the reference you're looking for, then leave you to your own devices to find the point of interest.

It seems likely that Windows Media Player has execution parameters that allow you to specify the start point in the stream on launch, but my goal is to find a general syntactical solution via URLs to this problem, not a player-specific parameter. Has anyone seen this problem solved with URLs in an elegant way?

 

September 29, 2005

"Unclear Provenance"

Dave Winer and Doc Searls are talking about a mixup where Doc quotes Winer via a spam blog named Joape. Doc didn’t immediately recognize Dave Winer’s comments which had been “repurposed” (as Dave charitably describes it) on a blog of “unclear provenance” (an equally charitably characterization from Doc).  In a hurry that seems easy enough to do, but it poses a question and a problem to the blogosphere.

 

Dave is not worried about re-publishing of his ideas – at least in this case – but is simply asking for attribution. But even if the spam blog in question had bothered to provide the proper attribution and links to the original content, the real problem here would remain; how to avoid having legitimate content “re-purposed” for inclusion in splogs?

 

Currently most splogs are identified through textual and link analysis. The content of these pages is typically saturated with keywords hoping to be found and clicked to through a search engine.

 

As Doc Searls suggests in the title of his post, this is evolving into a Turing Test of sorts. The real Turing Test was easier for the questioner, as the questioner could interact with the subject of the test, choose topics, and tailor subsequent questions dynamically based on previous responses. Systems that analyze blog pages to identify them as splogs don’t have that ability – it’s just a chunk to text, links and images to analyze.

 

Even so, companies like Google had proven until now that they were equal to the task.  With fairly good precision (erring on the side of admitting marginal splogs into the system rather than risking excluding real blogs) they can determine by crawling the blog whether it is of dubious provenance or not. That is, based on the current content of most splogs. With the Joape site above, an example is given that thwarts even the best algorithms written by Google or anyone else to detect splogs; it cheats on the Turing Test, by including real human commentary on its pages that has been copied – re-purposed – from legitimate blogs.  On some level, a splog is useless even as a splog if it doesn’t provide advertising links which can generate revenue. But it’s going to be difficult to distinguish between an automated splog and an authentic blog if that’s the case, since many authentic blogs also have advertising links as well.

 

I’ll expand more on this over the next few weeks – this is an important issue for blogosphere infrastructure. For now, though, let me note how simple and effective a tactic like the one Joape is ostensibly relying on is. By grabbing full posts from Dave Winer, and other legitimate bloggers, a splam blog can attract viewers, possibly more effectively than it previous could by keyword density and link games. The “content column” is all legitimate commentary lifted from other sources, and the “links column” is chock full of the advertising links that the splog owner is depending on for impression/click-through revenues. It looks just like an authentic blog. All the commentary is real, human-authored content. Current algorithms for splog detection won’t properly identify this kind of splog.

Spings

I know the term ping spam has garnered some currency already, but given the ascendancy of the term splog – short for ‘spam blog’ – over the last couple months, I think the term sping – short for ‘spam ping’ – might just win out, for consistency with splog if no other reason. I created a wiki for sping yesterday.

Why Signed Pings Matter

We’ve been socializing the idea of digital signatures for pings for some time now, and have had to hone our arguments in the face of inevitable and understandable push-back we get in some quarters on this issue.  On the one hand, there’s a clear value proposition to publishers who already have an established, verifiable brand. Signed pings convey the integrity and authority of the publisher with each ping submitted. In cases where a signature can be mapped to an established SSL certificate, already in use by the publisher for ecommerce or cryptographic features on their website, little needs to be done to enable these publishers to assert their ownership over submitted content.

On the other end of the scale, though are blog hosts like Blogger and LiveJournal that host millions of blogs under one domain. What good does a signed ping do for Blogger.com? Knowing it came from Google may be nominally useful, but that really wasn’t a part of the problem in the first place.

 

Signed pings from blog hosts are valuable because they enable pings to be delivered with extra metadata that can be used by syndicators and aggregators to separate the signal from the noise. For example, if a blog host signs a ping that carries with it the following tag:

 

<splog_filter link=89, content=63/>

 

consumers of this ping can then apply local policy in processing this ping. Since the blog host signed the ping, they can be confident about the information supplied in the tag above (confident, that is that the information is unchanged and really came from the blog host).  If the consumer in question here has a policy of accepting pings that exceed scores of 50 on both the link and content analysis, this ping would be accepted for distribution. If the policy requires that the content analysis scores a 75 or better, this ping would be rejected.

 

So the signature is important. If the spog_filter scores are going to drive processing policy for aggregators and other ping consumers, then there will be significant interest from the black hat crowd in “spoofing” these values for a submitted ping. Signing a ping based on a private key with a matching public key published as part of a DNS zone file makes this kind of spoofing impractical for the bad guys.

 

For “branded domains” that have editorial control over everything that is published from that domain – like the washingtonpost.com, for example – the ability to sign a ping for submitted content provides immediate and tangible value; WaPo signed pings will be easily identified and trusted as sourced from the Washington Post, no matter where their pings get routed in the cloud. For service providers that host all manner of content from their domains, the signature of the service provider provides confidence in the integrity and source of the ping from the host, including supplied scores and metrics that can assist in filtering out authentic blogs from splogs.

 

Over time, additional authentication can be applied where appropriate to publishing domains. Publishers who use the Yahoo! platform to publish, for example, may see value in submitting to an additional authentication process whereby Yahoo vets the identity and attributes of the publisher/blogger. In this case, the investment in this authentication can be leveraged by signed pings. If Yahoo! then asserts (with digital signatures) that pings for this publisher are now backed by Yahoo!’s authentication assertions, those pings can be confidently accorded the appropriate status in terms of trust by aggregators and consumers.


 

Signed Pings aren’t the solution, then, but an enabling technology that provides the basis for delivering any of a number of possible solutions and applications related to identity, authentication and trust in the blogosphere.

September 27, 2005

Blogger.com and Splog Filtering

One thing I’ve noticed is that Blogger.com seems to have a pretty good handle on which of their own blogs are in fact splogs – spam blogs. As I mentioned in a previous post, an analysis of the ping traffic that comes into Weblogs.com’s ping server indicates that not only are popular ping servers process a lot of pings from splogs, the lion’s share of these splogs appear to come from Google’s Blogger.com. This makes perfect sense, as blogger.com is a well-known, free service that lends itself fairly well to automated blog creation and deployment.

blogger_next_button.JPG

If you go to www.blogger.com and click the “Next Blog” button, you’re taken to a random blog in the blogger.com family. At the top of each blog is this familiar control bar:

 

Click “Next Blog” a few times, and see what you get. If you click through enough blogs, you will probably navigate to a splog or two, but notice how few and far between they are, if you get any splogs at all. Clearly, there is some filtering going on here. Blogger is performing some analysis on their blogs and suppressing the blogs it believes to be splogs. Also, there’s the “Flag” button, which according to Google plays a role in identifying splogs as well,  but at the rate at which splogs are currently being spawned at blogger.com, human flagging has to be just a small part of the equation.

 

This isn’t a complaint; having tried it out a bit, I’m impressed with the apparent efficiency of whatever heuristics Google is using to identify the splogs in their midst. Dave Winer and I had a conversation last week about what it would take to effectively (and quickly) identify splogs. It’s not an easy problem. It used to be that link analysis was a pretty strong indicator. In past months though, black hat SEO types have upped the ante. State-of-the-art splogs now mix in a healthy portion of links to legitimate sites, and now sport “imported” content for posts that makes them hard to easily identify as search engine spamming – posts aren’t just long lists of keywords anymore. Distinguishing real blogs from splogs is getting harder by the day, but apparently Google/Blogger are ahead of the curve, at least compared to everyone else.

Patchcasting

Colleague Billy Sylvester has an interesting idea for RSS feeds and pings – patchcasting. It’s not quite as media-hip as podcasting, but could be a nice new part of the blogosphere on its own.  The idea is to host a “feed portal” where software publishers of all kinds could host feeds chronicling releases of their software.  When a new update ships, an new entry is added, and the appropriate ping is generated and sent out to the ping server.

 

The updated entry information would be nominally human-readable – version and descriptive information about the release would be useful to interested reader – but for the most part these feeds would best read and consumed by other machines. Users would be able to choose which software packages they want to monitor via patchcasting, and the operating system (eventually?) or a helper software agent would check for updates to the subscribed feeds, and react accordingly, depending on the policy set by the user. For critical OS updates from Microsoft with a valid trust assertion, say, new entries would be downloaded and installed automatically. For other packages, the pending patches, updates and upgrades would simply be presented for user action when the user checked in.

 

One of the nice features of subscribing to feeds is that they are anonymous, or more precisely, unidirectional. I’m loathe to sign up for email notification from software vendors regarding product updates and other information. At best, I get way more “update” information than I really need. At worst, my inbox begins to mysteriously fill up with unsolicited mail from all number of solictitative parties. With an RSS feed that provides product update information, I’m in control. I check in to see what’s new when I want, and only when I want. When I check in to see what’s new on the update feed for XYZ Software, they have no idea who I am – I’m unspammable, unmarketable in this setup.

 

VersionTracker has something a little like this. If you look here you can see they are spinning the updates for a wide variety of packages into an RSS feed, in this case, grouped by platform. What Billy Sylvester envisions is something much more granualar, I think. One feed for a specific software package. For example, if you own Adobe Photoshop CS for Windows, you would want to monitor a feed Adobe maintains just for this product. You may also want to subscribe to the “Plugins” feed which streams information on available plugins and updates for them.  In any case, the right granularity would be that which made managing policy for updates to the feeds easy and effective. For Adobe Photoshop CS, any new updates that appear in its feed might be downloaded automatically, available for quick installation the next time the user checks in.

 

Currently, there’s obviously no OS support for such a thing as patchcasting.  It would take some time and proof that the idea is sound before software agents could be written to implement the policy aspects for the patchcasting user – auto-downloads, pop-up notifiers, etc. For now, though, patchcasting seems to be useful enough just as a function of your favorite newsreader. Instead of giving your favorite software vendors a convenient way to contact you whenever they’d like, subscribe to feeds for the specific packages your interested in. A new update will offer a link that will bring you to the appropriate download page, no email needed, thank you very much. If this succeeds at all, tools for patchcasting will surely follow.

September 26, 2005

Eventful as Non-Blog Feed Source

Eventful (aka EVDB.com) is a good example of both a useful network service and a source of ping/feed data that not blog-oriented. I've subscribed to their feed for the Target Center, which has at least been useful to keep me informed of local happenings there that I'm too busy to enjoy, like the U2 concert there last Friday.

Comet: A blog tool for my mother-in-law?

SixApart is talking about Comet, their new blogging platform. Given the simplicity of the tools already out there, like TypePad.com and MovableType, also from SixApart, how much easier can Comet make things? Apparently Comet will let you do more, while maintaining ease of use. Digital media management gets a boost, as does "community management" for your blog.

September 25, 2005

Splog Filtering

A casual look at the front page over at Weblogs.com will show that the bad guys rule the battlefield currently in the struggle to combat splogs – spam blogs.

weblogs_screen.JPG

Of the first 20 listed on the page (taken around 2:30pm Central on 9/25/05), all but six (#2, #3, #11, #12, #16 and #20) are splogs. So in this small sample 70% of the submitted pings have splogs behind them. We’ve monitored this issue over a period of time and a much broader sampling of data. On the whole, the average number of splog pings is much lower than is reflected here. But this sample does make the point; there’s a lot of noise in the channel already, and it’s growing much faster than the signal.

 

Splogs are one of the reasons we are getting into the ping server business. We support open infrastructure, so at the lowest level, we will process everyone’s pings, including the splogs.  There’s significant demand, however, among aggregators and consumers to provide filtering services that will identify and isolate splogs from the publishing stream. A host of different tools and techniques are being evaluated as a means to quickly and accurately distinguish splogs from “real” blogs. One of the running themes of this blog will be to chronicle our development and service efforts in this area. Eventually, we envision that consumers of our outbound ping stream will be able to configure our output to filter out splogs, given a “splog quotient” threshold, or some other equivalent set of metrics.

 

Also, if you look at the ads in the right hand side of the picture, note the amazing contextual awareness Google demonstrates in addressing me with their ads. Suck at golf? How’d they know?

 

Pings and Tokenized Payloads

One of the intriguing ideas offered by the RSS Ping proposal is the concept of “tokenized” content being submitted as part of a ping.  The goal of this idea is to enable publishers to submit a full content payload as part of a ping message to a ping server, without having to worry about the content being propagated around the Internet, beyond the publisher’s control and ability to monetize. A full content ping would provide the ping server provider to analyze the post in situ – no need to invoke a harvesting agent to dereference the URIs supplied in a normal ping to retrieve the content. This doesn’t make much difference for a single post, but when you imagine that millions of ping everyday might arrive at the ping server carrying the full content of the post, ready to analyze, the operational efficiencies over the conventional approach would be significant. For any search engine, there’s often more resources expended in accessing and retrieving the source content than there is spent on indexing and analyzing it, once it is in hand. Skipping the crawling/harvesting process represents a huge gain in the efficiency and performance of metadata extraction systems.

 

The RSS Ping approach, then, suggests that publishers might be willing to submit their full content with their pings, so long as the full content is altered in such a way that makes it virtually useless for unauthorized redistribution. RSS Ping proposes that that stop words – words like “and”, “or”, “the”, “but” and “of” – be stripped from the full content, yielding a tokenized payload.  Stripped of stop words, the tokenized payload wouldn’t be readable by humans, and therefore of little value for illicit redistribution. However, since stop words are ignored by search engines and crawlers that extract keywords and metadata, the tokenized payload should be just as useful for purposes of categorization and navigation.

 

Let me equivocate here when I say “just as useful” above; stop words are actually a thorny little issue for search engines, and the absence of stop words in the payload raises a host of questions about how advanced searches will be performed across such content. But setting that aside for now, the stop-word-stripping idea is an interesting one, as it gives aggregators and search engines nearly all they need to skip the crawling process and process the content directly upon submission of the ping.  Typically, encryption has been suggested as the remedy for submitting full content to the cloud, a solution that is likely cause far more problems than it solves.

 

September 24, 2005

Structured Blogging: Emerging

I spent some time this summer talking with Bob Wyman of PubSub.com about “structured blogging”. While we’ve come at from different angles, we both see potential for some exciting innovations in the way posts might be structured in the future. Bob has a website and blog up on the subject at www.structuredblogging.org, although it looks like it’s been pretty quiet there lately (Bob’s got a lot of other things going on right now, I know).

 

The promise of structured blogging is two-fold; on the publishing side of things, structuring different kinds of posts (e.g. a movie review, a recipe, or a job listing) provides a means to intelligently apply meaningful style templates to different object types. The real power of structured blogging, though, is realized on the back end, in the aggregation and navigation of structured posts.  If all the movie reviews out there were published according to a simple XML schema, it would be trivial for aggregators to index meaningful elements of these posts (say, the overall rating for the movie, or its name).  Currently, brute force search can be used to find movie reviews in the blogosphere, but these are subject to the known problems of searching unstructured data – you get a lot of noise mixed in with the desired signal.

 

I’ve had the opportunity to preview a couple different tools under development in this area in the last few weeks, and I’m impressed. The beauty of this kind of technology is that is so transparent to the user, and so powerful on the back end. For the user of a structured blogging tool, the posting process is simple and intuitive. For a movie review, instead of starting with just a text edit box, a “Movie Review” form is selected, that combines fields for the structured elements – Name, rating, date, reviewer, etc. – with fields that contain the free-form commentary for the review.

 

In a subsequent post, I’ll see if I can get permission to show a couple screenshots of the tool that will illustrate the “friendliness” of this technology. For now, it’s got me thinking a lot about what the infrastructure on the back end of this technology would like in order to leverage the capabilities of these new tools. Stay tuned.

Tagging Posts with Rojo

I’ve spent some time using the Rojo service lately.  I haven’t invested the time to configure it for what it’s meant for – social networking as blog-reading aide – but am planning on it. However, I have been using a handy feature they offer: the ability to quickly tag a post. In “expanded” mode, Rojo presents a small type-in field that I can use to tag that post. I’ve been using this feature to label posts I’d like to come back to, and I’ve found it quite useful. It’s quicker than adding bookmarks, and keeping them organized. Better, I can share ‘em.

I know Technorati has addressed both author-assigned tagging (via Flickr and Buzznet) and reader-assigned tagging (via del.icio.us and Furl), but it’s just so much simpler with Rojo. Type in the tag, hit enter and your done. Cool.

Categories

Blog Tools | Blogosphere | Feeds | Identity | Miscellaneous | Ping | RailsConf | RailsConf2006 | RubyonRails | Tags | VeriSign |

Blogroll

Jeff Richards' Demand Insights

Web Security Blog

The Accountable Web

SSL Blog

Demystifying the Web's Secure Backbone

Powered by
Movable Type 3.2
Disclaimer: Opinions expressed here and in any corresponding comments are the personal opinions of the original authors, not of VeriSign.

VeriSign Legal Notices

Read our Privacy Policy