Video Screencast Help
Search Video Help Close Back
to help
Not able to make it to Vision this year? Get a sampling of what's going on in the Best of Vision on Demand group.

"Unclear Provenance"

Created: 29 Sep 2005
Tim Callan's picture
0 0 Votes
Login to vote

Dave Winer and Doc Searls are talking about a mixup where Doc quotes Winer via a spam blog named Joape. Doc didn’t immediately recognize Dave Winer’s comments which had been “repurposed” (as Dave charitably describes it) on a blog of “unclear provenance” (an equally charitably characterization from Doc).  In a hurry that seems easy enough to do, but it poses a question and a problem to the blogosphere.

 

Dave is not worried about re-publishing of his ideas – at least in this case – but is simply asking for attribution. But even if the spam blog in question had bothered to provide the proper attribution and links to the original content, the real problem here would remain; how to avoid having legitimate content “re-purposed” for inclusion in splogs?

 

Currently most splogs are identified through textual and link analysis. The content of these pages is typically saturated with keywords hoping to be found and clicked to through a search engine.

 

As Doc Searls suggests in the title of his post, this is evolving into a Turing Test of sorts. The real Turing Test was easier for the questioner, as the questioner could interact with the subject of the test, choose topics, and tailor subsequent questions dynamically based on previous responses. Systems that analyze blog pages to identify them as splogs don’t have that ability – it’s just a chunk to text, links and images to analyze.

 

Even so, companies like Google had proven until now that they were equal to the task.  With fairly good precision (erring on the side of admitting marginal splogs into the system rather than risking excluding real blogs) they can determine by crawling the blog whether it is of dubious provenance or not. That is, based on the current content of most splogs. With the Joape site above, an example is given that thwarts even the best algorithms written by Google or anyone else to detect splogs; it cheats on the Turing Test, by including real human commentary on its pages that has been copied – re-purposed – from legitimate blogs.  On some level, a splog is useless even as a splog if it doesn’t provide advertising links which can generate revenue. But it’s going to be difficult to distinguish between an automated splog and an authentic blog if that’s the case, since many authentic blogs also have advertising links as well.

 

I’ll expand more on this over the next few weeks – this is an important issue for blogosphere infrastructure. For now, though, let me note how simple and effective a tactic like the one Joape is ostensibly relying on is. By grabbing full posts from Dave Winer, and other legitimate bloggers, a splam blog can attract viewers, possibly more effectively than it previous could by keyword density and link games. The “content column” is all legitimate commentary lifted from other sources, and the “links column” is chock full of the advertising links that the splog owner is depending on for impression/click-through revenues. It looks just like an authentic blog. All the commentary is real, human-authored content. Current algorithms for splog detection won’t properly identify this kind of splog.

Blog Entry Filed Under: