One of the intriguing ideas offered by the RSS Ping proposal is the concept of “tokenized” content being submitted as part of a ping. The goal of this idea is to enable publishers to submit a full content payload as part of a ping message to a ping server, without having to worry about the content being propagated around the Internet, beyond the publisher’s control and ability to monetize. A full content ping would provide the ping server provider to analyze the post in situ – no need to invoke a harvesting agent to dereference the URIs supplied in a normal ping to retrieve the content. This doesn’t make much difference for a single post, but when you imagine that millions of ping everyday might arrive at the ping server carrying the full content of the post, ready to analyze, the operational efficiencies over the conventional approach would be significant. For any search engine, there’s often more resources expended in accessing and retrieving the source content than there is spent on indexing and analyzing it, once it is in hand. Skipping the crawling/harvesting process represents a huge gain in the efficiency and performance of metadata extraction systems.
The RSS Ping approach, then, suggests that publishers might be willing to submit their full content with their pings, so long as the full content is altered in such a way that makes it virtually useless for unauthorized redistribution. RSS Ping proposes that that stop words – words like “and”, “or”, “the”, “but” and “of” – be stripped from the full content, yielding a tokenized payload. Stripped of stop words, the tokenized payload wouldn’t be readable by humans, and therefore of little value for illicit redistribution. However, since stop words are ignored by search engines and crawlers that extract keywords and metadata, the tokenized payload should be just as useful for purposes of categorization and navigation.
Let me equivocate here when I say “just as useful” above; stop words are actually a thorny little issue for search engines, and the absence of stop words in the payload raises a host of questions about how advanced searches will be performed across such content. But setting that aside for now, the stop-word-stripping idea is an interesting one, as it gives aggregators and search engines nearly all they need to skip the crawling process and process the content directly upon submission of the ping. Typically, encryption has been suggested as the remedy for submitting full content to the cloud, a solution that is likely cause far more problems than it solves.