DOCTYPE in RSS is causing some parsers to fail [#41165]

If you want to include a copyright symbol in your node content, you just type in © and everything works dandily. However, you can't just have © in an RSS feed, because © doesn't exist as an entity within the XML doctype. There are three solutions:

Import HTML entities into the XML file.
Encode the entities as &copy;.
Turn entities into numeric equivalents, like ©.

In node.module, we're currently doing option #1; that is, we're including <!DOCTYPE rss [<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">]> at the top of our RSS feeds. This is perfectly valid.

However, importing other doctypes is an optional feature of XML. It is up to the XML consumer, the parser, to decide whether it wants to support that or not. If it doesn't, the HTML entities aren't imported and, in some cases, the XML document is considered invalid. I'm unclear on whether this is all the time, or only if the parser attempts to validate the document (as opposed to just accept it for being "well-formed"). Regardless, the DOCTYPE that node.module is adding is causing some readers to fail. This is no one's fault: the XML standards say we can include the DOCTYPE as a producer, but they also say that the consumer doesn't need to support it.

The authoratitive source on RSS validation, FeedValidator.org, says:

Support for the HTML and XHTML doctypes in now widespread — in browsers. While the Feed Validator will validate feeds which make use of DTDs specifically defined for use with Atom or RSS, the support for such advanced — and optional — XML features is not widespread in feed readers. As such, this approach is not recommended.

It turns out, however, that Drupal is also doing option #2 above - if you add &copy to a node, or any of the other HTML entities, it is "double-encoded" in the RSS feed to &copy;. The attached patch removes the DOCTYPE (#1) entirely and just does double-encoding (#2). While together they don't harm the feed at all, losing #1 will allow more parsers, who have chosen not to support DOCTYPE, access to our data.

I'm setting this to "ready for commit", as this is less about coding, and more about knowledge of external standards.

Comment	File	Size	Author
	_rssdoctype.patch	992 bytes	morbus iff

Comments

Comment #1

dries commented 14 December 2005 at 18:07

I guess this needs fixing in aggregator.module too?

Contributed modules:

$ grep -r DTD * | grep rss | awk -F : {'print $1'} | uniq | sort
commentrss/commentrss.module
cvslog/cvs.module
event/event.module
jobsearch/patches/node.module-4.6.2
jobsearch/patches/node.module-4.6.3
news_page/news_page.module
project/issue.inc
task/task.module