If you want to include a copyright symbol in your node content, you just type in © and everything works dandily. However, you can't just have © in an RSS feed, because © doesn't exist as an entity within the XML doctype. There are three solutions:

  1. Import HTML entities into the XML file.
  2. Encode the entities as ©.
  3. Turn entities into numeric equivalents, like ©.

In node.module, we're currently doing option #1; that is, we're including <!DOCTYPE rss [<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">]> at the top of our RSS feeds. This is perfectly valid.

However, importing other doctypes is an optional feature of XML. It is up to the XML consumer, the parser, to decide whether it wants to support that or not. If it doesn't, the HTML entities aren't imported and, in some cases, the XML document is considered invalid. I'm unclear on whether this is all the time, or only if the parser attempts to validate the document (as opposed to just accept it for being "well-formed"). Regardless, the DOCTYPE that node.module is adding is causing some readers to fail. This is no one's fault: the XML standards say we can include the DOCTYPE as a producer, but they also say that the consumer doesn't need to support it.

The authoratitive source on RSS validation, FeedValidator.org, says:

Support for the HTML and XHTML doctypes in now widespread — in browsers. While the Feed Validator will validate feeds which make use of DTDs specifically defined for use with Atom or RSS, the support for such advanced — and optional — XML features is not widespread in feed readers. As such, this approach is not recommended.

It turns out, however, that Drupal is also doing option #2 above - if you add &copy to a node, or any of the other HTML entities, it is "double-encoded" in the RSS feed to &amp;copy;. The attached patch removes the DOCTYPE (#1) entirely and just does double-encoding (#2). While together they don't harm the feed at all, losing #1 will allow more parsers, who have chosen not to support DOCTYPE, access to our data.

I'm setting this to "ready for commit", as this is less about coding, and more about knowledge of external standards.

CommentFileSizeAuthor
_rssdoctype.patch992 bytesMorbus Iff
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Dries’s picture

I guess this needs fixing in aggregator.module too?

Contributed modules:

$ grep -r DTD * | grep rss | awk -F : {'print $1'} | uniq | sort
commentrss/commentrss.module
cvslog/cvs.module
event/event.module
jobsearch/patches/node.module-4.6.2
jobsearch/patches/node.module-4.6.3
news_page/news_page.module
project/issue.inc
task/task.module
Dries’s picture

Committed to HEAD. Thanks.

Morbus Iff’s picture

Status: Reviewed & tested by the community » Fixed
Anonymous’s picture

Status: Fixed » Closed (fixed)