Hi folks, I'm having a bit of a problem this morning and was hoping for some help or at least some advice. Last night I began getting errors concerning the parsing of a rss newfeed. It began with no changes to my site and the news hadn't even been updated on the remote site I was pulling in. But all of a sudden I began getting the error:

/?q=admin/aggregator/update/6

Aggregator: failed to parse RSS feed PCLinuxOnline: not well-formed (invalid token) at line 38.

I'm still able to receive the same feed using the same backend in kontact so I figured it must be me/my site. But after deleting it and reconfiguring it with no luck, I deleted the whole table from the database. After rebuilding my feeds from scratch, I'm still getting that error. Other feeds function normally.

So, I'm wondering if yaw reckon that's an error on my end or theirs?

And if it's on my end, how might I correct this issue?

thanks so much,
susan

Comments

Greg Delisle’s picture

I've seen this happen when the feed being read contains an unescaped or unencoded entity like é or ü (that's é and ü). Many parsers are able to handle these deftly, or at least fail gracefully, but apparently not Drupal's parser which barfs the entire feed.

The only way I've found to correct it is to wait until the feed no longer contains that character and then it'll start working.

I'm hoping that the 4.6 upgrades to the aggregator module introduce something to fix this issue, though technically it's not a bug because the error isn't in the parser, it's in the feed which in order to conform to spec shouldn't have these characters. But it would be nice to see Drupal's aggregator handle these and other parsing errors with more grace. If it's not fixed in the upgrade, it should be made an issue or feature request.

Steven’s picture

RSS feeds and XML in general can contain Unicode characters perfectly, provided they are /encoded/ correctly. If you use a Unicode encoding like UTF-8, no entity escaping is required except for special HTML characters (< > ").

The Drupal Talk feed on this site in fact often contains non-english posts, which are aggregated correctly (note however that this particular feed is generated by Feedster, which messes up unicode characters in titles... I've mailed them about it 4 or 5 times already and they still haven't fixed it properly. I gave up.)

Looking at the PCLinuxOnline feed, it has no encoding attribute or byte order mark in it, which means (according to the XML spec) that it must be interpreted as UTF-8. The main site however uses the ISO-8859-1 encoding, and they're probably not doing any conversion. Looking at it in Firefox shows that the "Extreme Makeover" article has a bad character in it.

In short: mail the site, get them to fix their feed. It should be as simple as adding encoding="ISO-8859-1" to the <?xml ?> prolog. We are parsing those feeds 100% according to specs. Drupal can handle Unicode characters just fine.

--
If you have a problem, please search before posting a question.

srlinuxx’s picture

Wow, wonderful answers. You guys are so smart. Thanks so much.

_____
--You talk the talk, but do you waddle the waddle?

Greg Delisle’s picture

I understand that the aggregator is parsing according to the spec, it would just be nice if there was a bit more tolerance in the way it handled the error. As you point out, the error is on the part of the feed publisher, but asking them to correct it is usually not productive.

As it is, whenever one of my feeds drops a bad character, that feed goes offline for my users. It would be a better user experience if, say, a gremlin were displayed instead of that character, or perhaps if that item were dropped, at least the users could get part of the information instead of nothing. While it's true that a malformed XML document is a malformed XML document even if it's only one character, it would be smarter if the Drupal aggregator didn't just give up -- ideally it's machines talking to machines so there shouldn't be errors, but in the real world there are errors and all we can choose is how we behave when there is an error.

It might be that the Drupal aggregator is relying entirely on the behavior of the PHP XML parser and so we don't have any control over this. But if there is the possibility of control, I think we should have softer error responses because they provide for a better UI.

Steven’s picture

UTF-8 validation can be done, but it would only be useful for latin-using languages with not much accents or non-ASCII letters. Otherwise so much text will be lost that it will be unreadable. Fixing the error on their site is a whopping 5 minutes of work and involves adding the right encoding="..." attribute to their feed. Nothing more.

If you really want to validate UTF-8, you can use this snippet which I wrote. You could invoke it in drupal_xml_parser_create().

function validate_utf8($text) {
  // First pass: find bad sequences
  $text = preg_replace('@(?:'.
    // Invalid bytes
    '[\xFE-\xFF]|'.
    // Ranges not part of Unicode spec
    '\xF4[\x90-\xBF][\x80-\xBF]{2}|'.  //  U+110000 -   U+1FFFFF
    '[\xF5-\xF7][\x80-\xBF]{3}|'.
    '[\xF8-\xFB][\x80-\xBF]{0,4}|'.    //  U+200000 -  U+3FFFFFF
    '[\xFC-\xFD][\x80-\xBF]{0,5}|'.    // U+4000000 - U+7FFFFFFF
    // Unnecessarily long sequences
    '[\xC0-\xC1][\x80-\xBF]|'.
    '\xE0[\x80-\x9F][\x80-\xBF]|'.
    '\xF0[\x80-\x8F][\x80-\xBF]{2}|'.
    // Sequences that are too short
    '[\xC0-\xF7](?![\x80-\xBF])|'.
    '[\xE0-\xF7][\x80-\xBF](?![\x80-\xBF])|'.
    '[\xF0-\xF7][\x80-\xBF]{2}(?![\x80-\xBF])|'.
    // Invalid characters
    '\xED[\xA0-\xBF][\x80-\xBF]|'.     // UTF-16 surrogates
    '\xEF\xBF[\xBE-\xBF]'.             // U+FFFE / U+FFFF
    ')@', '�', $text);

  // Second pass: clean up invalid continuation bytes
  $text = preg_replace_callback('@'.
     // Any normal character sequence
     '(^|[\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]{2}|'.
     '[\xE0-\xEF][\x80-\xBF]{3}|[\xF0-\xF7][\x80-\xBF]{4})'.
     // ...followed directly by continuation bytes.
     '([\x80-\xBF]+)@', '_validate_utf8', $text);

  return $text;
}

function _validate_utf8($matches) {
  return $matches[1] . str_repeat('�', strlen($matches[2]));
}

--
If you have a problem, please search before posting a question.

Greg Delisle’s picture

I confess that this is technically over my head and I'm not 100% sure what this code is meant to do. Is it attempting to replace "bad" characters with gremlins, or remove them altogether? That's really nice. If that's the case, would you consider contributing it as an issue/patch to the 4.6 aggregator module?

I was trying to raise my point from the non-technical user's perspective, that is, my site's readers don't give a fig whether the charset is specified correctly, and the NYT (or whoever) doesn't give a fig whether I say their feed is set wrong. All anyone cares about is getting the news from NYT to my readers, and one bad character shouldn't be enough to block the flow of that information, it's not fault-tolerant enough. Sixteen articles with one missing accented e is a better offering for my readers than zero articles and a red line in my watchdog.

Even if it doesn't make it into the aggregator module, this code is really helpful. Thanks a lot!

mahtin’s picture

With Drupal 4.6 and PHP 5.0.4 I'm still not getting this feed read into the news aggregator ...

http://msdn.microsoft.com/rss.xml

...which is strange. The start of the data is 0xef 0xbb 0xbf followed by the standard <?xml ...> and drupal_xml_parser_create() and drupal_convert_to_utf8() don't zap it! I get "Failed to parse RSS feed MSDN Just Published: Empty document at line 1." in the logs.

What's up here?

Steven’s picture

Those three bytes are the UTF-8 byte order mark. As far as I know, the PHP XML parser should handle that, but I could be mistaken (the one in PHP4 completely ignores encodings).

I'll do some tests later to see if this is a general problem.

--
If you have a problem, please search before posting a question.

jasonhendry’s picture

I get the same error using the RSS 2.0 feed from Brucey's weblog.

http://www.schneier.com/blog/index.xml

"Failed to parse RSS feed Bruce Schneier and Security: Empty document at line 1."

Did I miss something in the setup of the aggregator; php_xmlrpc is not enabled in PHP5 on my site. Drupal 4.6.3 by the way...

I can see the xml in firefox... can anyone suggest why drupal is reporting an empty document ?

thanks,
loungeroom.

jasonhendry’s picture

I had the http:// protocol prefix on the URL input box associated with the feed.

As soon as I removed that, I got a new error,

Failed to parse RSS feed Schneier on Security: invalid schema .

I tried the Feed Validator http://feedvalidator.org/ and it complained a little about text representation but still passed conformance checks.

well, that's one step closer.

loungeroom.

sangamreddi’s picture

Hi,

I opened up agrgregator module and found the - drupal_xml_parser_create
and copied the script U provided but it's not working. How to use this snippet? Does i am going in the right way? Help please

Thanks in Advance

Sunny
www.gleez.com

stevryn’s picture

I have scoured these forums and can find nothing..I tried to use aggregator to put an RSS feed block for slashdot on our site. I have this error
Failed to parse RSS feed Slashdot:202 Accepted

using drupal 4.6.1

Any ideas here?

Webcreature’s picture

I had the same problem. Seems to be a carelessness in common.inc.

See code for function format_rss_item() (DRUPAL 4.6.5) in line 689:

$output .= ' <'. $key .'>'. check_plain($value) ."</$key>\n";

should be

if( !is_numeric($key) ) {
$output .= ' <'. $key .'>'. check_plain($value) ."</$key>\n"; }

If $key is 0 you will get the error "invalid qualifier in XML syntax" ...

Any references or critics to not doing so?