My installation of Aggregator is stripping out all < and > tags leaving all the HTML bare on the page. Pretty much gibberish.

I have tried to play with the allowed HTML tag setting, but it has no effect on the output.

Comments

jonwatson’s picture

I think I have identified where the stripping is occurring, but I don't know how to stop it. Line 915 of aggregator.modules.inc is:

function aggregator_filter_xss($value) {

	return filter_xss($value, preg_split('/\s+|<|>/', variable_get('aggregator_allowed_html_tags', '<a> <b> <br> <dd> <dl> <dt> <em> <i> <li> <ol> <p> <strong> <u> <ul>'), -1, PREG_SPLIT_NO_EMPTY)); 
}

I'm kind of confused by this function. It seems that the allowed_html_tags are hardcoded into this function rather than being taken from the Aggregator settings, but in either case I have modified the allowed tags in both this file and the aggregator functions to no avail.

Here is a sample of the output:

table border=0 width= valign=top cellpadding=2 cellspacing=7trtd valign=top class=jfont style=font-size:85%;font-family:arial,sans-serifbrdiv style=padding-top:0.8em;img alt= height=1 width=1/divdiv class=lha href=http://news.google.com/news/url?sa=Tct=us/7-0fd=Rurl=http://www.mlive.com/businessreview/oakland/index.ssf/2008/12/clean_energy_activist_t_boone.htmlcid=1280628287ei=JpBGSeaBMIaqNr-1xYYPusg=AFQjCNGzQffrI8CAznmDXVieFa2u_sg0qg• Clean energy activist T. Boone Pickens to higlight 2009 Mackinac b.../b/abrfont size=-1font color=#6f6f6fMLive.com,nbsp;MInbsp;-/font nobr1 hour ago/nobr/fontbrfont size=-1His speech on the island will include a moderated discussion about how Michigan can become a leader in the balternative energy/b industry. b.../b/font/div/font/td/tr/table 

It seems obvious that the problem is that all of the < and > tags are being removed and thus this is not valid HTML and therefore displayed as-is.

That's about the extent of my skills, though. I can't find out where in the code this error is being produced.

Does anyone have any pointers for me?

Thanks

Jon

jonwatson’s picture

After much mucking about. I have positively identified line 717 as the cuplrit:

717:  if (!xml_parse($xml_parser, $data, 1)) {
718:    watchdog('aggregator', 'The feed from %site seems to be broken, due to an error "%error" on line %line.', array('%site' => $feed['title'], '%error' => xml_error_string(xml_get_error_code($xml_parser)), '%line' => xml_get_current_line_number($xml_parser)), WATCHDOG_WARNING);
719:    drupal_set_message(t('The feed from %site seems to be broken, because of error "%error" on line %line.', array('%site' => $feed['title'], '%error' => xml_error_string(xml_get_error_code($xml_parser)), '%line' => xml_get_current_line_number($xml_parser))), 'error');
720:    return 0;
721:  }
722:  xml_parser_free($xml_parser);

When the global $items array is populated by the call to xml_parse, the < and > tags are stripped out of it.

However, since xml_parse seems to be an internal PHP class, I don't have a clue what to do about this issue.

Any help?

Jon

msielski’s picture

Subscribing to this. This is very much still an active bug in 6.9's Feed Aggregator. I too confirmed that it is php's xml_parse doing it, and am trying to isolate how I can fix it. Particularly, google's Blog and News RSS/Atom feeds make heavy use of embedded HTML, using what are valid XML predefined entities (&amp; &lt; &gt; &quot;).

damien tournoud’s picture

Status: Active » Postponed (maintainer needs more info)

I can't reproduce any of the behavior you are describing on PHP 5.2.4. If this is really an issue in PHP XML parser, please report information about your PHP version.

jbsarma’s picture

PHP version 5.2.8 and Drupal 6.9. This is very much a problem. Appreciate urgent attention.

jbsarma’s picture

Version: 6.6 » 6.9

PHP version 5.2.8 and Drupal 6.9. This is very much a problem. Appreciate urgent attention.

dave reid’s picture

Status: Postponed (maintainer needs more info) » Closed (won't fix)

We had this same problem on drupal.org's aggregator, and I'm pretty sure it was identified as a problem with PHP's libxml. See #362294: Drupal.org aggregator stores news posts broken, Drupal Planet broken.

dave reid’s picture

See the PHP bug report at http://bugs.php.net/bug.php?id=45996 for the affected versions.

slimandslam’s picture

Title: Aggregator Module Stripping out all < and > » Aggregator Module broken under PHP 5.2.8 -- HTML entities are ignored

To clarify, the expat parser in PHP (this one: http://www.php.net/manual/en/book.xml.php)
is broken in PHP 5.2.8. The bug is in libxml. The issue is that the parser ignores HTML entities
during parsing resulting in XML with the entities stripped out of the parsed content.

This means that any drupal module that uses expat is broken if you're running under PHP 5.2.8. The
aggregator module is broken (if your content has html entities in it). Fix is in PHP 5.2.9 (to be released).

More details: http://drupal.org/node/384060

slimandslam’s picture

PHP 5.2.9 was just released. This problem is fixed: http://www.php.net/ChangeLog-5.php#5.2.9 (Issue #45996)

Lakeside’s picture

Hmm... The PHP 5.2.9 hasn't improved the problem on my system.