Feed Aggregator problem

By rj.seward on 26 Jan 2009 at 21:36 UTC

I have a Aggregator up and functioning on my site and up until recently it has been working quite well. Right now there seems to be a problem with the way it parses the text output from a news feed into a teaser. For example, here is a headline and teaser displayed today on my site:

Iceland’s Government Collapses
NY Times - Front Page - 55 min 41 sec ago
Large anti-government demonstrations in Iceland have been mirrored elsewhere in Europe, but the largest economies have been spared.br/br/span class="advertisement" a href="http://www.pheedo.com/click.phdo?x=ba6699eca028480181189b6a17db1074u=htt..."img src="http://www.pheedo.com/img.phdo?x=ba6699eca028480181189b6a17db1074u=http:..." border="0"//a/span

It appears that the < and > symbols are removed somewhere between the source and my page.

Interestingly, if you go to the xml page at http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml and view the source, in the section where this feed info is, you will see the text "\<" and "\>" where there should be < ans > but this is not consistent throughout the page.

Does anyone else have this problem? Suggestions?

Comments

Clarification

rj.seward commented 26 January 2009 at 22:00

...in the section where this feed info is, you will see the text "& lt;" and "& gt;" (without a space between the & and the lt; and gt;) where there should be actual symbols < , >

In other words there is text rather than symbols.

Use search first

MidGe48 commented 26 January 2009 at 23:07

A search would have shown you a lot of similar issues including this one: http://drupal.org/node/311511

www.ZuNOB.com

Thanks for the pointer, but

rj.seward commented 27 January 2009 at 16:00

Thank you for the link. I did actually spend a good bit of time searching the forums and documentation seeking a resolution to this problem, but I did not find this post until you pointed it out.

I tried the "workaround" suggested in the post at http://drupal.org/node/311511:
"in function aggregator_save_item($edit) {...}
in aggregator.module for the title of the feed item. Just replace $edit['title'] with strip_tags($edit['title']) then remove all items and updates items."

As per the post I replaced the occurrences of $edit['title'] with strip_tags($edit['title']) at lines 831 and 838.

Unfortunately this does not appear to correct my problem. I also tried replacing $edit['description'] with strip_tags($edit['description']), but this did not do the trick either.

I am currently attempting to write my own function to strip out the code from the teaser... Unless anyone else has a better suggestion.

My solution

rj.seward commented 30 January 2009 at 19:14

Here is the function I created to deal with this problem. I simply added these lines at the end of aggregator.module:

# function to find escape strings, < and > delete these & the code encompassed within
function clean_tags($code) {
$code = preg_replace('/<(.*?)>/', " ", $code);
$code = preg_replace('/\r/', '', $code);
$code = preg_replace('/\t/', '', $code);
$code = preg_replace('/\s\s+/', ' ', $code);
return $code;
}

I inserted a call to this function on line 703 of aggregator.module with this code:
$data = clean_tags($data);

Now the teaser (or description as it is referred to in the code) displays perfectly.

The preg_replace for the \r, \t, and \s are just to clean up the code a bit before feeding it into the xml parser but are not essential to fixing the garbage output problem.

I hope this will help anyone else who might run into this same problem.

this is an interesting problem

justageek commented 30 January 2009 at 20:02

...because those encoded entities are supposed to be in the feed to make the feed valid, although rss feeds are not supposed to have html, so the nytimes is spitting out invalid stuff. But, I guess aggregator should decode them prior to display, or strip them for you, which is essentially the code you added, I think.

Thanks, this code fixed a

egm commented 19 May 2009 at 18:25

Thanks, this code fixed a whole bunch of feeds that we would otherwise have had to drop. Did you submit this as a fix for the module?

I am probably the dumbest

doc101 commented 18 November 2009 at 15:17

I am probably the dumbest person on the internet, but and when I updated aggregator.module with the code above i get the following errors:

The feed from blah blah seems to be broken, because of error "Invalid document end" on line 1.
The feed from blah blah seems to be broken, because of error "200 feed not parseable".

any advice? is there a patch or downloadable aggregator.module out there that could save me a huge headache?

Thanks!

Feed Aggregator problem

Comments

Clarification

Use search first

Thanks for the pointer, but

My solution

this is an interesting problem

Thanks, this code fixed a

I am probably the dumbest

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community