I have a Aggregator up and functioning on my site and up until recently it has been working quite well. Right now there seems to be a problem with the way it parses the text output from a news feed into a teaser. For example, here is a headline and teaser displayed today on my site:
Iceland’s Government Collapses
NY Times - Front Page - 55 min 41 sec ago
Large anti-government demonstrations in Iceland have been mirrored elsewhere in Europe, but the largest economies have been spared.br/br/span class="advertisement" a href="http://www.pheedo.com/click.phdo?x=ba6699eca028480181189b6a17db1074u=htt..."img src="http://www.pheedo.com/img.phdo?x=ba6699eca028480181189b6a17db1074u=http:..." border="0"//a/span
It appears that the < and > symbols are removed somewhere between the source and my page.
Interestingly, if you go to the xml page at http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml and view the source, in the section where this feed info is, you will see the text "\<" and "\>" where there should be < ans > but this is not consistent throughout the page.
Does anyone else have this problem? Suggestions?
Comments
Clarification
...in the section where this feed info is, you will see the text "& lt;" and "& gt;" (without a space between the & and the lt; and gt;) where there should be actual symbols < , >
In other words there is text rather than symbols.
Use search first
A search would have shown you a lot of similar issues including this one: http://drupal.org/node/311511
www.ZuNOB.com
Thanks for the pointer, but
Thank you for the link. I did actually spend a good bit of time searching the forums and documentation seeking a resolution to this problem, but I did not find this post until you pointed it out.
I tried the "workaround" suggested in the post at http://drupal.org/node/311511:
"in function aggregator_save_item($edit) {...}
in aggregator.module for the title of the feed item. Just replace $edit['title'] with strip_tags($edit['title']) then remove all items and updates items."
As per the post I replaced the occurrences of $edit['title'] with strip_tags($edit['title']) at lines 831 and 838.
Unfortunately this does not appear to correct my problem. I also tried replacing $edit['description'] with strip_tags($edit['description']), but this did not do the trick either.
I am currently attempting to write my own function to strip out the code from the teaser... Unless anyone else has a better suggestion.
My solution
Here is the function I created to deal with this problem. I simply added these lines at the end of aggregator.module:
# function to find escape strings,
<and>delete these & the code encompassed withinfunction clean_tags($code) {
$code = preg_replace('/
<(.*?)>/', " ", $code);$code = preg_replace('/\r/', '', $code);
$code = preg_replace('/\t/', '', $code);
$code = preg_replace('/\s\s+/', ' ', $code);
return $code;
}
I inserted a call to this function on line 703 of aggregator.module with this code:
$data = clean_tags($data);
Now the teaser (or description as it is referred to in the code) displays perfectly.
The preg_replace for the \r, \t, and \s are just to clean up the code a bit before feeding it into the xml parser but are not essential to fixing the garbage output problem.
I hope this will help anyone else who might run into this same problem.
this is an interesting problem
...because those encoded entities are supposed to be in the feed to make the feed valid, although rss feeds are not supposed to have html, so the nytimes is spitting out invalid stuff. But, I guess aggregator should decode them prior to display, or strip them for you, which is essentially the code you added, I think.
Thanks, this code fixed a
Thanks, this code fixed a whole bunch of feeds that we would otherwise have had to drop. Did you submit this as a fix for the module?
I am probably the dumbest
I am probably the dumbest person on the internet, but and when I updated aggregator.module with the code above i get the following errors:
The feed from blah blah seems to be broken, because of error "Invalid document end" on line 1.
The feed from blah blah seems to be broken, because of error "200 feed not parseable".
any advice? is there a patch or downloadable aggregator.module out there that could save me a huge headache?
Thanks!