Download & Extend

Feed aggregator strips characters from rss feeds, breaking summaries and links

Project:Drupal core
Version:6.16
Component:base system
Category:support request
Priority:normal
Assigned:Unassigned
Status:postponed (maintainer needs more info)

Issue Summary

When parsing RSS XML, the feed aggregator is stripping out HTML character entities, causing broken summaries and links. For example, I've started having problems with Google News, using the following RSS feed: http://news.google.com/news?pz=1&ned=us&hl=en&output=rss

The link URL that Google News sends me uses the "&" HTML character entity to separate a number of query parameters. Instead of storing the separator, the feed aggregator strips it out, breaking the URL.

Here is an excerpt of an item Google News sends me when I manually issue the above HTTP request:

http://www.google.com/news/url?sa=T&ct=us/0-1-0&fd=R&url=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhome&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNHj-3q-T54lxYDsGKz59Z1SYNwWeA

Meanwhile. the database value aggregator_item.link shows:

http://news.google.com/news/url?sa=Tct=us/0-1-0fd=Rurl=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhomecid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHnvnZsA6MZcAfcjhbnJsK8jgEg3A

Article summaries are unreadable as well, because the character entities <, >, and & are all being stripped out, exposing raw tags and attributes to the user as if they were actual content. Here is that same article's summary, first fetched manually, then as extracted from aggregator_item.description.

Manual fetch:

<font style="font-size:85%;font-family:arial,sans-serif"><br><div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div><div class=lh><table border=0 align=right cellspacing=0 cellpadding=0cellpadding=3 style="font-size:100%;font-family:arial,sans-serif"><tr><td width=80 align=center style="padding-left:6px;" valign=top><a href="http://www.google.com/news/url?sa=T&ct=us/0-1i-0&fd=R&url=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052700229.html&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNE9LqRyNbPJvaqoyIXmlAdXf0MppA"><img src=http://nt1.ggpht.com/news?imgefp=3eZ5RfaBt_oJ&imgurl=media3.washingtonpost.com/wp-dyn/content/photo/2009/05/27/PH2009052700231.jpg width=77 height=80 alt="" border=1><br><font size=-2>Washington Post</font></a></td></tr></table><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-0&fd=R&url=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhome&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNHj-3q-T54lxYDsGKz59Z1SYNwWeA"><b>Obama N. Korea Options May Be Limited by Regime Shift</b></a><br><font size=-1><b><font color=#6f6f6f>Bloomberg</font></b></font><br><font size=-1>By Indira AR Lakshmanan and Heejin Koo May 27 (Bloomberg) -- The Obama administration&#39;s ability to pressure North Korea&#39;s insular leadership to abandon nuclear weapons may be hamstrung by internal jockeying and unease in the communist state over who <b>...</b></font><br><font size=-1><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-1&fd=R&url=http://www.youtube.com/watch%3Fv%3D95v9d3w4tzM&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNGpRYBBToDRC7HOJpkjh0W4CV9b_g">Video: Reaction: Will North Korea&#39;s nukes lead to war?</a> <font size=-1 color=#6f6f6f><nobr>UPI</nobr></font><object width="448" height="356"><param name="movie" value="http://www.youtube.com/v/95v9d3w4tzM"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/95v9d3w4tzM"type="application/x-shockwave-flash"wmode="transparent"width="448"height="356"></embed></object><br></font><font size=-1><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-2&fd=R&url=http://english.aljazeera.net/news/asia-pacific/2009/05/2009527195524608822.html&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNEcKrutKflbrDY5vKsuWnTrm404Nw">N Korea warned of &#39;consequences&#39;</a> <font size=-1 color=#6f6f6f><nobr>Aljazeera.net</nobr></font></font><br><font size=-1 class=p><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-3&fd=R&url=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052702353.html&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNETTwK5v1ZU5ca8k9wTk7P7RMkMCg"><nobr>Washington Post</nobr></a>&nbsp;- <a href="http://www.google.com/news/url?sa=T&ct=us/0-1-4&fd=R&url=http://www.reuters.com/article/topNews/idUSTRE54Q5R620090527&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNFM8UcQSuznoxM9BowXrJyFuPDgXA"><nobr>Reuters</nobr></a>&nbsp;- <a href="http://www.google.com/news/url?sa=T&ct=us/0-1-5&fd=R&url=http://www.upi.com/Top_News/2009/05/27/Clinton-N-Korea-must-face-consequences/UPI-36601243462350/&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNGHdmvRqQSTBEXH-dvEoX5LaJV_fA"><nobr>United Press International</nobr></a>&nbsp;- <a href="http://www.google.com/news/url?sa=T&ct=us/0-1-6&fd=R&url=http://www.google.com/hostednews/afp/article/ALeqM5iwb9ioSkAJOAJKRCgQfv4s9aWMfw&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNHQWg5AsGKcuXGG7moNPErpZcZM3A"><nobr>AFP</nobr></a></font><br/><font class=p size=-1><a class=p href=http://www.google.com/news?pz=1&ned=us&hl=en&ncl=dj3-cRGcAtrLCiMjO4SV67_oe2RqM><nobr><b>all 12,560 news articles</b></nobr></a></font><br clear=all> </div></font>

And from aggregator_item.description:

font style=font-size:85%;font-family:arial,sans-serifbrdiv style=padding-top:0.8em;img alt= height=1 width=1/divdiv class=lhtable border=0 align=right cellspacing=0 cellpadding=0cellpadding=3 style=font-size:100%;font-family:arial,sans-seriftrtd width=80 align=center style=padding-left:6px; valign=topa href=http://news.google.com/news/url?sa=Tct=us/0-1i-0fd=Rurl=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052700229.htmlcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNG0rv_TekZcwrpLxlrbLAy2KdFltgimg src=http://nt1.ggpht.com/news?imgefp=3eZ5RfaBt_oJimgurl=media3.washingtonpost.com/wp-dyn/content/photo/2009/05/27/PH2009052700231.jpg width=77 height=80 alt= border=1brfont size=-2Washington Post/font/a/td/tr/tablea href=http://news.google.com/news/url?sa=Tct=us/0-1-0fd=Rurl=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhomecid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHnvnZsA6MZcAfcjhbnJsK8jgEg3AbObama N. Korea Options May Be Limited by Regime Shift/b/abrfont size=-1bfont color=#6f6f6fBloomberg/font/b/fontbrfont size=-1By Indira AR Lakshmanan and Heejin Koo May 27 (Bloomberg) -- The Obama administration#39;s ability to pressure North Korea#39;s insular leadership to abandon nuclear weapons may be hamstrung by internal jockeying and unease in the communist state over who b.../b/fontbrfont size=-1a href=http://news.google.com/news/url?sa=Tct=us/0-1-1fd=Rurl=http://www.youtube.com/watch%3Fv%3D95v9d3w4tzMcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHumhczb7vl0CI71D4t0tARoKU5ngVideo: Reaction: Will North Korea#39;s nukes lead to war?/a font size=-1 color=#6f6f6fnobrUPI/nobr/fontobject width=448 height=356param name=movie value=http://www.youtube.com/v/95v9d3w4tzM/paramparam name=wmode value=transparent/paramembed src=http://www.youtube.com/v/95v9d3w4tzMtype=application/x-shockwave-flashwmode=transparentwidth=448height=356/embed/objectbr/fontfont size=-1a href=http://news.google.com/news/url?sa=Tct=us/0-1-2fd=Rurl=http://english.aljazeera.net/news/asia-pacific/2009/05/2009527195524608822.htmlcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNE5hdYoN7qNEdBjjj6OuolDDmnoigN Korea warned of #39;consequences#39;/a font size=-1 color=#6f6f6fnobrAljazeera.net/nobr/font/fontbrfont size=-1 class=pa href=http://news.google.com/news/url?sa=Tct=us/0-1-3fd=Rurl=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052702353.htmlcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHU6JKN2yIBgXhck1oUbz22rXCi8AnobrWashington Post/nobr/anbsp;- a href=http://news.google.com/news/url?sa=Tct=us/0-1-4fd=Rurl=http://www.reuters.com/article/topNews/idUSTRE54Q5R620090527cid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNE_a9umAbB5q8utbbiJKgw5DspyewnobrReuters/nobr/anbsp;- a href=http://news.google.com/news/url?sa=Tct=us/0-1-5fd=Rurl=http://www.upi.com/Top_News/2009/05/27/Clinton-N-Korea-must-face-consequences/UPI-36601243462350/cid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNFcPFh6qwvi48m7D3Gi1iJ_iN8CMAnobrUnited Press International/nobr/anbsp;- a href=http://news.google.com/news/url?sa=Tct=us/0-1-6fd=Rurl=http://www.google.com/hostednews/afp/article/ALeqM5iwb9ioSkAJOAJKRCgQfv4s9aWMfwcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNH1zK4oeEQT7yxox24fZGwrIqMIywnobrAFP/nobr/a/fontbr/font class=p size=-1a class=p href=http://news.google.com/news?pz=1ned=ushl=enncl=dj3-cRGcAtrLCiMjO4SV67_oe2RqMnobrball 12,560 news articles/b/nobr/a/fontbr clear=all /div/font

Any chance I've got something misconfigured? I can't imagine what though...

Comments

#1

Status:active» closed (fixed)

Sorry, the character entities in this post are being rendered as the actual characters...
Please "view source" of the blockquotes to see what the feed aggregator is picking up.

Thanks for any help!

#2

Status:closed (fixed)» active

#3

Priority:normal» critical

HELP THIS IS HAPPENING TO ME
(update)
http://www.twincityscene.com/aggregator/

#4

Version:6.12» 6.13

also experiencing this issue.
anyone have a solution?

#5

Same problem! There must be a workaround for this. The Drupal.org aggregator has a google news feed that works just fine....

#6

This issue seems to have died down but I am still having this problem. Anyone have the fix for this problem?

#7

we still have this problem as well.

#8

are we the only ones experiencing this?
is it a bug? bad feed? misconfiguration somewhere in our setup?

#9

Still happening in 6.14...

#10

Version:6.13» 6.14

#11

Subscribe! The feed is looking good in Google RSS reader but not in Drupal

#12

Facing this bug now in D6.15. So far I have found that the tags are already broken by the time the DESCRIPTION hits the aggregator_parse_feed() function. But those tags are not part of the node body. The nodes are getting rendered somewhere along the line with a bunch of DIV tags (mostly DIV CLASS='field' and related) that later get their anglebrackets removed, causing a bunch of crap to appear in the aggregator_item table. Still digging. Anyone else finding anything here?

#13

Progress. First I tried playing with the default RSS view for frontpage, which is related but not actually the culprit. Then I managed to make *almost* all of the unwanted bracketless DIVs disappear by EXCLUDING all fields from the "Display Fields" (RSS) page. However there is still one that is not being affected: just before the final attribution gets added, there is embedded code that looks like this: div class=og_rss_groups/div (notice the anglebrackets are missing). Looks like OG is getting in on this action. Oy. Well, that's enough progress for one day!

#14

Spoke too soon. This does not work in all cases. But I have discovered something else while playing with print_r() and the function aggregator_parse_feed(). The variable $data has the proper formatting all the way through the function. The messed-up stuff is in the variable $items, after it gets filled by the parser. *sigh*

#15

Version:6.14» 6.16

still happening in 6.16 :(

#16

I too pass, levo weeks studying the case and nothing

#17

Could you explain more in detail how you fixed it. I'm a newbie to Drupal and really need this fixed.

I'd really appreciate your help.

Pav

#18

@Pav: from what i can tell, it's not fixed yet...

#19

More progress, but it's not good news. I was playing with a client's RSS feed and also subscribing to it from the same site when I noticed this issue and began my explorations. Several hours of messing with the code of the aggregator module didn't do the trick, and only yielded more questions (see above). Today I tried doing the same setup (subscribing to my own RSS feed) on another D6.16 site AND IT WORKED FLAWLESSLY. The anglebrackets did NOT get stripped and the feed items displayed as expected wthout any hacking.

This leads me to wonder if the problem is related to another module; a code-crash between aggregator and something else. Time to start deactivating modules and see what happens...

#20

thanks for your investigation work on this, and please keep us posted on any developments...

#21

same issue for me. showing in my footer at www.speakerpulse.com . Would love to be able to have this look clean!!

#22

Note: In my case I am trying to subscribe via aggregator to the rss feed coming from the same site (this is so we can get the "blog this" links for our users). If you examine a preview of the RSS Feed in the "frontpage" view, you will see that all special HTML characters within the "description" are translated to special entities. For instance, < gets changed to &lt; etc. Then apparently when the aggregator reads them, it fails to return these characters back to their original HTML state and just strips them out. HOWEVER, as mentioned above, this problem does NOT happen on all sites where I attempt it. On some sites it works just fine. So there seems to be another module involved. Or perhaps it's a filter. Still digging...

#23

Update (no good news):
- installing views_rss beta 4 did not help
- rebuilding the frontpage view did not help
- changing the input format of the stories to filtered html did not help
- removing the html corrector filter did not help
- The feed looks great in Feedburner, so the problem is definitely in Aggregator (when it comes back in).

#24

Category:bug report» support request
Priority:critical» normal
Status:active» postponed (maintainer needs more info)

I tested the Google News URL in the summary and it worked for me. There isn't enough information here to troubleshoot the issue. What would help is an example feed URL that triggers the problem, and step-by-step directions to reproduce the issue.