Feed aggregator strips characters from rss feeds, breaking summaries and links

valderost - May 27, 2009 - 23:13
Project:Drupal
Version:6.14
Component:base system
Category:bug report
Priority:critical
Assigned:Unassigned
Status:active
Description

When parsing RSS XML, the feed aggregator is stripping out HTML character entities, causing broken summaries and links. For example, I've started having problems with Google News, using the following RSS feed: http://news.google.com/news?pz=1&ned=us&hl=en&output=rss

The link URL that Google News sends me uses the "&" HTML character entity to separate a number of query parameters. Instead of storing the separator, the feed aggregator strips it out, breaking the URL.

Here is an excerpt of an item Google News sends me when I manually issue the above HTTP request:

http://www.google.com/news/url?sa=T&ct=us/0-1-0&fd=R&url=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhome&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNHj-3q-T54lxYDsGKz59Z1SYNwWeA

Meanwhile. the database value aggregator_item.link shows:

http://news.google.com/news/url?sa=Tct=us/0-1-0fd=Rurl=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhomecid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHnvnZsA6MZcAfcjhbnJsK8jgEg3A

Article summaries are unreadable as well, because the character entities <, >, and & are all being stripped out, exposing raw tags and attributes to the user as if they were actual content. Here is that same article's summary, first fetched manually, then as extracted from aggregator_item.description.

Manual fetch:

<font style="font-size:85%;font-family:arial,sans-serif"><br><div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div><div class=lh><table border=0 align=right cellspacing=0 cellpadding=0cellpadding=3 style="font-size:100%;font-family:arial,sans-serif"><tr><td width=80 align=center style="padding-left:6px;" valign=top><a href="http://www.google.com/news/url?sa=T&ct=us/0-1i-0&fd=R&url=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052700229.html&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNE9LqRyNbPJvaqoyIXmlAdXf0MppA"><img src=http://nt1.ggpht.com/news?imgefp=3eZ5RfaBt_oJ&imgurl=media3.washingtonpost.com/wp-dyn/content/photo/2009/05/27/PH2009052700231.jpg width=77 height=80 alt="" border=1><br><font size=-2>Washington Post</font></a></td></tr></table><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-0&fd=R&url=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhome&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNHj-3q-T54lxYDsGKz59Z1SYNwWeA"><b>Obama N. Korea Options May Be Limited by Regime Shift</b></a><br><font size=-1><b><font color=#6f6f6f>Bloomberg</font></b></font><br><font size=-1>By Indira AR Lakshmanan and Heejin Koo May 27 (Bloomberg) -- The Obama administration&#39;s ability to pressure North Korea&#39;s insular leadership to abandon nuclear weapons may be hamstrung by internal jockeying and unease in the communist state over who <b>...</b></font><br><font size=-1><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-1&fd=R&url=http://www.youtube.com/watch%3Fv%3D95v9d3w4tzM&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNGpRYBBToDRC7HOJpkjh0W4CV9b_g">Video: Reaction: Will North Korea&#39;s nukes lead to war?</a> <font size=-1 color=#6f6f6f><nobr>UPI</nobr></font><object width="448" height="356"><param name="movie" value="http://www.youtube.com/v/95v9d3w4tzM"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/95v9d3w4tzM"type="application/x-shockwave-flash"wmode="transparent"width="448"height="356"></embed></object><br></font><font size=-1><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-2&fd=R&url=http://english.aljazeera.net/news/asia-pacific/2009/05/2009527195524608822.html&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNEcKrutKflbrDY5vKsuWnTrm404Nw">N Korea warned of &#39;consequences&#39;</a> <font size=-1 color=#6f6f6f><nobr>Aljazeera.net</nobr></font></font><br><font size=-1 class=p><a href="http://www.google.com/news/url?sa=T&ct=us/0-1-3&fd=R&url=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052702353.html&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNETTwK5v1ZU5ca8k9wTk7P7RMkMCg"><nobr>Washington Post</nobr></a>&nbsp;- <a href="http://www.google.com/news/url?sa=T&ct=us/0-1-4&fd=R&url=http://www.reuters.com/article/topNews/idUSTRE54Q5R620090527&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNFM8UcQSuznoxM9BowXrJyFuPDgXA"><nobr>Reuters</nobr></a>&nbsp;- <a href="http://www.google.com/news/url?sa=T&ct=us/0-1-5&fd=R&url=http://www.upi.com/Top_News/2009/05/27/Clinton-N-Korea-must-face-consequences/UPI-36601243462350/&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNGHdmvRqQSTBEXH-dvEoX5LaJV_fA"><nobr>United Press International</nobr></a>&nbsp;- <a href="http://www.google.com/news/url?sa=T&ct=us/0-1-6&fd=R&url=http://www.google.com/hostednews/afp/article/ALeqM5iwb9ioSkAJOAJKRCgQfv4s9aWMfw&cid=1246822686&ei=mMAdSrXbGtKdlQf65LCJBw&usg=AFQjCNHQWg5AsGKcuXGG7moNPErpZcZM3A"><nobr>AFP</nobr></a></font><br/><font class=p size=-1><a class=p href=http://www.google.com/news?pz=1&ned=us&hl=en&ncl=dj3-cRGcAtrLCiMjO4SV67_oe2RqM><nobr><b>all 12,560 news articles</b></nobr></a></font><br clear=all> </div></font>

And from aggregator_item.description:

font style=font-size:85%;font-family:arial,sans-serifbrdiv style=padding-top:0.8em;img alt= height=1 width=1/divdiv class=lhtable border=0 align=right cellspacing=0 cellpadding=0cellpadding=3 style=font-size:100%;font-family:arial,sans-seriftrtd width=80 align=center style=padding-left:6px; valign=topa href=http://news.google.com/news/url?sa=Tct=us/0-1i-0fd=Rurl=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052700229.htmlcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNG0rv_TekZcwrpLxlrbLAy2KdFltgimg src=http://nt1.ggpht.com/news?imgefp=3eZ5RfaBt_oJimgurl=media3.washingtonpost.com/wp-dyn/content/photo/2009/05/27/PH2009052700231.jpg width=77 height=80 alt= border=1brfont size=-2Washington Post/font/a/td/tr/tablea href=http://news.google.com/news/url?sa=Tct=us/0-1-0fd=Rurl=http://www.bloomberg.com/apps/news%3Fpid%3D20601087%26sid%3DaiGdyvcQTLLA%26refer%3Dhomecid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHnvnZsA6MZcAfcjhbnJsK8jgEg3AbObama N. Korea Options May Be Limited by Regime Shift/b/abrfont size=-1bfont color=#6f6f6fBloomberg/font/b/fontbrfont size=-1By Indira AR Lakshmanan and Heejin Koo May 27 (Bloomberg) -- The Obama administration#39;s ability to pressure North Korea#39;s insular leadership to abandon nuclear weapons may be hamstrung by internal jockeying and unease in the communist state over who b.../b/fontbrfont size=-1a href=http://news.google.com/news/url?sa=Tct=us/0-1-1fd=Rurl=http://www.youtube.com/watch%3Fv%3D95v9d3w4tzMcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHumhczb7vl0CI71D4t0tARoKU5ngVideo: Reaction: Will North Korea#39;s nukes lead to war?/a font size=-1 color=#6f6f6fnobrUPI/nobr/fontobject width=448 height=356param name=movie value=http://www.youtube.com/v/95v9d3w4tzM/paramparam name=wmode value=transparent/paramembed src=http://www.youtube.com/v/95v9d3w4tzMtype=application/x-shockwave-flashwmode=transparentwidth=448height=356/embed/objectbr/fontfont size=-1a href=http://news.google.com/news/url?sa=Tct=us/0-1-2fd=Rurl=http://english.aljazeera.net/news/asia-pacific/2009/05/2009527195524608822.htmlcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNE5hdYoN7qNEdBjjj6OuolDDmnoigN Korea warned of #39;consequences#39;/a font size=-1 color=#6f6f6fnobrAljazeera.net/nobr/font/fontbrfont size=-1 class=pa href=http://news.google.com/news/url?sa=Tct=us/0-1-3fd=Rurl=http://www.washingtonpost.com/wp-dyn/content/article/2009/05/27/AR2009052702353.htmlcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNHU6JKN2yIBgXhck1oUbz22rXCi8AnobrWashington Post/nobr/anbsp;- a href=http://news.google.com/news/url?sa=Tct=us/0-1-4fd=Rurl=http://www.reuters.com/article/topNews/idUSTRE54Q5R620090527cid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNE_a9umAbB5q8utbbiJKgw5DspyewnobrReuters/nobr/anbsp;- a href=http://news.google.com/news/url?sa=Tct=us/0-1-5fd=Rurl=http://www.upi.com/Top_News/2009/05/27/Clinton-N-Korea-must-face-consequences/UPI-36601243462350/cid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNFcPFh6qwvi48m7D3Gi1iJ_iN8CMAnobrUnited Press International/nobr/anbsp;- a href=http://news.google.com/news/url?sa=Tct=us/0-1-6fd=Rurl=http://www.google.com/hostednews/afp/article/ALeqM5iwb9ioSkAJOAJKRCgQfv4s9aWMfwcid=1246822686ei=zMAdSvWdFtqdlQfG8J2OBwusg=AFQjCNH1zK4oeEQT7yxox24fZGwrIqMIywnobrAFP/nobr/a/fontbr/font class=p size=-1a class=p href=http://news.google.com/news?pz=1ned=ushl=enncl=dj3-cRGcAtrLCiMjO4SV67_oe2RqMnobrball 12,560 news articles/b/nobr/a/fontbr clear=all /div/font

Any chance I've got something misconfigured? I can't imagine what though...

#1

valderost - May 28, 2009 - 00:09
Status:active» closed

Sorry, the character entities in this post are being rendered as the actual characters...
Please "view source" of the blockquotes to see what the feed aggregator is picking up.

Thanks for any help!

#2

valderost - May 28, 2009 - 00:09
Status:closed» active

#3

fluidicmethod - July 13, 2009 - 15:50
Priority:normal» critical

HELP THIS IS HAPPENING TO ME
(update)
http://www.twincityscene.com/aggregator/

#4

chadd - July 21, 2009 - 16:35
Version:6.12» 6.13

also experiencing this issue.
anyone have a solution?

#5

bixwilson - July 22, 2009 - 20:32

Same problem! There must be a workaround for this. The Drupal.org aggregator has a google news feed that works just fine....

#6

Pathol2187 - September 3, 2009 - 17:36

This issue seems to have died down but I am still having this problem. Anyone have the fix for this problem?

#7

chadd - September 3, 2009 - 21:21

we still have this problem as well.

#8

chadd - September 15, 2009 - 17:53

are we the only ones experiencing this?
is it a bug? bad feed? misconfiguration somewhere in our setup?

#9

valderost - October 3, 2009 - 02:18

Still happening in 6.14...

#10

chadd - October 6, 2009 - 14:03
Version:6.13» 6.14

#11

Lenn-art - December 3, 2009 - 08:20

Subscribe! The feed is looking good in Google RSS reader but not in Drupal

 
 

Drupal is a registered trademark of Dries Buytaert.