Aggregator titles display quotes and other characters with HTML entity equivalents badly

Steve Dondley - May 2, 2006 - 17:07
Project:Drupal
Version:7.x-dev
Component:aggregator.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs review
Description

The aggregator modules uses check_plain() on the title of a feed item but this results in html entity ugliness with quotes and other characters.

Patch attached used php function strip_tags() instead. Not sure if this is the way to go.

AttachmentSizeStatusTest resultOperations
aggregator_2.patch828 bytesIgnoredNoneNone

#1

Steve Dondley - May 2, 2006 - 17:41

Important clarification. This occurs when the feed already has html entities in it. For example, if the feed has a title of:

It's an Honor

(It's an Honor)

When it gets run through check_plain, the ampersand in the html entity get converted to an html entity, and it becomes:

It&amp#39;s an Honor

So the final output is:

It's an Honor

in the title.

#2

Steve Dondley - May 2, 2006 - 17:48

I can't properly display the html entities in the comments. It's all screwed up. Let me try with spaces before the semicolon. You should mentally remove them.

It&#39 ;s an Honor

gets converted to

It&amp ;#39; an Honor

which results in the following output:

It&#39 ;s an Honor

#3

Steven - May 7, 2006 - 15:02
Status:needs review» needs work

I'm pretty sure this is correct behaviour. Most feed fields contain regular, unmarked up text. Double-escaping like that in the source feed is only valid if the field contains escaped, marked up HTML text (such as the body).

In any case, this needs to be solved at the parsing stage, not the output stage. Drupal takes feeds in various formats.

Finally, using strip_tags() on output as a validation measure is disallowed. It should only be used as a conscious filtering op (such as done for taxonomy terms in title attributes).

#4

Steve Dondley - May 8, 2006 - 16:16

> Most feed fields contain regular, unmarked up text.

Well, Google uses htmlentities. They are a pretty major supplier of feeds.

#5

Roadskater.net - May 27, 2006 - 03:05

yes google news is pretty important and i like it, and it would be nice to have at least a checkbox available in aggregator setup to specify whether to unescape the title. i know nothing about this, and can't do this myself, i'm sorry to say. i had other google news related problems (duplicate news items) so i'm open to suggestions from those who may have good reasons for thinking google news is not important, or not best. thanks for anything anyone does to help us. please share if there's an adopted fix. thanks.

#6

kastaway - July 14, 2006 - 17:56
Version:x.y.z» 4.7.2

Does anybody have brainstorming ideas of how to work around this? I'd settle for just dropping the entities from the RSS feed, which would be better than the current mash of ascii characters on the front page of the site.....

#7

matt westgate - July 21, 2006 - 21:47

Rather than strip_tags(), you'll want to use filter_xss($item->title, array()) around the item title. It's been shown on the security mailing list that strip_tags() can be bypassed in terms of XSS exploits.

#8

rosenblum68 - September 18, 2006 - 17:16
Version:4.7.2» 4.7.3

I downloaded newest version of 4.7.3 aggregator.module, and it still has the problem. I has to modify the file in multiple locations (using advise for filter_xss) to get it to work both in the block AND the "more" summary page... just a heads up. http://wellweight.org

#9

edmund.kwok - October 5, 2006 - 18:14
Version:4.7.3» x.y.z
Assigned to:Steve Dondley» edmund.kwok
Status:needs work» needs review

Changed strip_tags to filter_xss as suggested and in two other places, block_item and summary_item, including the original page_item.

Btw, issue also applies in HEAD, http://drupal.org/node/63459, marking that duplicate because this was reported first. Also, changing version to cvs to get more attention.

Patch is for cvs version.

AttachmentSizeStatusTest resultOperations
aggregator_3_0.patch1.88 KBIgnoredNoneNone

#10

edmund.kwok - October 5, 2006 - 18:15

Patch for 4.7 cvs.

AttachmentSizeStatusTest resultOperations
aggregator_47.patch1.88 KBIgnoredNoneNone

#11

ahoeben - October 9, 2006 - 08:26

In this issue (which got sidetracked), I suggested cleaning up the title before it is inserted in the database. This has the obvious benefit of cleaning it up only once, making my patch a lot smaller than the one proposed here.

The argument against cleaning up the tile before insertion in the database is that it is not the Drupal way to do so. For example, when adding a title to a node, the title is inserted into the database as is (after making it database safe); no removal of tags etc. The rationale behind this is that when the user goes to edit the node, he would expect his/her original input and not a parsed/filtered version of it. There is an obvious need to keep the original input, for consistency to the user.

In the case of the aggregator module however, there is no need to store the original input, which is machine generated and which will never be edited. So I'ld say: filter before insertion in the database, and be done with it.

#12

jacauc - November 22, 2006 - 12:35
Version:x.y.z» 5.x-dev

Will this patch be part of drupal 5.0 core?
I am running drupal 5 CVS on a test site, and I still see & # 3 9 ; (spaces inserted there to prevent it from being parsed)

#13

drumm - November 27, 2006 - 00:47
Status:needs review» needs work

Tags need to be stripped from titles.

#14

edmund.kwok - November 27, 2006 - 08:42
Assigned to:edmund.kwok» Anonymous

Releasing this issue back into the wild. I'm not sure what's the best way to fix this..

#15

msmiffy - February 22, 2007 - 01:30

I see that this problem still exists in 5.1. Any sign of a permanent, main-stream resolution? I'll just go hack the code for now as this is UGLY.

#16

jacauc - February 22, 2007 - 05:33

I agree 100%

#17

csevb10 - April 10, 2007 - 21:14
Version:5.x-dev» 6.x-dev
Assigned to:Anonymous» csevb10

I repatched this for Drupal 6.
There's an aggregator_filter_xss so you can actually filter the feed content independent of the site content which seems like a slightly better solution than invoking the filter_xss directly. I patched the same 3 areas as before.
What does everyone think of this as a viable solution moving forward?

AttachmentSizeStatusTest resultOperations
aggregator_checkplain.patch1.86 KBIgnoredNoneNone

#18

csevb10 - April 11, 2007 - 16:24
Status:needs work» needs review

#19

agentrickard - June 11, 2007 - 02:53

I'm going to make a run at closing the Aggregator queue for D6.

Personally, I like http://drupal.org/files/issues/aggregator-striptags.patch and agree with ahoeben from #11.

This is machine-text, so keeping the "integrity" of the original title string is not relevant. Stripping the entities seems perfectly fine during the parsing stage.

#20

catch - October 30, 2007 - 11:56
Status:needs review» needs work

No longer applies.

#21

ezra-g - November 10, 2007 - 21:48

Since it was only one line, in D5.3 I just made the quick edit, removed the items from the feed, updated the feed and had the same #39 business.

#22

underpressure - January 7, 2008 - 21:00

What is the status of this problem? I use Google news and get this ugly output as well.

#23

alpha2zee - February 12, 2008 - 08:25

Drupal's architecture needs to be changed to allow admins complete control over filtering of both input (what is stored in the database) and output (the final, displayed HTML). Further, this filtering should be customizable as per every content type (newsfeed, user comment, and so on). E.g., machine-generated newsfeed titles (database stored) can be filtered for XSS, appropriate HTML tags, etc., during the input stage to avoid the overhead of similar filtering everytime the title is output as web page content.

Also, appropriate code/principles should be used from better (than Drupal's Filter module) filtering scripts like htmLawed and HTMLPurifier. Though the mentioned scripts can be used in plugged-in modules, they then either cannot have certain functionalities (like, to address the newsfeed-entity issue), or re-do some actions that Drupal's core does anyway (increasing the processing time).

#24

Arancaytar - February 12, 2008 - 08:48

What is the valid standard in XML - " or " or & #39;? I don't know how much more attention people pay to feed validity than website validity - would it be enough to require validity and reject quirky feeds, or do we lose a lot of functionality?

Edit: I note that whatever filter is here on d.o. is one whose code I would probably yell upon seeing. You do not unescape & amp; # 39; twice to make '. You just don't.

#25

alpha2zee - February 13, 2008 - 00:24

Lines 289, 1179, 1267, 1352, 1367 and 1394 of the drupal/modules/aggregator/aggregator.module file (version timed 1-10-08 22:14) have code like:

... check_plain($item->title) ...

Looks like replacing all those six check_plain occurrences with aggregator_filter_xss should take care of the issue.

#26

jdefay - June 10, 2008 - 22:58

I implemented the changes suggested by ahoeben above in his patch in post #11. Then I tried changing all the check_plain references to aggregator_filter_xss, but still got funky characters in Google News feeds on my site .

ahoeben's patch seems to clean titles that show up in feed blocks, but not in the Drupal aggregator category view which was still giving me funky single quotes in feed titles. It looks like the funky symbols are still being stored in the database, and ahoeben's patch fixes them when they are displayed in blocks.

Since the funky stuff was still getting displayed I wanted to remove them before they go into the database & thus make ahoeben's display patch unnecessary. So I added htmlspecialchars_decode() in the aggregator.module aggregator_parse_feed function.

It works great, whenever a "& #39;" shows up in the title of my news feeds, it's replaced with a single quote and then inserted into the database for later use.

Here's the line before:

$title = $item['TITLE'];

Here it is after:

$title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);

#27

TimAlsop - August 9, 2008 - 22:00

Can somebody let me know what patch/changes are recommended for a Drupal 6.3 environment ? I am happy to modify code, but wanted to know which code changes to make so that these special characters are changed before the google news feed items are written to the database.

#28

dominich - August 11, 2008 - 19:39

For me on 6.3 it's as the above post. I have a google news feed displayed in a block on a site, and the title links were showing &blah; style text - much in the same way that Firefox RSS feed reader does for sites like Slashdot et all interestingly.

Anyway - all I had to do was as jdefay says:

add htmlspecialchars_decode() in the aggregator.module aggregator_parse_feed function.

Here's the line before:

$title = $item['TITLE'];

Here it is after:

$title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);

jdefay rox.

D

#29

TimAlsop - August 11, 2008 - 22:16

I tried this, but when I update feed I get:

Fatal error: Call to undefined function: htmlspecialchars_decode() in /data01/mysite/public_html/drupal/modules/aggregator/aggregator.module on line 738

Is there something else I need to do so that this function is available ?

Thanks,
Tim

#30

TimAlsop - August 11, 2008 - 22:24

I just noticed that the htmlspecialchars_decode() function is only available in PHP >= 5.1.0. I am using PHP 4.4.4 which is supported by Drupal 6.3. Unfortunately I am unable to upgrade PHP because my site is hostsed on a server managed by my ISP and uses cpanel. So, the version of PHP is out of my control.

Is there a PHP 4.4.4 function which will do same/similar, or do I need to suffer this until PHP 5.1.0 or later is available ?

Thanks,
Tim

#31

TimAlsop - August 11, 2008 - 22:34

I found some suggestions at http://uk.php.net/htmlspecialchars_decode

So, I have now changed code as shown below:

if ( !function_exists('htmlspecialchars_decode') )
{
    function htmlspecialchars_decode($text)
    {
        return strtr($text, array_flip(get_html_translation_table(HTML_SPECIALCHARS)));
    }
}

    // Resolve the item's title. If no title is found, we use up to 40
    // characters of the description ending at a word boundary but not
    // splitting potential entities.
    if (!empty($item['TITLE'])) {
      $title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);
    }

I don't get any error now, so I will see if I get any news items with special characters included. I will report my findings in next few days.

#32

bradnana - November 25, 2008 - 04:32

This bug still exists as of D6.6 in core aggregator.module. Any plans for a patch to be brought into core?

--Brad

#33

GetActive - January 14, 2009 - 21:53
Version:6.x-dev» 6.8

It also/still happens on the "domain.com/user" page where the title is generated dynamically below the login form.

#34

domesticat - January 22, 2009 - 14:27

I can confirm that @jdefay's patch (#26) works well for >PHP5; I've been using it for months. I have not tested the other version, nor can I vouch for whether or not doing this is 'the Drupal way.' (I just know that my users need the feeds to work properly!)

#35

thiokol - February 6, 2009 - 11:31

thank you jdefay in #26, after much useless information your advice has finally solved my problem :)

#36

ztnews - February 26, 2009 - 15:59

jdefay,

I just looked at your site, which has a Google News Feed here: http://jason.defay.org/aggregator/categories (titled 'UCSD In the News').

How did you get that to work? When I setup something similar, all the ampersands (&) are stripped out of the links, which breaks them. Example:

http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://www3.sign...

http://news.google.com/news/url?sa=Tct=us/0-0fd=Rurl=http://www3.signons...

Any suggestions? Thx.

-- Cronin

#37

jdefay - February 26, 2009 - 17:11

Hi Cronin,

I just looked at the Google News Feed 'UCSD In the News' you mentioned on my site and noticed it does in fact have the ampersands in the links. Are you seeing something different on my site?

I just upgraded to 6.10 and haven't reimplemented any of these patches described in this thread. So I can't say for certain if the patches will actually strip the Google tags or not.

What I can say is that I got frustrated with the way Google feeds were causing frequent problems with the Drupal aggregator. I switched over to Yahoo news feeds and am getting consistently clean results. It's too bad because I like the Google news service better and would prefer to use it over Yahoo. I just didn't want to deal with headaches anymore.

Jason

#38

jdefay - February 26, 2009 - 19:05

All,

Since Cronin asked about it, I decided to go in and re-implement the simple patch I described in my earlier post to this thread.

I added the htmlspecialchars_decode function back into the aggregator.module, removed all the old items from the news feed, then refreshed it. Viola, no more funky characters in the news feed titles.

Since several new updates/releases of Drupal have happened since my original patch, and all of them still contain the old problematic PHP code, perhaps someone can volunteer to help me get this fix added to the next version of Drupal so we don't all have to keep re-patching every time a new release comes out.

Please feel free to respond to this thread and/or email me if you can help make this happen.

Cheers,

Jason

#39

jdefay - February 26, 2009 - 19:45
Status:needs work» reviewed & tested by the community

Tim,

Have you had any errors on this yet? If not, your aggregator patch seems to be a better, more backward compatible, improvement over mine.

I implemented your version and tested it on my site. It works great so I'd appreciate it if any of you who are reading this & willing to help would do the same and report back here.

Cheers,

Jason

#40

jdefay - February 26, 2009 - 20:39
Status:reviewed & tested by the community» needs work

Oops, I shouldn't have changed the patch status since it looks like the original patch isn't the one we trying to implement. I'm setting back to 'code needs work' until someone can create and submit a patch that meets the Drupal patch standards. Here is what I'm proposing someone help me do:

Write a valid patch that replaces line 737 through 741 with the following:

if ( !function_exists('htmlspecialchars_decode') )
{
    function htmlspecialchars_decode($text)
    {
        return strtr($text, array_flip(get_html_translation_table(HTML_SPECIALCHARS)));
    }
}

    // Resolve the item's title. If no title is found, we use up to 40
    // characters of the description ending at a word boundary but not
    // splitting potential entities.
    if (!empty($item['TITLE'])) {
      $title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);
    }

As soon as that patch is up and tested I'll work on getting it into the queue for the next Drupal release.

Cheers,

Jason

#41

domesticat - February 26, 2009 - 19:48

Been using this patch for a while with no issues; would be glad to see it get in the next version.

#42

ztnews - February 27, 2009 - 01:41

Jason,

"I just looked at the Google News Feed 'UCSD In the News' you mentioned on my site and noticed it does in fact have the ampersands in the links. Are you seeing something different on my site?"

No, your site looks fine. But my site (http://www.zts.com, see the section "Google Web Search Filtered for Recent Thermoelectric Content"). Check the links: "&"s are stripped out, and broken. I checked and they are stored in the DB that way too.

I just upgraded to drupal 6.9 (a day before 6.10 came out!).

Any thoughts on what's going on? Thanks for the replies.

-- Cronin

#43

jdefay - February 27, 2009 - 20:10

Sounds like the patch isn't working for some reason. Have you done all of the following?

1. Checked to make sure you have PHP version 5.1.0 or greater installed on your server
2. Changed line 741 in the aggregator.module from $title = $item['TITLE']; to $title = htmlspecialchars_decode($item['TITLE'], ENT_QUOTES);
3. Removed all previous posts from your DB and re-imported them.

You might also try out the patch I just created below. Please give me feedback on the patch, it's in need of QA testing!

Thanks,

Jason

#44

jdefay - February 27, 2009 - 20:09
Version:6.8» 6.10
Assigned to:csevb10» jdefay

Hi everyone,

I've created and tested the attached patch on my Drupal 6.10 site and am looking for volunteers to do additional patch testing.

The patch chages the aggregator.module as generally descibed in my posts above. It first checks to see if the server has the htmlspecialcharacters_decode() function available (PHP 5.1.0 and greater). If so it uses the function to clean up HTML in the article titles and saves the results to the database. If the function isn't there for some reason, it defaults to the prior method for saving article titles.

I'll give it a few weeks for testing, then if I don't hear anything I'll assume it's good to go and try to get it into the next patch release.

Thanks,

Jason

P.S. Thanks to domesticat for turning me on to WinMerge!

AttachmentSizeStatusTest resultOperations
aggregator_final.patch1.05 KBIdleFailed: Failed to apply patch.View details | Re-test

#45

ztnews - February 28, 2009 - 20:06

Jason,

Thanks much for the help, but after further testing & searching I'm pretty sure my problem is distinct from this thread.

A drupal thread (Aggregator Module Shows HTML Gibberish) which described my problem quite closely (dropping of key symbols like "&") is at this link: http://drupal.org/node/346990

It seems there is a known bug in one of the php libraries, libxml2 2.7.1,which strips out key html symbols (like <, >, & etc.). The bug is described here: http://bugs.php.net/bug.php?id=45996

My host's php installation has libxml2 2.7.2, which (if I'm reading the libxml release notes right) doesn't have the fix in it. But the latest release, libxml2 2.7.3, should. So only a few systems will be affected by this particular bug, and it will go away as the php libraries get updated.

In mean time, guess I'll use some work-around.

Thanks again .

#46

jdefay - March 18, 2009 - 04:42
Status:needs work» reviewed & tested by the community

I fear there may not be sufficient traffic on this thread to get additional patch testing done. After waiting a few weeks for testers to evaluate the final version of the patch, and not getting any responses, I've gone ahead and updated the status to "reviewed & tested by the community".

While nobody seems to have tested the final version, it was tested and improved by a number of users since it was first created in May of 2006 by Steve Dondley. Since those earlier iterations led us to this final version of the patch, I think it has been sufficiently tested to be committed to the next version of Drupal.

I apologize in advance for not being able to get other developers to provide a supporting opinion on the patch. However, I'm confident this patch will work just fine, particularly since the final patch checks to make sure the htmlspecialchars_decode() function is available on the server before running it.

Cheers,

Jason

AttachmentSizeStatusTest resultOperations
aggregator_final.patch1.05 KBIdleFailed: Failed to apply patch.View details | Re-test

#47

jdefay - March 18, 2009 - 16:54
Version:6.10» 6.x-dev

After re-reading the Drupal PATCH instructions, it appears that I should be upgrading this bug to Version "6.x-dev", so that's what I just did :-)

#48

packdragon - March 25, 2009 - 00:52

I've been having problems with my aggregator in that my feeds are displaying HTML code. I've got Drupal 6.10 installed. This patch sounds like exactly what I need. I clicked on the attachment, but it appears to display some text in my browser. How do I download and install this patch?

#49

catch - March 25, 2009 - 01:19
Version:6.x-dev» 7.x-dev
Status:reviewed & tested by the community» needs work

The patch seems to be reversed - htmlspeciachars_decode() is removed, not added by #46.

Additionally the current development version of Drupal is 7.x rather than 6.x - where possible we fix issues in 7 first, then backport. http://api.drupal.org/api/function/aggregator_parse_feed hasn't changed that much (although it's in a different file now), so it'd be roughly the same change to make, although Drupal 7 requires PHP 5.2 or greater, so no need for the function_exists() there. Only did a very cursory review of the patch, but if you could re-roll against 7 that'd help to get some more reviewers.

#50

alex_b - March 26, 2009 - 03:13
Status:needs work» needs review

This might work.

AttachmentSizeStatusTest resultOperations
61456_html_decode_titles.patch1.04 KBIdleFailed: Failed to apply patch.View details | Re-test

#51

Arancaytar - May 17, 2009 - 09:24
Status:needs review» needs work

The patch looks good, and this is a great idea, but we definitely need a test case for it.

#52

jdefay - July 11, 2009 - 05:27
Status:needs work» needs review

Thanks for re-rolling this Alex_b, sounds like we're ready to test & patch

#53

System Message - July 11, 2009 - 05:50
Status:needs review» needs work

The last submitted patch failed testing.

#54

jeffschuler - August 17, 2009 - 20:14
Assigned to:jdefay» Anonymous
Status:needs work» needs review

Re-rolled for current D7 HEAD.

#55

jeffschuler - August 17, 2009 - 20:15

Re-rolled for current D7 HEAD.

AttachmentSizeStatusTest resultOperations
61456_html_decode_titles_2.patch1021 bytesIdleFailed: 12389 passes, 4 fails, 0 exceptionsView details | Re-test

#56

System Message - August 26, 2009 - 11:49
Status:needs review» needs work

The last submitted patch failed testing.

#57

jeffschuler - August 31, 2009 - 20:34
Status:needs work» needs review

Testbot says that the Error handlers test failed. It passes on my machine.
Patch still applies to HEAD.
Re-setting to Needs Review for another shot.

 
 

Drupal is a registered trademark of Dries Buytaert.