The HTML Filter does a great job of stripping malicious and malformed elements. However, for certain types of elements such as <script></script>
and <style></style>
users should have the option of stripping out text between these elements as well, since it makes no sense to display style code or scripts in most situations. I can only think of these two examples right now, but in general this might apply to all elements appearing in the document head...
Specifically, I am having a problem dealing with HTML code where style code and scripts are not already commented out between <!-- -->
, which I'm assuming looks like a malformed tag to HTML Filter. Thus, an alternative solution might be to create a filter that just goes through and comments out text between these tags, and runs before HTML Filter, which will then remove the text.
Comment | File | Size | Author |
---|---|---|---|
#34 | html_filter_does_not-447684-34.patch | 1.5 KB | cilefen |
#34 | interdiff-447684.txt | 1.1 KB | cilefen |
#31 | html_filter_does_not-447684-31.patch | 2.24 KB | cilefen |
Comments
Comment #1
Dane Powell CreditAttribution: Dane Powell commentedIs there a reason that this hasn't been commented on? It's a relatively large issue on our site. I am happy to try to put in the legwork on this, but I am not very familiar with the module, and could use some guidance.
Additionally, because I believe the filter is mistaking (albeit conveniently) HTML comments for invalid tags, I am changing this to a bug report.
Comment #2
Dane Powell CreditAttribution: Dane Powell commentedComment #3
Dane Powell CreditAttribution: Dane Powell commentedbump... this is still an issue with the latest version. Please let me know if there is a better place to post this, or if it's just not going to be fixed.
Comment #4
j0rd CreditAttribution: j0rd commentedSame issue. I would like to make sure these styles are stripped and not displayed.
I believe the people who are having issues with this are people who are using mailhandler, in which an entire HTML Email will get inputted into the body, which usually contains
and tags along with tags and need the output to look semi-decent.Comment #5
apadernoI agree it doesn't make sense to strip script tags but leave their content; the only reason I can think of to show the script code is to allow an administrator user to notice there is content that must be removed (if showing the JavaScript code mixed with normal content would really allow to notice it).
Comment #6
Dane Powell CreditAttribution: Dane Powell commentedAs I mentioned in #75229: Compatibility with Outlook-generated HTML, I built a little input filter that can run before HTML Filter and will cause it to strip out this gunk. It's hackish, but if no one is going to try to patch the HTML Filter then mine might be the next best option.
Comment #7
Dane Powell CreditAttribution: Dane Powell commentedI have created a less-hackish patch for #75229: Compatibility with Outlook-generated HTML that comments out uncommented style definitions, thus allowing them to be stripped out by HTML filter. This patch could probably be easily adapted to apply to HTML filter directly, but I'm not sure how to do it.
Comment #8
Dane Powell CreditAttribution: Dane Powell commentedCome on maintainers, it's clear that a lot of people are affected by this issue, and we're having to resort to fixing it ourselves multiple times in individual modules. Obviously it should really be taken care of here. If you're not willing to fix this for some reason at least say so.
Comment #9
apadernoI guess it is not considered a critical bug.
Comment #10
j0rd CreditAttribution: j0rd commentedI believe it's considered a critical bug by everyone in this thread.
I personally can't see any reason to leave the stuff in between the tags in the output, but I can see many reasons why people would want it removed.
Issue is probably that with a filter that's so widely used, the maintainers are unwilling to change something that might effect many already existing installs. In which case this patch should at least be rolled into something less stable like DRUPAL-7 and for Drupal 6 we can use the extra module provided by Dane Powell.
I would still like to see the maintainers ring in on this, so us people who are effected by thing, can provide a proper solution for ourselves.
Comment #11
apadernoI meant that the issue seems to not be considered critical from the maintainers.
I am the first to think the content of the script tag should be removed together the script tag. To only remove the tag works for all the other tags, which have text content that can be shown; it is not the case for the script tag, which has content that is not text, but script code that should be executed by the browser.
If I would forget that the currently set input format does not allow me to use the script tags, I would like better if the script code would not be shown in the page.
Comment #12
lyricnz CreditAttribution: lyricnz commentedDrupal's "input formats" are made up of an ordered list of "filters" which are applied (in order) to the text of a particular format. If you have a site-specific requirement to remove content from *inside* elements, you should use a module that provides a filter that does this - then add this filter to your input format.
Perhaps there's one already at http://drupal.org/taxonomy/term/63
For a good background on Filters and Formats see http://www.lullabot.com/articles/drupal_input_formats_and_filters
Comment #13
apadernoThe problem here is not choosing an input filter to use, but a Drupal input filter that is not doing completely its task (or it is not doing what users expect from it).
If it makes sense to only strip the HTML tags when the tag is something like
<h2>
,<a>
(compare "Test" with "Test"), it doesn't make sense to do the same with a tag like<script>
, where the content of the tag is not text that should be shown.Comment #14
Damien Tournoud CreditAttribution: Damien Tournoud commentedTo all people in that thread that consider this issue as "critical": nothing in Drupal happens unless *someone* actually do something for it. Reassigning to Drupal 7, where it should first be considered, before an (eventual) back-port to Drupal 6.
Comment #15
Dane Powell CreditAttribution: Dane Powell commentedI think the first step in determining how best to fix this is to determine whether it is a bug or a feature request.
To me, it is quite clearly a bug - if I was the one designing this filter, I would certainly have it strip out text that would otherwise be hidden (such as scripts and styles) rather than dumping it (and there is sometimes a LOT of it) onto a page.
If you don't agree with this, please speak up. But if this is what everyone else would expect as well, then this issue quite clearly fits the definition of a bug and should be fixed in 6.x.
Comment #16
lyricnz CreditAttribution: lyricnz commentedIt seems like a feature request to me - the filter does what it says it does: remove *tags*. The fact that you want it to remove the *stuff between some of the tags*, is new/extra - the definition of a feature request. But, this distinction needn't effect whether it gets done.
Seriously, this would be pretty easy in a contrib module.
Comment #17
Dane Powell CreditAttribution: Dane Powell commentedThere's clearly a discrepancy between the filter's prescribed and expected behavior, at least for many users. Given that, I'm not exactly sure what side of the line this issue falls on.
At this point I'm just interested in getting this patched up in 6.x, so I'll probably roll a module based on the patch in #75229: Compatibility with Outlook-generated HTML. If others are able to add an option to strip text between disallowed tags to core in 7.x, that would be great.
Comment #18
apadernoI think that normally by "HTML tag", one means "the tag, and its content"; the content of a tag is not seen as an entity separated from the tag that contains it.
Comment #19
lyricnz CreditAttribution: lyricnz commented#18: so are you saying that <em>something important</em> should be removed entirely? What about <b> or <span> or <div> etc? (the actual selection of tags is up to the site admin). So clearly you don't mean "remove all contents from all removed tags".
Comment #20
Dane Powell CreditAttribution: Dane Powell commentedlyricnz, I don't think anyone is saying that this is an all-or-nothing affair. The header tags that we are talking about (style, script, meta, title) are functionally and categorically very different from content and styling tags (such as the ones that you mentioned).
Comment #21
sjtout CreditAttribution: sjtout commentedHi Dane -- have you tried your regular expression with the Custom Filter module? -- I'm thinking about putting it in there -- it would be nice to have it as a separate filter, so it could be assigned to run before HTML Corrector, but HTML Corrector could run before HTML Filter...
http://drupal.org/project/customfilter
Comment #22
Dane Powell CreditAttribution: Dane Powell commentedI hadn't heard of that module before, but it looks promising; as I mentioned in the other thread I'll try that. It's the next-best solution to actually getting this patched in core.
Comment #23
Dane Powell CreditAttribution: Dane Powell commentedFYI, I could not get other filter modules to deal with this so I created my own. Check out the Office HTML filter if you want to "fix" this problem yourself.
Comment #24
sunsubscribing
Anyone up for implementing this? Shouldn't be too hard.
Comment #25
Dane Powell CreditAttribution: Dane Powell commentedI have two ideas of how to do it but not the time at the moment to follow through - basically just adding a preg_replace or two to the filter_xss function:
1) Use a preg_replace like the following to remove offending tags and content (originally suggested by NancyDru):
$text = preg_replace('/<style.*?<\/style>/xmsi', '', $text);
However this sort of departs from the current architecture which only acts on individual tags (which is exactly what causes this problem)
2) Use the strategy of the office_html filter module to turn the offending content into a giant malformed tag that'll get stripped out in the normal process of the function:
Comment #26
Alan D. CreditAttribution: Alan D. commented#25-2 These will not match all tags. Office filter now uses this
Comment #27
kscheirerVerified the behavior described in the issue is still present in 8.0-alpha4.
sun (#24) and kiamlaluno (#13) agree this should be considered a bug - the default expectation is that text inside a script or style tag would be removed along with the tag itself.
This would also need a test.
Comment #28
star-szrD7 d.o does funky things to HTML tags in issue titles so re-titling to fix.
Tagging for tests per the last comment.
Comment #29
Wim Leers#28: issue titles are not filtered by text formats, but by
check_plain()
?Comment #30
star-szr@Wim Leers, no idea. All I know is that issue titles with HTML get messed up in certain issue views, can't remember exactly how atm but it might just strip out the tags.
Comment #31
cilefen CreditAttribution: cilefen commentedComment #32
cilefen CreditAttribution: cilefen commented#31 is inefficient because it is serializing the html an additional time. It would maybe be preferable to add some kind of option to XSS::filter() to strip the inner tag contents.
Comment #34
cilefen CreditAttribution: cilefen commentedComment #42
apadernoComment #43
kscheirerThis probably needs a test, but otherwise the patch in #34 seems solid.
Comment #49
irsar CreditAttribution: irsar commentedI am having the same issue in drupal 8.9.16. In one of the view we are loading the content which has script in it. And when we strip html it strips just the
tags and leaves the content inside the script tag. So we end up showing just the script content and not the actual html content of the page. I am not able to find "function _filter_html($text, $filter)" in "/core/modules/filter/filter.module" as mentioned in the patch #34. Any help would be greatly appreciated and thanks in advance.Comment #50
Alan Delval CreditAttribution: Alan Delval commentedI need to do more testing before publishing
Comment #54
smustgrave CreditAttribution: smustgrave at Mobomo commentedThis came up as a daily bugsmash target.
Testing with 10.1 I added
and confirmed alert still appears.
This ticket will still need a test case (or expand existing ones).
Plus an issue summary update using the default template
thanks.
Comment #56
smustgrave CreditAttribution: smustgrave at Mobomo commentedSo brought this up in #bugsmash with @larowlan and @longwave and consensus seems to be this is a feature we may want to keep. Shows that a user has done something they shouldn't.
Also the code that would probably have to be edited is Xss.php which is a nightmare, and would impact other tags as well.