I am trying this module to get rid of scripts, and it is not doing so. The tags are gone, but the script text remains. I have set it to run first. Yes, I edited the node and yes, I cleared cache.
| Comment | File | Size | Author |
|---|---|---|---|
| #8 | forrester_filter.zip | 1016 bytes | nancydru |
Comments
Comment #1
danepowell commentedHi NancyDru
Can you please post the raw input you are having problems filtering as a text file?
FYI, this module only acts on style tags at the moment, because those are the only ones I've had problems with, but I'd be happy to expand it to include script tags as well if it's not too difficult.
Also note that this module doesn't "get rid" of anything on its own - it simply HTML-comments-out offending sections of code, which can then be stripped by the core HTML filter.
Comment #2
nancydruAh, so it doesn't take care of
<script>tags, or the many crappy things Word does (like<o:p>or”)? If that's the case, it doesn't help me at all. I really want all that Word stuff gone.Comment #3
danepowell commentedAs I mentioned- no, because I have yet to see them in Office-generated content. However if you are having problems with them send me a copy of your raw input so I can take a crack at adding support for them.
Those should be taken care of using other filters such as the core HTML filter, by either whitelisting other tags or blacklisting and stripping the offending ones. If that's not working for you (i.e. the core HTML filter is broken in yet another way...) let me know.
I'm beginning to think this this module was poorly named - it is not a turnkey solution for killing Office HTML gunk, it is simply meant to hide the content contained within header tags (
style,script,meta, etc...), thus filling a gaping hole left by Drupal's core html filter (see #447684: HTML Filter does not strip text between 'style' and 'script' elements). In combination with that filter (or others), it is very easy to hide Office-generated gunk in a very general way. Perhaps I will update the description page to highlight that fact.FWIW, if a turnkey solution is what you are looking for, I don't think such a module would be practical or represent best practice, as we'd constantly be chasing new variants of HTML crud as they are introduced by Microsoft, not to mention that on any production site, whitelisting (supported by the core HTML filter) should always be used over blacklisting.
Comment #4
danepowell commentedComment #5
danepowell commentedComment #6
nancydruAs far as I'm concerned you can close this. I have abandoned this module and created my own.
Comment #7
danepowell commentedI hope I have not generated any antipathy here- I'd be interested to know more about the module you created and discuss if it would be beneficial to the Drupal community for us to work together on this, or if we are really trying to fill different niches.
Comment #8
nancydruNo, I just didn't understand this module before. I had hoped someone had already dealt with the crap I get when people copy and paste from Word and then add scripts. Here's what I have.
Comment #9
danepowell commentedOkay, I think we should work together on this and incorporate the features you've added into this module, if that's alright with you. A few thoughts:
1) Your module deals with script tags as well as style tags. Awesome. I'd like to get rid of xml tags as well, to deal with #735496: Filter XML tags.
2) You take a much more direct approach, by stripping the tags and contents altogether instead of commenting them out. I like that, it's probably easier and has less overhead than relying on a second filter to strip the content.
3) You also decode a bunch of HTML entities that aren't "law-abiding", but they all look legitimate to me, and I can only imagine them causing trouble if you're trying to view something in plain text. But it seems like in that case you should enable a "plain text" filter to decode all HTML entities. What are your thoughts on this?
Comment #10
nancydruActually, the biggest problem with the entities is that when they get into a title, Drupal can go bonkers and include only part of the title or even none at all. So the other module that calls check_markup also uses that filter on the title.
Beyond that, maybe all browsers can handle them now, but that has not always been the case, nor do I know what other languages do with them.
My users don't save their content as HTML from Word and then paste the whole thing in, as it looks in that other issue. They are just copying the text straight from Word and pasting that. Then they add some scripts, mostly Google Analytics tracking. But allowing any scripts is a security nightmare, hence my desire to get rid of them.
There might be faster ways to scan the content. I would think that the technique I use would work with XML as well.
$text = preg_replace('/<xml.*?<\/xml>/xmsi', '', $text);I don't care for the commenting technique because I have sites that I have developed where many of the users are still using dial-up, so I don't send any more text than I have to.
Feel free to use any of my code. My module that calls the filter is easily reconfigurable, so I can test the result when it's ready.
Comment #11
danepowell commentedOkay, check out the latest dev release (should roll in the next 12 hours). It uses your style/script/xml tag and HTML entity filters. Let me know if you have any more suggestions. If there's no complaints I'll roll 6.x-1.1.
Comment #12
nancydruIt will take me a bit to test again as I uninstalled it. We also just had a big installation on my customer's site that is probably going to have some fixing to do in a few hours.