| Project: | Views Bonus Pack |
| Version: | 6.x-1.x-dev |
| Component: | Views Export |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
Issue Summary
In views_bonus_export.theme.inc are these lines, starting at line 63:
<?php
foreach ($row as $field => $content) {
$vars['themed_rows'][$num][$field] = str_replace(
array('&', '<', '>'),
array('&', '<', '>'),
$content);
}
?>A few problems.
First, there is no need to check for >. The W3C's XML spec sec. 2.4 only says to worry if it's within a CDATA section (because it could conflict with ]]>), but those were removed per #494848: CDATA is pointless.
Second, & and < should not be altered unless they are not part of "markup delimiters, or within a comment." For example, if you come across <b>, the opening < should not be altered. Also, if you come across across &, that should not be altered. Unfortunately the current code will convert the tag to <b> and the entity to &amp;. And if either is within a comment, they should be skipped entirely.
Comments
#1
Although trying to use Drupal's theme rendering system to build the XML document is a laudable goal, it is fraught with difficulties. Rather than re-inventing the wheel I propose the following strategy:
I doubt many people want to theme the XML export (but probably have too at the moment).
I've attached a patch that cleans up the XML export of VBP, it does:
#2
#3
@Steven - thanks a lot for the help. The exports are "lovely" :)
I'm testing the module for integration with other app and I'll be post more about this, for information.
Great idea, views_bonus!
#4
The double encoding we're seeing results from Views cleaning data with check_plain(). The HTML entities include the & character, which is then encoded again by _views_bonus_export_shared_preprocess().
Depending on SimpleXML for D7 (where we can count on PHP 5.x) is a good idea. But meantime we need a fix for D6.
The answer is to use htmlspecialchars() with the optional third argument to prevent double encoding. Patch attached.
#5
This time with the right arguments. But the $double_encode argument wasn't added until PHP 5.2 so we're back to where we started.
#6
This time using a simple regex to exclude HTML entities produced by check_plain().
#7
@nedjo a valiant effort, I think we need to merge your patch with mine, we only use simplexml if it us around, so we don't require it. Also your patch doesn't address the other issues that mine does.
#8
@Steven Jones:
Yes, your patch addresses several issues, all of which deserve attention. Mixing them all in a single issue though ties this bugfix to a significant refactoring of the code.
To sort this out, I've moved the SimpleXML refactoring to a new issue, #744104: Use SimpleXML for XML handling and rendering in export. If you could follow up and redo your patch there that would be great.
The question of tag handling is addressed in this issue: #369940: Spaces in field labels create invalid XML export file.
The questions of root node name and handling of pluralization in rows probably also deserve their own issues. Please go ahead and open them if there aren't existing issues already.
Thanks!
Meantime I'm restoring the original title, focused on the specific bug reported in this issue.
#9
Nedjo's patch looks great - the only thing we might want to do, from the point of view of security (check_plain being better tested that the regex) would be to decode any encoded portions the string first (using just the characters acted on by htmlspecialchars), and then let check_plain reencode everything.
There is a handy tmlspecialchars_decode function that we could use, except that it is PHP 5.1+ only, so for now we could use something like this:
<?php// This can be replaced with tmlspecialchars_decode() in future versions supporting PHP 5.1+.
strtr($text, array_flip(get_html_translation_table(HTML_SPECIALCHARS, ENT_QUOTES)));
?>
#10
BTW, Drupal has decode_entities
You are right, Owen. The patch above didn't do the trick by itself, but using your idea did:
$vars['themed_rows'][$num][$field] = htmlspecialchars(htmlspecialchars_decode($content), ENT_NOQUOTES, TRUE);I know that I have PHP 5.2+, so I didn't worry about that function.
#11
I'm generally against decoding the content because its going to end up with a follow up "XML export decodes X".
I'm seriously considering committing Nedjo's patch btw. Its pretty straight forward and just avoids escaping anything that has already been escaped by check_plain/htmlspecialchars.
#12
I applied Nedjo's patch from #5 and it did not resolve the problem until I added the decode (i.e., "avoids escaping anything that has already been escaped" is not what happens in real life). I agree that this is not the ideal solution. It would be far better if Views didn't do its encoding to start with, but that would require a more far-reaching patch.
#13
@NancyDru Not sure what you mean by "real life." Its a regex so it either works or its got a bug that needs to be fixed. Also, views only does the encoding you tell it to. It sounds like you're having some other escaping issues other than is the point of this bug. Can you provide something that I can do to break the regex?
That said, I've put a ton of content through it on my local machine through several different field types and I've been getting exactly what I expected in the XML output. This seems like a reasonable fix for the time being so I'm going to commit it for the time being.
#14
I had a simple blog node with a title of "Marketing & Strategy." The XML string had
Marketing &amp; Strategybefore I added the decode.#15
Running the current dev with nedjo's patch:
http://views.nerdpalace.net/export_example/xml?attach=page_1
http://views.nerdpalace.net/node/3
The & is double escaped like it should be and the & is only escaped once. This is what I see everywhere I've tried. I still think there's something odd going on on you system. Can you investigate further?
#16
Here's a patch that combines the patches from #5 and #6.
Uses the $double_encode argument to htmlspecialchars when available and, when not, defaults to the regex workaround.
For sites with PHP 5.2.3 and above, this should prevent all double encoding. Another potential advantage is that this facilitates a D7 upgrade.
#17
Seems reasonable. Will give it a shot this weekend.
#18
Unfortunately there seems to be a problem with using htmlspecialchars for this in my test. I added ©right; which is a valid HTML entity but not a valid XML entity. Because its a valid HTML entity htmlspecialchars doesn't escape it again and we get an invalid entity. This threw an error in chromium, firefox, and simplexml.
#19
Is this Drupal core issue related? #882298: XML-RPC request (string) values are not safe for UTF-8 While the topic is certainly different, should filtering of invalid characters and converting to HTML-safe entities happen at the same time?
#20
+++ export/views_bonus_export.theme.inc 9 Apr 2010 18:04:59 -0000@@ -67,14 +67,25 @@ function template_preprocess_views_bonus
+ // The fourth argument to htmlspecialchars was added in PHP 5.2.3.
+ if (version_compare(PHP_VERSION, '5.2.3') >= 0) {
+ // Convert & < and > characters but not quotes. Special characters may have been converted ¶
+ // already to HTML entities by e.g. check_plain(), so we need the fourth argument to prevent
+ // double encoding.
+ $vars['themed_rows'][$num][$field] = htmlspecialchars($content, ENT_NOQUOTES, 'ISO-8859-1', FALSE);
+ }
Not entirely sure what exactly this code is doing (the comments rather document PHP, but not underlying reasonings), but #882438: Globally prevent double encoding in check_plain() by raising minimum PHP to 5.2.3 contains very good reasons for not using the $double_encode parameter.
+++ export/views_bonus_export.theme.inc 9 Apr 2010 18:04:59 -0000@@ -67,14 +67,25 @@ function template_preprocess_views_bonus
+ // TODO: remove this workaround when upgrading to D7.
Note that D7 requires 5.2.0 only. Of course, views_bonus_export can require 5.2.3 through its info file.
Powered by Dreditor.