| Project: | Apache Solr Search Integration |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | major |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
current environment
Drupal 6.13
Hi,
I'm using apache solr in 2 environments
Apache Solr Framework 6.x-1.0-rc2
Apache Tomcat - 5.5.27
Solr Nightly Build - solr-2009-05-27.tgz
Apache Solr Framework - 6.x-1.0-beta10
Apache Tomcat - 5.5.28
Solr Nightly Build - solr-2009-08-25.tgz
Now after upgrading Apachesolr module, Tomcat and Solr the links on the apachesolr blocks are generating incorrect links for fields which have an ampersand '&' in them.
For example when the Architect Name contains
Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO
it generates the following 2 links
[root@bfc33 tmp]# cat 1.txt
working - http://www.old.com/search/apachesolr_search/lipski?filters=sm_cck_field_...
broken - http://www.new.com/search/apachesolr_search/lipski?filters=sm_cck_field_...
it generates %2526amp instead of %26amp
thx.
yashesh bhatia
Comments
#1
i did a diff between the b10 and rc2 versions of apachesolr.index.inc
[root@bfc33 tmp]# diff apachesolr_b10/apachesolr.index.inc apachesolr_rc2/apachesolr.index.inc2c2
< // $Id: apachesolr.index.inc,v 1.1.2.4 2009/04/30 17:38:40 pwolanin Exp $
---
> // $Id: apachesolr.index.inc,v 1.1.2.6 2009/07/01 22:43:50 pwolanin Exp $
31,34d30
< return _apachesolr_strip_decode(preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text));
< }
<
< function _apachesolr_strip_decode($text) {
38c34,37
< return htmlspecialchars(html_entity_decode($text, ENT_NOQUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8', FALSE);
---
> $text = htmlspecialchars(html_entity_decode($text, ENT_NOQUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');
> // We must strip low bytes second in case there was an encoded
> // low-byte character.
> return apachesolr_strip_ctl_chars($text);
91c90
< $document->body = _apachesolr_strip_decode($text);
---
> $document->body = apachesolr_clean_text($text);
[root@bfc33 tmp]#
and noticed the htmlspecialchars in b10 is using double_encode=FALSE, but that is not the case in rc2. would this be causing the double encoding attempt for & to & ?
php version is 5.2.9
thx.
yashesh bhatia
#2
hi, i changed
$text = htmlspecialchars(html_entity_decode($text, ENT_NOQUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8')to
$text = htmlspecialchars(html_entity_decode($text, ENT_NOQUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8', FALSE)deleted the index and rebuilt it but am getting the same problem. back to debugging.
thx
yashesh bhatia
#3
I think & has been an issue in all versions -I'd be surprised if you ever saw it working.
#4
u can try it at
http://publicart.callforentry.org/
type 'lipski' in the search box
it gives u - http://publicart.callforentry.org/search/apachesolr_search/lipski
click on the 3rd link in the 3rd block.
"Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO (1)"
and it goes to
http://publicart.callforentry.org/search/apachesolr_search/lipski?filter...
works fine for 6.x-1.0-beta10
#5
digging a bit more into the file apachesolr.module and specifically the functions apachesolr_facet_block and apachesolr_l it seems the problem is in the apachesolr_l. here's the breakpoints output from the function
----------------------------------------------------------------------------------------------------------------------------------------------------------------
WORKING CODE - 6.x-1.0-beta10
Sep 15 11:39:04 [info] entering function apachesolr_facet_block() in apachesolr.module
Sep 15 11:39:04 [info] $facet_field = sm_cck_field_work_agent_name
Sep 15 11:39:04 [info] Breakpoint 10
Sep 15 11:39:04 [info] entering function apachesolr_l() in apachesolr.module
Sep 15 11:39:04 [info] $text = Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO (1)
Sep 15 11:39:04 [info] $path = search/apachesolr_search/lipski
Sep 15 11:39:04 [info] $options = Array
(
[query] => filters=sm_cck_field_work_agent_name%3A%22Brooke%20Michl-Smith%2C%20Senior%20Architectural%20Designer%2C%20Karlsberger%2C%20Architects%2C%20Columbus%2C%20Ohio%20Martin%20%26amp%3B%20Martin%2C%20Engineers%2C%20Lakewood%2C%20CO%22
)
Sep 15 11:39:04 [info] $retval = <a href="/search/apachesolr_search/lipski?filters=sm_cck_field_work_agent_name%3A%22Brooke%20Michl-Smith%2C%20Senior%20Architectural%20Designer%2C%20Karlsberger%2C%20Architects%2C%20Columbus%2C%20Ohio%20Martin%20%26amp%3B%20Martin%2C%20Engineers%2C%20Lakewood%2C%20CO%22">Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin &amp; Martin, Engineers, Lakewood, CO (1)</a>
----------------------------------------------------------------------------------------------------------------------------------------------------------------
NOT WORKING CODE - 6.x-1.0-rc2
Sep 15 17:11:57 [info] $facet_field = sm_cck_field_work_agent_name
Sep 15 17:11:57 [info] Breakpoint 10
Sep 15 17:11:57 [info] entering function apachesolr_l() in apachesolr.module
Sep 15 17:11:57 [info] $text = Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO (1)
Sep 15 17:11:57 [info] $path = search/apachesolr_search/lipski
Sep 15 17:11:57 [info] $options = Array
(
[query] => Array
(
[filters] => sm_cck_field_work_agent_name:"Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO"
)
[attributes] => Array
(
[class] => apachesolr-facet
)
)
Sep 15 17:11:57 [info] $retval = <a href="/search/apachesolr_search/lipski?filters=sm_cck_field_work_agent_name%3A%22Brooke%20Michl-Smith%2C%20Senior%20Architectural%20Designer%2C%20Karlsberger%2C%20Architects%2C%20Columbus%2C%20Ohio%20Martin%20%2526amp%3B%20Martin%2C%20Engineers%2C%20Lakewood%2C%20CO%22" class="apachesolr-facet">Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin &amp; Martin, Engineers, Lakewood, CO (1)</a>
----------------------------------------------------------------------------------------------------------------------------------------------------------------
The above debugging statements are the parameters and the return value from the function apachesolr_l in apachesolr.module
i'll dig more deeper, it's not solved yet but wanted to share the debugging information.
yashesh bhatia
#6
sorry i forgot to mention the drupal versions clearly.
Working code - Drupal 6.10, apachesolr 6.x-1.0-beta10
Broken code - Drupal 6.13, apachesolr 6.x-1.0-rc2
i'll debug the common.inc for the function url() for drupal 6.10 and 6.13
yashesh
#7
some more debugging. it's breaking down in includes/common.inc in function url().
at the place
if (is_array($options['query'])) {
$options['query'] = drupal_query_string_encode($options['query']);
}
------------------------------------------------
working code
Sep 15 13:24:51 [info] entering function url() in common.inc
Sep 15 13:24:51 [info] $path = search/apachesolr_search/lipski
Sep 15 13:24:51 [info] $options = Array
(
[query] => filters=sm_cck_field_work_agent_name%3A%22Brooke%20Michl-Smith%2C%20Senior%20Architectural%20Designer%2C%20Karlsberger%2C%20Architects%2C%20Columbus%2C%20Ohio%20\
Martin%20%26amp%3B%20Martin%2C%20Engineers%2C%20Lakewood%2C%20CO%22
[attributes] => Array
(
)
[html] =>
)
------------------------------------------------
not working
Sep 15 18:29:53 [info] entering function url() in common.inc
Sep 15 18:29:53 [info] $path = search/apachesolr_search/lipski
Sep 15 18:29:53 [info] $options = Array
(
[query] => Array
(
[filters] => sm_cck_field_work_agent_name:"Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO"
)
[attributes] => Array
(
[class] => apachesolr-facet
)
[html] =>
)
------------------------------------------------
now need to find out why the two $options are different or is it a bug with the function drupal_query_string_encode
#8
some more debugging
it seems the way the links are generated in the function apachesolr_facet_block is causing the problem
------------------------------------------------
working code - Beta 10 - 6.x-1.0-beta10
line 658
else {
$new_query->add_filter($facet_field, $facet, $exclude);
$path = $new_query->get_path();
$querystring = $new_query->get_url_querystring();
}
if ($count || $active) {
$items[$sortpre . '*' . $facet_text] = theme('apachesolr_facet_item', $facet_text, $count, $path, $querystring, $active, $unclick_link, $response->response->numFound, $options);
}
------------------------------------------------
broken code - RC2 - 6.x-1.0-rc2
line 694
else {
$new_query->add_filter($facet_field, $facet, $exclude);
$options['query'] = $new_query->get_url_queryvalues();
$link = theme('apachesolr_facet_link', $facet_text, $new_query->get_path(), $options, $count, $active, $response->response->numFound);
}
if ($count || $active) {
$items[$sortpre . '*' . $facet_text] = $link;
}
------------------------------------------------
the way the links are generated is different for the theme functions apachesolr_facet_link and apachesolr_facet_item
#9
This all comes down to the following line:
<?phpfunction apachesolr_l($text, $path, $options = array()) {
...
return '<a href="'. check_url(url($path, $options)) .'"'. drupal_attributes($options['attributes']) .'>'. ($options['html'] ? $text : check_plain($text)) .'</a>';
?>
The check_plain($text) does it. To fix I'm sending $options['html'] = TRUE; into the apachesolr_l call.
#10
#11
hi. i applied the patch. but it's not fixing the url yet. here's the debugging information in the function apachesolr_l
Sep 16 16:56:44 [info] entering file /work/projects/panmvng/devel/panmvng/trunk/index.php
Sep 16 16:56:44 [info] entering function apachesolr_l() in apachesolr.module
Sep 16 16:56:44 [info] $text = Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO (1)
Sep 16 16:56:44 [info] $path = search/apachesolr_search/lipski
Sep 16 16:56:44 [info] $options = Array
(
[query] => Array
(
[filters] => sm_cck_field_work_agent_name:"Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO"
)
[attributes] => Array
(
[class] => apachesolr-facet
)
[html] => 1
)
// $options after the array merge.
Sep 16 16:56:44 [info] $options = Array
(
[query] => Array
(
[filters] => sm_cck_field_work_agent_name:"Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO"
)
[attributes] => Array
(
[class] => apachesolr-facet
)
[html] => 1
)
// $genurl = check_url(url($path, $options));
Sep 16 16:56:44 [info] $genurl = /search/apachesolr_search/lipski?filters=sm_cck_field_work_agent_name%3A%22Brooke%20Michl-Smith%2C%20Senior%20Architectural%20Designer%2C%20Karlsberger%2C%20Architects%2C%20Columbus%2C%20Ohio%20Martin%20%2526amp%3B%20Martin%2C%20Engineers%2C%20Lakewood%2C%20CO%22
// $retval = '<a href="'. check_url(url($path, $options)) .'"'. drupal_attributes($options['attributes']) .'>'. ($options['html'] ? $text : check_plain($text)) .'</a>';
Sep 16 16:56:44 [info] $retval = <a href="/search/apachesolr_search/lipski?filters=sm_cck_field_work_agent_name%3A%22Brooke%20Michl-Smith%2C%20Senior%20Architectural%20Designer%2C%20Karlsberger%2C%20Architects%2C%20Columbus%2C%20Ohio%20Martin%20%2526amp%3B%20Martin%2C%20Engineers%2C%20Lakewood%2C%20CO%22" class="apachesolr-facet">Brooke Michl-Smith, Senior Architectural Designer, Karlsberger, Architects, Columbus, Ohio Martin & Martin, Engineers, Lakewood, CO (1)</a>
Sep 16 16:56:44 [info] leaving function apachesolr_l() in apachesolr.module
here's the theme function theme_apachesolr_facet_link after the patch was applied
function theme_apachesolr_facet_link($facet_text, $path, $options = array(), $count, $active = FALSE, $num_found = NULL) {$options['attributes']['class'][] = 'apachesolr-facet';
if ($active) {
$options['attributes']['class'][] = 'active';
}
$options['attributes']['class'] = implode(' ', $options['attributes']['class']);
$options['html'] = TRUE; // added by patch
return apachesolr_l($facet_text ." ($count)", $path, $options);
}
thx
yashesh bhatia
#12
#13
Setting correct status
#14
Needs confirmation that this is still a bug when using the latest code.
#15
I think this is still an issue with the latest 6.x-2.x-dev code, but I haven't tried with 6.x-1.x-dev.
I updated to the 6.x-2.x-dev 2010-Jul-26 and still get ampersands displayed as
&in the current search and filter blocks.Forcing $options['html'] = TRUE; does seem to fix it, but I'm not sure exactly where this is best done.
In the apachesolr.module there's the theme_apachesolr_unclick_link funtion.
function theme_apachesolr_unclick_link($facet_text, $path, $options = array()) {apachesolr_js();
if (empty($options['html'])) {
$facet_text = check_plain($facet_text);
}
else {
// Don't pass this option as TRUE into apachesolr_l().
unset($options['html']);
}
$options['attributes']['class'] = 'apachesolr-unclick';
return apachesolr_l("(-)", $path, $options) . ' '. $facet_text;
}
This seems to unset the $options['html'] value - presumably because the apachesolr_l function adds $options['html'] = FALSE;
So adding $options['html'] = TRUE to every call to theme_apachesolr_unclick_link would simply prevent it from running the link text through check_plain().
But in Robert's solution to the initial problem above in #9 we do pass $options['html'] = TRUE; to apachesolr_l().. contrary to what is suggested in
theme_apachesolr_unclick_link().
So I'm unclear as to which way is best.
I think the two theme functions theme_apachesolr_unclick_link() and theme_apachesolr_facet_link() should probably be consistent in how they deal with $options['html'] = TRUE;
So should they:
1) Force $options['html'] = TRUE; in both functions and pass this to apachesolr_l()
2) Add the check for $options['html'] to both functions, and only check_plain() if html is FALSE or not set (as in theme_apachesolr_unclick_link).
Currently I have simply passed $options['html'] = TRUE; to all calls to theme_apachesolr_unclick_link and it seems to work for me.
Any suggestions or am I missing something obvious?
#16
forcing html to true in all cases may subject your site to a serious security vulnerability.
#17
In Robert's patch in #9 above he adds $options['html'] = TRUE; for everything passing through theme_apachesolr_facet_link() - does this introduce a security vulnerability?
#18
I am still having this issue with the double encoding of ampersands using 6.x-2.0-beta3 and am looking for a solution that doesn't force $options['html'] = TRUE to prevent check_plain() on the output, as I would like to avoid the potential security vulnerabilities mentioned above.
I am not 100% sure how many times check_plain() or htmlspecialchars() are being called on the values in question, but it must be more than once.
Anyway, my current solution is as follows:
In the apachesolr.module - the function apachesolr_l() calls check_plain() when returning the link. I add $text = html_entity_decode($text); just before this to decode any encoded ampersands (or other html entities... )
<?php
// decode html entities before sendiong through check_plain
$text = html_entity_decode($text); // added this line
return '<a href="'. check_url(url($path, $options)) .'"'. drupal_attributes($options['attributes']) .'>'. ($options['html'] ? $text : check_plain($text)) .'</a>'
?>
Then, in theme_apachesolr_unclick_link() I added a similar line:
<?phpfunction theme_apachesolr_unclick_link($facet_text, $path, $options = array()) {
apachesolr_js();
if (empty($options['html'])) {
$facet_text = html_entity_decode($facet_text); // added this line
$facet_text = check_plain($facet_text);
}
else {
// Don't pass this option as TRUE into apachesolr_l().
unset($options['html']);
}
$options['attributes']['class'] = 'apachesolr-unclick';
return apachesolr_l("(-)", $path, $options) . ' '. $facet_text;
}
?>
Now all my ampersands are ampersands again, which is nice.
I am not sure if this is the right way to go and would welcome any pointers or suggestions.
Patch attached (patch created against 6.x-2.0-beta3).
#19
The problem still exists in the current 6.x-2.x-dev release. I tried the patch in #18 but that didn't fix the issue.
For example: "walk&talk" becomes "walk%2526talk" in the browser's address bar and "walk%26talk" in the search bar.
#20
#21
@wim - so it's going from the path to the search box that's the only problem?
NO ONE HAS GIVEN ME STEPS TO REPRODUCE! Marking postponed until someone posts steps to reproduce.
#22
I find it easy to reproduce: In the search field I enter "walk&talk" and click the search button. I get redirected to the search results, in which the search field now contains "walk%26talk", and no results are found for this search term.
#23
Ok, so that test seems to work correctly in 6.x-1.x, so let's compare the code.
#24
(Can we put the status back to active to attract more eyes on this issue?)
#25
#26
Note that in the same way a "+" also changes into "%2B". I think these are the only characters that cause this kind of behavior.
#27
Please note that all solutions so far focussed on apachesolr_l and theme_apachesolr_unclick_link. But this doesn't change the behavior when entering "this&that" in the search field and performing the search.
#28
It would be advisable for everyone partaking in this issue to also state if they're using Apache or Nginx, and if they're using Tomcat or not (and if so, what the server.xml looks like). It could very well be that the problem is not just apachesolr, or at least not for everyone.
#29
Please ignore the problems I reported since #19. The behavior I was experiencing were specific for Nginx, I don't experience them on Apache. Fixing these issues should go in another issue (which I will create and post a patch for later this week).
The one remaining problem in this issue is that the links in a filter block display
&as&. To reproduce:I tested the patch in #18 which fixes the problem completely.
#30
tested the patch and it works for me. Problem solved.
#31
We want to be use the same solution in 6.x-1.x and 6.x-2.x. Please compare the code and tell me what's different and make a patch to fix it in that way.
Is that what the patch in #18 does? The poster doesn't say.
#32
That is exactly what the patch in #18 does: It adds the necessary decoding so that an ampersand in $text shows up as
&and not as&.Here are the patches for 5.x-2.x, 6.x-1.x and 6.x-2.x.
#33
Would I be correct in assuming that pw's comment in #31 is moot then, since this issue appears in 6-1.x as well as 6-2.x?
If that is the case, I'm willing to endorse the addition of this to the 2.x branch
#34
Exactly, and that's why the patches in #32 are for. Thanks!
#35
This has gone on long enough. Fixed in 5.2 (forgot to create patch), 6.1, 6.2 and 7.
#36
Sorry I didn't review in more detail before, but seems like we ought to decode more selectively, e.g. using htmlspecialchars_decode()?
#37
Won't check_plain re-encode the decoded entities? I don't want to see future issues about double encoded open-angle bracket (&lt;).
#38
Yes, check_plain will re-encode them - that's the point. We want the text to be safe, but not double encoded.
#39
I defer to you. My understanding is that check_plain(html_entity_decode($text)) is safe.
#40
yes, it should be safe, just wondering if we are overly decoding. Anyhow, let's leave it for now.
#41
Automatically closed -- issue fixed for 2 weeks with no activity.