Posted by jgraham on July 22, 2009 at 11:43pm
| Project: | Apache Solr Search Integration |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
Two issues;
1. If you have content with html entities you can search via the entity code but not the actual string.
2. Upon search result display the output displays the html entities encoded rather than as expected.
Explanation:
1. Assuming you have content with > (or any html entity);
Searching for > for this content, which is what is displayed to the user will yield no results. However, searching for > will yield results.
2. More importantly on the search results page you will see the entities as though they were escaped eg.> will display as > in drupal content, but solr search results will display the content as > rather than the expected >
Comments
#1
I got the same problem. I have customized polls results using FusionCharts (showing poll results on a flash chart) and the object HTML code is shown on search results.
Like this: "= new FusionCharts("/sites/all/themes/my_theme/flash/poll.swf", "FusionCharts_1", "680", "340", "0", ...".
Is there a way to filter HTML / JavaScript code?
#2
It's supposed to be filtered.
Kontrol_x, are you also using 6.x-1.0-beta11 like the issue version?
#3
This is still a problem in recent versions of 6.x-1.x-dev. I've attached a screenshot and the Solr response that references this particular node. In many cases, the entities are double encoded (&) and in some cases they're triple encoded ("). Should apachesolr be handling these entities better, and if so, at which layer (i.e. indexing, preprocessing of search results, presentation of search results)?
#4
For what it's worth, this is also happening on drupal.org. Screenshot attached.
#5
We make text safe before we index is in apachesolr.index.inc
/*** Strip html tags and also control characters that cause Jetty/Solr to fail.
*/
function apachesolr_clean_text($text) {
// Add spaces before stripping tags to avoid running words together.
$text = filter_xss(str_replace(array('<', '>'), array(' <', '> '), $text), array());
// Decode entities and then make safe any < or > characters.
$text = htmlspecialchars(html_entity_decode($text, ENT_NOQUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');
// We must strip low bytes second in case there was an encoded
// low-byte character.
return apachesolr_strip_ctl_chars($text);
}
So - we should make the same transformation to search terms, and make sure not to encode again the result.
some of the double encoding may come from: http://api.drupal.org/api/function/template_preprocess_search_result/6
#6
Also, PHP > 5.2.3 is required to turn off double encoding: http://us3.php.net/manual/en/function.htmlspecialchars.php
So it's not something we can rely on.
#7
Note -the dev version of the module pushes all the encoding into the PHP client.
#8
ok, well we must (I think) index the entities to avoid breaking the xml documents
The incorrect display is due to: http://api.drupal.org/api/function/template_preprocess_search_result/6
#9
Please try this patch.
#10
also opened upstream bug http://code.google.com/p/solr-php-client/issues/detail?id=30
#11
Patch above is incorrect - see discussion on upstream issue.
This patch should be more correct, but still safe. Requires reindexing.
#12
I know Im failing to grasp something, but why check_plain? why not check_markup?
#13
check_markup already ran (we assume) when the node body was rendered for indexing.
The situation here is that we now decode entities before we put that in the search index - that means I can inject XSS via encoded entities like <script> (I tried it - it works absent the check_plain() in the patch)
check_plain just calls htmlspecialchars() and does utf-8 validation. We could just call htmlspecialchars() directly I think, but for consistency doing it this way since the preprocess function does check_plain() for the title.
#14
Confirmed this patch solves the entity issue when applied to the 6.x-1.x-Dev branch.
Search Results
http://skitch.com/kennysilanskas/nmqw7/results-w-ampersand
Facet Display
http://skitch.com/kennysilanskas/nmqib/facets
Nice work, folks.
#15
With this patch these entities no longer interfere with search, but I'm not sure they are actually being matched.
#16
I'm also concerned about custom code out in the wild - or custom theming. I wonder if we need ot make safe all the $doc properties except title?
#17
This tries to make all relevant keys safe - wish we had some better way...
#18
+++ apachesolr_search.module 10 Dec 2009 18:14:09 -0000@@ -408,10 +408,15 @@ function apachesolr_process_response($re
+ $snippet = strtr(check_plain($snippet), array('<strong>' => '<strong>', '</strong>' => '</strong>'));
if (!isset($doc->body)) {
This particular part smells like a hack to me...
Are we thinking too much for users here? What if I want to index and return HTML, perhaps we should let them. Seems to me that I should be able to. It's not like we stop people from putting <script> into their database tables. We do however provide the option to filter it out when displaying. Perhaps we should do the same here?
Why does having html in the xml doc sent for indexing cause a problem? we have [!CDATA elements, so I think it should be kosher... I guess 90% of people would want just the text and not the elements anyway, so perhaps not an issue.
Anyway, doing this is better than the current problem, so I'm in favor of a commit for that reason anyway, but we should reconsider the approach IMO.
Best,
Jacob
I'm on crack. Are you, too?
#19
Ugh, yeah. That's a nasty hack. I don't have a better idea at the moment, though, so +0.5 in favor of committing.
#20
with note in README and changed function name.
This break any custom code using this function on purpose to make developers aware of the security implications.
#21
committed after discussion w/ Robert
#22
For comparison the non-API change patch, but note that because we encode the search keys, there are some cases where searches will fail when they should match.
Also breaks any code trying to use something other than dismax to handle the request since lucene syntax may use &&
#23
rolled back the prior patch and committed #22 as less likely to cause problems for existing users
#24
Has been committed to 6.2.
#25
Fixed in 5.x-.2.x-dev... Not so sure if is OK for D5.
Patch attached.
#26
Automatically closed -- issue fixed for 2 weeks with no activity.
#27
In 6.x-2.x-dev quotes in snippets are shown as ". Is it possible to fix this?
And, searching of single character & produces the URL like search/apachesolr_search/%2526 and the query box on the result page contains %26 (and the result is irrelevant).
#28
#29
HTML entities are showing up in search results in 6.2 and 6.1 branches.
#30
Is the behavior different is 6.x-1.x vs 6.x-2.x? Are you running the latest version?
More details and steps to reproduce are needed before we can make any progress.
#31
Hi,
Just to add some detail to the issue I can post some sample data which will hopefully help with figuring out this issue.
From our database:
@James -
Outside of MRSS XML files, there's no supported way to do this directly in the embed. You may be able to load your playlist in dynamically using Javascript:
<code>var playlist = new Array (
{
'title': 'First Video',
'levels': new Array(
{'file': 'video1_500.flv', 'bitrate': '500', 'width': '420'},
{'file': 'video1_1000.flv', 'bitrate': '1000', 'width': '720'},
{'file': 'video1_2000.flv', 'bitrate': '2000', 'width': '1080'}
)
},
{
'title': 'Second Video',
'levels': new Array(
{'file': 'video2_500.flv', 'bitrate': '500', 'width': '420'},
{'file': 'video2_1000.flv', 'bitrate': '1000', 'width': '720'},
{'file': 'video2_2000.flv', 'bitrate': '2000', 'width': '1080'}
)
}
);
var player = document.getElementById('player');
player.sendEvent('LOAD', playlist);
Note that for each playlist item, instead of setting the 'file' property, you set the 'levels' property, and set that to an array of objects containing "file", "bitrate", and "width" properties.
For now, this is still experimental / unsupported, but give it a try and let us know what you think.
The search snippet:
could put the url and bitrate directly in the embed or javascript that would make it much easier ... this directly in the embed. You may be able to load your playlist in dynamically using Javascript: var playlist = new Array ( { 'title': 'First Video', & ...In this case we would not want the to show up since this is affecting how Google ranks keywords on our site. If you view the node itself the is not being escaped.
My gut feeling is the snippet needs to have extra processing done to it in apachesolr_process_response().
Note: We're using 1.0RC5 of this module.
Thanks.
#32
A number of escaping issues were fixed in the 1.0. Please try updating to 1.0 and see if this persists.
#33
Hi pwolanin,
As far as I can tell these are the fixes for 1.0:
Apache Solr Search Integration 6.x-1.0, 2010-03-03------------------------------
#686390 by justindodge, pwolanin don't overwite existing class for facets.
#623046 by robertDouglass, pwolanin make the results array more useful.
#666936 by justindodge, pwolanin make Drupal.behaviors.apachesolr respect context.
#672882 by David Lesieur, Fix Show more links on taxonomy facets.
#679522 by pwolanin, Add gettableFiles to solr admin interface config.
I don't see any reference to escaping fixes. Were the additional changes to escaping not included as part of the changelog or am I looking at the wrong set of changes?
Thanks.
#34
The changelog is not exactly reflective of every fix made. Please try updating to 1.0.
#35
Actually the fix I'm thinking of went into RC4:
Apache Solr Search Integration 6.x-1.0-RC4, 2009-12-17------------------------------
#528086 by pwolanin, better (but still problematic) handling of entities.
#36
Are you using the Drupal codefilter, or just code tags?
#37
We are using the codefilter module (http://drupal.org/project/codefilter).
#38
Hi, I am also experiencing a variant of this issue. In particular, " entities in my HTML are both searchable and visible in search results. Please see http://markleybros.com/search/apachesolr_search/quot
Obviously, entities like " should not appear in search results as ", they should appear as "
Also, other character entities (such as …, ’, etc..) are all searchable, even though they are not appearing in search results. Please see http://markleybros.com/search/apachesolr_search/hellip
Aside from the results display issue, I think it's an issue for character entities to be indexed as keywords. They should definitely be resolved down to their UTF-8 counterparts before being indexed. Otherwise words like naiveté (stored in the database as "naveté") will be indexed unreliably or not at all.
I am running the latest Drupal 6, the latest Apache SOLR 1.4.1, and the latest version of this apachesolr module.
I notice that this issue is marked as "maintainer needs more information". Please let me know what information is still needed so I can supply it.
#39
There are a few possibilities - including that this issue is specific to your theme.
Can you reproduce this in garland theme?
Can you attach a file containing text that when indexed exhibits the problem?
#40
I was able to fix this by implementing theme_apachesolr_search_snippets(). I tried doing it in hook_apachesolr_search_result_alter() by getting and setting the htmlspecialchars_decode-ed body field. But, for some reason, when setting the decoded body field content html special chars returned to their encoded state. If this is an acceptable fix, then it seems it would be simple enough to roll the change into the default theme implementation.
<?php
function MYTHEME_apachesolr_search_snippets(&$doc, &$snippets){
$result = '';
if (isset($snippets['body'])) {
$result .= htmlspecialchars_decode($snippets['body']);
unset($snippets['body']);
}
if (isset($snippets['teaser'])) {
$result .= (strlen($result) > 0) ? ' ... ' : '';
$result .= htmlspecialchars_decode($snippets['teaser']);
unset($snippets['teaser']);
}
$result .= (strlen($result) > 0) ? ' ... ' : '';
return $result . implode(' ... ', $snippets) . ' ...';
}
?>
#41
@malex you are running 6.x-2.x or 6.x-1.x?
Until someone posts the SOURCE text (i.e. what you see when editing the node, preferably as an attachment) that's indexed and exhibits the issue, with full info on the input format, exact apachesolr module version info, and with confirmation as to whether the problem appears in Garland I cannot possibly debug it and this will continue to be postponed.
#42
I haven't confirmed this on a clean install yet but it looks like this happens when you provide escaped HTML in your bodies. This is particularly common when you have a WYSIWYG installed. apachesolr "correctly" encodes html going out to solr so as to not provide bad XML and this is when
"becomes&quot;. When we get it back, we obviously don't want to decode the string coming back(safe strings) so we're stuck with this double encoded string.Thanks to csevb10 for looking through this as well and figuring out some of the details.
Edit: wrap entities in code tags so they're readable.
#43
Ok, confirmed it on a clean install today
Apachesolr module:
version = "6.x-1.2"
Node body:
"Lorem ipsizzle dolizzle sit dawg, go to hizzle adipiscing yippiyo. Go to hizzle da bomb own yo', volutpizzle, suscipizzle own yo', gravida vizzle, crunk. I'm in the shizzle ma nizzle tortizzle. Sed erizzle. Izzle at dolor dapibizzle turpis tempus bizzle. Crunk pellentesque nibh izzle turpizzle. We gonna chung izzle tortor. Mammasay mammasa mamma oo sa eleifend rhoncizzle rizzle. In hac its fo rizzle platea dictumst. Bizzle dapibizzle. Curabitizzle dawg fo shizzle my nizzle, pretizzle fo shizzle my nizzle, shut the shizzle up boom shackalack, eleifend vitae, nunc. Dang suscipit. Integizzle sempizzle fo shizzle sizzle purus."Solr result:
"Lorem ipsizzle dolizzle sit dawg, go to hizzle adipiscing yippiyo. Go to hizzle da bomb own yo', volutpizzle, suscipizzle own yo', gravida vizzle, crunk. I'm in the shizzle ma nizzle tortizzle. Sed erizzle. Izzle at dolor dapibizzle turpis tempus bizzle. Crunk pellentesque nibh izzle ...Definitely repeatable in Garland.
I can also give you access to the amazon instance I used in testing this if it would help.
#44
Oh other requested information. I was using the default core Filtered HTML filter.
#45
I can confirm this issue. And as with #42, we really don't want to filter out the returned HTML.
#46
We should be decoding all entities before re-escaping some of them at index time. Is that code broken?
apachesolr.index.inc:16: return htmlspecialchars(html_entity_decode($text, ENT_NOQUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');#47
I guess so. Wrote a quick test script that called apachesolr_clean_text qith a quote entity and I get the double escaped quote entity out.
Looks to me like html_entity_decode doesn't decode the quote entity because of ENT_NOQUOTES and then htmlspecialchars encodes the & in the entity because that's what it does. We aren't setting the php 5.2.3 double_encode parameter so that's doing exactly what its told but I'm guessing its not what we want.
What is the reason its not decoding quote entities?
#48
I was trying to making the 2 calls symmetric, but perhaps that's not right?
#49
I *think* the fix might be as simple as changing line 25 in apachesolr.index.inc to:
return htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_NOQUOTES, 'UTF-8');#50
or perhaps it should be:
htmlspecialchars(html_entity_decode($text, ENT_QUOTES, 'UTF-8'), ENT_QUOTES, 'UTF-8');which is symmetric, but should decode quotes initially
see: http://api.drupal.org/api/drupal/includes--bootstrap.inc/function/check_...
#51
so this seems to be close to a full fix in combination with the change in #50 for the indexing code. You have to re-index again to make it effective.
So, the fact that you can even have a protected words file was missing form the Solr wiki (I just updated it), but is included in the LucidWorks Solr reference PDF, and indeed seems to work - here's a highlighted snippet example:
search: http://localhost:8983/solr/ad/select/?q=%26amp%3B
<lst name="deeb44902dd8/node/70721"><arr name="body">
<str>
Region <strong>&</strong> Visibility settings - For more information on blocks, theme regions and block visibility
</str>
</arr>
</lst>
So you see we avoid the problem of getting the highlighting as
&<strong>amp</strong>;The relevant change to the drupal-1.1 schema.xml is attached. The contents of delim-protwords.txt is simply:
&<
>
'
"
#52
For simplicity, we can use just the one protected words file for both stemming and word delimiter, perhaps?
#53
Here's full patch for 7.x-1.x
#54
oops patch in #52 changes the schema in the wrong place. Here's a better D6 patch.
#55
committed to 6.x-1.x and 7.x-1.x
#56
What about 6.x-2.x version?
#57
#58
oops - the changes to 7.x-1.x and 6.x-1.x broke phrase searches on dismax as described in #1085142: Using quotes returns no results
#59
I think we have to double-pass since htmlspecialchars() has no option for doing single not double quotes.
#60
@James - your patch is CNW I think since it's missing the added protwords.txt file and does something funny to the CHANGELOG.
#61
patch above has the 2 string transformations in the wrong order so caused double encoding.
committing this to 7.x
#62
#63
Looks good for 6.1 (committed)
#64
Re-roll for 6.2, with protwords.txt this time, and without CHANGELOG.txt.
#65
Looks like a clean port from the diff.
#66
#67
Great! Works in 6.2.
#68
Automatically closed -- issue fixed for 2 weeks with no activity.
#69
I have my protwords.txt file with the words above and I already re-index my data, but if I search for "quots" i get results with
... received &<strong>quot</string>;Sending....Because we have '... received "Sending...' in our searchable data.
&<
>
'
"
thanks
#70
#71
We need much more information for a thread this old
What version?
What is the reason you are using protwords.txt for these replacements?
How can we replicate (as in add this content to a node, index the node, search for this word, voila)
Thanks!
#72
Drupal pressflow 6.17.85
Apache Solr 1.4.1
drupal apachesolr module with http://drupal.org/files/issues/528086-64.patch
schema.xml - schema name="drupal-1.4" version="1.2"
This happens in any node with quots on body for instance. If we have a text like:
we have very information on our "database" but some...When we literally search for "quots" we get this result:
we have very information on our &<strong>quot</string>;database&<strong>quot</string>; but some...We set protowords.txt with
&<
>
'
"
to avoid this behavior, but we were not successful.
Thanks!
#73
#74
#75