Download & Extend

Search result trimming should not fall inside HTML entities/tags

Project:Drupal core
Version:7.x-dev
Component:search.module
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

Search result should not be trimmed inside word, and nowise on html code. It could look awful.

example:
készítése

looks like:
készít&eac ...

Comments

#1

Version:6.2» 7.x-dev

Just assigning this to D7, since that is where this will have to be fixed.

Search results are trimming at a given character limit. They should be trimmed at a word boundary. Should be an easy fix - if nobody else takes it on, I will get to it eventually.

#2

Looks like this still needs to be fixed, and can we have a test for it too?

#3

Status:active» needs review

I believe this this will fix it, but I haven't tested. There should be a test for this.

AttachmentSizeStatusTest resultOperations
269911-3-wordsafe_trim_of_search_excerpts-d7.patch654 bytesIdlePASSED: [[SimpleTest]]: [MySQL] 17,973 pass(es).View details

#4

The patch looks like it should work in the case that the exerpt didn't find the keywords... A couple of questions:

a) Is it possible for search_exerpt to truncate incorrectly if it did find a keyword? ... I looked at the code: No. So that's OK.

b) Is putting " ..." after the excerpt the right thing to do in all languages? I doubt it. However, that's probably a separate issue.

c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?

#5

#6

Interesting question about word boundary. Suppose $boundary is language specific so should be defined in iso.inc

#7

Perhaps (word boundary definitions in iso.inc)... But the search excerpt function is making a search excerpt of the page for search results, which is trucated content (e.g. node title/body truncated to N characters), and the content could be in any language. Search doesn't know about languages, in its current incarnation. So having a language-aware set of boundary characters would not help until search becomes more language aware. (There are separate issues about that, but at this point it's D8 material.)

Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese -- I know just enough about languages and UTF to know that I don't know much and bring up questions. Who should we ask?

#8

c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?

If this is indeed an issue, and it might very well be, it should be filed as a separate issue against the truncate_utf8() function.

Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese [...]

truncate_utf8() is a Drupal function, not a PHP function. (There's also no such thing as "UTF-9".)

#9

I think it's good idea to post about this in http://groups.drupal.org/japan

#10

Sorry, my mistake about PHP vs Drupal function on trucate_utf8.

Obviously utf9 was a typo. There is really no need to be so snarky in the issue queue when someone makes a typo. Thanks!

Can someone find an answer about whether spaces are valid word boundaries in all languages? I am extremely busy this week and don't have time to research it on IRC.

#11

@jhodgon:

Taking Chinese as an example, there are no characters which denote word boundaries, since a word is often wholly contained within a given character.

I've attached two scripts I've found that handle Chinese reasonably well using different approaches. However, keep in mind that this is just one language - the same applies more or less to Japanese, Korean and other similar languages which require somewhat different algorithms to parse.

So, short form, if we want truncate_utf8() to handle languages that don't use a space for word separation, it's going to have to get a whole lot more complicated and resource hungry. If you think this is worth pursuing further, let's do so in a separate issue. In the meantime, I think we can disregard the improper handling of CJK results for the purpose of this issue, at least.

AttachmentSizeStatusTest resultOperations
cwordseg_fast.txt5.47 KBIgnored: Check issue status.NoneNone
cwordseg_mp.txt7.69 KBIgnored: Check issue status.NoneNone

#12

Status:needs review» needs work

Thanks brianV for confirming what I suspected: a space is not a word boundary in all languages.

So the patch in #3 is not going to work for us in general. And since the search module unfortunately doesn't know anything about what language it's dealing with, I don't know what will. Not an easy problem.

Actually, we could fix the actual reported issue about HTML entities by converting HTML entities in this function to characters before truncating (not on a word boundary). It would still stop in a word, but it at least wouldn't make half-entities.

So I also took a quick look at the the 12 functions reported by api.drupal.org in D7 that call truncate_utf8() [note that there could be other calls in classes, since api.drupa.org is not displaying them yet!]. Most do not use the word boundary functionality. Those that do:
http://api.drupal.org/api/function/_menu_parents_recurse/7
http://api.drupal.org/api/function/_book_toc_recurse/7
http://api.drupal.org/api/function/theme_dblog_message/7
http://api.drupal.org/api/function/comment_submit/7

Those are also suspect.

Also I think the "..." after truncating that is in truncate_utf8 as well (almost all the calls either use that or put their own "..." afterwards) is possibly also not correct? I don't know about that.

We have some CJK word boundary code in the search module already, by the way. I don't know how well it works.

Not too sure what to do about all of this.

#13

Status:needs work» needs review

@jhodgon:

That is a separate issue about trancate_utf8(), and should be filed as such. It in no way affects this issue. The problems with CJK have to be fixed there, not in this issue.

#14

Status:needs review» needs work

Well, until truncate_utf8() is fixed, the solution suggested here is not acceptable, because it will likely result in zero-length excerpts in some languages, which would be worse than what we have currently (which is excerpts that break in the middle of a word). So I'm marking this back to "needs work".

Plus, I still don't know if putting "..." at the end of an excerpt is the right thing for all languages. Any ideas on that?

So, let's get back to basics here. The problem reported here is that the current way of doing things is broken in that it can break in the middle of an HTML entity. I think that one solution to that would be to use http://php.net/manual/en/function.html-entity-decode.php to decode the entities, then truncate without the word boundary restriction). That way we wouldn't be breaking up entities, and that would solve the reported issue.

And by the way, there are about 25 issues that come up when you search for truncate_utf8 (including this one). I will go through them later today and file a new one if this is a new issue about it not working for CJK.

But this one is almost certainly relevant:
#200185: truncate_utf8() is used as a substring function
As noted by Gabor:
http://drupal.org/node/200185#comment-662567
truncate_utf8() is NOT meant to be used as a substring function. So we should not be using it here.

#15

#16

Title:wrong trim used for search result » Wrong trim used for search result
Priority:minor» normal
Status:needs work» reviewed & tested by the community
Issue tags:+needs backport to D5, +needs backport to D6

#3 is ok.

@jhodgdon: in the current form of truncate_utf8(), if there is no space anywhere it will simply truncate at the indicated character length. This will never result in a zero-length string.

This likely should be backported to D5 and D6.

#17

Title:Wrong trim used for search result » wrong trim used for search result
Priority:normal» minor
Status:reviewed & tested by the community» needs work
Issue tags:-needs backport to D5, -needs backport to D6

New truncate_utf8() issue:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

#18

Title:wrong trim used for search result » Wrong trim used for search result
Priority:minor» normal
Status:needs work» reviewed & tested by the community
Issue tags:+needs backport to D5, +needs backport to D6

#19

Here's the patch for #3 against latest DRUPAL-5 and DRUPAL-6.

Also, jhodgon, the UTF-9 comment was not meant to be snarky. I really thought you thought there would be a truncate_utf9() function. Sorry about that. :/

AttachmentSizeStatusTest resultOperations
269911-3-wordsafe_trim_of_search_excerpts-d6.patch664 bytesIgnored: Check issue status.NoneNone
269911-3-wordsafe_trim_of_search_excerpts-d5.patch665 bytesIgnored: Check issue status.NoneNone

#20

Status:reviewed & tested by the community» needs review

Sorry about the cross-posting there, and about being prickly, Freso. :)

And Damien is correct about the truncate function working OK if a space isn't found (#16), although in this case it will still be broken as far as possibly breaking in the middle of an entity or HTML tag. (Hopefully we'll also fix the truncate function to work better for non-Latin languages, on that other issue).

But what about #14 suggestion -- shouldn't we be doing an entity decode and maybe a tag removal before we truncate?

#21

Version:7.x-dev» 6.x-dev
Status:needs review» reviewed & tested by the community

Committed to CVS HEAD. Thanks!

#24

Version:6.x-dev» 7.x-dev
Status:reviewed & tested by the community» needs work

See #20. Shouldn't we be decoding entitites and removing html tags before truncating?

#25

Yes, we should probably decode entities in search_excerpt(), just after the call to strip_tags().

#26

Status:needs work» needs review

Here's a patch that does an entity decode/encode.

It appears to work. I was able to test both the "I found keywords to highlight" and "I didn't" paths in the code by using the Porter Stemmer module, and both are displaying OK on the search results page.

AttachmentSizeStatusTest resultOperations
269911.patch1.75 KBIdleFAILED: [[SimpleTest]]: [MySQL] 19,164 pass(es), 2 fail(s), and 0 exception(es).View details

#27

Status:needs review» needs work

The last submitted patch, 269911.patch, failed testing.

#28

Looks like those test failures are real... will need to investigate.

#29

comment search test is fragile because it depends on exact string comparison in search results

#30

Yeah, probably the exact string comparison changed a bit and the test needs a rewrite if we go with this patch.

#33

Status:needs work» needs review

Here's a new patch. On my test box it passes the search tests... let's see what the testbot says.

AttachmentSizeStatusTest resultOperations
269911fixtests.patch1.82 KBIdlePASSED: [[SimpleTest]]: [MySQL] 19,164 pass(es).View details

#34

Addresses the issue in the OP. Can we have a test for this?

#35

Status:needs review» needs work

Yeah, a test would be a good idea indeed. A node with lots of entities would need to be constructed so we could verify the old function would break it wrong. Probably the test would also need to use a preprocessing function in order to exercise the "I didn't actually find the keyword, just take the first N characters" path in the code (the preprocess function changes the text so the exact search keyword isn't found in the node text, and then the excerpt function doesn't work well, see #493270: search_excerpt() doesn't work well with stemming, which I would have loved to get fixed for D7 but it was moved to D8, sigh, maybe I should move it back, since all sorts of other API changes have gotten in after the deadline, sigh, and that's a really ugly bug.)

That aside, we've also been discussing on #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) that '...' is not the correct thing to do for non-Latin languages. So we should be passing the ... that is used in the search_excerpt() function through t() before pasting together the ranges.

#36

Status:needs work» needs review

OK.

I fixed the above issues and I wrote a test. It (of course) uncovered some other problems in this function, so I had to fix them too in order to get the test to pass.

AttachmentSizeStatusTest resultOperations
269911-withtests.patch5.02 KBIdlePASSED: [[SimpleTest]]: [MySQL] 19,203 pass(es).View details

#37

Status:needs review» needs work

Patch works for me, a couple minor comments:

  • Is the space after the $text not needed in strpos(' ' . $text) ... it's in original source, should it be strpos(' ' . $text . ' ')?
  • Avoid running strlen() twice, by using min(), [untested] something like $s = min(strlen($end) - 1, $s);
  • ... if you do the above min(), it makes sense to do same thing using max() in line above it.
  • Please add comment above htmlspecialchars (around +1117) that says something about re-encoding previously decoded html entities.
  • Similarly, comment // If we didn't find anything, return the beginning. doesn't make sense now that we're re-encoding it, and not just returning the beginning string (and "return the beginning" isn't a complete enough sentence anyways).

#38

Regarding space before/after:

        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);

The space before is in the first line, and the space after is in the second line.

Otherwise, I think I've addressed your comments - thanks!

AttachmentSizeStatusTest resultOperations
269911-cleanup.patch5.19 KBIdlePASSED: [[SimpleTest]]: [MySQL] 19,592 pass(es).View details

#39

Status:needs work» needs review

#40

Status:needs review» reviewed & tested by the community

Looks great! I'm not ecstatic about the comments, but let's address those separately .. maybe we can work together on all search comments and have a single cleanup patch.

#41

Status:reviewed & tested by the community» needs work

Could we better comment that test a little?

#42

Status:needs work» needs review
Issue tags:+Needs Documentation

I added the following as an introduction to the test case:

'Passes keywords and a sample marked up string, "The quick brown fox jumps over the lazy dog", and compares it to the correctly marked up string. The correctly marked up string contains either highlighted keywords or the original marked up string if no keywords matched the string.'

AttachmentSizeStatusTest resultOperations
269911-cleanup-docs.patch5.71 KBIdleFAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 269911-cleanup-docs.patch.View details

#43

Looks fine to me...

#44

Status:needs review» reviewed & tested by the community

#45

Status:reviewed & tested by the community» needs work

Does not apply anymore.

#46

Status:needs work» needs review

#42: 269911-cleanup-docs.patch queued for re-testing.

#47

Status:needs review» needs work

The last submitted patch, 269911-cleanup-docs.patch, failed testing.

#48

Status:needs work» needs review

I re-rolled the patch, which didn't apply anymore. It seems my editor caught some spaces that should have been removed.

AttachmentSizeStatusTest resultOperations
269911_cleanup_docs_rerolled.patch10.72 KBIdleFAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 269911_cleanup_docs_rerolled.patch.View details

#49

Status:needs review» needs work

The last submitted patch, 269911_cleanup_docs_rerolled.patch, failed testing.

#50

Status:needs work» needs review

Let's try once more.

AttachmentSizeStatusTest resultOperations
269911_cleanup_docs_cleaned.patch5.74 KBIdlePASSED: [[SimpleTest]]: [MySQL] 22,747 pass(es).View details

#51

Status:needs review» reviewed & tested by the community

Looks good to me.

#52

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

#53

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

#54

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

#55

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

#56

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

#57

Title:Wrong trim used for search result » Search result trimming should not fall inside HTML entities/tags

pwolanin asked me to change the title to something better

#58

Status:reviewed & tested by the community» needs work

Please use our own decode_entities() instead of html_entity_decode(), which is buggy. And check_plain() instead of htmlspecialchars().

Also, this is not really great for translatability purposes:

<?php
$text = (isset($newranges[0]) ? '' : t('... ')) . implode(t(' ... '), $out) . t(' ...');
?>

I suggest we do it this way:

<?php
$separators
= explode('!excerpt', t('... !excerpt ... !excerpt ...'));
$text = (isset($newranges[0]) ? '' : $separators[0]) . implode($separators[1]), $out) . $separators[2]);
?>

That will give more context to translators.

#59

Good suggestions. I'll make another patch later.

#60

Status:needs work» needs review

Here we go (I hope)... let's see if it still passes the tests at least.

AttachmentSizeStatusTest resultOperations
269911-drupalway.patch5.59 KBIdleFAILED: [[SimpleTest]]: [MySQL] 22,745 pass(es), 1 fail(s), and 0 exception(es).View details

#61

Status:needs review» needs work

The last submitted patch, 269911-drupalway.patch, failed testing.

#62

Hm... Failed "Entities are converted in excerpt". Perhaps the substitution of the drupal functions wasn't successful. Will need to investigate.

#63

Status:needs work» needs review

Ah. Looks like there was a screwed-up character in that last patch. This should work better.

AttachmentSizeStatusTest resultOperations
269911-fixchars.patch5.59 KBIdlePASSED: [[SimpleTest]]: [MySQL] 22,762 pass(es).View details

#64

jhodgdon, could you use cvs diff -up to see which functions are changing

While you on it, is it possible to make highlighting swappable?
As maintainer of http://drupal.org/project/rustemmer I need to patch search.module because I need highlight a stemmed words which mostly different from words contained in fragment

#65

andypost: I wish. I use Eclipse for patching and they don't have a -p option as far as I can tell. I'm too lazy to learn another tool...

Regarding making highlighting swappable: I tried, and it was rejected for D7. So it's been pushed off to D8. I know what you're saying -- I am the maintainer of Porter Stemmer for English... Anyway, it's a separate issue. See
#493270: search_excerpt() doesn't work well with stemming
If you're interested, I do have a working version of this in the Search by Page contrib module and Porter Stemmer (with a hook in SBP that lets Porter Stemmer do its own highlighting). Your stemming module could implement the same hook... I have both D6 and D7 versions in SBP/Porter.

And I would still love to get that into D7, but you can read the history on that other issue...

#66

Status:needs review» reviewed & tested by the community

I think it's possible we can rip decode_entities() in favor of the php built-in html_entity_decode() since Drupal 7 requires PHP 5.2 and utf8 support was added in 5.0. http://www.php.net/manual/en/function.html-entity-decode.php

However, that should be a separate issue. This patch looks good.

#67

Status:reviewed & tested by the community» fixed

Committed to CVS HEAD. Thanks.

#68

@pwolanin: I can confirm that html_entity_decode($input, ENT_QUOTES, 'UTF-8') works correctly and passes our tests. This charset parameter must have hidden in a cave or something... we refactored the whole decode_entities() in D7 to workaround supposed limitations in the entity table set of PHP :)

#69

ok, well assuming the built-in is dramatically faster, let's have a follow-up for that conversion: #878408: Replace decode_entities() with built-in html_entity_decode()

#70

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

nobody click here