Search result trimming should not fall inside HTML entities/tags [#269911]

Comment	File	Size	Author
#63	269911-fixchars.patch	5.59 KB	jhodgdon
#60	269911-drupalway.patch	5.59 KB	jhodgdon
#50	269911_cleanup_docs_cleaned.patch	5.74 KB	avpaderno
#48	269911_cleanup_docs_rerolled.patch	10.72 KB	avpaderno
#42	269911-cleanup-docs.patch	5.71 KB	mradcliffe
#38	269911-cleanup.patch	5.19 KB	jhodgdon
#36	269911-withtests.patch	5.02 KB	jhodgdon
#33	269911fixtests.patch	1.82 KB	jhodgdon
#26	269911.patch	1.75 KB	jhodgdon
#19	269911-3-wordsafe_trim_of_search_excerpts-d6.patch	664 bytes	Freso
#19	269911-3-wordsafe_trim_of_search_excerpts-d5.patch	665 bytes	Freso
#11	cwordseg_fast.txt	5.47 KB	brianV
#11	cwordseg_mp.txt	7.69 KB	brianV
#3	269911-3-wordsafe_trim_of_search_excerpts-d7.patch	654 bytes	Freso

Comment #1

brianV commented 4 June 2009 at 04:14

Version:

6.2

» 7.x-dev

Just assigning this to D7, since that is where this will have to be fixed.

Search results are trimming at a given character limit. They should be trimmed at a word boundary. Should be an easy fix - if nobody else takes it on, I will get to it eventually.

Log in or register to post comments

Comment #2

jhodgdon

she/her

English

commented 18 February 2010 at 22:46

Looks like this still needs to be fixed, and can we have a test for it too?

Log in or register to post comments

Comment #3

Freso commented 27 February 2010 at 17:26

Status:	Active	» Needs review
Issue tags:		+Needs tests

Status	File	Size
new	269911-3-wordsafe_trim_of_search_excerpts-d7.patch	654 bytes

I believe this this will fix it, but I haven't tested. There should be a test for this.

Log in or register to post comments

Comment #4

jhodgdon

she/her

English

commented 28 February 2010 at 15:11

The patch looks like it should work in the case that the exerpt didn't find the keywords... A couple of questions:

a) Is it possible for search_exerpt to truncate incorrectly if it did find a keyword? ... I looked at the code: No. So that's OK.

b) Is putting " ..." after the excerpt the right thing to do in all languages? I doubt it. However, that's probably a separate issue.

c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?

Log in or register to post comments

Comment #5

pasqualle

🇭🇺 Budapest

commented 29 March 2010 at 17:25

similar problem fixed in dblog module
#718662: DBLog listings truncate messages in the middle of HTML tags

Log in or register to post comments

Comment #6

andypost

he/him

Russian

commented 29 March 2010 at 19:07

Interesting question about word boundary. Suppose $boundary is language specific so should be defined in iso.inc

Log in or register to post comments

Comment #7

jhodgdon

she/her

English

commented 29 March 2010 at 19:21

Perhaps (word boundary definitions in iso.inc)... But the search excerpt function is making a search excerpt of the page for search results, which is trucated content (e.g. node title/body truncated to N characters), and the content could be in any language. Search doesn't know about languages, in its current incarnation. So having a language-aware set of boundary characters would not help until search becomes more language aware. (There are separate issues about that, but at this point it's D8 material.)

Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese -- I know just enough about languages and UTF to know that I don't know much and bring up questions. Who should we ask?

Log in or register to post comments

Comment #8

Freso commented 1 April 2010 at 14:36

c) Will this patch work for languages (Chinese, Japanese) that don't have spaces between words? No, I don't think so. It looks to me as though in this case, the proposed patch will probably return a 0-character excerpt, which is not OK. Or is there something I am missing? I think the correct solution is to use $boundary to find a word boundary, rather than the truncate_utf8() function?

If this is indeed an issue, and it might very well be, it should be filed as a separate issue against the truncate_utf8() function.

Anyway, both this patch and #718662 patch are using the PHP function truncate_utf8(), which obviously wouldn't care about Drupal's iso.inc or any other definitions of word boundaries. I don't know whether or not truncate_utf9() does something sensible for Chinese/Japanese [...]

truncate_utf8() is a Drupal function, not a PHP function. (There's also no such thing as "UTF-9".)

Log in or register to post comments

Comment #9

andypost

he/him

Russian

commented 1 April 2010 at 14:52

I think it's good idea to post about this in http://groups.drupal.org/japan

Log in or register to post comments

Comment #10

jhodgdon

she/her

English

commented 1 April 2010 at 15:10

Sorry, my mistake about PHP vs Drupal function on trucate_utf8.

Obviously utf9 was a typo. There is really no need to be so snarky in the issue queue when someone makes a typo. Thanks!

Can someone find an answer about whether spaces are valid word boundaries in all languages? I am extremely busy this week and don't have time to research it on IRC.

Log in or register to post comments

Comment #11

brianV commented 1 April 2010 at 16:46

Status	File	Size
new	cwordseg_mp.txt	7.69 KB
new	cwordseg_fast.txt	5.47 KB

@jhodgon:

Taking Chinese as an example, there are no characters which denote word boundaries, since a word is often wholly contained within a given character.

I've attached two scripts I've found that handle Chinese reasonably well using different approaches. However, keep in mind that this is just one language - the same applies more or less to Japanese, Korean and other similar languages which require somewhat different algorithms to parse.

So, short form, if we want truncate_utf8() to handle languages that don't use a space for word separation, it's going to have to get a whole lot more complicated and resource hungry. If you think this is worth pursuing further, let's do so in a separate issue. In the meantime, I think we can disregard the improper handling of CJK results for the purpose of this issue, at least.

Log in or register to post comments

Comment #12

jhodgdon

she/her

English

commented 1 April 2010 at 17:16

Status:

Needs review

» Needs work

Thanks brianV for confirming what I suspected: a space is not a word boundary in all languages.

So the patch in #3 is not going to work for us in general. And since the search module unfortunately doesn't know anything about what language it's dealing with, I don't know what will. Not an easy problem.

Actually, we could fix the actual reported issue about HTML entities by converting HTML entities in this function to characters before truncating (not on a word boundary). It would still stop in a word, but it at least wouldn't make half-entities.

So I also took a quick look at the the 12 functions reported by api.drupal.org in D7 that call truncate_utf8() [note that there could be other calls in classes, since api.drupa.org is not displaying them yet!]. Most do not use the word boundary functionality. Those that do:
http://api.drupal.org/api/function/_menu_parents_recurse/7
http://api.drupal.org/api/function/_book_toc_recurse/7
http://api.drupal.org/api/function/theme_dblog_message/7
http://api.drupal.org/api/function/comment_submit/7

Those are also suspect.

Also I think the "..." after truncating that is in truncate_utf8 as well (almost all the calls either use that or put their own "..." afterwards) is possibly also not correct? I don't know about that.

We have some CJK word boundary code in the search module already, by the way. I don't know how well it works.

Not too sure what to do about all of this.

Log in or register to post comments

Comment #13

brianV commented 1 April 2010 at 18:39

Status:

Needs work

» Needs review

@jhodgon:

That is a separate issue about trancate_utf8(), and should be filed as such. It in no way affects this issue. The problems with CJK have to be fixed there, not in this issue.

Log in or register to post comments

Comment #14

jhodgdon

she/her

English

commented 2 April 2010 at 15:43

Status:

Needs review

» Needs work

Well, until truncate_utf8() is fixed, the solution suggested here is not acceptable, because it will likely result in zero-length excerpts in some languages, which would be worse than what we have currently (which is excerpts that break in the middle of a word). So I'm marking this back to "needs work".

Plus, I still don't know if putting "..." at the end of an excerpt is the right thing for all languages. Any ideas on that?

So, let's get back to basics here. The problem reported here is that the current way of doing things is broken in that it can break in the middle of an HTML entity. I think that one solution to that would be to use http://php.net/manual/en/function.html-entity-decode.php to decode the entities, then truncate without the word boundary restriction). That way we wouldn't be breaking up entities, and that would solve the reported issue.

And by the way, there are about 25 issues that come up when you search for truncate_utf8 (including this one). I will go through them later today and file a new one if this is a new issue about it not working for CJK.

But this one is almost certainly relevant:
#200185: truncate_utf8() is used as a substring function
As noted by Gabor:
http://drupal.org/node/200185#comment-662567
truncate_utf8() is NOT meant to be used as a substring function. So we should not be using it here.

Log in or register to post comments

Comment #15

jhodgdon

she/her

English

commented 11 April 2010 at 14:18

Separate issue on truncate_utf8()
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

Log in or register to post comments

Comment #16

damien tournoud commented 11 April 2010 at 14:26

Title:	wrong trim used for search result	» Wrong trim used for search result
Priority:	Minor	» Normal
Status:	Needs work	» Reviewed & tested by the community
Issue tags:		+Needs backport to D6, +Needs backport to D5

#3 is ok.

@jhodgdon: in the current form of truncate_utf8(), if there is no space anywhere it will simply truncate at the indicated character length. This will never result in a zero-length string.

This likely should be backported to D5 and D6.

Log in or register to post comments

Comment #17

jhodgdon

she/her

English

commented 11 April 2010 at 14:27

Title:	Wrong trim used for search result	» wrong trim used for search result
Priority:	Normal	» Minor
Status:	Reviewed & tested by the community	» Needs work
Issue tags:	-Needs backport to D6, -Needs backport to D5

New truncate_utf8() issue:
#768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug)

Log in or register to post comments

Comment #18

damien tournoud commented 11 April 2010 at 14:35

Title:	wrong trim used for search result	» Wrong trim used for search result
Priority:	Minor	» Normal
Status:	Needs work	» Reviewed & tested by the community
Issue tags:		+Needs backport to D6, +Needs backport to D5

Log in or register to post comments

Comment #19

Freso commented 11 April 2010 at 14:40

Issue tags:

-Needs backport to D6, -Needs backport to D5

Status	File	Size
new	269911-3-wordsafe_trim_of_search_excerpts-d5.patch	665 bytes
new	269911-3-wordsafe_trim_of_search_excerpts-d6.patch	664 bytes

Here's the patch for #3 against latest DRUPAL-5 and DRUPAL-6.

Also, jhodgon, the UTF-9 comment was not meant to be snarky. I really thought you thought there would be a truncate_utf9() function. Sorry about that. :/

Log in or register to post comments

Comment #20

jhodgdon

she/her

English

commented 11 April 2010 at 14:50

Status:

Reviewed & tested by the community

» Needs review

Sorry about the cross-posting there, and about being prickly, Freso. :)

And Damien is correct about the truncate function working OK if a space isn't found (#16), although in this case it will still be broken as far as possibly breaking in the middle of an entity or HTML tag. (Hopefully we'll also fix the truncate function to work better for non-Latin languages, on that other issue).

But what about #14 suggestion -- shouldn't we be doing an entity decode and maybe a tag removal before we truncate?

Log in or register to post comments

Comment #21

dries commented 11 April 2010 at 14:52

Version:	7.x-dev	» 6.x-dev
Status:	Needs review	» Reviewed & tested by the community

Committed to CVS HEAD. Thanks!

Log in or register to post comments

Comment #24

jhodgdon

she/her

English

commented 11 April 2010 at 16:59

Version:	6.x-dev	» 7.x-dev
Status:	Reviewed & tested by the community	» Needs work

See #20. Shouldn't we be decoding entitites and removing html tags before truncating?

Log in or register to post comments

Comment #25

damien tournoud commented 11 April 2010 at 17:10

Yes, we should probably decode entities in search_excerpt(), just after the call to strip_tags().

Log in or register to post comments

Comment #26

jhodgdon

she/her

English

commented 13 April 2010 at 16:27

Status:

Needs work

» Needs review

Status	File	Size
new	269911.patch	1.75 KB

Here's a patch that does an entity decode/encode.

It appears to work. I was able to test both the "I found keywords to highlight" and "I didn't" paths in the code by using the Porter Stemmer module, and both are displaying OK on the search results page.

Log in or register to post comments

Comment #27

13 April 2010 at 16:50

Status:

Needs review

» Needs work

The last submitted patch, 269911.patch, failed testing.

Log in or register to post comments

Comment #28

jhodgdon

she/her

English

commented 13 April 2010 at 17:56

Looks like those test failures are real... will need to investigate.

Log in or register to post comments

Comment #29

andypost

he/him

Russian

commented 13 April 2010 at 18:19

comment search test is fragile because it depends on exact string comparison in search results

Log in or register to post comments

Comment #30

jhodgdon

she/her

English

commented 13 April 2010 at 18:38

Yeah, probably the exact string comparison changed a bit and the test needs a rewrite if we go with this patch.

Log in or register to post comments

Comment #33

jhodgdon

she/her

English

commented 13 April 2010 at 22:47

Status:

Needs work

» Needs review

Status	File	Size
new	269911fixtests.patch	1.82 KB

Here's a new patch. On my test box it passes the search tests... let's see what the testbot says.

Log in or register to post comments

Comment #34

jpmckinney commented 14 April 2010 at 18:29

Addresses the issue in the OP. Can we have a test for this?

Log in or register to post comments

Comment #35

jhodgdon

she/her

English

commented 15 April 2010 at 13:08

Status:

Needs review

» Needs work

Yeah, a test would be a good idea indeed. A node with lots of entities would need to be constructed so we could verify the old function would break it wrong. Probably the test would also need to use a preprocessing function in order to exercise the "I didn't actually find the keyword, just take the first N characters" path in the code (the preprocess function changes the text so the exact search keyword isn't found in the node text, and then the excerpt function doesn't work well, see #493270: search_excerpt() doesn't work well with stemming, which I would have loved to get fixed for D7 but it was moved to D8, sigh, maybe I should move it back, since all sorts of other API changes have gotten in after the deadline, sigh, and that's a really ugly bug.)

That aside, we've also been discussing on #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) that '...' is not the correct thing to do for non-Latin languages. So we should be passing the ... that is used in the search_excerpt() function through t() before pasting together the ranges.

Log in or register to post comments

Comment #36

jhodgdon

she/her

English

commented 20 April 2010 at 05:11

Status:

Needs work

» Needs review

Status	File	Size
new	269911-withtests.patch	5.02 KB

OK.

I fixed the above issues and I wrote a test. It (of course) uncovered some other problems in this function, so I had to fix them too in order to get the test to pass.

Log in or register to post comments

Comment #37

douggreen commented 20 April 2010 at 06:13

Status:

Needs review

» Needs work

Patch works for me, a couple minor comments:

Is the space after the $text not needed in strpos(' ' . $text) ... it's in original source, should it be strpos(' ' . $text . ' ')?
Avoid running strlen() twice, by using min(), [untested] something like $s = min(strlen($end) - 1, $s);
... if you do the above min(), it makes sense to do same thing using max() in line above it.
Please add comment above htmlspecialchars (around +1117) that says something about re-encoding previously decoded html entities.
Similarly, comment // If we didn't find anything, return the beginning. doesn't make sense now that we're re-encoding it, and not just returning the beginning string (and "return the beginning" isn't a complete enough sentence anyways).

Log in or register to post comments

Comment #38

jhodgdon

she/her

English

commented 20 April 2010 at 16:20

Status	File	Size
new	269911-cleanup.patch	5.19 KB

Regarding space before/after:

        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);

The space before is in the first line, and the space after is in the second line.

Otherwise, I think I've addressed your comments - thanks!

Log in or register to post comments

Comment #39

jhodgdon

she/her

English

commented 20 April 2010 at 16:20

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #40

douggreen commented 20 April 2010 at 18:17

Status:

Needs review

» Reviewed & tested by the community

Looks great! I'm not ecstatic about the comments, but let's address those separately .. maybe we can work together on all search comments and have a single cleanup patch.

Log in or register to post comments

Comment #41

webchick

she/they

English

Vancouver 🇨🇦

commented 21 April 2010 at 08:17

Status:

Reviewed & tested by the community

» Needs work

Could we better comment that test a little?

Log in or register to post comments

Comment #42

mradcliffe

he/him

English

commented 22 April 2010 at 05:34

Status:	Needs work	» Needs review
Issue tags:		+Needs documentation

Status	File	Size
new	269911-cleanup-docs.patch	5.71 KB

I added the following as an introduction to the test case:

'Passes keywords and a sample marked up string, "The quick brown fox jumps over the lazy dog", and compares it to the correctly marked up string. The correctly marked up string contains either highlighted keywords or the original marked up string if no keywords matched the string.'

Log in or register to post comments

Comment #43

jhodgdon

she/her

English

commented 22 April 2010 at 06:57

Issue tags:

-Needs tests, -Needs documentation

Looks fine to me...

Log in or register to post comments

Comment #44

jhodgdon

she/her

English

commented 22 April 2010 at 06:57

Status:

Needs review

» Reviewed & tested by the community

Log in or register to post comments

Comment #45

klausi

he/him||they/them

German

🇦🇹 Vienna

commented 31 May 2010 at 09:43

Status:

Reviewed & tested by the community

» Needs work

Does not apply anymore.

Log in or register to post comments

Comment #46

jhodgdon

she/her

English

commented 31 May 2010 at 16:42

Status:

Needs work

» Needs review

#42: 269911-cleanup-docs.patch queued for re-testing.

Log in or register to post comments

Comment #47

31 May 2010 at 17:00

Status:

Needs review

» Needs work

The last submitted patch, 269911-cleanup-docs.patch, failed testing.

Log in or register to post comments

Comment #48

avpaderno

he/him

Italian

Brescia, 🇮🇹 🇪🇺

commented 31 May 2010 at 18:00

Status:

Needs work

» Needs review

Status	File	Size
new	269911_cleanup_docs_rerolled.patch	10.72 KB

I re-rolled the patch, which didn't apply anymore. It seems my editor caught some spaces that should have been removed.

Log in or register to post comments

Comment #49

31 May 2010 at 18:20

Status:

Needs review

» Needs work

The last submitted patch, 269911_cleanup_docs_rerolled.patch, failed testing.

Log in or register to post comments

Comment #50

avpaderno

he/him

Italian

Brescia, 🇮🇹 🇪🇺

commented 31 May 2010 at 19:24

Status:

Needs work

» Needs review

Status	File	Size
new	269911_cleanup_docs_cleaned.patch	5.74 KB

Let's try once more.

Log in or register to post comments

Comment #51

jhodgdon

she/her

English

commented 1 June 2010 at 14:00

Status:

Needs review

» Reviewed & tested by the community

Looks good to me.

Log in or register to post comments

Comment #52

jhodgdon

she/her

English

commented 10 June 2010 at 22:37

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

Log in or register to post comments

Comment #53

jhodgdon

she/her

English

commented 25 June 2010 at 14:51

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

Log in or register to post comments

Comment #54

pwolanin commented 9 July 2010 at 02:37

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

Log in or register to post comments

Comment #55

jhodgdon

she/her

English

commented 20 July 2010 at 17:24

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

Log in or register to post comments

Comment #56

jhodgdon

she/her

English

commented 5 August 2010 at 14:58

#50: 269911_cleanup_docs_cleaned.patch queued for re-testing.

Log in or register to post comments

Comment #57

jhodgdon

she/her

English

commented 6 August 2010 at 18:10

Title:

Wrong trim used for search result

» Search result trimming should not fall inside HTML entities/tags

pwolanin asked me to change the title to something better

Log in or register to post comments

Comment #58

damien tournoud commented 6 August 2010 at 18:35

Status:

Reviewed & tested by the community

» Needs work

Please use our own decode_entities() instead of html_entity_decode(), which is buggy. And check_plain() instead of htmlspecialchars().

Also, this is not really great for translatability purposes:

+  $text = (isset($newranges[0]) ? '' : t('... ')) . implode(t(' ... '), $out) . t(' ...');

I suggest we do it this way:

$separators = explode('!excerpt', t('... !excerpt ... !excerpt ...'));
$text = (isset($newranges[0]) ? '' : $separators[0]) . implode($separators[1]), $out) . $separators[2]);

That will give more context to translators.

Log in or register to post comments

Comment #59

jhodgdon

she/her

English

commented 6 August 2010 at 19:53

Good suggestions. I'll make another patch later.

Log in or register to post comments

Comment #60

jhodgdon

she/her

English

commented 6 August 2010 at 21:07

Status:

Needs work

» Needs review

Status	File	Size
new	269911-drupalway.patch	5.59 KB

Here we go (I hope)... let's see if it still passes the tests at least.

Log in or register to post comments

Comment #61

6 August 2010 at 22:10

Status:

Needs review

» Needs work

The last submitted patch, 269911-drupalway.patch, failed testing.

Log in or register to post comments

Comment #62

jhodgdon

she/her

English

commented 6 August 2010 at 22:22

Hm... Failed "Entities are converted in excerpt". Perhaps the substitution of the drupal functions wasn't successful. Will need to investigate.

Log in or register to post comments

Comment #63

jhodgdon

she/her

English

commented 6 August 2010 at 22:38

Status:

Needs work

» Needs review

Status	File	Size
new	269911-fixchars.patch	5.59 KB

Ah. Looks like there was a screwed-up character in that last patch. This should work better.

Log in or register to post comments

Comment #64

andypost

he/him

Russian

commented 7 August 2010 at 08:43

jhodgdon, could you use cvs diff -up to see which functions are changing

While you on it, is it possible to make highlighting swappable?
As maintainer of http://drupal.org/project/rustemmer I need to patch search.module because I need highlight a stemmed words which mostly different from words contained in fragment

Log in or register to post comments

Comment #65

jhodgdon

she/her

English

commented 7 August 2010 at 15:21

andypost: I wish. I use Eclipse for patching and they don't have a -p option as far as I can tell. I'm too lazy to learn another tool...

Regarding making highlighting swappable: I tried, and it was rejected for D7. So it's been pushed off to D8. I know what you're saying -- I am the maintainer of Porter Stemmer for English... Anyway, it's a separate issue. See
#493270: search_excerpt() doesn't work well with stemming
If you're interested, I do have a working version of this in the Search by Page contrib module and Porter Stemmer (with a hook in SBP that lets Porter Stemmer do its own highlighting). Your stemming module could implement the same hook... I have both D6 and D7 versions in SBP/Porter.

And I would still love to get that into D7, but you can read the history on that other issue...

Log in or register to post comments

Comment #66

pwolanin commented 9 August 2010 at 23:09

Status:

Needs review

» Reviewed & tested by the community

I think it's possible we can rip decode_entities() in favor of the php built-in html_entity_decode() since Drupal 7 requires PHP 5.2 and utf8 support was added in 5.0. http://www.php.net/manual/en/function.html-entity-decode.php

However, that should be a separate issue. This patch looks good.

Log in or register to post comments

Comment #67

dries commented 10 August 2010 at 01:11

Status:

Reviewed & tested by the community

» Fixed

Committed to CVS HEAD. Thanks.

Log in or register to post comments

Comment #68

damien tournoud commented 10 August 2010 at 01:22

@pwolanin: I can confirm that html_entity_decode($input, ENT_QUOTES, 'UTF-8') works correctly and passes our tests. This charset parameter must have hidden in a cave or something... we refactored the whole decode_entities() in D7 to workaround supposed limitations in the entity table set of PHP :)

Log in or register to post comments

Comment #69

pwolanin commented 10 August 2010 at 01:49

ok, well assuming the built-in is dramatically faster, let's have a follow-up for that conversion: #878408: Replace decode_entities() with built-in html_entity_decode()

Log in or register to post comments

Comment #70

24 August 2010 at 01:50

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Search result trimming should not fall inside HTML entities/tags

Comments