GHOP #135 Add an escape mechanism to Glossary module [#208217]

Add an escape mechanism to Glossary module

Task description

The heart of the Glossary module is a Drupal filter (see 'Resources' below). There are three cases in which the current glossary filter doesn't work as desired. Adding a mechanism to address this will make this module much more useful.

occasionally, the user just wants to skip a section of content entirely;
in some cases, we need to flag a phrase rather than single words ("coal mining" rather than "coal" and "mining");
within certain already identified sections, such as a block of code, we need to not flag the terms as this can lead to erroneous copy-and-pasting.

Possible approaches

It has been suggested that special case <SPAN> tags could be used. For example:

<span class="glossary-skip">a section of content that should not be scanned for terms</span>
<span class="glossary-phrase">a phrase that should be used as a single entity</span>
<code>code that should be ignored when scanning.</code>

Another suggestion has been to use some code inside comment tags ().

Deliverables

Before coding, a brief description of the mechanism chosen.
A patch ready to be committed, created against the current "-dev" version that provides the desired functionality. This may include (but is not anticipated) changes to the README.txt distribution, if appropriate.
Changes or additions ready to be committed (other than minor editorial corrections) to the Handbook pages.

Resources

Primary contact

Nancy Wichmann, direct email

Comment	File	Size	Author
#24	glossary_documentation1.txt	1.2 KB	mwrochna
#23	glossary_filter5.patch	12.07 KB	mwrochna
#21	glossary_filter4.patch	10.34 KB	mwrochna
#19	glossary_phrase_list.jpg	11.96 KB	nancydru
#19	glossary_phrase.jpg	54.93 KB	nancydru
#17	glossary_filter3.patch	10.34 KB	mwrochna
#15	glossary_list.jpg	106.87 KB	nancydru
#15	glossary_prob1.jpg	146.54 KB	nancydru
#13	glossary_filter2.patch	8.2 KB	mwrochna
#8	glossary_filter1.patch	8.86 KB	mwrochna

Comments

Comment #1

nancydru

she/her/hers

English

Boston

commented 10 January 2008 at 01:24

Title:

GHOP# xx Add an escape mechanism to Glossary module

» GHOP #133 Add an escape mechanism to Glossary module

Comment #2

nancydru

she/her/hers

English

Boston

commented 10 January 2008 at 18:10

Title:

GHOP #133 Add an escape mechanism to Glossary module

» GHOP #135 Add an escape mechanism to Glossary module

Comment #3

nancydru

she/her/hers

English

Boston

commented 11 January 2008 at 03:05

If someone claims this, I have also discovered that it is inserting links inside IMG tags.

Comment #4

nancydru

she/her/hers

English

Boston

commented 21 January 2008 at 22:04

Claimed by mwrochna.

Comment #5

mwrochna commented 22 January 2008 at 18:46

Assigned:	Unassigned	» mwrochna
Status:	Active	» Postponed (maintainer needs more info)

a) Personally for skipping I would use square brackets [no-glossary]mismatches[/no-glossary], mainly because it's the shortest form I can think of. What do you think about it? Maybe I should do a quick research (ex. contact glossary contributors for opinions)? Also it's easier to implement, and probably faster (than searching for matching </span>s)
b) about glossary-phrases - are overlaping matches the only problem (like in terms C,C++)? If so, glossary terms can include spaces, so it would be enough to, for ex., put every match in 'invisible' <acronym> or [no-glossary] tags.
c) I can't reproduce any match within img tags (<img sth here />), they shouldn't match, because < and > chars are counted. Could you precise the input format and matching settings used?

Comment #6

nancydru

she/her/hers

English

Boston

commented 22 January 2008 at 22:20

a) That looks fine to me; short is better. One reason for the original suggestions was that they could be left in the HTML output for later debugging, but I am not married to that idea. Please make sure it's in keeping with the recent Codefilter changes that are also using square brackets. http://drupal.org/node/38047 and http://drupal.org/node/210046

b) Additionally, http://drupal.org/node/82776 mentions wanting to do "coal mining" rather than "coal" and "mining."

c) <img src="/files/docs/drupal_add_cvs.jpg"> ended up with the "drupal" part being flagged because I have a "Drupal" term in my glossary. This caused a "page not found" every time the page displayed.

Comment #7

nancydru

she/her/hers

English

Boston

commented 23 January 2008 at 13:01

Status:

Postponed (maintainer needs more info)

» Active

Comment #8

mwrochna commented 26 January 2008 at 17:36

Status	File	Size
new	glossary_filter1.patch	8.86 KB

Sorry for the delay, I ran into troubles trying to profile the code.
I almost entirely rewrote the __glossary_insertlink() function, now it should be faster and a little more 'powerful'.
It doesn't check backward for every match, so it's at least twice as fast on my test site with 20 words in the glossary.
It should use twice less memory for large texts, but it can be really memory-hungry for big glossaries (because it keeps all $ins_after parts in memory), I'll try to fix that (this will probably need moving the 'Find match candidates' part into _glossary_filter_process()).

[no-glossary] tags are implemented, escaping of code tags too.
b) 'cole mining' was also a problem of duplicate matches overlapping, it's fixed now. We don't need a "glossary-phrase" tag, unless we'd like to leave support for overlapping matches too.
c) The only problem I can see with <img> tags is when there are additional '>' chars (ex. used as 'less-than', not as html tags). Now if there is a '<' char, everything will be escaped until a '>' char. Do the <img> tags still not work? Maybe we should skip '<' chars that have whitespace or digits after it?
d) I'm not sure if we need to escape tags that probably won't contain matches (like [% ) ? Another reason is that ex. the GeShi filter allows custom code tags, so code filters should process the text earlier anyway, and then we only need to check for html <code> and <pre> tags. On the other hand it doesn't slow the parsing down that much. So I left the other version commented out on line 539.

Comment #9

mwrochna commented 26 January 2008 at 17:37

Status:

Active

» Needs review

Comment #10

nancydru

she/her/hers

English

Boston

commented 26 January 2008 at 19:00

Ouch! It looks like I need to stop working on other issues while you're doing this one. I made a big change for non-Latin characters and it's missing from your code. I hope this patch applies.

Comment #11

nancydru

she/her/hers

English

Boston

commented 26 January 2008 at 19:08

Status:

Needs review

» Needs work

Hunk #1 succeeded at 516 with fuzz 1 (offset 37 lines).
Hunk #2 FAILED at 549.

Is it possible to get the current 5.x-1.x-dev version and re-do the changes? I will postpone current issues until you are finished.

Comment #12

mwrochna commented 26 January 2008 at 19:46

Oh, sorry, my mistake.
I've glanced at the differences and I see no problem, I'll update a merged patch in a moment.

Comment #13

mwrochna commented 26 January 2008 at 21:28

Status:

Needs work

» Needs review

Status	File	Size
new	glossary_filter2.patch	8.2 KB

I just had a power outage and I still have some troubles with connection :)
I hope this patch works.

Comment #14

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 00:14

It applied cleanly. I'll test it now.

Comment #15

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 00:43

Status	File	Size
new	glossary_prob1.jpg	146.54 KB
new	glossary_list.jpg	106.87 KB

Okay, on first pass, I see several terms that are not flagged at all. I also see a term that is matched on a substring even though my setting is "whole word." See the attached screenshots.

Comment #16

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 00:44

Status:

Needs review

» Needs work

Comment #17

mwrochna commented 27 January 2008 at 10:27

Status:

Needs work

» Needs review

Status	File	Size
new	glossary_filter3.patch	10.34 KB

Oops, I forgot to rename a variable when merging, which caused the filter to act randomly.
Also there was one bug with multibyte matches and one when matching synonyms.

Comment #18

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 14:36

Great, thanks. I'll give this a shot in a moment.

Those problems are definitely gone. The problem of inserting flags within an IMG tag seems to be solved. I don't see any CODE flagged at the moment (which is good). "[no-glossary]" is working.

I'll do some more testing, but it looks pretty good right now.

Comment #19

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 14:56

Status	File	Size
new	glossary_phrase.jpg	54.93 KB
new	glossary_phrase_list.jpg	11.96 KB

The overlapping / phrase problem is still there (see screenshots).

I have a term "Site Documentation" and "site;" it flagged "site" but not "site documentation."

Comment #20

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 14:56

Status:

Needs review

» Needs work

Comment #21

mwrochna commented 27 January 2008 at 17:34

Status:

Needs work

» Needs review

Status	File	Size
new	glossary_filter4.patch	10.34 KB

A bug in parentheses (I think this should have raised a php notice...).

Comment #22

nancydru

she/her/hers

English

Boston

commented 27 January 2008 at 17:57

Looks good. It looks like there isn't much that is needed for the handbook pages other than documenting [no-glossary] and that phrases will be preferred over individual terms (I assume that is correct). That is that if "coal" and "mining" are found together and there is a term for "coal mining" that will be used, but if "coal tar" or "strip mining" are found then "coal" and "mining" will be flagged in those phrases.

BTW, and this is not part of the original request so you may refuse, would it be hard to add a setting to skip header tags from being flagged?

Comment #23

mwrochna commented 28 January 2008 at 17:11

Status	File	Size
new	glossary_filter5.patch	12.07 KB

I added a simple setting for custom blocking tags. I removed <acronym> from hard-coded tags (it might have helped earlier to avoid overlapping), and moved it to default custom tags (in case somebody uses them).

Comment #24

mwrochna commented 28 January 2008 at 18:15

Status	File	Size
new	glossary_documentation1.txt	1.2 KB

As for the documentation, I don't know in what form, where and how much I should write (should I ask for doc edting permissions?), but here's a try:
1. On page A. Installation and Settings (http://drupal.org/node/201760), after Settings>For each input format>Replace matches
2. On the main doc page (http://drupal.org/node/196880) I'd add a pararaph before Issues
3. On the 'Howto: Install glossary module' page (http://drupal.org/node/69583), next to Configuring>Replace Matches

Preview:
1.

Blocking tags - content of html tags listed here won't be filtered for matches. Glossary links will also never be included inside <a>,<code>, <pre>, and [no-glossary] [/no-glossary] tags.

Which terms match

When two different terms occur in one place, the longer one will be matched. This way, when you add definitions for 'code' and 'postal code', the phrase 'postal code' will show one definition, different than 'code' alone (of course 'machine code', if not defined, will still show only the definition of 'code'). This also applies when you change the match type (e.g. you can set 'postcode' and 'coder' to have different definitions, even though originally 'code' would match substrings).
If you don't want some text to be filtered for terms at all, enclose it with [no-glossary] [/no-glossary] tags. You can also choose not to inlude glossary links in specific html tags.

If you'd like some tags not to include any glossary links, list them in Blocking tags instead of always enclosing them with [no-glossary]. Separate names of html tags (without < or > characters) with spaces.

Comment #25

nancydru

she/her/hers

English

Boston

commented 28 January 2008 at 21:40

Looks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.

Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.

Comment #26

nancydru

she/her/hers

English

Boston

commented 28 January 2008 at 21:41

Looks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.

Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.

Comment #27

nancydru

she/her/hers

English

Boston

commented 28 January 2008 at 22:35

Status:

Needs review

» Reviewed & tested by the community

Comment #28

nancydru

she/her/hers

English

Boston

commented 29 January 2008 at 04:06

Status:

Reviewed & tested by the community

» Fixed

I have committed the code and made the doc changes that I can do with the DO problems going on.

Comment #29

mwrochna commented 29 January 2008 at 06:06

Thanks for testing it so thoroughly.

Format/general settings: Different formats have different tags, and though I can't think of any situation where the same tag would be used differently, it may affect performance a little (e.g. we don't want to check for all Full HTML blocking tags in every other format). But that's not a big benefit, and I can't think of a reason against, either (except some usability), so I just thought that all settings which can be format settings do so (global settings don't include settings which affect individual nodes, it seems they only change how glossary pages look).

Comment #30

nancydru

she/her/hers

English

Boston

commented 29 January 2008 at 12:49

Okay. I'll leave it like that.

Thanks for taking on this task.

Comment #31

mwrochna commented 1 February 2008 at 14:24

line 603 should be changed to

- $findtagfunc = $mb_prefix .'stripos';
+ $findtagfunc = $mb_prefix .'strpos';

because mb_stripos seems to be 20 (!) times slower than mb_strpos.

Comment #32

nancydru

she/her/hers

English

Boston

commented 1 February 2008 at 15:00

Okay, thanks. Does this require adding uppercase versions to the tags list? I will commit this with my next round of changes.

BTW, I was experimenting with "mbstring.func_overload" on php.ini and it causes a ">" on the end of ignored tags.

Comment #33

Anonymous (not verified) commented 15 February 2008 at 15:01

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.