Add an escape mechanism to Glossary module

Task description

The heart of the Glossary module is a Drupal filter (see 'Resources' below). There are three cases in which the current glossary filter doesn't work as desired. Adding a mechanism to address this will make this module much more useful.

  1. occasionally, the user just wants to skip a section of content entirely;
  2. in some cases, we need to flag a phrase rather than single words ("coal mining" rather than "coal" and "mining");
  3. within certain already identified sections, such as a block of code, we need to not flag the terms as this can lead to erroneous copy-and-pasting.

Possible approaches

It has been suggested that special case <SPAN> tags could be used. For example:

<span class="glossary-skip">a section of content that should not be scanned for terms</span>
<span class="glossary-phrase">a phrase that should be used as a single entity</span>
<code>code that should be ignored when scanning.</code>

Another suggestion has been to use some code inside comment tags (<!--xxx-->).

Deliverables

  1. Before coding, a brief description of the mechanism chosen.
  2. A patch ready to be committed, created against the current "-dev" version that provides the desired functionality. This may include (but is not anticipated) changes to the README.txt distribution, if appropriate.
  3. Changes or additions ready to be committed (other than minor editorial corrections) to the Handbook pages.

Resources

Primary contact

Nancy Wichmann, direct email

Comments

nancydru’s picture

Title: GHOP# xx Add an escape mechanism to Glossary module » GHOP #133 Add an escape mechanism to Glossary module
nancydru’s picture

Title: GHOP #133 Add an escape mechanism to Glossary module » GHOP #135 Add an escape mechanism to Glossary module
nancydru’s picture

If someone claims this, I have also discovered that it is inserting links inside IMG tags.

nancydru’s picture

Claimed by mwrochna.

mwrochna’s picture

Assigned: Unassigned » mwrochna
Status: Active » Postponed (maintainer needs more info)

a) Personally for skipping I would use square brackets [no-glossary]mismatches[/no-glossary], mainly because it's the shortest form I can think of. What do you think about it? Maybe I should do a quick research (ex. contact glossary contributors for opinions)? Also it's easier to implement, and probably faster (than searching for matching </span>s)
b) about glossary-phrases - are overlaping matches the only problem (like in terms C,C++)? If so, glossary terms can include spaces, so it would be enough to, for ex., put every match in 'invisible' <acronym> or [no-glossary] tags.
c) I can't reproduce any match within img tags (<img sth here />), they shouldn't match, because < and > chars are counted. Could you precise the input format and matching settings used?

nancydru’s picture

a) That looks fine to me; short is better. One reason for the original suggestions was that they could be left in the HTML output for later debugging, but I am not married to that idea. Please make sure it's in keeping with the recent Codefilter changes that are also using square brackets. http://drupal.org/node/38047 and http://drupal.org/node/210046

b) Additionally, http://drupal.org/node/82776 mentions wanting to do "coal mining" rather than "coal" and "mining."

c) <img src="/files/docs/drupal_add_cvs.jpg"> ended up with the "drupal" part being flagged because I have a "Drupal" term in my glossary. This caused a "page not found" every time the page displayed.

nancydru’s picture

Status: Postponed (maintainer needs more info) » Active
mwrochna’s picture

StatusFileSize
new8.86 KB

Sorry for the delay, I ran into troubles trying to profile the code.
I almost entirely rewrote the __glossary_insertlink() function, now it should be faster and a little more 'powerful'.
It doesn't check backward for every match, so it's at least twice as fast on my test site with 20 words in the glossary.
It should use twice less memory for large texts, but it can be really memory-hungry for big glossaries (because it keeps all $ins_after parts in memory), I'll try to fix that (this will probably need moving the 'Find match candidates' part into _glossary_filter_process()).

[no-glossary] tags are implemented, escaping of code tags too.
b) 'cole mining' was also a problem of duplicate matches overlapping, it's fixed now. We don't need a "glossary-phrase" tag, unless we'd like to leave support for overlapping matches too.
c) The only problem I can see with <img> tags is when there are additional '>' chars (ex. used as 'less-than', not as html tags). Now if there is a '<' char, everything will be escaped until a '>' char. Do the <img> tags still not work? Maybe we should skip '<' chars that have whitespace or digits after it?
d) I'm not sure if we need to escape tags that probably won't contain matches (like [% ) ? Another reason is that ex. the GeShi filter allows custom code tags, so code filters should process the text earlier anyway, and then we only need to check for html <code> and <pre> tags. On the other hand it doesn't slow the parsing down that much. So I left the other version commented out on line 539.

mwrochna’s picture

Status: Active » Needs review
nancydru’s picture

Ouch! It looks like I need to stop working on other issues while you're doing this one. I made a big change for non-Latin characters and it's missing from your code. I hope this patch applies.

nancydru’s picture

Status: Needs review » Needs work

Hunk #1 succeeded at 516 with fuzz 1 (offset 37 lines).
Hunk #2 FAILED at 549.

Is it possible to get the current 5.x-1.x-dev version and re-do the changes? I will postpone current issues until you are finished.

mwrochna’s picture

Oh, sorry, my mistake.
I've glanced at the differences and I see no problem, I'll update a merged patch in a moment.

mwrochna’s picture

Status: Needs work » Needs review
StatusFileSize
new8.2 KB

I just had a power outage and I still have some troubles with connection :)
I hope this patch works.

nancydru’s picture

It applied cleanly. I'll test it now.

nancydru’s picture

StatusFileSize
new146.54 KB
new106.87 KB

Okay, on first pass, I see several terms that are not flagged at all. I also see a term that is matched on a substring even though my setting is "whole word." See the attached screenshots.

nancydru’s picture

Status: Needs review » Needs work
mwrochna’s picture

Status: Needs work » Needs review
StatusFileSize
new10.34 KB

Oops, I forgot to rename a variable when merging, which caused the filter to act randomly.
Also there was one bug with multibyte matches and one when matching synonyms.

nancydru’s picture

Great, thanks. I'll give this a shot in a moment.

Those problems are definitely gone. The problem of inserting flags within an IMG tag seems to be solved. I don't see any CODE flagged at the moment (which is good). "[no-glossary]" is working.

I'll do some more testing, but it looks pretty good right now.

nancydru’s picture

StatusFileSize
new54.93 KB
new11.96 KB

The overlapping / phrase problem is still there (see screenshots).

I have a term "Site Documentation" and "site;" it flagged "site" but not "site documentation."

nancydru’s picture

Status: Needs review » Needs work
mwrochna’s picture

Status: Needs work » Needs review
StatusFileSize
new10.34 KB

A bug in parentheses (I think this should have raised a php notice...).

nancydru’s picture

Looks good. It looks like there isn't much that is needed for the handbook pages other than documenting [no-glossary] and that phrases will be preferred over individual terms (I assume that is correct). That is that if "coal" and "mining" are found together and there is a term for "coal mining" that will be used, but if "coal tar" or "strip mining" are found then "coal" and "mining" will be flagged in those phrases.

BTW, and this is not part of the original request so you may refuse, would it be hard to add a setting to skip header tags from being flagged?

mwrochna’s picture

StatusFileSize
new12.07 KB

I added a simple setting for custom blocking tags. I removed <acronym> from hard-coded tags (it might have helped earlier to avoid overlapping), and moved it to default custom tags (in case somebody uses them).

mwrochna’s picture

StatusFileSize
new1.2 KB

As for the documentation, I don't know in what form, where and how much I should write (should I ask for doc edting permissions?), but here's a try:
1. On page A. Installation and Settings (http://drupal.org/node/201760), after Settings>For each input format>Replace matches
2. On the main doc page (http://drupal.org/node/196880) I'd add a pararaph before Issues
3. On the 'Howto: Install glossary module' page (http://drupal.org/node/69583), next to Configuring>Replace Matches

Preview:
1.

  • Blocking tags - content of html tags listed here won't be filtered for matches. Glossary links will also never be included inside <a>,<code>, <pre>, and [no-glossary] [/no-glossary] tags.
  • 2.

    Which terms match

    When two different terms occur in one place, the longer one will be matched. This way, when you add definitions for 'code' and 'postal code', the phrase 'postal code' will show one definition, different than 'code' alone (of course 'machine code', if not defined, will still show only the definition of 'code'). This also applies when you change the match type (e.g. you can set 'postcode' and 'coder' to have different definitions, even though originally 'code' would match substrings).
    If you don't want some text to be filtered for terms at all, enclose it with [no-glossary] [/no-glossary] tags. You can also choose not to inlude glossary links in specific html tags.

    3.

  • If you'd like some tags not to include any glossary links, list them in Blocking tags instead of always enclosing them with [no-glossary]. Separate names of html tags (without < or > characters) with spaces.
  • nancydru’s picture

    Looks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.

    Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.

    nancydru’s picture

    Looks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.

    Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.

    nancydru’s picture

    Status: Needs review » Reviewed & tested by the community
    nancydru’s picture

    Status: Reviewed & tested by the community » Fixed

    I have committed the code and made the doc changes that I can do with the DO problems going on.

    mwrochna’s picture

    Thanks for testing it so thoroughly.

    Format/general settings: Different formats have different tags, and though I can't think of any situation where the same tag would be used differently, it may affect performance a little (e.g. we don't want to check for all Full HTML blocking tags in every other format). But that's not a big benefit, and I can't think of a reason against, either (except some usability), so I just thought that all settings which can be format settings do so (global settings don't include settings which affect individual nodes, it seems they only change how glossary pages look).

    nancydru’s picture

    Okay. I'll leave it like that.

    Thanks for taking on this task.

    mwrochna’s picture

    line 603 should be changed to

    - $findtagfunc = $mb_prefix .'stripos';
    + $findtagfunc = $mb_prefix .'strpos';
    

    because mb_stripos seems to be 20 (!) times slower than mb_strpos.

    nancydru’s picture

    Okay, thanks. Does this require adding uppercase versions to the tags list? I will commit this with my next round of changes.

    BTW, I was experimenting with "mbstring.func_overload" on php.ini and it causes a ">" on the end of ignored tags.

    Anonymous’s picture

    Status: Fixed » Closed (fixed)

    Automatically closed -- issue fixed for two weeks with no activity.