GHOP #135 Add an escape mechanism to Glossary module

NancyDru - January 10, 2008 - 01:14
Project:Glossary
Version:5.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:mwrochna
Status:closed
Description

Add an escape mechanism to Glossary module

Task description

The heart of the Glossary module is a Drupal filter (see 'Resources' below). There are three cases in which the current glossary filter doesn't work as desired. Adding a mechanism to address this will make this module much more useful.

  1. occasionally, the user just wants to skip a section of content entirely;
  2. in some cases, we need to flag a phrase rather than single words ("coal mining" rather than "coal" and "mining");
  3. within certain already identified sections, such as a block of code, we need to not flag the terms as this can lead to erroneous copy-and-pasting.

Possible approaches

It has been suggested that special case <SPAN> tags could be used. For example:

<span class="glossary-skip">a section of content that should not be scanned for terms</span>
<span class="glossary-phrase">a phrase that should be used as a single entity</span>
<code>code that should be ignored when scanning.</code>

Another suggestion has been to use some code inside comment tags (<!--xxx-->).

Deliverables

  1. Before coding, a brief description of the mechanism chosen.
  2. A patch ready to be committed, created against the current "-dev" version that provides the desired functionality. This may include (but is not anticipated) changes to the README.txt distribution, if appropriate.
  3. Changes or additions ready to be committed (other than minor editorial corrections) to the Handbook pages.

Resources

Primary contact

Nancy Wichmann, direct email

#1

NancyDru - January 10, 2008 - 01:24
Title:GHOP# xx Add an escape mechanism to Glossary module» GHOP #133 Add an escape mechanism to Glossary module

#2

NancyDru - January 10, 2008 - 18:10
Title:GHOP #133 Add an escape mechanism to Glossary module» GHOP #135 Add an escape mechanism to Glossary module

#3

NancyDru - January 11, 2008 - 03:05

If someone claims this, I have also discovered that it is inserting links inside IMG tags.

#4

NancyDru - January 21, 2008 - 22:04

Claimed by mwrochna.

#5

mwrochna - January 22, 2008 - 18:46
Assigned to:Anonymous» mwrochna
Status:active» active (needs more info)

a) Personally for skipping I would use square brackets [no-glossary]mismatches[/no-glossary], mainly because it's the shortest form I can think of. What do you think about it? Maybe I should do a quick research (ex. contact glossary contributors for opinions)? Also it's easier to implement, and probably faster (than searching for matching </span>s)
b) about glossary-phrases - are overlaping matches the only problem (like in terms C,C++)? If so, glossary terms can include spaces, so it would be enough to, for ex., put every match in 'invisible' <acronym> or [no-glossary] tags.
c) I can't reproduce any match within img tags (<img sth here />), they shouldn't match, because < and > chars are counted. Could you precise the input format and matching settings used?

#6

NancyDru - January 22, 2008 - 22:20

a) That looks fine to me; short is better. One reason for the original suggestions was that they could be left in the HTML output for later debugging, but I am not married to that idea. Please make sure it's in keeping with the recent Codefilter changes that are also using square brackets. http://drupal.org/node/38047 and http://drupal.org/node/210046

b) Additionally, http://drupal.org/node/82776 mentions wanting to do "coal mining" rather than "coal" and "mining."

c) <img src="/files/docs/drupal_add_cvs.jpg"> ended up with the "drupal" part being flagged because I have a "Drupal" term in my glossary. This caused a "page not found" every time the page displayed.

#7

NancyDru - January 23, 2008 - 13:01
Status:active (needs more info)» active

#8

mwrochna - January 26, 2008 - 17:36

Sorry for the delay, I ran into troubles trying to profile the code.
I almost entirely rewrote the __glossary_insertlink() function, now it should be faster and a little more 'powerful'.
It doesn't check backward for every match, so it's at least twice as fast on my test site with 20 words in the glossary.
It should use twice less memory for large texts, but it can be really memory-hungry for big glossaries (because it keeps all $ins_after parts in memory), I'll try to fix that (this will probably need moving the 'Find match candidates' part into _glossary_filter_process()).

[no-glossary] tags are implemented, escaping of code tags too.
b) 'cole mining' was also a problem of duplicate matches overlapping, it's fixed now. We don't need a "glossary-phrase" tag, unless we'd like to leave support for overlapping matches too.
c) The only problem I can see with <img> tags is when there are additional '>' chars (ex. used as 'less-than', not as html tags). Now if there is a '<' char, everything will be escaped until a '>' char. Do the <img> tags still not work? Maybe we should skip '<' chars that have whitespace or digits after it?
d) I'm not sure if we need to escape tags that probably won't contain matches (like [% ) ? Another reason is that ex. the GeShi filter allows custom code tags, so code filters should process the text earlier anyway, and then we only need to check for html <code> and <pre> tags. On the other hand it doesn't slow the parsing down that much. So I left the other version commented out on line 539.

AttachmentSize
glossary_filter1.patch8.86 KB

#9

mwrochna - January 26, 2008 - 17:37
Status:active» patch (code needs review)

#10

NancyDru - January 26, 2008 - 19:00

Ouch! It looks like I need to stop working on other issues while you're doing this one. I made a big change for non-Latin characters and it's missing from your code. I hope this patch applies.

#11

NancyDru - January 26, 2008 - 19:08
Status:patch (code needs review)» patch (code needs work)

Hunk #1 succeeded at 516 with fuzz 1 (offset 37 lines).
Hunk #2 FAILED at 549.

Is it possible to get the current 5.x-1.x-dev version and re-do the changes? I will postpone current issues until you are finished.

#12

mwrochna - January 26, 2008 - 19:46

Oh, sorry, my mistake.
I've glanced at the differences and I see no problem, I'll update a merged patch in a moment.

#13

mwrochna - January 26, 2008 - 21:28
Status:patch (code needs work)» patch (code needs review)

I just had a power outage and I still have some troubles with connection :)
I hope this patch works.

AttachmentSize
glossary_filter2.patch8.2 KB

#14

NancyDru - January 27, 2008 - 00:14

It applied cleanly. I'll test it now.

#15

NancyDru - January 27, 2008 - 00:43

Okay, on first pass, I see several terms that are not flagged at all. I also see a term that is matched on a substring even though my setting is "whole word." See the attached screenshots.

AttachmentSize
glossary_prob1.jpg146.54 KB
glossary_list.jpg106.87 KB

#16

NancyDru - January 27, 2008 - 00:44
Status:patch (code needs review)» patch (code needs work)

#17

mwrochna - January 27, 2008 - 10:27
Status:patch (code needs work)» patch (code needs review)

Oops, I forgot to rename a variable when merging, which caused the filter to act randomly.
Also there was one bug with multibyte matches and one when matching synonyms.

AttachmentSize
glossary_filter3.patch10.34 KB

#18

NancyDru - January 27, 2008 - 14:36

Great, thanks. I'll give this a shot in a moment.

Those problems are definitely gone. The problem of inserting flags within an IMG tag seems to be solved. I don't see any CODE flagged at the moment (which is good). "[no-glossary]" is working.

I'll do some more testing, but it looks pretty good right now.

#19

NancyDru - January 27, 2008 - 14:56

The overlapping / phrase problem is still there (see screenshots).

I have a term "Site Documentation" and "site;" it flagged "site" but not "site documentation."

AttachmentSize
glossary_phrase.jpg54.93 KB
glossary_phrase_list.jpg11.96 KB

#20

NancyDru - January 27, 2008 - 14:56
Status:patch (code needs review)» patch (code needs work)

#21

mwrochna - January 27, 2008 - 17:34
Status:patch (code needs work)» patch (code needs review)

A bug in parentheses (I think this should have raised a php notice...).

AttachmentSize
glossary_filter4.patch10.34 KB

#22

NancyDru - January 27, 2008 - 17:57

Looks good. It looks like there isn't much that is needed for the handbook pages other than documenting [no-glossary] and that phrases will be preferred over individual terms (I assume that is correct). That is that if "coal" and "mining" are found together and there is a term for "coal mining" that will be used, but if "coal tar" or "strip mining" are found then "coal" and "mining" will be flagged in those phrases.

BTW, and this is not part of the original request so you may refuse, would it be hard to add a setting to skip header tags from being flagged?

#23

mwrochna - January 28, 2008 - 17:11

I added a simple setting for custom blocking tags. I removed <acronym> from hard-coded tags (it might have helped earlier to avoid overlapping), and moved it to default custom tags (in case somebody uses them).

AttachmentSize
glossary_filter5.patch12.07 KB

#24

mwrochna - January 28, 2008 - 18:15

As for the documentation, I don't know in what form, where and how much I should write (should I ask for doc edting permissions?), but here's a try:
1. On page A. Installation and Settings (http://drupal.org/node/201760), after Settings>For each input format>Replace matches
2. On the main doc page (http://drupal.org/node/196880) I'd add a pararaph before Issues
3. On the 'Howto: Install glossary module' page (http://drupal.org/node/69583), next to Configuring>Replace Matches

Preview:
1.

  • Blocking tags - content of html tags listed here won't be filtered for matches. Glossary links will also never be included inside <a>,<code>, <pre>, and [no-glossary] [/no-glossary] tags.
  • 2.

    Which terms match

    When two different terms occur in one place, the longer one will be matched. This way, when you add definitions for 'code' and 'postal code', the phrase 'postal code' will show one definition, different than 'code' alone (of course 'machine code', if not defined, will still show only the definition of 'code'). This also applies when you change the match type (e.g. you can set 'postcode' and 'coder' to have different definitions, even though originally 'code' would match substrings).
    If you don't want some text to be filtered for terms at all, enclose it with [no-glossary] [/no-glossary] tags. You can also choose not to inlude glossary links in specific html tags.

    3.

  • If you'd like some tags not to include any glossary links, list them in Blocking tags instead of always enclosing them with [no-glossary]. Separate names of html tags (without < or > characters) with spaces.
  • AttachmentSize
    glossary_documentation1.txt1.2 KB

    #25

    NancyDru - January 28, 2008 - 21:40

    Looks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.

    Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.

    #26

    NancyDru - January 28, 2008 - 21:41

    Looks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.

    Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.

    #27

    NancyDru - January 28, 2008 - 22:35
    Status:patch (code needs review)» patch (reviewed & tested by the community)

    #28

    NancyDru - January 29, 2008 - 04:06
    Status:patch (reviewed & tested by the community)» fixed

    I have committed the code and made the doc changes that I can do with the DO problems going on.

    #29

    mwrochna - January 29, 2008 - 06:06

    Thanks for testing it so thoroughly.

    Format/general settings: Different formats have different tags, and though I can't think of any situation where the same tag would be used differently, it may affect performance a little (e.g. we don't want to check for all Full HTML blocking tags in every other format). But that's not a big benefit, and I can't think of a reason against, either (except some usability), so I just thought that all settings which can be format settings do so (global settings don't include settings which affect individual nodes, it seems they only change how glossary pages look).

    #30

    NancyDru - January 29, 2008 - 12:49

    Okay. I'll leave it like that.

    Thanks for taking on this task.

    #31

    mwrochna - February 1, 2008 - 14:24

    line 603 should be changed to

    - $findtagfunc = $mb_prefix .'stripos';
    + $findtagfunc = $mb_prefix .'strpos';

    because mb_stripos seems to be 20 (!) times slower than mb_strpos.

    #32

    NancyDru - February 1, 2008 - 15:00

    Okay, thanks. Does this require adding uppercase versions to the tags list? I will commit this with my next round of changes.

    BTW, I was experimenting with "mbstring.func_overload" on php.ini and it causes a ">" on the end of ignored tags.

    #33

    Anonymous (not verified) - February 15, 2008 - 15:01
    Status:fixed» closed

    Automatically closed -- issue fixed for two weeks with no activity.

     
     

    Drupal is a registered trademark of Dries Buytaert.