Closed (fixed)
Project:
Glossary
Version:
5.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Reporter:
Created:
10 Jan 2008 at 01:14 UTC
Updated:
15 Feb 2008 at 15:01 UTC
Jump to comment: Most recent file
The heart of the Glossary module is a Drupal filter (see 'Resources' below). There are three cases in which the current glossary filter doesn't work as desired. Adding a mechanism to address this will make this module much more useful.
It has been suggested that special case <SPAN> tags could be used. For example:
<span class="glossary-skip">a section of content that should not be scanned for terms</span> <span class="glossary-phrase">a phrase that should be used as a single entity</span> <code>code that should be ignored when scanning.</code>
Another suggestion has been to use some code inside comment tags (<!--xxx-->).
| Comment | File | Size | Author |
|---|---|---|---|
| #24 | glossary_documentation1.txt | 1.2 KB | mwrochna |
| #23 | glossary_filter5.patch | 12.07 KB | mwrochna |
| #21 | glossary_filter4.patch | 10.34 KB | mwrochna |
| #19 | glossary_phrase_list.jpg | 11.96 KB | nancydru |
| #19 | glossary_phrase.jpg | 54.93 KB | nancydru |
Comments
Comment #1
nancydruComment #2
nancydruComment #3
nancydruIf someone claims this, I have also discovered that it is inserting links inside IMG tags.
Comment #4
nancydruClaimed by mwrochna.
Comment #5
mwrochna commenteda) Personally for skipping I would use square brackets
[no-glossary]mismatches[/no-glossary], mainly because it's the shortest form I can think of. What do you think about it? Maybe I should do a quick research (ex. contact glossary contributors for opinions)? Also it's easier to implement, and probably faster (than searching for matching</span>s)b) about glossary-phrases - are overlaping matches the only problem (like in terms
C,C++)? If so, glossary terms can include spaces, so it would be enough to, for ex., put every match in 'invisible'<acronym>or [no-glossary] tags.c) I can't reproduce any match within img tags (
<img sth here />), they shouldn't match, because < and > chars are counted. Could you precise the input format and matching settings used?Comment #6
nancydrua) That looks fine to me; short is better. One reason for the original suggestions was that they could be left in the HTML output for later debugging, but I am not married to that idea. Please make sure it's in keeping with the recent Codefilter changes that are also using square brackets. http://drupal.org/node/38047 and http://drupal.org/node/210046
b) Additionally, http://drupal.org/node/82776 mentions wanting to do "coal mining" rather than "coal" and "mining."
c)
<img src="/files/docs/drupal_add_cvs.jpg">ended up with the "drupal" part being flagged because I have a "Drupal" term in my glossary. This caused a "page not found" every time the page displayed.Comment #7
nancydruComment #8
mwrochna commentedSorry for the delay, I ran into troubles trying to profile the code.
I almost entirely rewrote the __glossary_insertlink() function, now it should be faster and a little more 'powerful'.
It doesn't check backward for every match, so it's at least twice as fast on my test site with 20 words in the glossary.
It should use twice less memory for large texts, but it can be really memory-hungry for big glossaries (because it keeps all $ins_after parts in memory), I'll try to fix that (this will probably need moving the 'Find match candidates' part into _glossary_filter_process()).
[no-glossary]tags are implemented, escaping of code tags too.b) 'cole mining' was also a problem of duplicate matches overlapping, it's fixed now. We don't need a "glossary-phrase" tag, unless we'd like to leave support for overlapping matches too.
c) The only problem I can see with
<img>tags is when there are additional '>' chars (ex. used as 'less-than', not as html tags). Now if there is a '<' char, everything will be escaped until a '>' char. Do the<img>tags still not work? Maybe we should skip '<' chars that have whitespace or digits after it?d) I'm not sure if we need to escape tags that probably won't contain matches (like
[%) ? Another reason is that ex. the GeShi filter allows custom code tags, so code filters should process the text earlier anyway, and then we only need to check for html<code>and<pre>tags. On the other hand it doesn't slow the parsing down that much. So I left the other version commented out on line 539.Comment #9
mwrochna commentedComment #10
nancydruOuch! It looks like I need to stop working on other issues while you're doing this one. I made a big change for non-Latin characters and it's missing from your code. I hope this patch applies.
Comment #11
nancydruHunk #1 succeeded at 516 with fuzz 1 (offset 37 lines).
Hunk #2 FAILED at 549.
Is it possible to get the current 5.x-1.x-dev version and re-do the changes? I will postpone current issues until you are finished.
Comment #12
mwrochna commentedOh, sorry, my mistake.
I've glanced at the differences and I see no problem, I'll update a merged patch in a moment.
Comment #13
mwrochna commentedI just had a power outage and I still have some troubles with connection :)
I hope this patch works.
Comment #14
nancydruIt applied cleanly. I'll test it now.
Comment #15
nancydruOkay, on first pass, I see several terms that are not flagged at all. I also see a term that is matched on a substring even though my setting is "whole word." See the attached screenshots.
Comment #16
nancydruComment #17
mwrochna commentedOops, I forgot to rename a variable when merging, which caused the filter to act randomly.
Also there was one bug with multibyte matches and one when matching synonyms.
Comment #18
nancydruGreat, thanks. I'll give this a shot in a moment.
Those problems are definitely gone. The problem of inserting flags within an IMG tag seems to be solved. I don't see any CODE flagged at the moment (which is good). "[no-glossary]" is working.
I'll do some more testing, but it looks pretty good right now.
Comment #19
nancydruThe overlapping / phrase problem is still there (see screenshots).
I have a term "Site Documentation" and "site;" it flagged "site" but not "site documentation."
Comment #20
nancydruComment #21
mwrochna commentedA bug in parentheses (I think this should have raised a php notice...).
Comment #22
nancydruLooks good. It looks like there isn't much that is needed for the handbook pages other than documenting [no-glossary] and that phrases will be preferred over individual terms (I assume that is correct). That is that if "coal" and "mining" are found together and there is a term for "coal mining" that will be used, but if "coal tar" or "strip mining" are found then "coal" and "mining" will be flagged in those phrases.
BTW, and this is not part of the original request so you may refuse, would it be hard to add a setting to skip header tags from being flagged?
Comment #23
mwrochna commentedI added a simple setting for custom blocking tags. I removed
<acronym>from hard-coded tags (it might have helped earlier to avoid overlapping), and moved it to default custom tags (in case somebody uses them).Comment #24
mwrochna commentedAs for the documentation, I don't know in what form, where and how much I should write (should I ask for doc edting permissions?), but here's a try:
1. On page A. Installation and Settings (http://drupal.org/node/201760), after Settings>For each input format>Replace matches
2. On the main doc page (http://drupal.org/node/196880) I'd add a pararaph before Issues
3. On the 'Howto: Install glossary module' page (http://drupal.org/node/69583), next to Configuring>Replace Matches
Preview:
1.
<a>,<code>,<pre>, and[no-glossary] [/no-glossary]tags.2.
Which terms match
When two different terms occur in one place, the longer one will be matched. This way, when you add definitions for 'code' and 'postal code', the phrase 'postal code' will show one definition, different than 'code' alone (of course 'machine code', if not defined, will still show only the definition of 'code'). This also applies when you change the match type (e.g. you can set 'postcode' and 'coder' to have different definitions, even though originally 'code' would match substrings).
If you don't want some text to be filtered for terms at all, enclose it with
[no-glossary] [/no-glossary]tags. You can also choose not to inlude glossary links in specific html tags.3.
[no-glossary]. Separate names of html tags (without < or > characters) with spaces.Comment #25
nancydruLooks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.
Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.
Comment #26
nancydruLooks good so far - I still have a few more minutes of testing to do, then I'll apply the doc changes.
Question: Is there a reason why "blocking tags" should be part of the format as opposed to a general setting? I could argue this either way, I'm interested in your reasoning.
Comment #27
nancydruComment #28
nancydruI have committed the code and made the doc changes that I can do with the DO problems going on.
Comment #29
mwrochna commentedThanks for testing it so thoroughly.
Format/general settings: Different formats have different tags, and though I can't think of any situation where the same tag would be used differently, it may affect performance a little (e.g. we don't want to check for all Full HTML blocking tags in every other format). But that's not a big benefit, and I can't think of a reason against, either (except some usability), so I just thought that all settings which can be format settings do so (global settings don't include settings which affect individual nodes, it seems they only change how glossary pages look).
Comment #30
nancydruOkay. I'll leave it like that.
Thanks for taking on this task.
Comment #31
mwrochna commentedline 603 should be changed to
because mb_stripos seems to be 20 (!) times slower than mb_strpos.
Comment #32
nancydruOkay, thanks. Does this require adding uppercase versions to the tags list? I will commit this with my next round of changes.
BTW, I was experimenting with "mbstring.func_overload" on php.ini and it causes a ">" on the end of ignored tags.
Comment #33
Anonymous (not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.