First non-English <h> title is not shown in TOC
dami - May 5, 2008 - 03:46
| Project: | Table of Contents |
| Version: | 6.x-3.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | AlexisWilke |
| Status: | closed |
Description
If the title of first tag is not in English, i.e. it doesn't consist any letters [A-z0-9 ], then it won't show up in TOC. The reason is that the auto generated anchor id will be empty for this first tag. Subsequent tags do not have this problem.
Looking at the code, looks like if title of a "h" tag is empty, the filtered result is inconsistent. If it's the first empty tag, it's ignored in TOC. However, all subsequent empty tags will be assigned to an css id, as a result, it shows up in TOC.

#1
This patch seems fix the problem, but I am not sure if it breaks anything else.
#2
Patch in #1 will actually change the existing anchor numbering. Here is a patch that just deal with
<h>tags, that do not have any valid anchor id characters (i.e. [A-z0-9 ]) in its title. With this patch, Any real empty<h>tags will also show in TOC, if this is not wanted, we may add another line of code to deal with it.#3
I'm trying to find information, but I'm not getting a clear answer: what are the valid characters for XHTML attribute values? If they are more than just A-z 0-9, we should allow them, rather than just assigning an arbitrary attribute. If not, then some function should convert the string into something valid and use that instead.
--Andrew
#4
Can someone determine if this is still an issue in the latest development version?
#5
I have upgraded my site to 6.x. So I couldn't test 5.x-dev
But the problem still exists in both 6.x-2.2 and 6.x-2.x-dev
#6
deviantintegral wrote:
On the page http://www.w3.org/TR/xhtml1/#C_8 we can read:
And in the section 6.2 on http://www.w3.org/TR/html4/types.html#h-6.2 :
Also, ID's content is sanitized, but it's not adapted for words with diacritics. For example, say we have this title in French:
<h2>Été</h2>it will be changed for:
<h2 id="t">Été</h2>and the url will be:
/page#t
It should be:
/page#Ete
Diacritics should be changed for non-diacritics (e.g. é=>e).
So I attach a patch to accept [A-Za-z][A-Za-z0-9:_.-]* and to improve rendering of sanitized diacritics.
Edit: I forgot to attach also the i18n-ascii.txt file (we must put it in the module's directory).
#7
File i18n-ascii.txt attached.
#8
Thanks for the patch. Finding the info about the XHTML spec was very useful. I've updated it slightly with more comments, and included the i18n-ascii.txt file in the patch. I've tested against the 5.x version.
Do you see this as RTBC?
#9
#10
Some comments:
1) I think it would be more efficient to parse i18n-ascii.txt only once, so I've moved it outside the "foreach" loop. See the new patch.
2) I think that anchors would be more readable if spaces was replaced by a hyphen. See this example:
Without hyphen: /page#CurrentDrupalcoreinitiatives
With hyphen: /page#Current-Drupal-core-initiatives
I've put a suggestion in the patch:
$anchor = preg_replace("/ +/", "-", $anchor);3) I'm not sure about this one. What about doing the same thing with apostrophe? Example with "L'arrivée de l'été l'a lentement réchauffé":
Without apostrophe: /page#Larrivee-de-lete-la-lentement-rechauffe
With apostrophe: /page#L-arrivee-de-l-ete-l-a-lentement-rechauffe
Personally I replace apostrophe with hyphen in Pathauto, but usage of apostrophe must be changing according to languages.
The best would be surely some settings in the module's admin, as Pathauto, but it's more code. What do you think?
#11
Hi deviantintegral,
I wonder if you have some comments about the patch proposed?
Thanks.
#12
Salut Gay Luron,
My comment would be that it makes it quite complicated to add more code to support apostrophes one way or another. Especially because it will make the module slower. More code means slower module...
I will look into that later, but it sounds like a good idea to put a dash by default since that would most certainly look okay in most languages.
Thank you.
Alexis Wilke
#13
Okay, the transliteration is installed in 3.x-dev, HOWEVER, with FCKeditor, a character such as é is transformed into é. This means the transliteration does NOT work. I guess we should add all the default HTML entities in the i18n-acsii.txt file? Please, feel free to re-open this issue if you provide that fix.
Thank you.
Alexis Wilke
#14
Hi Alexis,
I don't know if we should add the default HTML entities in the file, because the «problem» can be resolved (I guess, I didn't test) in FCKeditor with this configuration:
FCKConfig.ProcessHTMLEntities = false ;See ProcessHTMLEntities.
#15
Okay, I guess that should be documented somewhere... Most people would not know about such a thing!
Thank you for the pointer
Alexis Wilke
Addition: Added info in the TOC front page for now.
#16
Automatically closed -- issue fixed for 2 weeks with no activity.