geshifilter_filter should respond to $op=prepare [#181402]

I noticed that geshifilter_filter() doesn't respond to $op=prepare. This means that certain types of code, etc. can get clobbered by other filters, such as the HTML tags filter.

For example, if you had code like this:

<cpp>
 #include <stdio.h>
</cpp>

This would show up as:

#include

I took a look at codefilter.module and pretty much tried to copy what that module is doing to process the text so that it doesn't get messed up by another filter.

I've tested this patch on my site with several different types of input and it seems to work just fine. It might not be ready to go in yet, but I'd like people to be able to take a look at it and try it out on their own systems to see if there are any problems with it.

Comment	File	Size	Author
#7	preparation_3.patch	6.9 KB	soxofaan
#4	preparation_2.patch	6.83 KB	soxofaan
#2	preparation.patch	4.49 KB	soxofaan
	geshifilter_process_0.txt	4.15 KB	aclight

Comments

Comment #1

soxofaan commented 12 October 2007 at 08:04

Assigned:	aclight	» soxofaan
Priority:	Normal	» Critical

Comment #2

soxofaan commented 12 October 2007 at 23:41

Status:

Needs review

» Needs work

Status	File	Size
new	preparation.patch	4.49 KB

I worked a bit on the patch to simplify it and solve some remaining issues.
please be free to test it and report you findings.

It's not ready for committing, yet.
Thanks to this preparation a lot of the conflict detection is not needed anymore, I have to look what I can remove.

Comment #3

aclight commented 13 October 2007 at 03:36

I just ran a few quick tests with your patch and it seemed to work as well as my patch on my test page with highlighted code, but it does get closer than my patch does at fixing the problems linked to in the original post of this issue: http://drupal.org/node/70904

Specifically, your patch fixes #1, #2, and #4 on the test page at http://www.dvessel.com/node/93

Also, I'm a little worried about your use of 0xFA-0xFD. My understanding is that technically those are valid UTF-8 characters, though they were never assigned to represent anything and thus are very unlikely to be used. This statement is based on the 3rd table of the Description topic at http://en.wikipedia.org/wiki/UTF-8

Once you're done with the patch I'll try to review it more closely.

Comment #4

soxofaan commented 14 October 2007 at 12:34

Status:

Needs work

» Needs review

Status	File	Size
new	preparation_2.patch	6.83 KB

I'm also not very happy with the 0xFA-0xFF hack. I chose not to use 0xFE and 0xFF, because code filter uses them already.
I'm looking for a better solution, e.g. with legal utf8 encoding or base64 encoding of the whole code block, but I'm not sure yet.

For the moment I think it is safe enough to use 0xFA-0xFD however.

This new version of the patch also removes some unneeded conflict detection stuff: html tag filtering and conflicts with the line break filter.
(note that geshifilter.module got some other updates in the meantime)

Comment #5

aclight commented 14 October 2007 at 14:49

Status:

Needs review

» Needs work

I downloaded the newest 2.x-dev version and the patch applied, but with offsets. I guess you meant in your last comment that you created the patch before you made the other upgrades to the module? In any way, it applied successfully.

Hunk #2 succeeded at 685 (offset -4 lines).
Hunk #3 succeeded at 715 with fuzz 1 (offset -4 lines).
Hunk #4 succeeded at 839 (offset -46 lines).
Hunk #5 succeeded at 861 (offset -46 lines).
Hunk #6 succeeded at 916 (offset -46 lines).
Hunk #7 succeeded at 1013 (offset -46 lines).

I tested the patch and I had the same results as I indicated for the patch in comment #3.

As for the patch itself:

I still think it's best to use xFE and xFF. It's not just codefilter that uses these chars. The Drupal example filter module even suggests using these chars (see http://api.drupal.org/api/function/filter_example_filter/5). Given that the GeSHi filter module should only prepare sections of code that are enclosed by tags, where name is one of the languages that GeSHi has been configured to handle.
The only potential problem I see is that both codefilter and GeSHi filter might respond to <lang></lang> tags (where "lang" = "code"). In that case those sections of code would be prepared identically (assuming GeSHi filter moves to using xFE and xFF), and so whatever filter was set to come first would do the filtering. The user could get around this by using [lang][/lang] tags, which codefilter doesn't respond to.

Also, in my original patch I didn't change tags in square brackets (eg. [lang][/lang), but you've done that in your patches. In retrospect, I think you're right to also prepare those blocks.

So, to summarize, I think it's best to prepare blocks of text that will be processed by GeSHi filter in the following way:
```
<lang></lang> ==> xFElangxFFxFE/langxFF  and
[lang][/lang] ==>  xFElangxFFxFE/langxFF
```
If you wanted to make sure that only GeSHi filter touched blocks that were prepared by GeSHi filter, you could do this instead:
```
<lang></lang> ==> xFEGFlangxFFxFE/langxGFFF  and
[lang][/lang] ==>  xFEGFlangxFFxFE/langGFxFF
```
```
+function _geshifilter_prepare_callback($match) {
+  return "\xFA". $match[1] . $match[2] ."\xFB"
+    . str_replace(array("\r", "\n", '<', '>'), array('', '&#10;', "\xFC", "\xFD"), $match[3])
+    ."\xFA/". $match[1]. "\xFB";
```
In this block you're replacing any < and > chars within the block of code that will be filtered with xFC and xFD. I'm not sure why this is necessary or desirable. codefilter.module sends all text between the code tags through a function that replaces \r and \n (as the patch does above) but then sends the text through check_plain(), which then calls htmlentities() and encodes <, &, etc. in their safe form.
Likewise, in the process phase, the patch converts xFC to < and xFD to >:
```
+  // Undo linebreak and angle bracket escaping from preparation phase
+  $source_code = str_replace(array('&#10;', "\xFC", "\xFD"), array("\n", '<', '>'), $source_code);
+
```
Again, I don't think that's wise. For example, if there were a filter that processed after GeSHi filter, and if that filter did no escaping of text that it would process, it would potentially process the text returned by GeSHi filter because there are unescaped HTML entities left over.

Comment #6

soxofaan commented 15 October 2007 at 09:02

Concerning the patch+offset issue: the dev snapshots are only generated every 12 hours, so I guess you were too early and downloaded an outdated one ;). I wrongly assumed you were using CVS. Luckily the patch applied successfully.

Anyway, about your remarks:

You made a point that it shouldn't be a problem that other filters can use these bytes too. I'll change it back to 0xFE+0xFF.

The only potential problem I see is that both codefilter and GeSHi filter might respond to < lang >< /lang > tags (where "lang" = "code".

I could avoid this by flagging codefilter as a conflicting filter and raise warnings when codefilter and GeSHi filter occur in the same input format.
I did that because I had to invert the check_plain operation before the process phase. The GeSHi library needs the original source code with < and &, instead of < and &. But I could indeed do this with a check_plain() and decode_entities() pipeline.
I think this is not true, because the replacement is only done on the source code (the stuff between [code] and [/code]), the stuff outside the code blocks is left untouched.
This last point made me think of another problem however. With the GeSHi filter you have an option to do nothing with the source code by default (when there is no language attribute). This is only detected in the process phase, while it should be done in the preprocess phase, so that there would be no preparation and other filters (like html filter) could do their thing instead.

I'll work on a new patch

Comment #7

soxofaan commented 15 October 2007 at 09:58

Status	File	Size
new	preparation_3.patch	6.9 KB

here is the reworked patch

Comment #8

aclight commented 15 October 2007 at 13:13

Re:
#1: I think flagging it as conflicting is a good idea. Is it possible to only flag it as conflicting with codefilter if geshifilter is set up to handle <code></code> tags? And only if brackets are GESHIFILTER_BRACKETS_ANGLE or GESHIFILTER_BRACKETS_BOTH? I also didn't see anything in the patch in #7 that adds this warning. It wasn't clear to me if you had added that already or if you were just planning to add that.

#2: I understand. I forgot that GeSHi calls htmlentities() on its own, so I guess calling check_plain() isn't absolutely necessary, but it doesn't seem to hurt anything (now that you call decode_entities() in the process step.

#3:

This last point made me think of another problem however. With the GeSHi filter you have an option to do nothing with the source code by default (when there is no language attribute). This is only detected in the process phase, while it should be done in the preprocess phase, so that there would be no preparation and other filters (like html filter) could do their thing instead.

That's a good point. Does the new patch include detection of such a case? If so, where is it doing that? I must have missed it.

BTW, I tested the new patch and it still works as before. I also tested with highlighted code in a teaser and it was correctly highlighted (I hadn't tested this before).

I think this is getting close.

Comment #9

soxofaan commented 15 October 2007 at 14:02

at #8

Since we are now escaping both <code> and [code] to "\xFEcode\xFF", there will be a conflict in all cases. I would keep it that way. I think GeSHi filter and code filter should not occur in the same input format, even if they would have different triggers. GeSHi filter is meant as a more advanced code filter, which can replace it, and not just extend it. Also from a usability point of view I think it would be hard to educate your users that the slightly different tags <code> and [code] have a different behavior.
Anyway, the conflict detection is not in the patch from #7. I was planning to do it in a different patch, since it involves some extra code changes under the hood and I prefer small but functional patches.
Some escaping (with check_plain() now) is necessary. For example if you're inserting html or xml source code, you don't want that the tags in the source code to trigger other filters.
this is also not in the patch from #7. I'll make a separate issue and patch for this too.

BTW, I tested the new patch and it still works as before.

You mean this as a good thing, no?

If you have no further remarks on the patch from #7, I'll commit it.
(thanks for the fruitful discussion btw)

Comment #10

soxofaan commented 15 October 2007 at 14:03

Status:

Needs work

» Reviewed & tested by the community

Comment #11

aclight commented 15 October 2007 at 14:07

When I said the patch worked as before, that was a good thing. Sorry for not being clearer. Your responses in #8 all sound good to me. I have no further worries about the patch.

Thanks for working on this so much. This should really help geshifilter work more robustly.

Comment #12

soxofaan commented 15 October 2007 at 14:18

Status:

Reviewed & tested by the community

» Fixed

committed: http://drupal.org/cvs?commit=84565

Comment #13

soxofaan commented 16 October 2007 at 11:57

the remaining issues mentioned in #9 are also solved:
1) conflict detection with codefilter: http://drupal.org/cvs?commit=84644
3) not escaping in prepare phase when there won't be a real processing phase: http://drupal.org/cvs?commit=84649

Comment #14

(not verified) commented 30 October 2007 at 12:01

Status:

Fixed

» Closed (fixed)

geshifilter_filter should respond to $op=prepare

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

News items

Our community

Documentation

Drupal code base

Governance of community