Greedy regex mangles successive tags

iochroma - April 21, 2008 - 16:59
Project:Drupal Markup Engine
Version:5.x-1.0
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:active
Description

Two tags
Pre1<dme:x>foo</dme:x>Post1 Pre2<dme:x>bar</dme:x>Post2 , where dme:x will be replaced by §inner_content§ will result in Pre1§bar§Post2.

Obviously the two regexes for preparing and processing tags with close tags are too greedy.
In _dme_prepare_text I replaced
$regex_with_close_tag = "@<dme:(". implode('|', $tag_list) .')\s*((?:\s+\w+=([-+]?[0-9.]+|(["\']).+\4))*)\s*>(.*)</dme:\1>@isU';
by
$regex_with_close_tag = "@<dme:(". implode('|', $tag_list) .')\s*((?:\s+\w+=([-+]?[0-9.]+|(["\']).+\4))*)\s*>((?U).*)</dme:\1>@isU';
I'm not very familiar with PERL regexes but (.*?) did not work as I expected.

In _dme_process_text I did the same but to make it work fast I had to remove the nested tags logic (which I took no time to fully understand):
$regex = "@\[dme:(". implode('|', $tag_name_list) .')\s*((?:\s+\w+=([-+]?[0-9.]+|(["\']).+\4))*)\s*\]((?U).*)\[/dme:\1\]@isU';

I just took a very quick look into the code but I think that the solution with the tag nesting does not work anyway. A recursive approach would be better I think.

Regards,
Oliver

#1

jcfiala - April 25, 2008 - 16:17

Interesting... I did a fair number of simpletests with the various tag placement and the like - I'll have to add your example to my tests and see how it plays out. Thanks for letting me know about this!

#2

jcfiala - May 4, 2008 - 21:31

Hmm... I set up this test:

"The cute little fox jumped over the lazy darn dog."

And it worked, I'm getting

"The cute little fox whzcrq bire gur lazy qnea dog."

What version of PHP are you using? I wonder if the regex works differently on different versions. The U switch at the end makes it an ungreedy match in my tests. On the other hand, if this is a problem with various versions, changing it to match your suggestion may be better...

 
 

Drupal is a registered trademark of Dries Buytaert.