No-Markup Markdown mode and improved GeSHi support

milianw - February 11, 2008 - 00:32
Project:Markdown with SmartyPants
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:milianw
Status:needs work
Description

Hi there!

I've made some changed to marksmarty.module in order to get a decent input filter for comments. Since one really should not allow generic HTML tags there I've enabled marksmarty to support php-markdown-no-markup[^1]. This way no markup feature gets broken by using something crude like striptags(). A simple whitelist mechanism is in place.

Additionally I've added a GeSHi compliance mode. Since markup is no longer allowed with the above option activated, GeSHi needs to run after marksmarty. But I came across few problems:

  1. geshi prepares itself by replacing tags with a bracket-notation. markdown sees those as generic text and encodes entities etc.
  2. smartypants potentially replaces the quotes of the language attribute

All this is fixed with my patch. Additionally generic markdown code blocks get GeSHi featues as well (plain text mode).

I really hope this gets imported into trunk. Please give me a review. Especially the descriptions/labels for administration should be reviewed.

Thanks have a good night.

[^1]: see http://michelf.com/weblog/2007/php-markdown-no-markup/

AttachmentSize
no_markup_and_geshi.patch4.46 KB
no-markup_markdown.php_.txt1.01 KB

#1

milianw - February 11, 2008 - 01:04

I've made some changes to my patch and included a call to filter_xss_admin(). This should make this filter even more safe.

The XSS-check is only applied when markup inside markdown is disabled. Maybe this should be an additional option? Just tell me, I'll implement it.

AttachmentSize
no_markup_and_geshi.2.patch 4.48 KB

#2

moshe weitzman - February 11, 2008 - 21:48

why not rely on the core html filter to strip unwanted tags? filters are orderable, so it seems like you could put geshi before markdown, no?.

#3

milianw - February 11, 2008 - 21:54

Code HTML filter is not aware of Markdown syntax, like autolinks (<link here>) etc. Read the post by Michel Fortin, creator of PHP Markdown.

And yes, GeSHi comes after Markdown, which is a requirement as well, since I want Markdown code blocks to become geshi plaintext blocks:

    this is a markdown code block
    with my patch above this gets
    fancy linenumbers and stays consistent with
    other code blocks on my page

<blockcode>
this is no markdown, but looks just like the code block above
</blockcode>

So my changes seem to be the only way to do it.

#4

moshe weitzman - February 11, 2008 - 22:01

I did read that post. I don't see an answer to my question in "Code HTML filter is not aware of Markdown syntax, like autolinks () etc. " It seems to me like this mode would be useful if you dodn't have an HTML filter. But we do, and it is on by default.

I'm not too familiar with geshi. please submit that as a separate patch and explain again why an order of markdown and then geshi is insufficient. lets leave smartypants out of the discussion for now. i for one, never enable it and it really does not belong in our project, IMO.

#5

milianw - February 11, 2008 - 22:46

Take this input:

# Markdown

A paragraph with a link: <http://example.com>

> this is a blockquote

<h1>HTML</h1>

<p>A paragraph <a href="javascript:alert(foobar);">not wanted</a></p>

My patched marksmarty filter generates the following:

<h6>Markdown</h6>

<p>A paragraph with a link: <a href="http://example.com">http://example.com</a></p>

<blockquote>
  <p>this is a blockquote</p>
</blockquote>

<p>&lt;h1&gt;HTML&lt;/h1&gt;</p>

<p>&lt;p&gt;A paragraph &lt;a href=&#8221;javascript:alert(foobar);&#8221;&gt;not wanted&lt;/a&gt;&lt;/p&gt;</p>

This looks very good. I could whitelist p, h1 tags if I wanted to.

Now lets take a look at what HTML filter does:

HTML filter before marksmarty:

<p>Mixed content below:</p>

<h1>Markdown</h1>

<p>A paragraph with a link:</p>

<p>&gt; this is a blockquote</p>

<p>HTML</p>

<p>A paragraph <a href="alert(foobar);">not wanted</a></p>

Ok, nothing new here - it doesn't know Markdown syntax and messes things up.

marksmarty before HTML filter:

Mixed content below:

Markdown

A paragraph with a link: <a href="http://example.com">http://example.com</a>


  this is a blockquote


HTML

A paragraph <a href="alert(foobar);">not wanted</a>

What has happend here? I'd forgot to whitelist the tags I needed. But what tags are that? Which tags does markdown -possibly- generate? In my eyes this is _very_ tedious. Its marksmarties task to support the user. This is clearly not the case in this setup.

So I hold my position that HTML filter is not suitable here.

As to the GeSHi stuff, I'll try to supply a seperate patch. And while I'm at it a slimmed down patch for this thread as well.

Regarding smartypants: Please don't remove it! I love it and use it. It should be made a seperate module maybe, but that's about it.

EDIT: patch attached, I hope I haven't messed something up with while seperating this and the geshi issue.
When testing don't forget the additional markdown parser, also added.

AttachmentSize
no-markup.patch 2.22 KB
no-markup_markdown.php_.txt 1.01 KB

#6

moshe weitzman - February 20, 2008 - 17:33

I have asked the security team for input here. I am extremely reluctant to encourage users to disable the HTML filter.

#7

greggles - February 20, 2008 - 18:02

I'd forgot to whitelist the tags I needed. But what tags are that? Which tags does markdown -possibly- generate? In my eyes this is _very_ tedious. Its marksmarties task to support the user.

couldn't this be solved by better defaults and instructions in the marksmarty module?

Just add a list of those tags to the README.txt and I think this is solved. I agree with moshe, we shouldn't try to re-implement the HTML filter nor recommend that people use marksmarty without it.

#8

milianw - February 22, 2008 - 13:29

Here's a list of tags that can be created by markdown:

p, ul, ol, li, br, blockquote, code, pre, a, strong, b, em, i, img, h1, h2, h3, h4, h5, h6, hr

These are additionally supported by markdown extra:

table, th, td, tr, thead, tbody, tfoot, dl, dd, dt, div, sup, abbr

Maybe I forgot some, but I think this should be pretty much it. So adding this list to the README will fix it? People will remember to change it properly? And what about attributes, they wont be stripped by HTML filter I hope? I talk about align, id, title, href, alt, src.

What about doing something like that (assumes markdown only module, smartypants is neglected):

<?php
function _markdown_settings($format) {
 
$filters = filter_list_format($format);
 
$form = array();
  if (isset(
$filters['filter/0']) && $filters['markdown/0']->weight < $filters['filter/0']->weight) {
   
// htmlfilter is activated and ordered after markdown
   
$tags = 'p, ul, ol, li, br, blockquote, code, pre, a, strong, b, em, i, img, h1, h2, h3, h4, h5, h6, hr, table, th, td, tr, thead, tbody, tfoot, dl, dd, dt, div, sup, abbr';
   
$form['markdown'] = array(
     
'#type' => 'fieldset',
     
'#title' => 'Markdown',
     
'#collapsible' => false,
    );
   
$form['markdown']['htmlfilter'] = array(
     
'#type' => 'markup',
     
'#value' => '<p>'.t('To get full Markdown support, please allow these HTML tags in HTML filter:').'<br /><code>&#039;. $tags .&#039;</code></p>',
    );
  }
  return
$form;
}
?>

#9

milianw - February 22, 2008 - 13:40

Darn, I've spotted another issue. It's related to the GeSHi support:

Right now HTML Filter must come after Markdown, because of autolink and blockquote syntax. But GeSHi should come _after_ HTML Filter and _before_ Markdown! We'd have to add possibly even more tags to HTML Filter.

That makes me favouring the no-markup markdown approach again. Less configuration, same results.

#10

soxofaan - May 2, 2008 - 14:05
Status:needs review» needs work

Another option is to add a prepare filter phase to Markdown filtering.
The preparation step would escape things like > this is a blockquote and <http://example.com> from the example in #5
And in the final process phase, these things would be unescaped again before being markdowned.

I had to use this approach in GeSHi filter to avoid ugly filter conflicts with the core html filter and such. Conflict solving is now much simpler.

The preparation phase in the Drupal filter system is specifically designed for the sort of problems, so I think it should be used instead of reimplementing 'yet another HTML filter'.

Filtering is a two-step process. First, the content is 'prepared' by calling the 'prepare' operation for every filter. The purpose of 'prepare' is to escape HTML-like structures. For example, imagine a filter which allows the user to paste entire chunks of programming code without requiring manual escaping of special HTML characters like < or &. If the programming code were left untouched, then other filters could think it was HTML and change it.

(From http://api.drupal.org/api/function/hook_filter/5)

HTML Filter must come after Markdown, because of autolink and blockquote syntax. But GeSHi should come _after_ HTML Filter and _before_ Markdown!

With the appropriate preparation step in Markdown and appropriate HTML filter white list, you can eliminate the need that HTML must run before Markdown, which makes things easier.

Also note that it is possible to run HTML filter after GeSHi filter, but it requires tweaking of the white list to keep the syntax highlighting markup.

#11

moshe weitzman - May 2, 2008 - 15:04

Thanks saxofaan. If you have some time, I'd love a patch form you. Your comments make a lot of sense.

 
 

Drupal is a registered trademark of Dries Buytaert.