I don't know if you consider this fit for inclusion, but if not... maybe someone will benefit from the patch anyway.

I need do import HTML that has such badly/illegally placed start/end tags, that HTMLTidy 'fixed' it in the wrong way. (^%$$%$ MS Frontpage)
So I made a custom script to fix that, and modify the HTML string before this module feeds it to HTML Tidy.

The following patch adds a facility (and config option) for running such a script. I called it a 'pre-tidy command'.

CommentFileSizeAuthor
#5 718794.patch6.03 KBroderik
import_html_pretidy.diff6 KBroderik
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

roderik’s picture

Status: Active » Needs review
dman’s picture

Yeah, I've already found the need to put one or two pre-tidy fix ups - mostly to do with international characters, non-UTF8, and ... some other MS wierdness I can't recall
It would be good to abstract this into a clearer config option. I was thinking a series of regular expressions to run, but I guess lower level code could be the thing also. Ah, Drupal hooks! They are MADE to be preprocessors.

I'll see how we can work this or something similar in.

verta’s picture

I'm similarly coming to the conclusion that the pages I need to import are too broken for the parser or tidy to deal with. I'm going to look into a pre-processer as well.

roderik’s picture

Just FYI: if anyone cares why I wanted to write my own script (and not regexps or so): Python Script To Tidy Up Ugly MS Frontpage HTML

(Not much directly to do with the patch...)

roderik’s picture

FileSize
6.03 KB

Updated patch. tempnam($_ENV['TEMP'], "htm") seems to have stopped working on my server, for some reason I don't care to find out.
(I had taken this construct from tidy_functions.inc. Now replacing it with Drupal's temp dir.)

@dman/#2: a hook sounds awesome, to enable both regexps and a command-line script like mine (which I didn't want to code in a PHP hook, though I might have done that if I had experience with a BeautifulSoup-like PHP library)

dman’s picture

Status: Needs review » Closed (fixed)

Clearing the old 6.x issues from the issue queue for a cleanup.
There is (now) a pre-parse hook that can operate on the raw HTML text, if anyone wants to use it. Some HTML was indeed just too broken for even htmltidy to eat it. An example of this in in the 'extras' directory as a stand-alone module.