Download & Extend

Run command on HTML file before tidy & import

Project:Import HTML
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs review

Issue Summary

I don't know if you consider this fit for inclusion, but if not... maybe someone will benefit from the patch anyway.

I need do import HTML that has such badly/illegally placed start/end tags, that HTMLTidy 'fixed' it in the wrong way. (^%$$%$ MS Frontpage)
So I made a custom script to fix that, and modify the HTML string before this module feeds it to HTML Tidy.

The following patch adds a facility (and config option) for running such a script. I called it a 'pre-tidy command'.

AttachmentSize
import_html_pretidy.diff6 KB

Comments

#1

Status:active» needs review

#2

Yeah, I've already found the need to put one or two pre-tidy fix ups - mostly to do with international characters, non-UTF8, and ... some other MS wierdness I can't recall
It would be good to abstract this into a clearer config option. I was thinking a series of regular expressions to run, but I guess lower level code could be the thing also. Ah, Drupal hooks! They are MADE to be preprocessors.

I'll see how we can work this or something similar in.

#3

I'm similarly coming to the conclusion that the pages I need to import are too broken for the parser or tidy to deal with. I'm going to look into a pre-processer as well.

#4

Just FYI: if anyone cares why I wanted to write my own script (and not regexps or so): Python Script To Tidy Up Ugly MS Frontpage HTML

(Not much directly to do with the patch...)

#5

Updated patch. tempnam($_ENV['TEMP'], "htm") seems to have stopped working on my server, for some reason I don't care to find out.
(I had taken this construct from tidy_functions.inc. Now replacing it with Drupal's temp dir.)

@dman/#2: a hook sounds awesome, to enable both regexps and a command-line script like mine (which I didn't want to code in a PHP hook, though I might have done that if I had experience with a BeautifulSoup-like PHP library)

AttachmentSize
718794.patch 6.03 KB