Run command on HTML file before tidy & import [#718794]

I don't know if you consider this fit for inclusion, but if not... maybe someone will benefit from the patch anyway.

I need do import HTML that has such badly/illegally placed start/end tags, that HTMLTidy 'fixed' it in the wrong way. (^%$$%$ MS Frontpage)
So I made a custom script to fix that, and modify the HTML string before this module feeds it to HTML Tidy.

The following patch adds a facility (and config option) for running such a script. I called it a 'pre-tidy command'.

Comment	File	Size	Author
#5	718794.patch	6.03 KB	roderik
	import_html_pretidy.diff	6 KB	roderik

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

roderik

Dutch

Amsterdam,NL / Budapest,HU

CreditAttribution: roderik commented 18 February 2010 at 22:29

Status:

Active

» Needs review

Comment #2

dman CreditAttribution: dman commented 19 February 2010 at 04:36

Yeah, I've already found the need to put one or two pre-tidy fix ups - mostly to do with international characters, non-UTF8, and ... some other MS wierdness I can't recall
It would be good to abstract this into a clearer config option. I was thinking a series of regular expressions to run, but I guess lower level code could be the thing also. Ah, Drupal hooks! They are MADE to be preprocessors.

I'll see how we can work this or something similar in.

Comment #3

verta CreditAttribution: verta commented 16 March 2010 at 16:10

I'm similarly coming to the conclusion that the pages I need to import are too broken for the parser or tidy to deal with. I'm going to look into a pre-processer as well.

Comment #4

roderik

Dutch

Amsterdam,NL / Budapest,HU

CreditAttribution: roderik commented 27 May 2010 at 19:04

Just FYI: if anyone cares why I wanted to write my own script (and not regexps or so): Python Script To Tidy Up Ugly MS Frontpage HTML

(Not much directly to do with the patch...)

Comment #5

roderik

Dutch

Amsterdam,NL / Budapest,HU

CreditAttribution: roderik commented 7 September 2010 at 13:30

File	Size
718794.patch	6.03 KB

Updated patch. tempnam($_ENV['TEMP'], "htm") seems to have stopped working on my server, for some reason I don't care to find out.
(I had taken this construct from tidy_functions.inc. Now replacing it with Drupal's temp dir.)

@dman/#2: a hook sounds awesome, to enable both regexps and a command-line script like mine (which I didn't want to code in a PHP hook, though I might have done that if I had experience with a BeautifulSoup-like PHP library)

Comment #6

dman CreditAttribution: dman commented 8 November 2012 at 10:34

Status:

Needs review

» Closed (fixed)

Clearing the old 6.x issues from the issue queue for a cleanup.
There is (now) a pre-parse hook that can operate on the raw HTML text, if anyone wants to use it. Some HTML was indeed just too broken for even htmltidy to eat it. An example of this in in the 'extras' directory as a stand-alone module.