Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I don't know if you consider this fit for inclusion, but if not... maybe someone will benefit from the patch anyway.
I need do import HTML that has such badly/illegally placed start/end tags, that HTMLTidy 'fixed' it in the wrong way. (^%$$%$ MS Frontpage)
So I made a custom script to fix that, and modify the HTML string before this module feeds it to HTML Tidy.
The following patch adds a facility (and config option) for running such a script. I called it a 'pre-tidy command'.
Comment | File | Size | Author |
---|---|---|---|
#5 | 718794.patch | 6.03 KB | roderik |
import_html_pretidy.diff | 6 KB | roderik |
Comments
Comment #1
roderikComment #2
dman CreditAttribution: dman commentedYeah, I've already found the need to put one or two pre-tidy fix ups - mostly to do with international characters, non-UTF8, and ... some other MS wierdness I can't recall
It would be good to abstract this into a clearer config option. I was thinking a series of regular expressions to run, but I guess lower level code could be the thing also. Ah, Drupal hooks! They are MADE to be preprocessors.
I'll see how we can work this or something similar in.
Comment #3
verta CreditAttribution: verta commentedI'm similarly coming to the conclusion that the pages I need to import are too broken for the parser or tidy to deal with. I'm going to look into a pre-processer as well.
Comment #4
roderikJust FYI: if anyone cares why I wanted to write my own script (and not regexps or so): Python Script To Tidy Up Ugly MS Frontpage HTML
(Not much directly to do with the patch...)
Comment #5
roderikUpdated patch.
tempnam($_ENV['TEMP'], "htm")
seems to have stopped working on my server, for some reason I don't care to find out.(I had taken this construct from tidy_functions.inc. Now replacing it with Drupal's temp dir.)
@dman/#2: a hook sounds awesome, to enable both regexps and a command-line script like mine (which I didn't want to code in a PHP hook, though I might have done that if I had experience with a BeautifulSoup-like PHP library)
Comment #6
dman CreditAttribution: dman commentedClearing the old 6.x issues from the issue queue for a cleanup.
There is (now) a pre-parse hook that can operate on the raw HTML text, if anyone wants to use it. Some HTML was indeed just too broken for even htmltidy to eat it. An example of this in in the 'extras' directory as a stand-alone module.