Closed (fixed)
Project:
Import HTML
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
18 Feb 2010 at 22:29 UTC
Updated:
8 Nov 2012 at 10:34 UTC
Jump to comment: Most recent file
Comments
Comment #1
roderikComment #2
dman commentedYeah, I've already found the need to put one or two pre-tidy fix ups - mostly to do with international characters, non-UTF8, and ... some other MS wierdness I can't recall
It would be good to abstract this into a clearer config option. I was thinking a series of regular expressions to run, but I guess lower level code could be the thing also. Ah, Drupal hooks! They are MADE to be preprocessors.
I'll see how we can work this or something similar in.
Comment #3
verta commentedI'm similarly coming to the conclusion that the pages I need to import are too broken for the parser or tidy to deal with. I'm going to look into a pre-processer as well.
Comment #4
roderikJust FYI: if anyone cares why I wanted to write my own script (and not regexps or so): Python Script To Tidy Up Ugly MS Frontpage HTML
(Not much directly to do with the patch...)
Comment #5
roderikUpdated patch.
tempnam($_ENV['TEMP'], "htm")seems to have stopped working on my server, for some reason I don't care to find out.(I had taken this construct from tidy_functions.inc. Now replacing it with Drupal's temp dir.)
@dman/#2: a hook sounds awesome, to enable both regexps and a command-line script like mine (which I didn't want to code in a PHP hook, though I might have done that if I had experience with a BeautifulSoup-like PHP library)
Comment #6
dman commentedClearing the old 6.x issues from the issue queue for a cleanup.
There is (now) a pre-parse hook that can operate on the raw HTML text, if anyone wants to use it. Some HTML was indeed just too broken for even htmltidy to eat it. An example of this in in the 'extras' directory as a stand-alone module.