Announce: First Release of Import HTML Module

By dman on 23 Jan 2006 at 12:59 UTC

As briefly discussed in passing a month or two ago, I've been intending to do this properly.

By the time I got CVS access last year I'd lost interest a bit, but due to a few requests I've re-activated this project and released a very hairy first go at the Import HTML module

I'll copy the project overview below (sorry it's big) - please take note of the requirements.

I'd be receptive to any input, but it's deployment testing and major bug squashing at the moment - no 'feature requests' :-B I've got too many settings already.

Drupal Module: Import_HTML

Synopsis

Facility to import an existing, static HTML site structure into Drupal
Nodes.

This is done by allowing an admin to define a source directory of an
traditional HTML website, and importing (as much as possible) the content and
structure into a drupal site.

Files will be absorbed completely, and their existing cross-links should
be maintained, whilst the standard headers, chrome and navigation blocks
should be stripped and replaced with Drupal equvalents. Old structure will be
inferred and imported from the old folder heirachy.

Requirements
Usage Detailed step-by-step
Intent What it's indended to to
- Methodology Exactly how it does it, at a
  coder level
- Notes Issues arising, and some detailed
  explanations
Guide Reference section
- Setup Requirements, and Installing for the
  first time
- Import Templates XSL. With great power
  comes great complexity
- Settings Explanation of the user
  settings
Development / TODO

Requirements

Before you begin

See the setup section for details. Because of the
number of settings, this is not just a point-and-go module.

XML/XSLT support on the server. Check your php_info().
HTMLTidy - Either with the PHP module or the commandline version.
Some understanding of XSL for advanced template translation.
Some libraries of my own (bundled) to actually do the XSLT

Usage

This module uses no database tables of its own. It requires XML support on
the server, this can be tricky if it's not already enabled.

Given a working system, the process is thus:

Visit the admin/settings/import_html page and
check the settings.
If all values look OK for now, you can try a test run by visiting
admin/import_html/demo . Choose a 'page' sort of page, not a portal or
layout-rich sort of thing. The demo will scrape the given file and import
it to the system. Some of the new navigation features will not be apparent
yet, as they apply only to large-scale imports, or at least imports that
have a defined siteroot.
Try opening the 'admin/import_html' main page and defining
a source folder. Enter the root path of the site you wish to import and
continue. The UI should display a treeview of the
files you can selectively select for import.

It's recommended to just try one page at a time to begin with.
Upon importing a page, a new node should be created. The object of the
import templates is to trim down the content block to its unique
value. This will probably require some template tuning, so make a
new template (copy the existing html2drupal.xsl), select it (enter the new
name in the admin page) "http://www.dpawson.co.uk/xsl/sect2/sect21.html">tweak the XSL and try
again.

If you are extremely lucky, or don't care too much about the extras, you
can go straight to bulk import.
If you need to check how the the images are turning up, they can safely
be imported as well using the previous interface. They will be copied,
structured in the same folders they were in originally, into the directory
configured in the admin/setting. Imported pages will have their links
rewritten to find them there.

Two type of content are being imported, depending on file suffix. 'Pages'
(html) - which become nodes ... and eveything else, which becomes 'files'.
Already it seems that file suffixes are not good enough,
should the suffix list be editable, or should I scan the files
themselves?
When you are happy that the body field is as tidy as it's going to get
(test several pages), you can try a bulk import. This may fill up your node
collection a bit, so be prepared to delete them if things don't work
perfectly first time. Many static sites have whole sections that are not
structured the same as the rest of the pages.
On input, a menu structure and a bunch of aliases will be
auto-generated. These can be manually adjusted easily. For instance, the
menu branches will initially be named after the document titles found in
the directory structure. Which is great if you used a decent folder
heirachy, but some of the labels can probably be tidied up a bit. For that
matter, after input, you can safely re-arrange the menu structure
altogether, shifting whole sections to different places without worrying
about links breaking. These changes will show through in the menu, sitemap
and breadcrumbs but not in the pathalias. There
may be issues navigating to pages deep in a menu where the parent has not
been imported or created yet. This is normal Drupal behaviour when making
menu links to non-existant paths.

By following these instructions, you should probably be able to end up
with a version of the old content in the new layout. For large sites (200+
pages) some extra tuning may be neccessary, eg using different templates for
different sources.

Incremental imports, processing just sections at a time, or repeated
imports as you tune the content or the transformation should be
non-destructive. Re-importing the same file will retain the same node ID
path, and any Drupal-specific additions made so far.

Intent / Theory

This is intended as a run-once sort of tool, that, once tuned right on a
handful of pages, can churn through a large number of reasonably structured,
reasonably formatted pages doing a lot of the boring copy& paste that
would otherwise be required.

The existing file paths of the source content will be used to create an
automatic menu, and therefore a heirachical structure identical to the source
URLs. With module:path, appropriate aliases will also be created such that
this will enable a drupal instance to TRANSPARENTLY REPLACE an existing
static site without breaking any bookmarks!

Methodology Overview /
Tasks

A peek under the hood into what happens in what order

Facility for spidering/enumerating existing source files. (the
admin/import_html page)
Define import rules - choose an XSL stylesheet, set some parameters on
it, configure presets for the imported
pages.
Expose selective selection of files to import (admin UI)
[screenshot]
Import each source file by way of sequential
1. (Optional) download/copy of files to local
  mirror site.
2. Processing with html-tidy, to prepare for XSL transforms
3. URL-rewriting via XSL. All hrefs are redirected to the new
  pseudo-location aliases, all srcs are redirected to somewhere under
  /files.
4. Content-scraping via XSL (XSL stylesheet will probably have to be
  customized to each source site)
5. Or content-scraping via RegExps and heuristics
6. Deduction (as much as possible) of meta-information like page
  title, author,date
7. Validate nodes and save them with node-insert calls
8. Extra API-insert calls to create menu navigation and path aliases
  (taxonomy leverage also?) Two types of alias: /path/filename
  and /path/filename.html can be created for now.
Pages are now first-class nodes, and can be administered through the
CMS as usual.

Notes

The more valid and more homogenous the source site is, the better. A
creation using strict XHTML and useful, semantic tags like #title #content or
something could be imported swiftly. One with a variety of table structures
may not...

Of course, this tool is supposed to be useful when dealing with messy,
non-homogenous legacy sites that need a makeover. 'todo'>Sometimes regular expression parsing may come to the rescue for
content extraction, but that's not implimented yet.

I'm choosing XSL because I know it, it's powerful for converting content
out of (well-structured) HTML, and I've had success with this approach in the
past. Others may object to this abstract technology (XSL is NOT an easy
learning curve) but the alternative options include RegExp wierdness or cut
and paste. (which I may patch on as alternative methods -
or someone else can have a go) Both approaches I've also used
successfully in bulk site templating (over THOUSANDS of pages) but it's my
call. Making your own XSL import template is
non-trivial.

Guide

Installation/setup

XML Support

The module can use either the PHP4 and PHP5 implimentations (which are
quite different) but the PHP modules do have to be enabled somehow.
This can be tricky as they often require extra libraries to be put in your
path somewhere. Please don't ask me for instructions, every time I've done it
it hurts my head.

HTMLTidy Setup

The module also uses the
famous HTMLTidy tool. There is now a PHP module that
impliments HTMLTidy natively, but that needs to be installed and enabled.
If you don't have access to that, we can run it from the command line. Find
the appropriate binary
release of HTMLTidy for your system, and place it in your PATH, in the
modules install directory, or wherever you like, then define the path to
the executable in the settings.
This works fine under Windows too.

Import Templates

An import template defines the mapping between existing HTML content and
our node values. It uses the XSL language because of the power it has to
select bits of a structured document, for example
select=\"//*[@id='content']\" ... will find the block anywhere
in the page, of any type with the id 'content', and
select=\"//table[@class='main']//td[position(3)]\" Will locate
the third TD block in the table called 'main'. Both these examples would be
common when trying to extract the actual text from a legacy
site.

You can begin with the example XSL template, this contains code that
attempts to translate a page containing the usual HTML structures like
(either title or h1) and (either the div called 'content' or the entire body
tag) into an import structure similar to that used by the
XML import module. This structure may change but for now it should
looks something like: [example]

It's likely that whatever site you are importing will NOT be shaped
exactly like we need it to translate straight using this format. You have to
identify the parts of your existing pages that can reliably be scanned for to
define content, then come up with an 'http://www.w3.org/TR/xpath'>XPath expression to represent this.

If your source, for example, didn't use nice H1 tage to denote the page
title, but instead always looked like

&lt;font size='+2'&gt;&lt;B&gt;my
  page&lt;/B&gt;&lt;/font&gt;

... your template could be made to find
it, wherever it was in the page select=\"//font[@size='+2']/B\"
And proceed to use that as the node title.

No, the code is not pretty, and if Regular Expressions are a foreign
language to you, This is worse.

But this is why developers have been ranting for the last ten years
about using semantic markup!!

The uniformity, and the usefulness of the metadata detected in the source
files will play a big part here.

It's easier to develop and test the XSLT using a third-party tool, I
recommend Cooktop. Be sure to set
the XSL engine to 'Sablotron' which is the one that PHP uses under the
hood.

Although it would be possible to configure a
logical mapping system to select different import templates based on
different content, at this stage the administrator is expected to be doing a
bit of hand-tweaking, and predicting all possible inputs is
impossible. Some of this sort of logic can
however be built into the powerful XSL template, if you are good at
XSL

Once importing is taking place, you can even filter it more to improve the
structure of the input, for example by removing all redundant FONT tags, or
by ensuring that every H1,2,3 tag has an associated #ID for anchoring. Yay
XSL.

Settings

On the admin/settings/import_html screen, you can (if you wish):

Choose the import template. These templates
translate between the existing page structure and the raw content blocks.
Currently you enter the name of the xsl file directly.
If I make this a select, some flexibility is lost
Customize a parameter used in the import template - the id of the real
'content' block of the source documents. This could be
extended into a wizard to work towards an all-purpose template, but that
will probably never happen. Can't predict how broken the import sites may
be.
When a site is imported, it must bring along some of its baggage.
Images and suchlike. You can choose where they will end up.
When the imported site is given new URLs (reflecting the original path)
we can publish the new nodes in a 'subsite' by applying a prefix directory
to the aliases they are issued. The existing old links will be written to
point to where the imported neighbouring pages are EXPECTED to end up.
Incremental processing means they may not always be there until the whole
site is done. Link checking (preferably on-the-fly) would be a nice tidy-up
process.
The new URLs generated for the new pages are url_aliases based on the
original paths. You can choose to have tidy (no suffix) or legacy (old .htm
or whatever suffix) aliases - or both.
As the input is (a fragment of) Pure HTML, the content filter (input
format) must be set correctly. I choose to define a blank filter, which
doesn't even add extra BRs, but you can override that if you wish.
Should this option be hidden, or is it
useful?
If using non-native HTMLTidy support, the path to the tidy executable
should be defined. Security issue giving commandline
access to a bin? Should move into server-side settings?

Notes on the Treeview Interface

Files and folders beginning with _ or . are nominally 'hidden' so are
skipped and do not show up on this listing. While it's
possible to list a thousand or so files, It may be a good idea to allow the
listing to be more selective, to scale to larger sites.

Development / TODO

As mentioned in Usage, this module uses no database tables of its own.
Pages are read straight into 'page' nodes. I guess it could feed into
flexinode if your import files had extra parsable content blocks, and I've
sucessfully used it to import other random XML formats (RecipeML) although
the advantages of doing so are limited.

It's easy to imagine this sytem set up as a synchroniser, that could
re-fetch and refresh local nodes when remote content changes. This would
involve recording exactly what the source URL was (which isn't currently
done) but would be a fun feature.

I may fork off the page-parsing into a pluggable method, so that a regexp
version can be developed alongside, and be used for folk without XSL
support.

How to leverage this to import a local site to a remote server? You must
either unpack the source files somewhere on that machine, then provide the
absolute path where the server can find them, or upload a
zip package and I'll try to take it from there.(TODO)

Also TODO is a 'Spidering' method to try to import URL sites.
Way in the future!

TODO Allow settings to set import content type to something
other than 'page'

TODO Find a way to map more meta-data from the original page
(assuming there is any to be extracted) to Drupal properties, eg get the
contents of META keywords into Taxonomy associations

TODO There are issure when a page links directly to a file
that would be regarded as a resource via an href. Most hrefs are re-written
to point to the new node, but things like large images or word docs get
imported under 'files'. The XSL rewrite_href_and_src.xsl attempts to correct
for this, but there may be some side-effects. Always run a link checker after
import.

Dan

Comments

Wow, great effort Dan :)

styro commented 23 January 2006 at 23:18

My own intentions for making progress in this area were sidetracked by other work and it (and other Drupal stuff) kinda slipped off my radar.

I do look forward to contributing to this module in the future though. I can forsee some of your TODO (eg syncing, using more metadata etc) items being useful for where we want to head as well. I like the general approach you've taken.

Once again, nice work. I look forward to trying it :)

--
Anton

Good efforts but getting errors

creatorsdream commented 28 January 2006 at 09:35

First off. I want to commend you for the documentation and the amount of work you've evidently put into this module. Kudos to you Dan! But...

I've done everything you say to do, and noting that I am not a developer, I have to say that I am at a dead end trying to get this static html converter to work. I've tried a VERY simple html document to process and hear are the errors/warnings I get:

warning: domxml_open_mem(): Space required after the Public Identifier
in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): SystemLiteral " or ' expected
in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): SYSTEM or PUBLIC, the URI is missing in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Opening and ending tag mismatch: meta line 9 and head
in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Opening and ending tag mismatch: hr line 59 and body in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Opening and ending tag mismatch: hr line 38 and html in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Premature end of data in tag hr line 21 in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Premature end of data in tag hr line 13 in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Premature end of data in tag body line 11 in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Premature end of data in tag meta line 5 in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Premature end of data in tag head line 4 in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
warning: domxml_open_mem(): Premature end of data in tag html line 3 in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 106.
user warning: Failed to parse in xml source. in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 107.
user warning: Still failed to parse contents of file :-( Give up. in /home/sites/site1/web/modules/import_html/coders_php_library/xml-transform.inc on line 135.

Then I get a blank form field and the following:

* Failed to initialize or parse XMLdoc input
* Failed to process page.

Geez! I've spent several days looking through Drupal and trying out some outdated modules that were attempting to bring in existing static html pages and NOTHING WORKS!

You know, I started out setting up Mambo as a cms to my web site and did find a very easy way to include existing html pages into the mambo site, but stopped developing with it because I had problems elsewhere. Then I hear about Drupal and have been very impressed with what is going on with this cms and especially with the community which supports it. But I have to say that some of the simplest tools like bringing in existing static web sites and clean image editing/placement are not easy to set up... making it very frustrating to continue with my efforts to set up my existing web site at: http://www.artmetal.com to use Drupal.

I really have a desire to upgrade my community driven web site to this cms, but it's taking a toll on me. Can you help me get this module working? What kind of information do I need to give you to get some feedback on my errors?

quique

This is why they call it testing!

dman commented 28 January 2006 at 11:28

OK, thanks for that.

You've got XML support, which is good.

But all those errors are coming from the parser, which indicates that it was not looking at valid XHTML

Opening and ending tag mismatch: meta line 9 and head

is quite a give-away.

NOW, the content SHOULD have been passed through htmltidy before we got this far - but either
It's not available,
not found,
or not working.

It's a BIT tricky to second guess things on remote servers, but I successfully did just last night

So.
How confidant are you that you have a working htmltidy executable?
And where is it?
I guess you've got php4, so the core extension is not possible.

At the top of coders_php_library/tidy-functions.inc you'll see

if(! defined("TIDY_PATH")){
  define("TIDY_PATH",system_path("/bin/")); //Include trailing slash
}

.. you need to change that to what will work on your system (just the directory that tidy is in)

I left this out of UI configs while I considered whether it was really a good idea to let a user (even an admin user) choose an absolute path to an executable.

Thanks to your feedback (and my experiment yesterday) I could probably work up a little more documentation (yes, more) about how to make tidy work right.

If you still have troubles, send me your phpinfo().

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

Mac OSX binary for tidy and tab2space

creatorsdream commented 28 January 2006 at 20:39

Dan,

I installed a precompiled version of tidy which included tab2space for Mac OSX. I placed them in the /bin/ directory and this seemed to work because I stopped getting the tidyhtml errors from your module. Is there a way to do a good test from the command line that will verify this works with your module?

quique

I've created a tidy install and test routine

dman commented 29 January 2006 at 05:46

Now in CVS (commandline update, it won't be bundled for 24 hours or something)
I have made the tidy functions much more paranoid.

Just visiting settings/import_html will
verify that the paths are correct,
try to fix them if not,
allow you to define it yourself,
and if all else fails, get the binary itself from sourceforge.

Once found, it will run tidy -v to make sure it's executable.

quique - simply not getting errors is a GOOD SIGN, but if you try the latest version you should see the proof by visiting settings.

If it's not found, try updating the bin path to where it really is.
OR
DON'T use the download install (it's currently hardcoded for linux) but copy the executable you found into the path ...modules/import_html/coders_php_library/bin/tidy

... it should get auto-detected from there!

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

This is Great! I am now creating pages. You're dman!

creatorsdream commented 30 January 2006 at 11:31

This is Great! I am now creating pages. You're dman!

I've been working with this for about a day now and have a couple of comments that might help you fine tune this very needed module. The first is to add the import_html/coders_php_library/bin directory even though you will not be supplying the tidy binary. It would also be great if there were a way to automatically add a title if one is not available. If the title is not available, the document is not imported.

I have also had to change my php.ini to give the script 10 minutes to run. I have a couple of thousand pages to process and I kept running out of time. So it may be useful to add a comment about that. One thing that would be REAL USEFULL is to have the ability to have a check box in the Import HTML Site to allow for a REPLACE or SKIP documents that have already been processed. This would probably help speed up processing of directories that have some of the documents imported, but some not due to the title issue. It's been real difficult determining what has been imported and what hasn't.

All in all Dan, you've done a fabulous job! Thank you for giving me something I can work with to import several thousand static html pages from my old site. It's still going to take some energy, but it's something that has to be done to incorporate the old site into the new Drupal site. Thanks for helping me see the light at the end of the tunnel... ;-)

quique

:-B Great to hear

dman commented 30 January 2006 at 15:16

add the import_html/coders_php_library/bin directory even though you will not be supplying the tidy binary.

No reason why not. I'll stick an extra readme in there too I guess.

It would also be great if there were a way to automatically add a title if one is not available. If the title is not available, the document is not imported.

I hadn't got as far as figuring how to handle trivial errors during bulk imports.
Throwing up the edit form is of course useless.
What would you suggest for an 'automatic' title? some version of the file path?

Perhaps I need some sort of 'problem' queue to shunt them into, but I was quite pleased to do the job without yet another database table. I guess I could use the cache...

I have also had to change my php.ini to give the script 10 minutes to run. I have a couple of thousand pages to process and I kept running out of time. So it may be useful to add a comment about that.

Yeah, that was a forseen issue. I was intending to take care of it in the UI, by chunking up the subsections, but (as anyone who's used a browser-based file manager knows) that can be clunky.

I've addressed this issue in several ways before (in other, similar projects)
The most robust I found (on a flakey, IIS server that kept timing out randomly) was to copy all my import files into a 'pending' directory in one process, and then repeatedly search the pending directory for anything that needed importing, deleting it when done. This action page would launch, process, and do a javascript refresh until the job was flagged as done.

As this module is obviously targeted at folk with lots of legacy content to deal with, I'd be interested in hearing suggestions.

One thing that would be REAL USEFULL is to have the ability to have a check box in the Import HTML Site to allow for a REPLACE or SKIP documents that have already been processed.

Yeah, that makes sense. Pretty Easy.

This would probably help speed up processing of directories that have some of the documents imported, but some not due to the title issue. It's been real difficult determining what has been imported and what hasn't.

I want to work on the 'subsection' UI. Currently if you focus on a deeper directory, that confuses the base_path that the navigation is built from. I need to be able to say "Just look at this dir, but remember the site root is up there."
Straightforward, but needs to be done in several hairy places.

It's great that it's doing its job for you.
There's ALWAYS some hand-tuning to do, but I can't help you out with titles that don't exist.

Did you consider trying to extract them with pattern matching? Using XSL you can, for example, select the first thing you find that is FONT="+2" or whatever messed up pseudo title the pages used.

... So do I get a link to your site?

.dan.
http://www.coders.co.nz/
PS. I've copied these suggestions into the issues register, feel free to add more there.

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

Followup and Thanks to Dman!

creatorsdream commented 30 January 2006 at 22:39

Did you consider trying to extract them with pattern matching? Using XSL you can, for example, select the first thing you find that is FONT="+2" or whatever messed up pseudo title the pages used.

... So do I get a link to your site?

PS. I've copied these suggestions into the issues register, feel free to add more there.

XSL is WAY over my head! I guess I'm going to have to edit each html document by hand unless you can implement the "NOTITLE" request I put in. 8-P I did add some suggestions to the issues register... it's my first on adding to issues. Thanks for pointing me to the right place!

And YES, I will add a link once the new Drupal site is up and running.

Thanks for your assistance, and for bringing another great module to the Drupal community!

quique

ok im stuck

ntozier commented 7 February 2006 at 18:26

I'm relatively new to drupal (having only been playing with it for about a week now). To learn it better I have been recreating my personal website. When I found your Import_HTML module I was quite happy. But I cannot seem to get it to work. I'm running php4 and drupal 4.7.0-beta4. I tried 4.6.5 but had some issues (upgrading to the beta fixed them). I have the tidy executable installed in /usr/bin/ and I have xml support enabled in my php setup.

The error I get when trying to go to admin->setting->import html
is Fatal error: Call to undefined function: form_textfield() in /home/httpd/sitename/modules/import_html/import_html.module on line 245

When I try to go to admin->import html I get a slightly different error of Fatal error: Call to undefined function: form_file() in /home/httpd/sitename/modules/import_html/import_html.module on line 221

Is this because of the version of Drupal I am using, or did I set something up wrong? It looks like the module does not take advantage of the updated form API. Any chance that you are working on porting it to 4.7? Thanks for your time.

---
ntoz

4.6 it is.

dman commented 7 February 2006 at 18:49

You're right, it's 4.6 only so far.

It'll be 4.7 soon I guess, but not this week,sorry.
Anyone else that wants to have a go in the meantime is welcome - It'll just be the admin/settings screen that needs updating I guess.

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

Tidy as shared module

petasques commented 8 February 2006 at 19:57

Hi,

Newbie but enthusiast at Drupal, oh God! i get in love, instantly! :)

There's a tidy shared module for PHP. They say it's tidy version 1.0, and may be you know if will work for this Import_HTML module.

http://pecl.php.net/tidy

Found at http://www.coggeshall.org/oss/tidy/ :

"Important note: Although the PECL repository claims that it is hosting a version of Tidy compatible with PHP 4.3 and higher, note that the version of Tidy bundled with PHP 5.0+ is not the same as the one found in the PECL repository. Tidy for PHP 5.0+ (extension version 2.0+) was completely rewritten to use the new PHP 5 architecture and supports many new features not found in the version of Tidy in the PECL repository for PHP 4.3+ (extension version 1.0+) "

Thank you very much, willing to develop for Drupal!

Thanks for the suggestion

dman commented 8 February 2006 at 20:35

I do test for existance of the extension and use it if possible, but I was only aware of the PHP5 implimentation.

Dues to the troubles that lots of beginners here seem to have with their hosting ( .htaccess doesn't show up in file listings, Host won't give them GRANT to the tables or write to the filesystem) and general other installation issues, I tried not to require that a rarely-used module would be installed on their system.

I know using the commandline to get to it is patchy ... but it sure seems to work with less installation hassles (now I've even bundled an installer)

But does anyone have any suggestions for a commandline XSL processor that's just as straightforward? Requiring the PHP XSL extension (which I unfortunately do) is unfortunate.

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

Thanks for the reply, i agree totally..what about DOM-XML?

petasques commented 10 February 2006 at 09:29

At first look i didn't notice that a Tidy installer was bundled, simply great.

Now we're on travel to find and install the XSLT extension, but in my ignorance about XSLT transformations, i ask myself if DOM-XML extension that it's already installed and enabled ( --with-dom-xslt says phpinfo) would be valid for this module needs. This extension is 'more popular', as is enabled by default in more PHP installations.

XSLT extension at PHP http://www.php.net/manual/en/ref.xslt.php

DOM-XML extension at PHP http://www.php.net/manual/en/ref.domxml.php

Thank you very much,

the extensions are great - if they are there.

dman commented 10 February 2006 at 09:49

At first look i didn't notice that a Tidy installer was bundled, simply great.

That was a later addition, I guess I didn't mention it in the docs.
Thanks for reminding me.

Now we're on travel to find and install the XSLT extension, but in my ignorance about XSLT transformations, i ask myself if DOM-XML extension that it's already installed and enabled ( --with-dom-xslt says phpinfo) would be valid for this module needs. This extension is 'more popular', as is enabled by default in more PHP installations.

Yes, the dom-xslt extension is what I use, if available (there are two flavours, one called dom/xsl, one called domxml/xslt which work differently. PHP5&4 respectively)
If either is enabled, we are good to go!

While it's true that most new installations do have it turned on these days, especially if someone has been experimenting with syndication, it's NOT so sure for existing hosting environments that some people may be setting up on.
One of my local ISPs, for instance, just doesn't, and I had a hell of a time getting it going on Win32 (a hand-rolled setup before XAMMP), so I got a bit paranoid about requirements.
As above, lots of users seem to operate without admin rights to their host.

Do I need to clarify this in the docs even more? I get the feeling folk would start to tune out if my readme got any bigger :-B

.dan.

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

extension DOM-XML

petasques commented 13 February 2006 at 18:08

Hi Dan,

I've modified the 'is the extension enabled?' checker to allow script execute with DOM-XML, because as it was the checker did not recognize the extension enabled, but we got errors...is here the appropiate place to report errors? or is there a better place to report you errors Dan?

thanks

You're probably right

dman commented 14 February 2006 at 09:31

I didn't go back and test the php5 detection after a few of the more recent changes. One of them being changing the detection script and giving up on dynamic extension loading (PHP dl() ) altogether.
I decided if your host doesn't have it in php.ini, it's not worth the bother attempting to pull it in at runtime. Besides, it was triggering uncatchable errors.

ANYWAY.
The nice place to put fixes like this is over there under the project issues

I can guess what went wrong, but you're welcome to show me how it works on your system :0B

My original stuff was all installation-specific, trying to guess what environments it may come up against, so the more diverse set-ups that try it out, the more we can learn.

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

i have a similar module

dado commented 13 February 2006 at 18:35

only different focus. mine also uses Drupal 4.7, XML DOM, tidy (PHP5 version).

my module parses fields of data out of web pages which admin points it at. The admin configures the parsing rules using XPath and PHP statements (using a syntax of my own).

the output of my module is currently CSV text, since I was planning on using node_import module to pull the data into nodes.

however, node_import isn't moving toward 4.7 compatibility. i was contemplating doing this 4.7 conversion for node_import myself. but maybe your module offers a better way of pulling outside data into nodes (from xml i presume). perhaps there is an opportunity to spin off an importation API module? or other ways to collaborate?

dado

Good going

dman commented 14 February 2006 at 09:45

Node Import was not what I wanted, Mine evolved around the outside of import/export ... but there is now almost zero of the original code left, only the xml 'schema' that import/export sorta used. I'm ready to abandon that also.

My thoughts are to use XHTML as the common touchstone format, using a method that's being called 'microformats' but I just called 'annotated xhtml'.

With XSL and xpath available, I think we can reduce all inputs down to unthemed, highly sematic chunks of content, that's not just CSV, but valid XHTML in the in-between transport/interpretation stage!

Sound good? All XHTML, All the time.
I'm in to establishing a schema of sorts. This schema can become the baseline for all scrapers etc (like your own?) to target, and I believe my import_html can take care of most of the structural issues. I'm intending to embed the original source/provenance as a tag in the data somewhere, but haven't yet.

Sound like some compatable aims?
I slowed down on that path because I didn't have faith in widespread XML support. A few months on, it looks a bit better.

.dan.

http://www.coders.co.nz/

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

i'm for it

dado commented 14 February 2006 at 13:49

dman,

You are the lizard king. I am all for the xhtml standards you speak of. I think Ber has proposed slightly related things here.

I am not sure about areas where our 2 modules might share resource or interoperate. I will have to explory yours in more depth. I will upload mind to sandbox in < 24 hrs.

As a first step, do you think you have somewhat generic code for importing XHTML as nodes? Could you point me toward this, so I can look into altering my module's output to work with your input method? Do you support importing taxonomy info associated with nodes?
dado
http://schtickdisc.org

here is my module for extracting data from web pages

dado commented 15 February 2006 at 11:21

in case anyone is interested. it's called "scraper". Pre beta maturity, though it is working well for my ~8 test cases.
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/dado/scraper/

you could jump to the readme here
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/dado/scraper/...

BACKGROUND
I have worked extensively with 3 commercial products which allow an admin to set up rules for extracting data out of web pages. I applied these learnings into this module, keeping the syntax and methodology as simple as possible.

from readme:

OVERVIEW
This module scrapes data from web pages. Currently outputs this data in CSV format.
It hopes to permit this data to be imported into Drupal as nodes, although such
data could be used for any purpose. It could be used for pulling any data of interest out of a static web page.

I hope we find the ability to collaborate or share resources between Import HTML and Scraper!
dado
http://schtickdisc.org

I have a problem getting this to work...

adoucette commented 16 May 2006 at 04:10

Hi all,
I have just installed Drupal 4.7 and would like to get the Import_HTML module up and running so that I can import an HTML site and move it all over to Drupal.

I get this error when I open Import_HTML:
Fatal error: Call to undefined function: form_file() in /home/mysitename/public_html/modules/import_html/import_html.module on line 221

I get this error when I open Import_HTML's settings:
Fatal error: Call to undefined function: form_textfield() in /home/mysitename/public_html/modules/import_html/import_html.module on line 245

What is going wrong here?
Also, I just had the hosting company (I'm on shared hosting) install HTML_Tidy but they did it as a PEAR module. Can I still use Import_HTML with Tidy as a PEAR module? How do I set the path to that?

Thanks,
Ariel

Module is 4.6 only

nevets commented 16 May 2006 at 04:19

You might check with the author of the module to see if they have plans to update for 4.7

Announce: First Release of Import HTML Module

Synopsis

Before you begin

XML Support

HTMLTidy Setup

Comments

New forum topics