Due to popular demand, I've documented how this can work.
Technically most of the functionality was there under the hood already, but now it's been tweaked to recognise CCK fields explicitly.

Import HTML - Import to CCK

The base functionality supports placing found content into the $node->body
field, not naturally into any arbitrary CCK fields, but this is also
possible.

If you have a CCK node with (eg) fields:

field_text, field_byline, field_image
and your input pages are nice and semantically tagged, eg

<body>
  <h1 id='title'>the title</h1>
  <div id='image'><img src='this.gif'/></div>
  <h3 id='byline'>By me</h3>
  <div id='text'>the content html etc</div>
</body>

A mapping from HTML ids to CCK fields will be done automatically, and
the content should just fall into place.

  $node->title = "the title";
  $node->field_image = "<img src='this.gif'/><";
  $node->field_byline = "By me";
  $node->field_text = "the content html etc";

In fact, ANY element found in the source text with an ID or class
gets added to the $node object during import, although most
data found this way is immediatly discarded again if the content type
doesn't know how to serialize it.
A special-case demonstrated here prepends field_ to known
CCK field names. Normally they get labelled as-is.

If the source data is NOT tagged, you'll have to develop a bit of
custom XSL to produce the same effect.

customtemplate2simplehtml.xsl

... xsl preamble ...
  <xsl:template name="html_doc" match="/">
    <html>
    <body>
    ... other extractions ...
    <h3 id="byline">
      <xsl:value-of select="./descendant::xhtml:img[2]/@alt" />
    </h3>
    </body>
  </html>
</xsl:template>

In this example, the byline we wanted to extract was the alt value of
the second image found in the page (a real-world example). This has now
been extracted and wrapped in an ID-ed h3 during an early phase of the
import process, and should now turn up in the CCK field_byline as
desired.
XSL is complex, but magic.

Comments

This is great! It would be

This is great! It would be nice if you could document this in the new CCK Handbook, probably at http://drupal.org/node/89493. You can add either a page or a comment to describe this.

Thanks!

very exciting

the possibilities here seem endless. :)

--
Cheers,
Thong

Tip: http://drupal.org/forum-posting
Website: http://www.edoodle.co.nz
Drupal for artists - demo at Website for artists

How do I specify which CCK node?

Hi, I'm new to Drupal so pardon me if this seems obvious. But don't I need to specify which CCK node type to create? For example, I created a CCK node type called "Import Corp" and another called "Press Releases". Let's say both have a title and field_keywords field. Do I need to modify the XSLT or module so that the import creates a "Press Releases" node instead of just a page node? I have a <div id="keywords">...</div> section in the HTML page that I want to import into Drupal.

I tried changing <node type="story"> in the XSLT to be <node type="Press Releases"> but the import HTML module still created ordinary page nodes with a title and body field.

I also tried modifying the import_html.module array variable $import_html_file_classes but that only caused the import to create a static HTML file in my files/imported directory.

Any guidance would be greatly appreciated. Or am I not understanding what this feature of Import HTML is supposed to do? Thanks!

Update 1/4/07: I clarified this post in the forums at http://drupal.org/node/106855

admin/settings/import_html

Specify the import target node type in the admin/settings/import_html.
There's a drop-down listing page, story, etc, etc.
Once you've set that, the behaviour described above starts to happen.

It looks from your other post like you are dealing with a 4.6 release version. The XSLT you should be basing your tweaks on is html2simplehtml

IF you truly have a div id="keywords" and a field_keywords element for it to drop into, then the latest version of the module SHOULD just work as expected, yes.

.dan.
How to troubleshoot Drupal | http://www.coders.co.nz/