problems with multiple nodes from one HTML file [#139770]

I'm using the import template / filter you attached in this comment to test import.

I want to get several reviews from each import page. The reviews have markup indicating author, publisher, review, reviewer, suppliers, etc.

I'm seeing several problems:
1. The "body" field gets set at the first review and remains the same throughout the import process. (Actually, my CCK node type does not even have a "body" field anyway.)

2. Similarly, the "teaser" field is set (based on the "body") at the first occurrence and remains the same for all the reviews.

3. Several of the fields (author and reviewer, for example) disappear and don't get imported. They disappear between the messages "Parsed in again. XHTML (XML)" and "TRANSLATED from messy source into a pure xhtml page to import". Images disappear here, too.

4. I get multiple warnings like this (can be ignored?)
warning: domxml_open_mem() [function.domxml-open-mem]: ID pagetitle already defined in /var/www/html/sites/all/modules/import_html/coders_php_library/xml-transform.inc on line 169.

5. Even when I confirm at the bottom of the demo run, it does not become an imported node.

Comments

Comment #1

leeksoup commented 28 April 2007 at 03:40

OK, looking through the import template, I think I see what is happening to the missing fields, i.e. item #3 above. I will try editing / customizing it some more to fix that.

Not sure what to do with images, but I'll give it a shot.

Comment #2

dman commented 28 April 2007 at 03:51

The sample was to roughly illustrate how to select and populate multiple nodes from one source file. This feature was built as a proof-of-concept, but never actually tested on multi-page imports.
As with any custom import, you have to tune the actual field-matching patterns yourself.

1. The 'body' should certainly be getting reset and not repeating. This may be a scoping problem inside the loop. or possibly in the XSL. I've seen some problems with the XPATH also, so it may be there :(
To make it match into your CCK definition, which I have no idea about, change the ID it's being given in the XSL file. You'll have to at least stick some dummy text into the body for the import to proceed however.

3. that's left as an exercise for the ready. I did one sample in the XSL.

Images are disappearing because I used the simpler xsl:value-of (which returns stripped text) instead of apply-templates, which has side-effects. Sorry, but XSL is tricky.

4. The warnings are from the source text. Even after going through tidy, your source does indeed have multiple IDs. Which are invalid. And I'm not even going to try to imagine the work-around needed.

Comment #3

leeksoup commented 29 April 2007 at 06:58

OK, I've done all this to customize the import filter xsl file - I've made the body a dummy and the "review" field has the review content as I want it. Also added in the rest of the fields I want to import.

I can see the data is coming in all right in the debug messages, but:
* I can't see the mapping over to field_ field names. I understood from the help file that this is supposed to happen automagically. Right or wrong? OK, on closer inspection, this does happen correctly to one of the fields (the "review" field), but all the other CCK fields are left alone. Am I supposed to mark them somehow? If so, where?

* it still won't actually import the nodes correctly. Nothing happens when I hit the submit button at the end of the demo run. (It just goes back to the demo page.) Also, the form I see at that point contains only the title of one of the reviews and no contents at all.

Also, where do I add the code for taxonomy handling (I want to look up terms by name)?

Thanks!

Comment #4

dman commented 29 April 2007 at 10:43

as in the Import to CCK docs, in theory, if your target node type is CCK that includes field_publisher , and the 'simplified' half-way stage that the custom XHTML/XSL produces contains a div (or any element) with either an ID or a class='publisher'
... then, the import process will detect that [div id='publisher'] gets inserted into $node->field_publisher and things get saved. It should be auto-detecting them based on introspection of the expected node type fields. [circa line 1850]
You can also make your translation produce [div id='field_publisher'] instead, which should have a similar result without relying on autodetection.

The sample I gave included just the one example of making this happen:

<!--  insert further custom data blocks   --> 
 <xsl:for-each select="descendant::*[@class='publisher']">
 <span class="publisher">
  <xsl:value-of select="." /> 
 </span>
</xsl:for-each>

You'll want to copy & paste that for each unique field.

Change <xsl:value-of select="." /> to <xsl:apply-templates /> if you want html markup to come through also. Normally it's safer to strip them, but if you want the old markup also...

every element in the transformed (simple) XHTML with a class or an id gets ported into the $node as a property-value with that label

The various node_save() hooks then have to save the data they recognise. In the case of CCK, that means the field_* values that may have been absorbed.

I'm not sure why the demo save is not working. There may be something iffy with the form save in v5, and there is simply no way to save many nodes from one edit form, so the demo probably just can't handle the concept of multiples. At one time I had it pumping out seveal node-edit pages in one screen, but I couldn't handle those events, it just didn't make sense. The demo is not designed to actually be used, just to test the parsing. Probably not going to happen.

If you can add a way to absorb the terms by name, it would go at the line inside import_html_xhtml_to_node() saying
// If there are any other things to come from HTML into $node, let me know now!
This is after the generic found elements have been added to the $node object, and before the core properties are set.

Your node will probably (by that point) be of the form
$node->terms = array('life','love', 'laughter');
and you'll want to end up with something like

$node->taxonomy = array(
 7 => object(
   tid => 7,
   vid => 3,
   title => life,
   ...
  )
 9 => object(
   tid => 9,
   vid => 3,
   title => love,
   ...
  )
);

(Pseudocode, I can't recall the exact taxonomy array syntax.)

Comment #5

leeksoup commented 30 April 2007 at 06:48

as in the Import to CCK docs, in theory, if your target node type is CCK that includes field_publisher , and the 'simplified' half-way stage that the custom XHTML/XSL produces contains a div (or any element) with either an ID or a class='publisher' ... then, the import process will detect that [div id='publisher'] gets inserted into $node->field_publisher and things get saved. It should be auto-detecting them based on introspection of the expected node type fields. [circa line 1850]

I looked at the code and I think I see the problem. It is only looking for id= and not class= ... my review field is the only one using an "id" while all the others use a "class". Here's the code line (~1866):

$ids = xml_query($datadoc,'.//*[@id]');

There's no equivalent code for the class attribute. I'm guessing that wouldn't be too hard to add but I'm getting errors when I try.

You can also make your translation produce [div id='field_publisher'] instead, which should have a similar result without relying on autodetection.

OK, I tried that and it appears to work. Now to try importing something for real and see if it breaks.

You'll want to copy & paste that for each unique field.

Yes, I did that already. The problem was as noted above.

On the demo save -- not a problem.

Thanks for the info on where to put the taxonomy code. I will try and get it working. If I do, I will post the code here.

Comment #6

dman commented 30 April 2007 at 07:18

Hm, well that aspect was never really extended.
The code there was from reverse-engineering my first look at CCK. I don't know what issues there were/are for those CCK fields that may be multiple. (somewhere in the $field_def I guess)
I guess I used ID (guaranteed to be unique per page) vs class (which implies non-unique to me) as the multiple-field thing was still undefined.

The code there could possibly be refactored.
At the moment it finds any ID, then checks if it matches a cck field def.
OTOH, it could load the CCK node def and search for values for all valid fields.
The second approach may be tighter, and would sit in the code better as a stand-alone phase also.

Comment #7

leeksoup commented 1 May 2007 at 00:13

Category:

support

» bug

I am now trying to import a section of the website.

It finds multiple reviews in a page, but only imports (saves) the last one as a node. Also some of the fields are missing from the saved node, though I can see them in the data dump in the debug messages. I think on inspection they are the ones that allow mutliple values, but I am not certain. Checking that and will post update when I'm sure.

Comment #8

leeksoup commented 1 May 2007 at 06:52

OK, more info:

1. If a field has multiple values, only the last one is saved. "field_author" for example, ought to have say 2 authors, but turns out as:
[field_author] => Array
(
[0] => Array
(
[value] => Shawn Sabouri
)

)

Oddly, if I don't map "author" over to "field_author" in the xsl template, I do get both values coming through:

[field_author] => Array
(
[0] => Greg Sabouri
[1] => Shawn Sabouri
)

BUT then they don't get put into the right field of the review, so that's no help.

2. If there's anything special about the field, say it's a Date or CT or something other than a basic text or numeric field, it doesn't get saved either.

3. As I noted above, only the last review found on a page is saved as a node. Though the data structures for the others look equally good.

Comment #9

dman commented 1 May 2007 at 08:35

The second example there is the default, naive behaviour of how to handle multiples of anything.
It predates the advent of CCK.

The first example appears to be the first attempt at replicating the internal CCK representation, although, as sorta expected, it's incomplete in that it doesn't handle multiples well.

Only text type fields were attempted. the typed fields were still growing in number and unstable at the time that was done. I haven't given any thought (or had any use yet) for the other types.

When I tried a dry run with the test page you supplied, I seemed to get all the expected nodes created in order (after a couple of tweaks). There was a bit that got left out of the loop incorrectly in the V5 upgrade which I had to repair, and did so last week - in HEAD.

Important - check your file against this change - the first one. The brace needed to be shifted down outside the building of the node array. It's a bit hard to spot.

Comment #10

leeksoup commented 2 May 2007 at 03:36

Important - check your file against this change - the first one. The brace needed to be shifted down outside the building of the node array. It's a bit hard to spot.

Ah! That takes care of the nodes not getting imported. Now I'm getting all the reviews as nodes.

However, I am still getting some weird results. For example, I imported math/mathhigh.htm, which contains 2 reviews. I then look at my new nodes and I see 3, not 2:
1) imported/math/mathhigh = review of Jacobs' Geometry
2) imported/math/mathhigh.htm = review of Algebra I: A Teaching Textbook
AND
3) node/36 = review of Algebra I: A Teaching Textbook (yeah, it shows up twice in the admin/content/node)

2 & 3 are identical in content.
???

The first example appears to be the first attempt at replicating the internal CCK representation, although, as sorta expected, it's incomplete in that it doesn't handle multiples well.

OK, so this feature was not implemented. Looking at the code, function import_html_absorb_properties, it does appear that new values for the field_ fields will overwrite the old.

Here's my attempt at a patch to fix this:

diff -r1.2 import_html.module
2045,2049c2045,2053
<         //is cck
<         $node->$key = array(array('value'=>$value));
<         return;
<   }
<
---
>            //is cck field, all of which are stored as arrays
>           if (! isset($node->$key)) {
>               $node->$key = array(array('value'=>$value));
>           }
>           else {
>               $a = $node->$key; $a[] = array('value'=>$value); $node->$key = $a;          }
>           return;
>         }
>

This does work correctly for my test case.

Comment #11

dman commented 2 May 2007 at 04:12

:-}

It's totally undefined which node gets ownership of the old URL, or what the new URL should be for the other progeny. That was never worked through.

I also saw a double-up, but thought it was due to my repeated attempts. Not sure what's the story there. Needs some more debug at about the time the
// If the process has resulted in xt:document blocks, each block
// is a new item.
or the
$files = _import_html_import_files($import_files, $source_siteroot, $form_id, $form_values);

Thanks for the CCK fix. Looks about right. I guess at one point CCK fields were a bit simpler. I'll put that into the release sometime soon.

Comment #12

dman commented 28 July 2007 at 21:51

Status:

Active

» Closed (fixed)

The functionality and doc could still be improved, but that's a TODO, not an open bug.

problems with multiple nodes from one HTML file

Comments

Comment #1

Comment #2

Comment #3

Comment #4

You'll want to copy & paste that for each unique field.

every element in the transformed (simple) XHTML with a class or an id gets ported into the $node as a property-value with that label

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

News items

Our community

Documentation

Drupal code base

Governance of community