Hi. I'm posting this more to see if others might be interested in a feature that I have as a requirement for one of my projects. I'll outline my requirement and thoughts. If you like this and think it is applicable to the main module, I'm happy to roll a patch for it.

One of my challenges is that my feeds source data is sometimes incorrect. My solution now is to pre-process the data, and I get it to be about 90% correct. For example, I'm dealing with retail product imports. The retailers give me categories that do not line up against my taxonomy terms, so I pre-parse the file to try to make them line up. Most of the time I do, but sometimes a few items are off. My solution when the items are imported incorrectly is to manually go to those created nodes, change the field values, and save.

That all works well until the next Feeds import when the item changes. Let's say they change the price in the feed (a different field from the one I manually changed.) Feeds sees the checksum is different and updates the items. This happens via the FeedsNodeProcessor and specifically this block of code:

// Execute mappings from $item to $node.
$this->map($item, $node);

// Save the node.
node_save($node);

Since the item changed and now needs updating, this simply copies over all of the source values and re-writes the node. So what happens is that I lose anything I manually changed.

My solution at the moment was to hack the following code into FeedsNodeProcessor, right above that:

// If we are updating, remove my three fields that I want manual control over
if($node->nid = $nid) {
  unset($item[advertisercategory], $item[color], $item[sex]);
}
// Execute mappings from $item to $node.
$this->map($item, $node);

// Save the node.
node_save($node);

That effectively says "if the node is flagged for an update, remove those three key/value pairs". The result is that I no longer ever have those three fields updated, so any manual field overrides now stick.

The big question is this - would others have a similar need for this functionality? If so, perhaps I could add another checkbox field to the mapper (similar to Unique Target) that allows the user to flag any fields that should not be updated beyond the first import. Then the above block of code could be modified to check for the presence of that checkbox and remove those key/value pairs from the node update.

Hopefully that all makes sense. I'd love to hear some feedback on the utility of this change.

Comments

alex_b’s picture

I see the valid use case, but I must admit that I am not too keen on adding another flag to the mapping API right now, especially as you can solve your problem easily by extending FeedsNodeProcessor in a custom module:

* Extend FeedsNodeProcessor and expose through plugins API.
* In your CustomFeedsNodeProcessor, override map() and exempt fields that you'd only like to map on insert (you can test for the presence of $node->nid to check whether you're updating or not).

Your use case is exactly one of those corner cases I had in mind when making Feeds so pluggable. Our team routinely writes custom fetchers/parsers/processors for client site builds.

Either way, let's keep this feature request open, I'm sure others have similar use cases.

rjbrown99’s picture

Thanks Alex. I did notice an initial problem with my hack - specifically around checksums. At the top of the process function, it has this code:

if ($hash == $this->getHash($nid)) {
  continue;
}

I ran into a case where I did a large import of a few thousand items. The taxonomy terms weren't mapping correctly, so I wanted to re-import a few times while I cleaned them up. After that I wanted manual control over those fields as indicated above.

This re-import captured the new hash, wrote it to the database, and then did not update my fields based on my logic from the initial post of this thread. The future imports where I did want to have those fields updated did not work because the checksums in the database had not changed, so the entire process was aborted. I had to ignore checksums and re-import every item.

So to you or anyone else who happens to want to selectively ignore fields upon update, it is important to properly integrate this logic with the hashing function.

Thanks for the logic as well Alex - I will try to follow that approach and post back results of my findings.

mweixel’s picture

Version: 6.x-1.x-dev » 6.x-1.0-alpha10

I have a related, but different, use case. I have fields in my feed item node type that are manually populated after a feed import that I don't want to lose if the feed is updated.

In my case, I am updating travel warning information from the US State Department, but I need to have extra fields in the feed item to include information describing our institution's response to each travel alert. Anything contributed by the feed is accepted when the feed item is created and then a committee adds the institutional components (internal categories, a textual response, an expiration date, etc). These are never contributed by the feed and are manually added as part of a work flow. At this point, however, if the feed is updated, these other data are lost.

It doesn't seem to me that a new custom feeds node processor that I'd be able to change this behavior, as I want to consume everything that's coming in from the feed.

rjbrown99’s picture

Thanks mxwixel, you actually helped me track down a problem I am having. Here's what was happening with my taxonomy fields. This is specific to a CCK content taxonomy field that is also set to use core taxonomy. I'm using this field to facet on with apachesolr, so it's important that it be there.

1) Import feed items to create a node. 1 new node created in this case. One of the imported fields is a CCK content taxonomy field that also saves to core taxonomy.

2) The created node is 16837 with values in the term_node taxonomy table as follows. We have a taxonomy field that showed up and was correctly populated from the import file.

nid    vid     tid
16837  16838   241

4) I then manually edit the node and set one of the blank (non-imported) content taxonomy fields to a term (also saves to core taxonomy.) This is because this particular field did not import a term for that item since I have partial data for some fields that require manual entry when they come in empty. After I saved the node, a look at term_node shows my new term in there:

nid    vid     tid
16837  16838   229 - NEW term added
16837  16838   241

5) I manually re-imported the feed. Let's say some random other field changed which triggered a different checksum and a new import. This updated my node we are talking about. But look what happened to the term_node table:

nid    vid     tid
16837  16838   241

6) Notice that the tid 229 is now gone. That's the one I manually added in step #4. The really bad part of this for me is that since this is a content taxonomy field, Feeds killed the value from term_node (which breaks Apachesolr and its facets since it is now gone), but when a user views the node it still LOOKS like the term is there because the CCK content taxonomy field is still populated. Feeds did not overwrite that, it just cut off the term_node table from underneath it.

I don't have a solution for this yet, but I absolutely need to find a way to stop this from happening. I need manual control over specific fields post-import and can't have any future imports removing tids from the term_node table from manually edited nodes.

I'd welcome any thoughts as to how to accomplish this.

rjbrown99’s picture

The issue in #4 happens regardless of whether of not the source field exists in the import. In this case, the source mapping field isn't even there yet it still overwrites the term_node items when saved. I verified this by dumping the $node object right before the node_save().

I am wondering if my comment in #4 is a bug? On update I would assume it isn't ideal to blow away what was already in the taxonomy field. Here's a little bit of code from the file:

          // If updating populate nid and vid avoiding an expensive node_load().
          $node->nid = $nid;
          $node->vid = db_result(db_query('SELECT vid FROM {node} WHERE nid = %d', $nid));

Now have a look at this:
#471074: Taxonomy synchronization caching bug
#594004: taxonomy_node_get_terms() needs a $reset param that node_load can call when *it* is reset

Would it would be appropriate to include something like that fix to load the original taxonomy terms to ensure they are saved back to the node with the node_save? Are there other things that may not be included based on the avoidance of a node_load?

Perhaps this is becoming a different issue from excluding content.

I have a fix that stops it from dropping taxonomy. In FeedsNodeProcessor.inc, inside public function process, I now have the following which is based on #59004 above.

          // If updating populate nid and vid avoiding an expensive node_load().
          $node->nid = $nid;
          $node->vid = db_result(db_query('SELECT vid FROM {node} WHERE nid = %d', $nid));

          // START CHANGE - Do not forget the taxonomy terms
          if($node->nid = $nid) {
            $result = db_query("SELECT * FROM {term_node} WHERE nid = %d AND vid = %d", $node->nid, $node->vid);
            while ($term = db_fetch_object($result)) {
              $existing_terms[] = $term;
            }
            $terms = array();
            foreach ($existing_terms as $term) {
              $terms[] = $term->tid;
            }
            $node->taxonomy = $terms;
            unset($result, $term, $existing_terms, $terms);
          }
peter törnstrand’s picture

I'm having this problem too. My use case is similar to #3. I have a imagefield which is populated manually after feed import. This fields gets discarded on feed update. I tried to use node_load() instead of just populating nid and vid but that resulted in duplicates of alla the images in another imagefield with multiple values. I'm not sure if the duplicates is related to my custom hook_feeds_node_processor_targets_alter() function.

Since this is a rather urgent problem I will try to find a way to get around this problem, will post here if I find anything.

peter törnstrand’s picture

I did a rather ugly work-around to get this working. Using my own FeedsNodeProcessor I'm loading the complete node with node_load() and using unset() on the fields I'm getting duplicates on.

Not a long term solution but works for now.

cboyden’s picture

I tried unset() too. One of the fields I want to preserve is the Title field, though, and unset() caused the processor to save an empty Title.

I ended up writing a custom processor and overriding two functions: process() and map().

In process(), I add a variable that flags whether the node is new or updated, and I pass that into the map() function:

          //set a flag to allow skipping of overwritten fields on update
          $update_data = ($node->nid = $nid) ? true : false;
          $this->map($item, $node, $update_data);
          node_save($node);

Then in map() I add an array of fields to skip, and continue the loop if the target field is one of them:

    $skip_targets = array('title','example1','example2'); 
    foreach ($this->config['mappings'] as $mapping) {
      $value = $parser->getSourceElement($source_item, $mapping['source']);
      if ($update_data && in_array($mapping['target'], $skip_targets)) {
        continue;
      }
    //etc.

I would like to generalize this a little by defining the list of fields to skip somewhere other than within the map() function, to make it possible to change these without editing module code. And I'd also love to hear if anyone has other suggestions for improvement.

alex_b’s picture

Status: Active » Closed (won't fix)

I set this issue to 'won't fix' as it can be addressed by overriding FeedsNodeProcessor and there is no plan to implement this within the Feeds project itself.