In FeedsProcessor.inc
function existingEntityId
includes ->condition('feed_nid', $source->feed_nid)
when checking for duplicates.

This allows for duplicates if multiple feeds include the same feed item. This is typical on BBC and other sites with several feeds tailored to audiences. I can't see a clean way to override this without just commenting the line out.

I suggest it is not needed considering GUID is Global.

(Thanks for all the great code!)

Comments

blazindrop’s picture

StatusFileSize
new571 bytes

+1 for removing this. We are using guid as the unique target but it's really only unique to the importer node (we have multiple nodes for a given importer) because of this condition.

Attached is a patch based on OP comments, but I'm thinking all of initial conditions shouldn't apply IF there's at least one unique target defined in this snippet of code:

I'm thinking the logic for this function should consider checking to see if there are unique targets set for the importer. If so, then don't have any conditions at all initially. So instead of this:

  protected function existingEntityId(FeedsSource $source, FeedsParserResult $result) {
    $query = db_select('feeds_item')
      ->fields('feeds_item', array('entity_id'))
      ->condition('feed_nid', $source->feed_nid)
      ->condition('entity_type', $this->entityType())
      ->condition('id', $source->id);

    // Iterate through all unique targets and test whether they do already
    // exist in the database.
    foreach ($this->uniqueTargets($source, $result) as $target => $value) {
      switch ($target) {
        case 'url':
          $entity_id = $query->condition('url', $value)->execute()->fetchField();
          break;
        case 'guid':
          $entity_id = $query->condition('guid', $value)->execute()->fetchField();
          break;
      }
      if (isset($entity_id)) {
        // Return with the content id found.
        return $entity_id;
      }
    }
    return 0;
  }

... something like this (note I have not tested this):

  protected function existingEntityId(FeedsSource $source, FeedsParserResult $result) {
    // Obtain unique targets for this source
    $targets = $this->uniqueTargets($source, $result);
  
    if (count($targets) == 0) {
      // no unique targets - nothing to identify existing entities.
      return 0;
    }
    else {
      $query = db_select('feeds_item')
      ->fields('feeds_item', array('entity_id'));
  
      // Iterate through all unique targets and test whether they do already
      // exist in the database.
      foreach ($targets as $target => $value) {
        switch ($target) {
          case 'url':
            $entity_id = $query->condition('url', $value)->execute()->fetchField();
            break;
          case 'guid':
            $entity_id = $query->condition('guid', $value)->execute()->fetchField();
            break;
        }
        if (isset($entity_id)) {
          // Return with the content id found.
          return $entity_id;
        }
      }
    }
    return 0;
  }

Thoughts?

emackn’s picture

Using GUID does not guarantee to reduce duplicates. You will still get duplicates if you are consuming feeds from different sources. Sources like AP, BBC, CNN, etc. do not provide some super id to identify items across all sources. In the end, It falls on the feed consumer to ultimately decide which items to show on your site.

blazindrop’s picture

You're right. Our particular use case uses several files from AP, each of which can duplicate an item in another AP file. This solves that problem, but the chances of ever seeing a duplicate from another source are low in our case.

Attached is a patch based on my idea of rewriting FeedsProcessor::existingEntityId(). I tested this thoroughly against two importers mapped to separate, identical files. A full import is done on the first one, but on the second there's no new nodes because of a guid match.

asb’s picture

I ran into this weird behaviour multiple times, as well, and never understood why Feeds' GUID isn't actually a Globally Unique Identifier, but more "locally limited identifier".

I don't know if this is supposed to be a bug or a feature, but I'd vote for removing the condition (unless there is a valid reason for this behaviour of existingEntityId(), and someone is able to explain why it's there).

blazindrop’s picture

Status: Active » Needs review

Should have set to needs review with supplied patch. Will reroll if needed..

rolodmonkey’s picture

Status: Needs review » Needs work

The last submitted patch, feeds-existingEntityId_rewrite-1233142-1.patch, failed testing.

johnv’s picture

I face the same problem.
IMO # 3 remove too much of the db-conditions, since the ID is not unique between Entity types. The change in #1539224: Add support for unique fields to be unique site wide seems sufficient to me.

Since I also have manually created nodes, I use #661606: Support unique targets in mappers

twistor’s picture

Status: Needs work » Closed (duplicate)