I'm running the latest version of SimpleFeed and SimplePie. I'm getting duplicated items when syncing my livejournal RSS.

Is there any way to see why the duplicate item checking is failing or why it is re-adding things that are already there? I haven't been editing old items or anything, I just left it along and it did it on its own. Thanks

Comments

Se7enLC’s picture

I just made some code changes and I think I might have addressed the problem.

It looks like duplicate checking is based on an md5sum of the title and body. So if either the title or body change slightly, the feed is no longer the same.

I decided to add in a call to "get_permalink()", which for livejournal feeds is the URL of the entry. Any changes made to the entry, including title and body, will NOT change that URL. This means that it will be able to update the correct feed item as needed. Fallback is to the title+body method if the $url variable doesn't get filled in.

Code is below, modified in simplefeed_item.module:

$url = $item->get_permalink();
$iid = md5($url);
if (!$url)
{ $iid = md5($title . $body); }

mrrijo’s picture

i had the same problem with current stable 5 release with google alerts. I have just put your code. And i am waiting for a while to see the any duplication occurs!. Thanks for this tip. :)

Se7enLC’s picture

well that didn't work at all, I still got duplicate items.

Se7enLC’s picture

looking in the db, it seems that the iid field of simplefeed_feed_item is not being filled in. It seems that the correct $iid is being saved into $form_state. The node is created using this line of code, which I assume is internal to drupal:

drupal_execute('feed_item_node_form', $form_state, $node);

There's no other mention of "feed_item_node_form" anywhere else in simplefeed, simplepie or drupal

Se7enLC’s picture

Another possible fix:

in the function "simplefeed_item_feed_parse" toward the end, "drupal_execute('feed_item_node_form', $form_state, $node);" is called. $node is supposed to contain the defaults and form_state should contain the per-feed-item parameters. As it turns out, the function is using the $iid from the node rather than from the form state.

I added the following line of code right before that function call:
$node->iid = $iid;

I now see the iid being filled into the database. Unsure yet how this will effect duplicate checking, but I am hopeful

EDIT: it seems that this change was a success. New items are being imported, but changed items are not being updated. This may be intentional behavior. I might make changes to drop and re-add changed feed items under the same number

mfer’s picture

I'm in the same situation.

Instead of dropping and re-adding why not updated the body, title, and other attributes on the node and save it if there is a change? This way, if any comments or other things have been added to the node on the local site they will remain.

mfer’s picture

What if we switched out

  $iid = md5($title . $body); 

For

  $iid = md5($item->get_id()); 

This is different than what was there in the 1.x version of simplefeed. That was using

  $iid = $item->get_id(true);

When get_id(true) is used the returned result is either

  md5($this->get_permalink() . $this->get_title());

or

  md5(serialize($this->data)); 

If no value or false is sent to get_id() it will return the id (from atom feeds), guid, identifier, permalink (link), title, or the same result as if it were true. My only concern here is if it gets to the title part. There may end up being duplicate titles on different posts.

So, are we up for going with

  $iid = md5($item->get_id()); 

or a similar function of our own that does the same thing except we do something different at the title part? In the few minutes I looked through this it looks like it might be a more robust solution.

mfer’s picture

Well, the idea of going with md5($item->get_id()) seems nice until you have to update the iid in simplefeed_feed_items table where you don't have all the info you are looking for.

In the data set I have the duplicates come up when there is a change to the body section of the node. This can be some going in there and changing something or it can be a change from feedburner. For example, feedburner adding blog bling at the bottom of the item. Or, I've noticed feedburner will occasionally remove extra spaces. This can happen after an item has been pulled in.

I'm experimenting with this instead:

  if (!$link = $item->get_permalink()) $link = $feed->get_permalink();
  $iid = md5($title . $link);

This is the same title and url we should have for each item. This won't cause duplications due to changes in the body. And, if we want to update the node based on body changes we can now detect the item and update it on something other than the body.

Updating to this is pretty easy. It's just a matter of using an update function similar to the last update.

The only big downfall would be if someone puts out 2 items with the same name and that have don't have item urls (so the feed url is used) the second one won't be added.

I'm comparing this to the current setup. I'll report back in a few days when there has been a chance to see data with the differences.

mfer’s picture

FYI, I'm going to work on a patch for what I proposed in #8 sometime in the next week.

mfer’s picture

Status: Active » Needs review
StatusFileSize
new3.77 KB

Here is a patch for the 5.x-2.x branch. The 6.x patch is coming soon.

mfer’s picture

Let's try that again on D5 without my .project and .settings files.

mfer’s picture

Here is a patch against 6.x-1.x. This is untested since I don't have simplefeed running on drupal 6 right now but the change seems pretty straight forward.

Also, I used the function to update based on the update 1 function but I altered it so it goes through the whole table and not just the first 50.

m3avrck’s picture

Matt this looks most excellent!

I'm going to double check these patches running 'em through a few 1000 blogs I have and confirm 'em both.

xiffy’s picture

@Se7enLC
Confirmed, I don't know if it is the Drupal way, probably not since the $iid is attached to the form-stated, so that would be the preferred method I suppose.
But that does not work, my iid in tha database stayed empty (drupal 6.2, simplepie 1.1.1 developement snapshot of simplefeed). Your $node->iid = $iid; worked like a charm.
This should be in a next release, or at least a working version of the preferred Drupal method.
Cheers

jmaties@drupal.org’s picture

"db_num_rows" does not exist in Drupal6.x :(

mfer’s picture

@m3avrck - any word on the patches in #11 and #12? I've had them running since I posted them without any noticeable problem.

m3avrck’s picture

Yes hope to commit this week, been on travel for a bit. I'll also triple check the patches running through a few thousand blogs too :)

m3avrck’s picture

@Se7enLC and @xiffy, please see this new issue for what you guys are experiencing: http://drupal.org/node/269185 -- this is seperate from the patches in this issue. Your issue seems to be a PHP4 type issue.

m3avrck’s picture

Thanks Matt! Patch #11 committed to D5: http://drupal.org/cvs?commit=120679

m3avrck’s picture

Patch #12 committed to D6: http://drupal.org/cvs?commit=120702

m3avrck’s picture

Status: Needs review » Fixed

good to go, thanks guys!

gsnedders’s picture

However, due to things like http://www.intertwingly.net/blog/2005/04/09/Clone-Wars, just using the ID doesn't work. It's wide spread enough to be an issue.

mfer’s picture

@gsnedders - what do you mean just using the id wouldn't work? I don't see how that's an issue since duplicate checking isn't based on just an id. Can you please explain what you mean?

gsnedders’s picture

I've been running around too much, so what I say may well be non-sense. Looking at it again, what you currently have doesn't work either, as you can have a conforming feed with two items with the same title and link which are different.

mfer’s picture

@gsnedders You can have a feed with a 2 separate items that have the same link and title? Can you provide an example of this?

What ever solution we use for this module needs to take into account the existing design debt of the module as well as an update path for those who are already using it. Any suggestions for an alternative method of duplicate detecting?

gsnedders’s picture

I would take the ID to be unique per item provided that it is not repeated at all in the current feed, if it is, I would then fall back on the link. I would always treat the ID to be unique only within the current feed, and not globally (though technically it should be).

I don't have any examples of feeds with multiple items with identical link/title with different IDs, but they shouldn't be treated as duplicates. One thing I do strongly believe is that the given ID should not just be outright ignored (even though it does need to be ignored in some cases for compatibility with the real world).

http://www.詹姆斯.com/blog/2006/08/rss-dup-detection gives a decent description of duplicate description, FWIW.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.