Hello!! First of all thank you very much for your module! I find it really useful!!
I am importing big xml files and I am using the xml Chunked process which works really great! My feed items have a "unique identifier of item", like "D87EF034-8BAE-4D08-AEFE-A794844A8224".
By now I have imported up to 100.000 nodes!
While the import is going really great when I try to update the items some of them get updated, though others don´t get updated and I have duplicates in my database.
Do you think I am missing something or so many nodes is to much for Drupal to handle?
Thank you!

Comments

dimakopu’s picture

Assigned: dimakopu » Unassigned
Sorin Sarca’s picture

Hi!
It's possible to have duplicates only if there are items with distinct unique id but with the same content, like below:

<items>
  <item>
    <id>D87EF034-8BAE-4D08-AEFE-A794844A8224</id>
    <content>The Content</content>
  </item>
  <item>
    <id>AA7EF034-8BAE-4D08-AEFE-A794844A8225</id>
    <content>The Content</content>
  </item>
</items>

You have to take a look in that xml file and search for duplicates.

Anyway, if import is stopped or fails it's possible to have duplicates since the hashes are saved in chunks (you can check settings to see chunk size for inserted hashes --used for performance reasons).

dimakopu’s picture

Thanks for the answer! Actually I found what was the problem.

I have two different feeds with different names, for example feed_data_0 and feed_data_1 connected to two xmls data_0.xml, data_1.xml.
The xmls when are updated it´s possible for the elements to get mixed. For example, an element with id="D87EF034-8BAE-4D08-AEFE-A794844A8224" which was initially imported from the data_0.xml when updated is imported from data_1.xml. So, the two different feeds sometimes import nodes with the same id.

Since the hash is constructed also by the machine name of the feed, elements with the same id, but imported from different feeds, have different hashes and that is why I had duplicates.

Sorin Sarca’s picture

Status: Active » Closed (fixed)
broon’s picture

Title: Get duplicates when update » How to avoid duplicates when importing the same content from different sources
Version: 7.x-2.6 » 7.x-3.0-rc2
Component: Miscellaneous » Code
Issue summary: View changes
Status: Closed (fixed) » Active

I am reopening this issue since I am stuck at a similar problem with the new 3.x version (which introduced hash groups).

I am importing several feeds from the same blog. Each feed is for a single tag. I don't want to have all tags imported, so I choose only some. As an example think of a nature blog with the tags "flower", "bush", "tree", "moss", "algae". I only want "flower" and "tree", so I set up two feeds:

1. machine_name: nature_flower, hash group: nature
2. machine_name: nature_tree, hash group: nature

I am assigning the same group since they come from the same source and I don't want duplicates. However, if a blog post in the nature blog is tagged with both flower and tree (e.g. cherry blossom), the post will be in both feeds. By importing, I get two copies of the same original blog post even though the unique id (xml node "guid" in rss) is the same.

As involving the machine_name in the hash would produce duplicates, the whole concept of hash groups was introduced (Thanks for that!). However it doesn't seem to work, maybe I am missing a point here? How should/could I avoid the duplicates?

broon’s picture

Category: Support request » Bug report
Priority: Normal » Minor

I think, I found the culprit. As said, I am using several feeds from the same blog, so the settings (fields, filters, ...) would be quite the same. At one point I discovered the export/import feature and of course used it to generate copies. I just changed the JSON string by altering the URL leaving everything else untouched. However, when importing, Feed Import overwrites the imported hash group and sets it to the machine_name.

In feed_import.module, function feed_import_import_feed_form_submit(), line 2338, it reads:
$code->settings['hashes']['options']['group'] = $code->machine_name;

So, this seems to be as intended. Question would be, why the option is in the JSON string then? I am marking it as a bug report now, you can choose to close it again if that's the way it should be. Maybe this can be turned into an import option?

Sorin Sarca’s picture

Question would be, why the option is in the JSON string then?

Edit: This is actually the explanatin for that line of code
To avoid conflicts with other feeds when you import a new configuration. If I use the group provided by the JSON and I import the feed (without checking if other configurations already use this group) it may overwrite entities. In your case you want to be the same group, and you can edit it in Hash Manager section before importing the feed items.
If you change the group after you have some imported entities you will have duplicates, so in order to test it (again) you should delete all imported items, set the group for each feed configuration and re-run the import process.

Sorin Sarca’s picture

Status: Active » Closed (works as designed)