For background, please see: http://drupal.org/node/61427

I don't know much about RSS but I think the aggregator should be using the guid of the feed item as the unique identifier and not the url of the feed item.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Steve Dondley’s picture

Status: Active » Needs review
FileSize
3.5 KB

OK, here's a first stab at a fix.

Steve Dondley’s picture

One other thing:

In addition to applying the patch, you have to create a new field in the aggregator_item table called "guid". It's a varchar with a length of 255.

Zen’s picture

Status: Needs review » Needs work

No idea about the issue, but if this requires a DB change, the patch should incorporate the changes to database.mysql (4.0 and 4.1), database.pgsql and updates.inc.

Thanks,
-K

Dries’s picture

This patch makes sense to me, but requires a database update.

pavlos’s picture

Hello,

I have applied the patch, but when used in conjunction with the node aggregator module duplicate nodes still get created. Unfortunately my programming experience is not that great in order to modify the aggregator.module.

I applied the patch by hand, line by line and added the guid column in the aggregator_item table.

Temporarily I tried a sort of hack by trying to select distinct node.title from node, but the node.nid is not selected by the query...

The other thing that just came to me is to add a unique index on the link of each feed... Will give it a try...

Any more insight on this would be excellent...

pavlos’s picture

Which unfortunately does not work....

Steve Dondley’s picture

Pavlos, I need some clarification. When you say you are getting duplicates with the "node aggregator" module, are you talking about the third party module? If so, whoever is in charge of that module would need to patch that module.

steve22’s picture

Version: 4.7.0 » 4.6.6

The patch stops producing duplicate items for google news feed. But for RSS feeds from other sites, it produces incorrect URL link of the feed. URL links of all feed items are converted to link to the homepage of the feed.
e.g. http://www.mysite.com/news/item/1 becomes http://www.mysite.com

Steve Dondley’s picture

Version: 4.6.6 » 4.7.0

steve22,

The patch was written for 4.7. I'm setting the version back to 4.7.

elvis2’s picture

Title: Aggregator module using wrong unique ID for feed items? » Another Idea: Aggregator module using wrong unique ID for feed items?
Category: bug » feature

Just wanted to add my two cents worth.

Why not just check if the title (or first few characters) exist before doing the insert. This way you eliminate any duplicate titles and from different sources. This could be a feature in the settings>aggregator area where one can tick this from a checkbox option. Here is the logic:

Before an insert do a quick query

SELECT (*) FROM aggregator_item AS ai
WHERE left(ai.title, 10) != left($edit['title'], 10) AND ai.fid != $edit['fid']
{ do insert }

this looks for a match from the rss title upto 10 characters against what the table ai.title has AND to make sure the fid's are not the same. if both are true then do the insert.

This is rough logic here but I just wanted to express the idea.

killes@www.drop.org’s picture

Title: Another Idea: Aggregator module using wrong unique ID for feed items? » Aggregator module using wrong unique ID for feed items?
Category: feature » bug

restoring status

Steve Dondley’s picture

Status: Needs work » Needs review
FileSize
798 bytes

Here is the long-awaited database update code.

beginner’s picture

If it's a bug, shouldn't a patch be provided for cvs first?

Steve Dondley’s picture

The patch will likely apply cleanly to both 4.7 and cvs.

killes@www.drop.org’s picture

Status: Needs review » Reviewed & tested by the community
bozo-1’s picture

I'm having the exact same problem as steve22, running 4.7.2 : the patch works just fine as far as Google News entries are concerned, but some other feeds no longer work correctly, redirecting us to the site's homepage instead.

One such example is Toolinux, for which the RSS feed can be found here.

bozo-1’s picture

Nailed it! An else if statement got lost in the battle.

The modified code looks like this around line 856 (second chunk in the patch) :

if ($item['LINK']) {
$link = $item['LINK'];
}
if ($item['GUID']) {
$guid = $item['GUID'];
}
else {
$link = $feed['link'];
}

Which means basically that, as long as there is no $guid, we're going to overwrite $link with $feed['link'];

Replacing it with the following code works like a charm :

if ($item['LINK']) {
$link = $item['LINK'];
}
else {
$link = $feed['link'];
}
if ($item['GUID']) {
$guid = $item['GUID'];
}

Steve, can you submit another patch fixing this?

Steve Dondley’s picture

Status: Reviewed & tested by the community » Needs work

Thanks for you work on this. I'll post a patch later today. If not, certainly by the end of the coming weekend.

bozo-1’s picture

Mmh, it's weird, I'm still having issues with a couple feeds, the aggregator is pulling some old content every now and then. Did anybody else encounter this kind of trouble?

drumm’s picture

Priority: Normal » Critical
Status: Needs work » Needs review
FileSize
1.54 KB

You need to change the database files too.

bozo-1’s picture

The guid entry has already been created in the database. Actually, my latest problem only seems to concern RSS feeds which do not have statements.

drumm’s picture

The missing column is for clean installs of Drupal HEAD.

david-cognite’s picture

FileSize
3.6 KB

Incase anyone is interested, I was able to apply this fix the version or aggregator.module I'm using with 4.6.

The supplied patch file obviously didn't work, but working through it manually matching up the lines (and those from bozo) and making the required changes by hand did. Attached is the patch for the 4.6 version.

Note, I've not tested this thoroughly, but it works for me.

Cheers.
David.

Dries’s picture

Needs to be re-rolled now the install system is in. :)

drumm’s picture

FileSize
676 bytes

Updated.

killes@www.drop.org’s picture

Version: 4.7.0 » x.y.z

cvs issue. The original issue was more of a feature so I won't backport to 4.7.

Dries’s picture

Status: Needs review » Reviewed & tested by the community

Looks good to me.

Steve Dondley’s picture

Doesn't comment 17 still need to be addressed? I think I (or someone) need to roll a new patch. Have not had an opportunity and won't for a few more days still.

alexd-1’s picture

Couple of issues:

1. Will these fixes work for 4.7 - If not, how difficult to adapt?

2. Will the fix work for both RSS and Atom feeds?

I am using 4.7 and having two problems with aggregator and google news.

If I use RSS feeds I get a duplicate entries: 1 headline only with #39; instead of apostrophes 2nd headline and summary of same story correct: eg:

Follow Tony Blair#39;s Example And Send Zaoui Home - Scoop.co.nz (press release)
Google News Feed - Wed, 12/07/2006 - 5:12pm

Follow Tony Blair's Example And Send Zaoui Home
Scoop.co.nz (press release), New Zealand - 11 Jul 2006
... allow Britain to deport Algerians back to Algeria who have links to extremist organisations, New Zealand should now follow suit and send Ahmed Zaoui home under ...

Categories: Latest News

If I use atom feeds, just get the first headline, no duplication

drumm’s picture

Status: Reviewed & tested by the community » Fixed

Committed my patch.

If #17 does need to be addressed, then reopen, hopefully with a patch.

bozo-1’s picture

In repy to #29 :

I have had the same problem. The check_plain() function is applied to the plain text between and tags, resulting in some unnecessary (at least, I guess so) sanity checks.

A simple fix is to replace the following line (line 1304 in 4.7.2 aggregator.module) :

$output .= '<a href="'. check_url($item->link) .'">'. check_plain($item->title) ."</a>\n";

with :

$output .= '<a href="'. check_url($item->link) .'" target=\"_blank\">'. $item->title ."</a>\n";

and (line 1320) :

$output = '<a href="'. check_url($item->link) .'">'. check_plain($item->title) .'</a> <span class="age">'. t('%age old', array('%age' => format_interval(time() - $item->timestamp))) .'</span>';

with :

$output = '<a href="'. check_url($item->link) .'">'. $item->title .'</a> <span class="age">'. t('%age old', array('%age' => format_interval(time() - $item->timestamp))) .'</span>';

In reply to #30

Just pulled the CVS version, the issue I raised in #17 is still present. I'm out of town tomorrow, so I'll try to roll a patch for both issues on Tuesday.

drumm’s picture

Please open a new issue for that; this one was about adding guids.

bozo-1’s picture

FileSize
469 bytes

Here's a simple patch for the issue raised in #17.

bozo-1’s picture

Status: Fixed » Needs review

Switching status back to "patch". The patch applies to the CVS version.

drumm’s picture

Status: Needs review » Fixed

Committed to HEAD.

Anonymous’s picture

Status: Fixed » Closed (fixed)
sillygwailo’s picture

Version: x.y.z » 4.7.3

Did this get into 4.7? I see the change before 4.7.3 but it doesn't seem to be in 4.7 i.e. it got tagged as MAIN but not HEAD? That said (and correct me if I'm wrong), but it looks like it's going to be in for 5.0.

killes@www.drop.org’s picture

Status: Closed (fixed) » Active

no, didn't get into HEAd. If you deem this important, roll a patch from the two patches that went in. The issue with the update system has been resolved. Marking active.

sillygwailo’s picture

Steve Dondley's patch at the top for aggregator.module still applies to 4.7. The updates.inc patch doesn't, however, so I attached one, very slightly modified (183 instead of 186).

sillygwailo’s picture

Status: Active » Needs review
drumm’s picture

Status: Needs review » Needs work

If this is for 4.7 set the version correctly. If this is for HEAD, write a patch to the correct file.

sillygwailo’s picture

As I believe this is fixed in 5.0, this issue is for 4.7.

magico’s picture

Version: 4.7.3 » 4.7.5

Giving this some time, if someone wants to provide a patch that backports from 5 to 4.7.5

killes@www.drop.org’s picture

Status: Needs work » Closed (fixed)

I don't think we still need this.