To my dismay, I learned that Google has changed the format of its RSS news feeds. A link to a news article that was once:

news.google.com/news/url?...4456b3208be04&cid=1106225330&ei=cmdXROaFBYqcHY3bnakG

becomes

news.google.com/news/url?...4456b3208be04&cid=1106225330&ei=1mZXRKfuC8WeHNmLhKAH

a few minutes later. Notice the different in the last part of the URL. Perhaps Google is doing this on purpose?

At any rate, this is causing many duplicate entries in the aggregator. I think the only solution is to hack Drupal to ignore the last few characters of the url.

Comments

Steve Dondley’s picture

I took a close look at Google's RSS feed. I see the following in there:

<guid isPermaLink="false">tag:news.google.com,2005:cluster=41ef6ba2</guid>

Perhaps this is what should be used?

--
Get better help from Drupal's forums and read this.

Steve Dondley’s picture

Roadskater.net’s picture

i've been trying to hack the code in the aggregator.module to compare the current aggregator_item title with other titles in an effort to avoid duplicate titles (regardless of url, link), but i know very little php or mysql basically, so it's muddle and bumble along and test.

i had thought this was the section of code to change, around line 894

    /*
    ** Save this item.  Try to avoid duplicate entries as much as
    ** possible.  If we find a duplicate entry, we resolve it and
    ** pass along it's ID such that we can update it if needed.
    */

    if ($link && $link != $feed['link'] && $link != $feed['url']) {
      $entry = db_fetch_object(db_query("SELECT iid FROM {aggregator_item} WHERE fid = %d AND link = '%s'", $feed['fid'], $link));
    }
    else {
      $entry = db_fetch_object(db_query("SELECT iid FROM {aggregator_item} WHERE fid = %d AND title = '%s'", $feed['fid'], $title));
    }

    aggregator_save_item(array('iid' => $entry->iid, 'fid' => $feed['fid'], 'timestamp' => $timestamp, 'title' => $title, 'link' => $link, 'author' => $item['AUTHOR'], 'description' => $item['DESCRIPTION']));
  }

i had thought a check for $title = $feed['title'] around the save might avoid duplicates but i didn't get it to work.

another option would be a "clean aggregator_items and related aggregator_category_items" routine that would work like the timestamp flush routine. around line 839, something like...

for each row of aggregator_item with a title = to the currently being parsed title {
  delete all aggregator_category_item entries with that iid 
  delete all aggregator_item rows with that iid
}

if it were in that section of the code, it MIGHT be something VAGUELY like...but i'm sure this is not right...and it would be timexpensive to run for every line of the aggregator data...

  $result = db_query('SELECT iid, title FROM {aggregator_item} WHERE aggregator_item.title = $title}');
  while ($feed = db_fetch_array($result)) {
    db_query('DELETE FROM {aggregator_category_item} WHERE iid IN ('. implode(', ', $items) .')');
    db_query('DELETE FROM {aggregator_item} WHERE fid = %d AND timestamp < %d', $feed['fid'], 
  }
}

again, i'm groping in the dark, but still groping. i'd love help.

it strikes me that perhaps a aggcleaner module might be a better approach, running off poormanscron or cron, and certainly if this title compare were to become part of the aggregator.module there should be an option to set the value on or off for those who don't need it.

the best place for the routine is probably outside of the iterative loops if possible, perhaps after a new feed has come in and been processed but then there'd need to be a loop anyway to process it all. ok i'm rambling.

thanks to anyone who takes an interest in this.
blake

Steve Dondley’s picture

I posted a patch on this here:

http://drupal.org/node/61433

In addition to applying the patch, you have to create a new field in the aggregator_item table called "guid". It's a varchar with a length of 255.

--
Get better help from Drupal's forums and read this.

Roadskater.net’s picture

i've put in your patch and made the db change and it is TONS better. thank you.

however, i am still getting some duplicates for some reason, likely that the guid changed somehow. this is nowhere near the number i got before.

so if there's anyone out there who can help me know what i need to in order to clean my aggregator_item and aggregatore_category_item tables of all duplicates based on title, keeping those with no duplicates and the latest timestamped of those that are duplicates, i'd really love to figure this out.

i'll say for myself that i've been studying mysql and php diligently and looking at lots of drupal code, but i'm not there yet.

thanks for the patch, and thanks in advance to anyone who educates me a bit. i know how to execute mysql and open the db in php, and i know how to send some queries, but some i don't follow...especially those where %s and %d are used...i've searched for info on this on the web and will be hitting the bookstores when i get enough money for one of those php mysql books. what do you recommend?

thanks,
blake

Steve Dondley’s picture

Can you please look in the database and see if the guid did change? Also, have you double-checked to make sure the duplicates are in fact duplicates? Maybe the headlines are identical but go to different publications (like on a wire story).

%s is just a format code for a string. %d is a format code for a decimal (number) value. These are part of Drupal's database interface and not particular to php or mysql.

--
Get better help from Drupal's forums and read this.

palmosc’s picture

Sorry for the question but not sure if I add it to the aggregator module, or what??
Jim