I realize this now on the 3rd feedapi install across a couple of versions. I am not sure wether this is at all feedapi related:

Links to original articles or to original site from Drupal result in a "Stopped" message in FF and no action in other browsers (I haven't tested extensively).

http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://gigaom.co...

Copying URL and pasting it to the browser address field results in the same behaviour.

Using the last part of the URL works fine: http://gigaom.com/2007/10/26/productivity-goes-social-with-jive/&cid=0&e...

Comments

aron novak’s picture

$ wget "http://news.google.com/news/url?sa=T&ct=us/0-0&fd=R&url=http://gigaom.com/2007/10/26/productivity-goes-social-with-jive/&cid=0&ei=tVoiR_SHAYuUaKDL8cgB"

If I do this, i experience strange thigs. At the first few times, it got redirected by 301 Moved temporary. After a few tries, i got 403 Forbidden.
It can be a bug of the news.google.com, it's definitely not a bug of feedapi, i think

alex_b’s picture

Is news.google.com starting to protect their RSS service from aggregators?

alex_b’s picture

... thanks for checking this btw. Alex

Jo Wouters’s picture

Yes, it looks like google filters on user agent:

wget http://news.google.com/news?hl=en&ned=us&ie=UTF-8&q=drupal&output=rss
gives a

Connecting to news.google.com|209.85.137.99|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
02:37:04 ERROR 403: Forbidden.

wget --user-agent="testing" http://news.google.com/news?hl=en&ned=us&ie=UTF-8&q=drupal&output=rss
gives:

Connecting to news.google.com|209.85.137.104|:80... connected.
HTTP request sent, awaiting response... 200 OK

I did not find any information about in online thought; not in the Terms of Use of Google News, or in any articles that describe this.

alex_b’s picture

Hi Jo,

thanks for checking this out... strange. Really looks like Google is starting to build walls. A way around this would be to filter out the target URL of the article of the news.google.com URL. This would at least be a strategy until Google News doesn't embed the original one anymore.

Any other ideas?

Alex

Jo Wouters’s picture

Alex,
That would not be a solution:

1) what I tested was trying to get the rss-feed from news.google, and that didn't even work because the wget user-agent is blocked by them (so they block both the rss-feed itself, as the link to the original article)

2) filtering out the target URL would violate their terms of use (http://www.google.com/support/news/bin/answer.py?answer=59255&hl=en ): "include a link to the Google News cluster of related articles for each news item, using the link provided in the Google News feed."

I think the right solution would be to use a user agent that is accepted by Google News. They must have a valid reason to block these kinds of requests.

I posted a question in Google News Help ( http://groups.google.com/group/news-HelpUsers/browse_thread/thread/30315... )

btw. blogsearch.google still seems to accepts requests with wget (without a special user-agent).

alex_b’s picture

Great that you posted this question to Google. I am curious to see their response...

aron novak’s picture

Status: Active » Closed (won't fix)

In my opinion FeedAPI should not include rss-publisher-related ugly hacks. If a site really needs to process such awkward feeds, it should be done by a separate parser.

AntiNSA’s picture

Component: Code » Code feedapi (core module)

I am looking at how to parse google feeds correctly... if anyont knows Id appreciate some leads...