Download & Extend

Improve caching scheme (See description and comments for specific improvements)

Project:Amazon Store
Version:6.x-2.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active

Issue Summary

If someone visits /amazon_store/item/ with a bogus or obsolete ASIN, the store module will always call amazon_store_http_request which contacts the Amazon servers. This is because we don't cache failed lookup results for a bogus ASIN or Browse Node.

So, if the Amazon Store site builder/webmaster (or an external site) links to one or more malformed/non-existent ASINs detail page in the site's store, or a misbehaving bot or scraper starts hitting your site's /amazon_store/item/xxx pages, you will probably run up against the 2000 requests per hour throttle limit pretty quickly. And that will probably cost you lost referrals because you'll starve legitimate users.

How bad is the potential problem? This Amazon developer forum post gives an indication. (Of course, there are people who put up hundreds if not thousands of sites using some kind of Amazon API-based store system, creating all those stores using the same account, so of course they hit the wall almost immediately. But that's not what I'm trying to fix here.)

Enabling Drupal's caching mechanisms might mitigate the potential for damage, but I suspect not, because the ctools item detail panel doesn't cache by default (as far as I can see).

One approach might be to cache specific types of lookups where we can determine that it was a bogus ASIN or other malformed request based on the response from the Amazon servers.

I suspect this feature might require complex modifications to the core API so we might want to consider this for a major version upgrade.

Any feedback and/or advice is most welcome.

Comments

#1

Oh, and lest we think that returning a 404 (not found) HTTP response will help: Probably not.

In my experience, search engine spiders and other bots keep trying for months to see if the 404'd page will reappear. :(

#2

If I remember right this is called negative caching in the DNS world, and would be appropriate.

Note that we should have nofollow on everything in the Amazon Store world, so at least theoretically, we won't have robots causing trouble.

Note also that searches *are* cached. But item lookups not as you point out.

#3

So, would it be appropriate (in the context of the Amazon Store application) to cache error returns for ASIN item lookups if the return data indicates that it's an invalid ASIN that caused the error?

If so, it might be useful to have a separate caching policy for cached error returns -- might want to cache for shorter periods, or longer... for example, a week, or a month! After all, we're not holding on to any valid Amazon data (really just the fact that the ASIN lookup returned an invalid value, which should be a more-or-less persistent status).

(Also, if we add this feature, I can see it might be useful / desirable to offer it as a configurable option in the store admin settings pages...)

#4

Note that we should have nofollow on everything in the Amazon Store world, so at least theoretically, we won't have robots causing trouble.

Unless, of course, you actually *want* your store's contents to be visible in a search engine's index. For my site, I actually do want Google/Yahoo/Bing to find my Amazon items. And, for example, if you are running a specialty store with a few hundred items, why would you not want to be indexed?

#5

Title:Improve caching scheme: Cache failed lookups (to minimize API calls and avoid throttling; mitigate DOS attacks)» Improve caching scheme (See description and comments for specific improvements)

This would be a big win for people running multiple 'stores' (sites) using one amazon API account:

Cache: Ability to share cached lookups between Drupal Sites.

This is the recommended approach to avoiding premature throttling. The cached lookups (ASINs, Browse Nodes, etc.) should be shared between sites.

So, all cached data should go into a shared database. This would probably require moving away from using the Drupal cache API. I know Drupal has APIs for switching databases when making calls to db_query() and related functions.