When users have permission to set their default CC license, this license is loaded each time for hook_user(), $op == 'load'. When a user selected a CC license, it triggers a call to the CC API. Some times, the API is not responding in time with the resulting message: "Error accessing CC API: Connection timed out".

What shall we do about it?
* Modify hook_user > load
* Implement some kind of cache for CC API return in the cc lib?

Comments

turadg’s picture

How about caching it and only updating it in a cron hook? If it gets stale, anyone can hit cron.php. To guarantee its there it can run on module enable too.

toemaz’s picture

Yes, using cron seems to be a good idea. Question is where we store it. I presume there is quite some data to capture: all the cc license links per jurisdiction, not?

toemaz’s picture

Ok, I noticed that in creativecommons_return_xml(), there is already a variable used to store the output of the API. Let's have a look how this can be fixed.

toemaz’s picture

If we wish to have a stable 1.0 version, this issue needs to be resolved. Without it, it's not production ready.

turadg’s picture

Status: Active » Needs work
kreynen’s picture

I am already using this module on several production sites and haven't had any issues with the API connections. In Open Media implementations we normally configure cron to run every 15 minutes to push files through Media Mover configs, synchronize Feeds, etc. Wouldn't need/want to refresh the cached data on every cron for those sites.

Rather than implementing the lookup just in a cron hook, why not timestamp the variable, add an expiration variable to the settings, and only update it if the variable goes "stale" on cron or the current creativecommons_return_xml(). If the variable is expired, but can't be updated because of API issues display a warning. If the admin doesn't like seeing warning and can't resolve the API issues, they can expend the expiration. Regardless, the module will function as long as at least on API call saves the variable.

toemaz’s picture

I did a quick review, and if I'm not mistaken, a serious rewrite is necessary of the way the cc api data is stored.

Suppose you let people select their jurisdiction and default license, then with the current implementation, chances are that you end up with a serious list of variables in the database. For each license uri (based on cc type, jurisdiction, locale) there are two variables used; the effective license data and a timestamp to track when the data needs to be refreshed. This implementation can introduce a performance hit.

Two aspects of implementing an interface with the CC API need to be considered:
- Collecting & refreshing the data
- Storing the data

Storage

Options are the database (variables table, dedicated table) and the filesystem (file in the file directory)

Collecting

Either all the data from the API can be collected in one time and then be refreshed at cron time. Either the data can be collected when it is required, i.e. when saving a node, user or site data. The latter can fail when the CC API is not reachable.

Can someone review this write up?

toemaz’s picture

@kreynen

Can users on your site select a default license?

I encountered this API issue already quite a lot to be honest, but I guess our default settings are different.

kreynen’s picture

@toemaz Yes. Users can select default license from the enabled licenses. 1000+ nodes were added to http://www.denveropenmedia.org/stats using this module. I also test this module on my laptop.

How would I configure the module to cause the error?

turadg’s picture

fwiw, I just encountered this same error on my QCommons.org site.

Error accessing CC API: Connection timed out

kreynen’s picture

I've chatted with the Creative Commons folks on #cc about this. They've been having performance issues with the API that are likely network related. I've requested some type of a public API status page so when there are known issues, people aren't chasing their tails debugging this on the wrong end.

With the recent release of the Creative Commons Plugin for Wordpress that doesn't cache the API response either, this has the potential to get worse so adding the option to cache seems like the way to go.

I don't have any experiencing with caching 'the Drupal way'. Anyone have suggested reading or a module that does it well?

balleyne’s picture

Not sure what best practice is here either, I mostly just tweaked the old caching code a little when I was working on this last summer, but with the module doing a lot more than what the original caching mechanism was intended to handle, a rewrite probably does make sense.

Unless there's a better/recommended "Drupal way," I like toemaz's idea of collecting the data and updating at cron time, except there's a question of how much data to collect. I don't think it'd be reasonable to pull down licensing information for *every* jurisdiction, because many sites will only use a few different jurisdictions, if not only one or two.

I'm not sure if this is too complicated, but maybe we should take an approach in the middle... for example, we could update the cache at cron time, but...
- only get licensing information for jurisdictions that are in use on the site (i.e. if there's a node using a license under that jurisdiction, or if that jurisdiction is a user or site-wide default; an American site doesn't need the Polish licenses, etc.)
- only get licensing information for available licenses (i.e. skip disabled licenses; a site using only free licenses wouldn't need to cache for ND or NC)

Also, I'm not familiar with cron.php from a development perspective yet, but I think a daily update would be find (that is, if a site's cron is running hourly, we don't need to check the CC API every time...).

Thoughts?

I like the idea of a dedicated table for the API data. I think it makes sense to cleanly separate it from the Drupal variables, which seem to be used more for settings than data.

And we could use better error handling. For example, IIRC, the current caching mechanism will check the API if more than 24 hours have passed, but I think it gives an error if the API can't be reached, rather than just using the old cached data (and maybe giving some kind of a warning). The error handling can be made much friendlier.

kreynen’s picture

The cron hook is really simple, but the delay should be a configurable variable. When I configure sites that are processing media, we run Drupal's cron every 10-15 minutes. That could result in more calls to the API in some cases.

An alternative to processing this on cron would be to leave the code basically as is with a check to see it the cached response of a specific call has expired. If it has, try the API call. If the API times out, use the cached value and display an error indicating that the cached value is expired with a warning that says something like...

Error accessing CC API: Connection timed out. If the Creative Commons API Cache variable isn't adjusted, the Creative Commons module will attempt to connect to the API again and update the cache the next time this action is completed.

Not the best wording, but it seems like caching the API calls that are made based on the site's CC configuration makes the most sense.

toemaz’s picture

Regarding the number of possible licenses to cache, I run a website with international appeal and the jurisdiction can be selected by the user. I foresee that there will be users for each jurisdiction, believe it or not. So the cache system has to cover the worst case scenario unfortunately.

I'm pro a dedicated db table and a cron run at configurable time that updates outdated licenses.

balleyne’s picture

@toemaz: so, to be clear, would the solution I suggested above meet your requirements? That is, cron would update any available licenses from jurisdictions presently in use, and if previously unavailable licenses are made available or if a new jurisdiction comes into use, that information would be retrieved from the API as needed, if it's needed before the next cron run.

I think that would cover your use-case because, worst-case scenario, if all jurisdictions/licenses are in use/available, it'll just grab everything. But it'd still keep the cache smaller for other sites.

Make sense?

toemaz’s picture

@balleyne yes, totally!

nkinkade’s picture

We have done a bit of reconfiguration and shuffling of services and I think the problems with the API are resolved now. Using Apache Bench I'm getting consistent response times from the API that average just over 100ms, with absolutely not timeouts or randomly slow responses. This doesn't mean that you shouldn't still implement caching the API responses, just that you should now have some breathing room. In fact, we cache API responses with Varnish. The results of API requests don't change very often, which make it a good candidate for caching, except for these two calls:

http://api.creativecommons.org/docs/readme_dev.html#license-class-issue
http://api.creativecommons.org/docs/readme_dev.html#license-class-get

Sorry about that issue. In the future, if you ever notice any problems with some CC web service, by all means jump on irc.freenode.net#cc, like @kreynen did, and/or send a message to webmaster@creativecommons.org.

Aloha,

Nathan

toemaz’s picture

Another issue why the connection with the API needs to be improved: #1022952: "Error accessing CC API: Not Found"

balleyne’s picture

Assigned: Unassigned » balleyne

I'm working on this now, with the solution outlined here:
http://drupal.org/node/873738#comment-3447322

balleyne’s picture

Ok, I've *almost* got this figured out, but I need some feedback.


First, I've committed one change so far: the custom cache system has been replaced with Drupal's Cache API. Rather than caching API data using Drupal variables, and maintaining/checking timestamps through the module, etc., as of this evening the module now uses Drupals' cache_set() and cache_get() functions. I've also implemented hook_flush_caches(), so if the "Clear Cached Data" button is clicked on the Admin > Site configuration > Performance page, the CC API caches will be cleared.

Cached API data is set to expire 24 hours after it's retrieval, meaning that the cache will be cleared the first time that cron runs after 24 hours has passed. Another option is to use the CACHE_PERMANENT setting, so that data from the API is cached indefinitely, unless the "Clear Cached Data" button is pushed. I went with 24 hours because that's roughly how the module was behaving before -- no user action is required to get updated data from the API. The data is still retrieved when it is first requested and not found in the cache.


Second, I have implemented a working hook_cron(), which preloads the cache with the following API calls: the license list for any active jurisdictions (where active means (a) a node exists with a license under that jurisdiction, (b) it's the site default, or (c) it's the default jurisdiction for a user), and license details for all enabled licenses across all active jurisdictions. It's working, a call to Drupal cron preloads the API data into the cache.

I haven't committed the change yet. There are a couple issues: (1) frequency, (2) execution time.

1. I'm not sure how to handle both a site that runs cron daily, and a site that runs cron every 15 minutes.
@kreynen, you suggested that "the delay should be a configurable variable" -- did you mean a variable in the CC module settings? AFAIK, hook_cron() doesn't have a configurable delay distinct from cron in general. It'd be ideal if the site administrator didn't have to think about this. I suppose a configurable variable with a reasonable default could work, but I'm hesitant to add even more to the configuration unless that's clearly the best/only way.

2. I'm not sure how the hook_cron() implementation would perform on a site like toemaz's. D7 has a cron queue hook ( http://api.drupal.org/api/drupal/modules--system--system.api.php/functio... ), but there's no such thing in core for D6. The CC cron process seems to take about 2 seconds on my Drupal test site for ~50 API calls (but that's on my laptop, through my residential ISP, not from a server). I'm just concerned about total execution time of the script if we're dealing with a massive site that may have a few hundred API calls.

A quick count of the total number of jurisdictions times total number of licenses would mean 450-500 calls to the API. The total license of licenses is a common case, but the total number of jurisdiction isn't (they'd have to all be in use). Still, do we want the hook_cron implementation to call the CC API 450-500 times? Potentially daily?


I think this modified solution would make sense:
- hook_cron() only preloads license information that is absent or expired from the cache (that way, if a new license was enabled, or a new jurisdiction was used, related license details would be preloaded into cache on a cron run, but it's not updating *everything* everytime)
- The cache expiry time is either increased, or set to permanent (i.e. no auto-expire). Maybe a value like one week would be reasonable. If new licenses are released, they'll become available within a week, or sooner if a site administrator clears the Drupal cache manually; API calls would be way down.


I'm going to sleep on it, and also ask CC folks for feedback in the next day or two. It'd be nice to hear back from people using this in different production environments too.

kreynen’s picture

1. I'm not sure how to handle both a site that runs cron daily, and a site that runs cron every 15 minutes.
@kreynen, you suggested that "the delay should be a configurable variable" -- did you mean a variable in the CC module settings? AFAIK, hook_cron() doesn't have a configurable delay distinct from cron in general. It'd be ideal if the site administrator didn't have to think about this. I suppose a configurable variable with a reasonable default could work, but I'm hesitant to add even more to the configuration unless that's clearly the best/only way.

Right. The creativecommons_cron would be called as often as Drupal's cron is run, but the first check would be to determine if the time creative commons cache was last updated < now() + creativecommons_cron_minutes. If the variable wasn't set, the rest of the creativecommons_cron hook would be processed. If the creativecommons_cron_minutes variable was set to 1440, the cache would only be updated once a day even if cron was running every 15 minutes.

This is similar to how the core Aggregator modules add an additional delay setting for each feed. Even if cron is running every 15 minutes, a feed can be skipped while processing cron for up to 4 weeks. The Feed API and Feeds include similar options.

balleyne’s picture

Ok, makes sense. I'd looked at the Aggregator module hook_cron().

I've committed the implementation of hook_cron(). It's set to run no more often than every 24 hours, and it updates only expired (or absent) items in the cache.

This issue isn't fully resolved yet, because if the cache expires and is cleared on cron run, but the API is somehow unavailable for update, there's still a potential situation where API errors could be raised. The cache and cron changes so far do significantly decrease the likelihood of an API call during normal (non-cron) requests.

That could be fixed with a permanent flag on the cache, so that there's *always* cached data to fall back on (unless someone manuals clears it), and cron would force an update. My hesitation here is the up to 500 API requests in a single cron call (worried about overall execution time of cron.php). D7's cron queue would help, but not sure yet what's best for D6.

I'll give it some thought...