I'm starting to work on distance searching/views integration on Geofield (see #1469956: [Meta] - Improve Views-powered Geofield proximity searches), and it occurred to me that an exposed filter doing geocoded views searches is likely to run into API limits pretty quickly. Under our regular use (saving geocoded data to geofield), this isn't really an issue, but I don't really have a place to store data with views filters. By utilizing Drupal's caching system, this should be relatively easy to do.

Ideally, I think we could make an option that states whether or not we cache a result. The data probably should be segmented to it's own table (cache_geocode?). For fields, it's probably not worth the effort, but makes sense if we need it for other purposes.

One thing that might be an issue is various providers Terms of Service. Looking at Google's for example (https://developers.google.com/maps/terms#section_10_1_3), they forbid caching results unless you're doing it for performance reasons, which I think could be reasonably argued for something like a views filter search.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

phayes’s picture

Good idea.

What we should do is base the cache-keys off the md5 hash of the incoming data (that way it works on all data, including KML files and the like). Then the results should be a gzipped serialized geoPHP object.

geocoder function should take a $cache parameter, and an option should be added to geocoder widget settings.

phayes’s picture

Status: Active » Needs work

I've added static caching, we still need to add persistent-caching and make it configurable in the UI

sdboyer’s picture

i'll just weigh in quickly and say +1 to having a separate cache bin, e.g., cache_geocode. in fact, i might even go so far as to recommend *multiple* separate bins, given that different providers have different terms of use. that way you can easily create tailored clearing logic for each provider.

you COULD utilize key prefixing and wildcard clears...but wildcard clears aren't necessarily the sanest thing on all cache backends (*COUGH* memcache), so bins are still the cleanest way to segment. that or cache tags, but you can't rely on those until D8 - they're now in core, iirc.

phayes’s picture

Relevant IRC discussion cut and paste:

[09:19am] sdboyer: and if you end up agreeing with my point in http://drupal.org/node/1515372#comment-5970438, there'd like be even more
[09:19am] Druplicon: http://drupal.org/node/1515372 => Caching Results => Geocoder, Code, normal, needs work, 3 comments, 2 IRC mentions
[09:19am] sdboyer: ah well, guidance i can do, but i can't promise a patch
[09:19am] sdboyer: sorry :
[09:19am] sdboyer: :(
[09:21am] phayes: Hi sdboyer - interesting idea about having different cache bins - I was thinking we would just do prefixing...
[09:22am] sdboyer: yeah prefixing has its limitations
[09:22am] sdboyer: cache tags really are the ideal solution to your issue here, but as i said, can't rely on em till 8
[09:22am] sdboyer: you might get a different answer from someone else, but this to me looks like a valid case for multiple cache bins
[09:23am] phayes: Makes the code much tricker - especially given that additional handlers can be added dynamically by other modules...
[09:23am] sdboyer: might seem a little weird, but if it helps users play nice with a providers TOS with minimum interaction...then yeah
[09:23am] sdboyer: well
[09:23am] sdboyer: if you wanted to do multiple cache bins, my approach would be
[09:23am] sdboyer: have a cache_geocode which is used if no other logic is provided
[09:24am] phayes: Here's what we should do -- allow plugins to define a cache-bin  -- if non is defined in the plugin, we use a general cache_geocode
[09:24am] sdboyer: when cache clearing routines run, you check to see if the logic for that provider has the appropriate function for cache management. if not, use generic logic
[09:24am] sdboyer: yep
[09:24am] sdboyer: exaclty
[09:25am] phayes: That way all the "local" handlers (kml, gpx, etc.) can just use cache_geocode
[09:25am] sdboyer: yep
[09:25am] sdboyer: it's the kind of thing that is more elegantly expressed if you have an abstract parent for the generic logic (which is part of why i bring up the OO question)
[09:25am] sdboyer: but it's eminently doable in procedural as well
jenlampton’s picture

I was just testing the static cache, and wasn't able to get persistant caching working - but it looks like that's just because it's not done yet. :)
Let me know what I can do to help move this forward.

jenlampton’s picture

I'm going to take a stab at setting up a cache_geocoder table and storing results in there for views-exposed-filter searches.

*edit* fixed typo, yes cache_geocoder not cache_geocode :)

phayes’s picture

Assigned: Unassigned » jenlampton

@jenlampton++ !

Oh, be sure to call it cache_geocoder (not cache_geocode)

steinmb’s picture

We also interested in some kind of cache. We are currently working on migrating data from an older systems into Drupal. One of the external systems contain addresses that we store on the Drupal-side in addressfields and then let geodecoder decode these into geofield. We rather quickly run into 'OVER_QUERY_LIMIT' at Google.

The migration (migrate module) are getting rerun every time we change the code but the addresses are 'always' the same so hitting Google this hard does not make any sense. I thought about putting a proxy cache between us and google with a looong static cache, but solving it here would be a much better, and smarter solution.

emilyf’s picture

I'm in the same situation as steinmb. Importing loads of content in via a feed importer, not even including a location field in the import but there is an addressfield and geofield in the node type that I'm importing into. Every import triggers 10,000+ OVER_QUERY_LIMIT errors...

Let me know if there is anything I can test.

nd987’s picture

Big +1 on this. Is there any progress towards persistent caching? Seems like a must-have feature to me, especially for those of us who do batch imports daily. Once you have the address geocoded, it ain't going anywhere, so you can cache it indefinitely.

Dean Reilly’s picture

I would like to see some discussion of using a proxy to cache responses from services instead of implementing this in geocoder. It seems to me that this approach might have some advantages:

  1. Firstly, it would be less code that would have to be written and maintained, which is always a good thing.
  2. Secondly, the http cache system is mature and flexible. It will respect whatever caching strategies the service suggests in headers or can be overridden in most cases by the user. It will handle expiry of the stale caches and if the service supports etags can even avoid re-requesting data which hasn't changed.
  3. Thirdly, it makes the cache more easily usable by other applications or modules which need to integrate with the service api.
nd987’s picture

The problem with relying on a proxy to cache geocode results is...you have to have a proxy. While the advantages @Dean Reilly posted are valid, most developers will not have or be able to setup (due to knowledge, hosting, business, or other limitations) an http proxy for this purpose. It could definitely be supported, but we need something that is easier and more trouble-free for the average developer.

Dean Reilly’s picture

Hosting limitations could be an issue. However, I'm not sure how many people would be running a site that would be making 50,000 geocoding requests a day (Yahoo!'s limit), and be running on a service where they don't have the necessary access to set up a proxy.

Exceptions might be sites hosted on a PaaS such as Acquia and Pantheon. But these are hosted on the Amazon and RackSpace clouds respectively. The user can always spin up an instance in the same AZ and configure that to act as the proxy.

There's also an argument to be made for the performance considerations of this module on smaller sites without proxies. I'm not sure the technical costs justify the benefits in this case either, though. Smaller sites on shared hosting probably have bigger performance bottlenecks than a geocoding api. (Although that is just a gut feeling and I have no numbers to back up that statement.)

I also believe the average developer could go away, learn how to and proceed to set up a proxy server fairly quickly.

-------------------------------

I'd like to point out I'm not terribly opposed to the idea all things considered but I do want to point out that there are alternatives and pros and cons to each.

I also think the issue regarding hitting the API limits probably suggests a queuing system which can spread out the load is also required as caching will only take us so far.

nd987’s picture

Keep in mind that Google's API limit is only 2500 per day (for non-business clients). Of course developers could setup a proxy if they really had to, but why make them? Adding a regular caching layer would involve:

  1. Creating a Schema for the cache table in the .install file (just copies standard cache table schema)
  2. When an address is geocoded, wrap the call in a cache_get / cache_set, using the above table name as bin, with a hash of the address as cache key
  3. There is no step 3
Dean Reilly’s picture

Step 3 would be implement different caching rules to comply with the different terms and conditions of the various services. See replies #3 and #4.

------------------

To switch sides quickly, I'd like to give some of the reasons I think it would be a good idea for geocoder to cache results even if a proxy is being used.

I think the big advantage to having caching happen at the api level would be that it's agnostic of the service. So we could make use of multiple service end points and have the cached results of one prevent requests to another.

We can also be smarter with the cache. We can provide caching up to a variable level of resolution. E.g. are we only plotting addresses up to the level of a city to provide anonymity to users? Then we don't need to geocode two different addresses from the same city for different users.

Finally, APIs aren't always brilliantly implemented. The ToS may be much more permissive towards caching than the http headers suggest. Sure, you can configure the proxy to override these headers but that then means lots of people doing the same job. Easier to have it done by geocoder.

------------------------------

It's probably also worth keeping an eye on the pluggable http client issue for D8. This would allow us to use a tool like Guzzle which has support for HTTP caching built in and allow people to get the benefits of a proxy without having to install one.

jpstrikesback’s picture

What if step 3 was simply an opt in on a config page?

drunken monkey’s picture

I've now taken a first shot at implementing this, for now without a config page, but with the caching variant overridable (in code) for each handler. The only problem is: it doesn't work. Seemingly, the PHP serialization functions have some problems with Geometry objects (or at least Points). Direct serialization and unserialization like in

$point = unserialize(serialize($point));

seems to work, but as soon as there is some external storage involved (be it a database or just a file or the clipboard), it fails and only returns an __PHP_Incomplete_Class object. Base64-encoding or similar tricks also don't work, as far as I can see.

So, to get this even basically started, without even considering the ToS and configuration problems, we will have to solve this. As far as I can see, we have the following options there:

  1. Someone smarter than me knows of a way this can be done, or discovers an error in my tests.
  2. We implement our own serialization layer for Geometry objects.
  3. We fix the problem elsewhere (i.e., in geoPHP, probably).

Any ideas or comments? What do you think of my approach to the other problems? What we could easily add, too, is some options on the config page to disable caching for some (or all) handlers.

jpstrikesback’s picture

I'm gonna test this and play the moment I get a second, unfortunately that could be 2 weeks from now, but dang it's nice to see this thread lives!

attiks’s picture

Status: Needs work » Needs review
FileSize
4.22 KB

This should work, I added geophp_load(); to geocoder_cache_get so the classes are known for the unserialize.

drunken monkey’s picture

Wow, great! Thanks for your help, I was completely lost with that one.

I don't think it's necessary to serialize the objects ourselves, though. Just move the geophp_load() call above the cache_get() and it's fine.

Other reviews would also be very welcome, suggestions how to proceed with the configuration, etc.

attiks’s picture

Status: Needs review » Reviewed & tested by the community

#20 works as well

drunken monkey’s picture

Re-roll of #20.
(Only whitespace changed, so I don't think we need a new review.)

drunken monkey’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
4.54 KB

Improved the patch in several ways regarding the cache_function callbacks.

  • Since there is only one callback for both get and set, we should also pass a $geometry argument when getting the cache value (in this case, NULL).
  • Pass additional parameters $processor, $data and $options to the caching callback. We don't know whether custom cache implementations will use the same cache ID we create.
  • Don't return values from geocoder_cache_set().
Sinan Erdem’s picture

Does anybody work on this issue? It would be nice to see the patch committed to a version.

drunken monkey’s picture

From the looks of it, I'd say nobody is working on this module.
See also #1563648: Taking a break from geocoder development. Some new developer would be needed – otherwise, you'll just have to hope phayes returns soon.

mikey_p’s picture

Just a note, that while this patch looks great, the widget that geocoder exposes for geofield and location doesn't use this function and thus doesn't benefit from this caching.

I'm not sure what the simplest solution to this is, outside of re-thinking the plugin system and possibly moving to an object oriented structure.

mikey_p’s picture

Issue summary: View changes
FileSize
11.85 KB

Here's a slightly hacky patch that allows geocoding through the field widget to be cached as well.

arbee’s picture

I just want to thank you all for your excellent work. So glad I stumbled upon this, I was having query limit problems with a Views proximity search and this looks like it's going to take care of it.

SocialNicheGuru’s picture

I am getting this now though:
Notice: Undefined index: cache in geocoder_widget_get_field_value() (line 283 of geocoder/geocoder.widget.inc).
Notice: Undefined index: cache in geocoder_widget_get_field_value() (line 294 of geocoder/geocoder.widget.inc).

quicksketch’s picture

Assigned: jenlampton » Unassigned

Let's unassign @jenlampton. :)

SocialNicheGuru’s picture

If I add this patch in #27 then cache tables for geocoder are not created when I install using Aegir.

rudiedirkx’s picture

Status: Needs review » Needs work

Doesn't apply to dev correctly anymore. Dev fixed a bug with multiple value fields, but this patch reintroduces it, because of

if (!isset($geometry)) {

so that part should be patched/merged manually.

rudiedirkx’s picture

Status: Needs work » Needs review

Re-roll of #27. With caching, without the multiple values bug.

Jelle_S’s picture

@rudiedirkx I think you forgot someting ;-)

rudiedirkx’s picture

What's that?

Jelle_S’s picture

I think you forgot to attach the patch in #33 ;-)

rudiedirkx’s picture

Haha crap. Eeeeeeh let me find that one....

Usually I remove my local patches, because "they're on d.o anyway!". Not this time thankgod!

kerasai’s picture

Hi there, sorry to step on anyone's toes but this seems like an exceptionally complicated way to solve a fairly simple problem.

For a recent project build, I rolled a module that provides a geocoder_handler plugin that is essentially a wrapper for the existing Google plugin, and adds the caching at that layer. No need to touch any of the other parts of Geocoder and the integration with fields and Views just keeps on doing its thing.

https://www.drupal.org/sandbox/bmoresafety20/geocoder_google_adv

Take a look, please ping me with questions or input.

rudiedirkx’s picture

@kerasai But you don't want to just cache google. Or any other one handler. You want to cache ALL handlers, depending on a field's config. I agree it's overcomplicated, but it works perfectly and we REALLY need geocoder caching, so let's get it in!

(It's probably not a good idea to serialize and persist Exception objects. I don't KNOW, but it seems wrong... And why do you save into geocoder_google_adv and into cache_geocoder_google_adv?)

sheijtel’s picture

The patch from #23 works fine when just using the geocoder API, great! Patch of #37 does not include the geocoder.install file which includes the schema

rudiedirkx’s picture

FileSize
12.23 KB

#23 has the bug described in #32

Attached patch should include everything. Thanks @Steven. Please try that one.

blacklabel_tom’s picture

Hi Rudie,

The patch in #41 won't apply for me against the latest dev. I applied it manually and I couldn't get any results through from the module.

A first glance the cache check here is a bit too aggressive. By default $cache_reset is FALSE and $cache_type is 2 meaning ALL requests will look in the cache before doing the lookup regardless of if the query is run for the first time or not.


if (!$cache_reset && $cache_type > 1) {
+    $geometry = geocoder_cache_get($cache_id, $processor, $data, $options);
+  }
+  else {
+    $geometry = call_user_func($processor['callback'], $data, $options);
+    if ($cache_type > 1) {
+      geocoder_cache_set($cache_id, $processor, $geometry, $data, $options);
+    }
+  }

I'm assuming this check needs to look if this location has been cached rather than the caching settings the function defaults to?

Cheers

Tom

blacklabel_tom’s picture

Status: Needs review » Needs work
rudiedirkx’s picture

That's weird. Applies fine for me:

rudie@home:geocoder$ patch -p1 --dry-run < patches/geocoder-1515372-41.patch
patching file geocoder.install
patching file geocoder.module
patching file geocoder.widget.inc

0 lines offset even.

I don't know about the actual code. It's weird that there are levels of cache checking, I agree, but I haven't scrutinized the code. I skimmed it, tested it and it worked. It's not perfect, but it's been over 3 years now and we really need caching.

You're very welcome to improve on it, or even start over, I'll review and test and RTBC when I'm happy =) but I don't think there'll be anything committed in 2015 then =(

blacklabel_tom’s picture

Hi Rudie,

I'm going to leave this one well alone as I'm tied to the 1.2 branch at the moment.

Cheers

Tom

Pol’s picture

Status: Needs work » Reviewed & tested by the community

Patch #41 applied and working flawlessly.

pol@x230 ~/git/geocoder $ wget https://www.drupal.org/files/issues/geocoder-1515372-41.patch
--2015-07-27 09:02:14--  https://www.drupal.org/files/issues/geocoder-1515372-41.patch
Resolving www.drupal.org... 93.184.220.99
Connecting to www.drupal.org|93.184.220.99|:443... connected.
HTTP request sent, awaiting response... gi200 OK
Length: 12519 (12K) [text/plain]
Saving to: ‘geocoder-1515372-41.patch’

2015-07-27 09:02:15 (26.2 KB/s) - ‘geocoder-1515372-41.patch’ saved [12519/12519]

pol@x230 ~/git/geocoder $ git apply geocoder-1515372-41.patch 
pol@x230 ~/git/geocoder $

I have a question though, are we obliged to use the static cache ?

lukus’s picture

Would it be possible to get this patch committed?

Cheers

L

WidgetsBurritos’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
12.38 KB
612 bytes

#41 shouldn't be committed because blacklabel_tom's concern in #42 above is a legitimate problem. I just ran into it as well.

I've attached an updated patch that fixes it, along with the interdiff so you can see the changes made.

Basically the idea is to separate the if/else into two separate steps.
- The first step attempts to retrieve from cache if the settings ask it to.
- Then in the second step. If the value is still empty (cache was either not retrieved at all, or cache failed to find a result), then it forces a new lookup and recaches.

rudiedirkx’s picture

If caching is enabled, do we still want to diff old and new values to see if the field value has changed? If caching is enabled, it doesn't matter anymore, does it? And settings might have changed, so it should do a new lookup. Or when the geofield might not have any data yet, a save should fetch them, even if the source field hasn't changed. Know what I mean?

Adding this before line 140 in geocoder.widget.inc would do it:

    if (empty($field_instance['widget']['settings']['cache'])) {

so the 'has the source value changed'-check only runs for non-cached fields.

If that's a bad idea, I RTBC.

WidgetsBurritos’s picture

I'm not sure I get what you're asking. Are you sure that's the correct line number in geocoder.widget.inc? Can you tell me a little bit more context of the surrounding code there, so I can see what specifically you are referring to?

rudiedirkx’s picture

Oh, it was line 140 for me. In geocoder_widget_get_field_value(), below comment

For entities being updated, determine if another geocode is necessary

there's a block of code that compares the old source field value ($field_original) against the new source field value ($source_field_values). It only compares the field values, not the widget settings etc. It's not guaranteed the coordinates will be the same (even if they exist), or even if they already all exist. The check is also 'wrong', because not the entire field data is used for geocoding. Only the geocoder knows which data is relevant, so let it handle performance. The cacher knows which data is used, so it knows when a new geocoding is necessary.

With caching enabled, the check is useless, because it'll use caching if the settings etc are the same, and will live fetch if it has to. If caching is not enabled, it shouldn't live fetch always, so the comparison code is still necessary.

The comment is my line 239. The code block ends on line 253. There's a return FALSE in the middle.

I think caching shouldn't even be an option, it should be enabled always, but that's another issue.

rudiedirkx’s picture

FileSize
10.08 KB

I've taken another stab at it. Quite different from previous patches (sorry!), so no interdiff.

It's much simpler than before. 1 cache table. Always cache results. Removed static cache. Removed cache overrides (callback, bin, ttl etc). Removed cache table from cache hook, because it should be really persistent. Coordinates don't change as often as we flush the cache. We could make a button to manually flush all/some geo cache from the settings form.

I only tested geocoding from an Addressfield, only with Google, but other fields and handlers haven't changed, so should work the same. I don't know how to test the geocoder() function...

Pol’s picture

Status: Needs review » Needs work

Could you please re-roll those patches (this and #2159925: Geocode from virtual fields/entity property instead of just real fields) against the current development version so I can apply them ?

thanks!

hussainweb’s picture

Status: Needs work » Needs review
FileSize
10.8 KB

Here is an attempt to reroll.

  • hussainweb authored 012fd40 on 7.x-1.x
    Issue #1515372 by drunken monkey, rudiedirkx, davidstinemetze, attiks,...
Pol’s picture

Status: Needs review » Fixed

Thanks to all for this :-)

  • 58a932f committed on 7.x-1.x
    Follow up to #1515372: Add missing .install file.
    
rudiedirkx’s picture

Yes, finally, fantastic! The 2 most important additions. Thanks @Pol, and @hussainweb for the quick work.

  • hussainweb authored 012fd40 on 7.x-2.x
    Issue #1515372 by drunken monkey, rudiedirkx, davidstinemetze, attiks,...
  • 58a932f committed on 7.x-2.x
    Follow up to #1515372: Add missing .install file.
    

  • hussainweb authored 012fd40 on 8.x-2.x
    Issue #1515372 by drunken monkey, rudiedirkx, davidstinemetze, attiks,...
  • 58a932f committed on 8.x-2.x
    Follow up to #1515372: Add missing .install file.
    

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

AdamPS’s picture

Caching is great, but this commit seems to have accidentally made a significant API change also.

  • Previously, function geocoder would throw exceptions on error.
  • Now it returns NULL.

As an example of the consequences, this has made a corresponding change to the results of a view filtered by geofield proximity when no results are returned:

  • Previously, we got the no results page.
  • Now there is no filtering - every single location in the database is returned to the view.

It's presumably impossible to identify and correct all the consequences of this change in contrib/custom code, so I think we have to put the API back how it was.

I have raised #2689545: API of function geocoder must not change.

AdamPS’s picture

Very sorry I was mistaken. The old code also returned FALSE - just the exception was previously caught in the plug-in and now it is caught in the main module file.

However I am now baffled as to how geofield proximity ever worked correctly with no results. I wonder if perhaps it never did. However the Google geocoder is very creative at interpreting text as a location, and often finds somewhere on the globe to return as a result. If there are no database entries near this, then we do in fact get no results.

So next step, I will raise an issue against geofield proximity filter. Sorry to have taken up everyone's time.

AdamPS’s picture

I believe there are some problems with caching in relation to error handing.

In particular, it looks like a temporary error (such as Google API limit reached) might be cached as FALSE for a certain search data. This means that even after the temporary error has resolved, that particular data will not geocode.

Secondly, the comment of geocoder_cache_get indicates an intention to cache "no results" entries using FALSE. However it looks like this wasn't working in some cases - in particular not for Google geocoder.

I have a patch for these in my issue #2689211: Confusion with errors and logging which attempts to fix these and some other problems with error cases. Any comments from the experts here on my issue would be much appreciated.

rudiedirkx’s picture

Relevant:

they were 'fixed' before 1.3, but I think they (or one) address your concerns.

AdamPS’s picture

@rudiedirkx Thanks for a quick reply.

As far as I can see #2649096: Fix exception throwing and empty result handling fixed matters so that Google geocoder empty results were cached. However it also meant that temporary errors (such as API limit reached) from Google and the majority of other geocoders were incorrectly cached. The problem is that the Google plugin throws an exception for both cases.

It seems that the plug-ins are currently not handling "no results" and "temporary error" in a consistent manner. Until that is fixed, it's probably not possible to get everything working. #2689211: Confusion with errors and logging provides that fix to the plug-ins.