This is a raw idea that evolved since Nick_vh pointed me to Geohash during Drupal Dev Days Barcelona 2012.

There are different clustering techniques, the main i'm focusing at is geared towards distance based clustering on maps. You can think of it as an in-Drupal implementation of http://www.maptimize.com/

Background

Geohash is easy to understand by reading the Geohash Wikipedia article or trying an Interactive Geohash demonstration website. But it is also very well described by David Smiley in his presentations on spatial search with Solr and Geohashes:

So Geohash is a lat/lon geocode system with a hierarchical spatial structure and gradual precision degradation.
For example Boston would be encoded as "DRT2Y".

The longer the prefix is that two points share, the nearer those points are. As the reverse isn't necessarily true, edges have to be considered for proximity searches (and clustering): Two points might be close to each other, but positioned next to edges of the hierarchical grid defined by the geohash algorithm.

Idea

  1. Calculate a grid level for clustering appropriate to the current zoom level (so that when visualizing points per grid, they shouldn't overlap)
  2. Do a group-by query on the defined grid level
  3. For all points returned for a grid
    1. Create a cluster (combine all points within the grid)
    2. Set it's new center point (centroid of the points or simpler: )
    3. Mark the grids that will be affected, that the cluster overlaps
    4. If a grid has already been marked by another cluster, merge those and repeat
  4. Clustering complete

The above algorithm is neither complete nor tested. While there are more sophisticated clustering algorithms out there, I think it might be a good fit for the use case of clustering on web maps. Using the grid-approach of geohashes allows to quickly identify clusters, which basically eliminates the need for complex distance calculations.

The grid-size would have to be chosen according to the pixel size of cluster / marker icons, so that they don't overlap. The additional steps solve the problem of creating a static grid and overlapping points at grid edges. The clusters don't align directly to the grid but their center points are calculated dynamically by calculating the center of the bounding box around the involved points. The bounding box then is used to mark neighbor grids. Clashes should be detected and merged not in a perfect by performant way.

I will try to visualize the described steps in the near future to better communicate the algorithm. In the meanwhile, i would like to mention an interesting post that explains a similar approach:
Grid based viral growth argorithm (archived blog post).

Steps

In order to see this work with Drupal-tools, I think we need Add geohash support to geoPHP and potentially also #1662584: Add geohash support? in Geofield.

Solr integration for Geohash should already be provided, as of the above mentioned presentations by David Smiley. Though, I haven't tested them so far. Also see the related Spatial Search SOLR-2155 documentation.

Comments

dasjo’s picture

Issue summary: View changes

added some steps

dasjo’s picture

Issue summary: View changes

minor

nick_vh’s picture

I'm all in for helping you at the solr side. Maybe we can sprint-code this during Drupalcon?

dasjo’s picture

great to hear, nick_vh!

i am about getting up to speed with search_api, solr & drupal mapping. at first, i'd like to get Geofield Map work with search_api solr searches.
it seems like it doesn't load field data from the entity, but relies on the data being present in the views result. therefore at the moment it doesn't work.

so this is like the first technical issue that i'd like to tackle and then progressively climb up the solr stack :)
trial SOLR-2155 with drupal and then prototype the algorithm that i outlined

will be excellent to sprint on this during Drupalcon!

dasjo’s picture

Issue summary: View changes

links

dasjo’s picture

Issue summary: View changes

add algorithm description

dasjo’s picture

hey there, i just committed a first trial of geohash-based clustering:
http://drupalcode.org/project/geocluster.git/commit/ea7c2f1

works quite well. it's still a very rough draft, but i will be working on it quite intensely the next days and keep you posted.

there is also some related discussion going on about the views integration: #1791796: Allow to inject a custom aggregation implementation
and once the drupal-only implementation works i'm looking forward to implement this as an apache solr plugin, which should speed up things nicely :)

neilireson’s picture

Component: Documentation » Code

Hi,
I am very interested in how you are progressing in this work. I have just hacked a rubbish geoclustering approach in SOLR using geohash. It just clusters all the points in the grids to the centre-point, very inelegant but simple to implement and fast to process. I would be keen to share work on developing a more elegant approach.

dasjo’s picture

hi,

thanks for your interest. i have already implemented a first geoclustering approach in SOLR that has been committed on thursday. see the geocluster_solr and geocluster_solr_demo modules.

it also just clusters all the points in the grids to the centre-point, as this is the most performant way and unfortunately SOLR doesn't support aggregate functions like avg() which would allow us to calculate the centre-point using built-in tools.

i'd suggest checking out the code + playing with the demo feature. i plan to release another alpha thats more tested in that regards soon.

more sophisticated clustering could be done using a solr plugin, work that i have started here: https://github.com/dasjo/SolrGeocluster + also see this related issue: #1800850: Use a solr plugin with search_api_solr for clustering
but for now i'd think i will keep with the simpler approach without the solr plugin because thats just easier for people to deploy :)

let me know what you think

neilireson’s picture

I have a terrible admission to make... I'm not really interested in the Drupal implementation, sorry.

I was actually just looking for the SOLR plugin work as that's what I'm working on.

My thoughts are thus:

To generate a SOLR field type (GeohashCluster) which takes a lat/lng value and creates a number of subfields one for each grid level of the hash tree. The SOLR query can then request a subfield (depending on the map viewpoint size and zoom level) or SOLR could do some work to generate the most appropriate clusters to send back. My current implementation just does the former.

One possible idea I'm toying with is a combination of server and client side processing. Where given that the client code can satisfactorily cluster a number of points, e.g. 1000, then the server responds with the most fine-grained grid level which provides less than that number of points. The client code can then cluster points which are too near. Currently this uses a combination of my code and google ClusterMarkers, although I need a bespoke google implementation which considers Markers which contain multiple points.

As for doing something other than returning the grid centre-point I'm not sure if I can think of anything which isn't very costly in terms of processing. Although I did find http://wiki.apache.org/solr/StatsComponent, which provides descriptive statistics on returned values.

If you are interests I will try to keep you informed of progress and perhaps if it makes sense work on this together.

dasjo’s picture

I have a terrible admission to make... I'm not really interested in the Drupal implementation, sorry.

I was actually just looking for the SOLR plugin work as that's what I'm working on.

fair enough :)

My thoughts are thus:

To generate a SOLR field type (GeohashCluster) which takes a lat/lng value and creates a number of subfields one for each grid level of the hash tree. The SOLR query can then request a subfield (depending on the map viewpoint size and zoom level) or SOLR could do some work to generate the most appropriate clusters to send back. My current implementation just does the former.

that's what i have implemented as well. here are the parameters of an example query:
fl=item_id,score&start=0&group.limit=1&qf=tm_field_place:latlon^1.0&group.field=ss_field_place:geocluster_index_4&group=true&json.nl=map&wt=json&fq=index_id:geocluster_index&rows=1000000

One possible idea I'm toying with is a combination of server and client side processing. Where given that the client code can satisfactorily cluster a number of points, e.g. 1000, then the server responds with the most fine-grained grid level which provides less than that number of points. The client code can then cluster points which are too near. Currently this uses a combination of my code and google ClusterMarkers, although I need a bespoke google implementation which considers Markers which contain multiple points.

the original idea for the Solr Plugin https://github.com/dasjo/SolrGeocluster was to do the whole clustering process in-solr. i have switched to a similar hybrid-approach where i just use Solr's grouping feature and perform final clustering in PHP.

As for doing something other than returning the grid centre-point I'm not sure if I can think of anything which isn't very costly in terms of processing. Although I did find http://wiki.apache.org/solr/StatsComponent, which provides descriptive statistics on returned values.

If you are interests I will try to keep you informed of progress and perhaps if it makes sense work on this together.

i'm definitely interested in collaborating. unfortunately my current approach will only help drupal users but at least we can exchange ideas on the concept level. i have also had some discussions with david smiley that might be of interest to you. feel free to use the contact form in order to exchange on that.

dasjo’s picture

Status: Active » Fixed

this has been implemented a while ago :)

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

added intro