This warning on the project page for this module worries me- I'm implementing this on a server with around 10k nodes using CCK facets. While the server has spare capacity, we work really hard to keep things efficient, as at peak times the server can become very congested. The strength of faceted search is when sifting through huge numbers of nodes- is there a way to make this more efficient- some kind of caching?

Many thanks,
Chris.

Caution

Faceted Search is database-intensive. If your server can barely keep up with your traffic, this package will make things worst. Make sure to benchmark performance before deploying this system on a busy site or on a site with many thousand nodes.

Comments

Category:feature» support

Caching will help with searches that use only a term or two. But with more terms, or if the search includes any user-entered keywords, the results page likely won't be present in the cache and will thus have to be generated. Also, given the number of possible search term combinations, the page cache could grow quite big. In other words, page caching doesn't help search much.

For Faceted Search to work smoothly with thousands of nodes, we need to improve speed by orders of magnitude. However, at this point I'm not sure what could be done to optimize performance further. With a lot of effort, I'm sure we could gain some more speed, but not nearly as much as needed. Faceted Search's search algorithm relies on SQL and is very similar to Drupal core's, many people have gone through that core code and in many projects it is still too slow—and it does not have to deal with as much data as Faceted Search does!

A more adequate option might be to use a search engine such as Apache Solr, although it doesn't provide the same feature set as Faceted Search (at least not yet).

Certainly in my implementation of this, which is a job board, searching by one or two terms (e.g. Location & Industry) is quite adequate!

We use static page caching (Boost module)- hopefully it will cache these pages for anonymous visitors, which may provide a very handy workaround for the time being.

Has anyone put together any optimization tips or FAQ's for this module? I'm struggling with the performance of this module on a site that has 15 to 20K nodes. I'm using boost but your right that it doesn't help much given the large number of combination's that a user can filter by. Two questions I have:

1) when using CCK fields as a facet does it help to make sure the field is indexed? or does in not matter at all?

2) when I use the devel module to get SQL times I see the longest running queries are ones that are doing a COUNT. Do you think it's plausible for me to make code changes to avoid those COUNT queries (assuming I'm okay with not have record counts show beside the facets)? Or would that possibly break other things in the module that rely on them?

Are there high water-mark recommendations for using this module. In other words, you should NOT consider using this module if you node # is higher than X and/or the number of facets is greater than Y?

Any other optimization techniques? I'd prefer to not have to go the Apache Solr route unless I have to.

1) Do you mean indexed with Field Indexer? If so, it does not matter, and that index is not of any use to the guided search.

2) The count queries are required. They appear longer than the actual search queries because they are executed first. MySQL then uses cached data from the count queries to help the search queries run faster.

Given the complexity of the database queries required for performing faceted searches, 20K nodes is a lot even for a dedicated server. There is no recommendation on the maximum number of nodes though. Results may vary greatly depending on your site's infrastructure.

Some easy-to-implement optimization ideas:

  • Reduce the number of facets being offered in the guided search.
  • Avoid node access modules.
  • I have seen a case where the content types were using a lot of CCK nodereferences, and when rendering the nodes for the search results, each nodereference was causing a node_load() of the referenced node. By changing the results display to avoid loading those nodereferences, the performance was greatly enhanced.

Once you have done all that can be done on the Drupal side, you'll have to fine-tune your MySQL setup.

The Apache Solr engine is orders of magnitude faster than anything we'll ever be able to achieve with SQL.

If you're using views to present your results, you also need to consider the performance of your view. Use the query log to look at what's going on...

Another thing to consider (Drupal 6 only): With i18n, Faceted Search might get really slow for a non-default language if your site is using the 'mixed' content selection mode (in admin/settings/i18n). This is related to #337089: Mixed mode ultra slow on large db.

"Fast" question.
I've read posts where one have arguments against caching of the search results. But certainly one should be able to cache the "front page" of every Faceted environment. We are testing out Faceted on 60,000-70,000 nodes with 1,500,000 term_node entries. It works quite well, except for the "insane" load placed on the database whenever you hit the first page of a Faceted search environment. The numbers shown on that page don't have to be 100% accurate and for our part, would only need to be refreshed once a day.

I haven't looked too closely at the faceted code, but would there be an obvious spot to "hook in" to create some sort of cache/content cache for the search results?

Drupal's page cache and block cache should work with Faceted Search. Are you looking for a different caching strategy than provided by those?

I just realized that this is for version 5.x, we are using Drupal 6. Would that make a difference?

Block cache and normal page cache is turned one. However, even with those cache settings, some really heavy queries are executed on every request.
The really heavy queries are being run from this function:
function load_categories($facet, $from = NULL, $max_count = NULL)
Slowest one running for 9-10 seconds on every page hit.
SELECT COUNT(DISTINCT(n.nid)) AS count, term_data.tid AS tid, term_data.name AS name FROM node AS n INNER JOIN temp_faceted_search_results_1 AS results ON n.nid = results.nid INNER JOIN term_node AS term_node ON n.vid = term_node.vid INNER JOIN term_data AS term_data ON term_node.tid = term_data.tid WHERE ((term_data.vid = 2) AND (n.type IN ('produkt'))) GROUP BY tid ASC ORDER BY count DESC, term_data.weight ASC, term_data.name ASC LIMIT 0, 11

I just threw together some code to better illustrate what I'm aiming at:

This code is what I now have at the end of the Faceted_search->load_categories function. Note that this is not i18n compatible.
What it does:
- creates a hash of every query executed in load_categories
- checks to see if there is cache stored for this hash
- if there is cached data, return data
- if no cache run query
- if run time exceeds cache_threshold store result from build_categories($results) in cache

With our 65,000 nodes, 2,000+ terms and 1,500,000 term->node connections, execution time is reduced from 20-25 seconds to ~1 second for the first level of the faceted search.

    $cache_min_life = 7200;
    $cache_threshold = 0.5;
    $start_time = microtime(TRUE);
    if (isset($from) && isset($max_count)) {
      $query_string = sprintf($query->query(), $query->args()).'vid:'.$facet->_vocabulary->vid.'tree:'.$facet->_vocabulary->hierarchy.$from.$max_count;
      if($cache = cache_get(md5($query_string), 'cache_faceted')) {
          return $cache->data;
      } else {
          $results = db_query_range($query->query(), $query->args(), $from, $max_count);
          $return = $facet->build_categories($results);
      }
    } else {
      $query_string = sprintf($query->query(), $query->args()).'vid:'.$facet->_vocabulary->vid.'tree:'.$facet->_vocabulary->hierarchy;
      if($cache = cache_get(md5($query_string), 'cache_faceted')) {
          return $cache->data;
      } else {
          $results = db_query($query->query(), $query->args());
          $return = $facet->build_categories($results);
      }
    }
    if(microtime(TRUE) - $start_time > $cache_threshold) {
        cache_set(md5($query_string), $return, 'cache_faceted', time() + $cache_min_life);
    }
    return $return;

Version:5.x-1.0-beta4» 6.x-1.x-dev

Interesting... However, the load_categories calls do not run if the guided search block is cached or if the page is cached. You might want to tweak the block cache to cache per role instead of per user. By default, it is caching per user to support all node access control modules.

the module block cache alter, has support for cache per role

      BLOCK_CACHE_PER_ROLE => t('Per role'),
      BLOCK_CACHE_PER_ROLE | BLOCK_CACHE_PER_PAGE => t('Per role per page'),

A datapoint on this http://sf.CarnalNation.com is using faceted search for the main search and the events search. With 3500 nodes it's not been an issue for us yet (but we also have a big DB server with 8GB of memory so our whole DB lives in cache) Current DB server load is less than 3% ~300 total DB queries a second - we also actually see quite a lot of page cache hits because a lot of the event browsing is by date and on a few primary facets.

I also strongly endorse the block cache alter module - we're using domains so it's the only way we can have block caching and it works really well.

Very interesting stuff, thanks jpp!