I've been working with the SearchApiSolrService class, and there are a couple of changes that make it easier to extend:

  • Allow the results in SearchApiSolrService::getFieldNames() to be altered even if the index itself has no other fields; don't throw an error if there's a field that the class can't figure out a prefix for. This change is in solrservice_getfieldnames.patch, attached.
  • When processing search results, separate the "extract results" and "extract facets" steps into separate methods. This makes it a lot easier to override part of the search/result behavior. I also found that specifying the mapping between 'id' and 'item_id' in SearchApiSolrService::getFieldNames() allowed me to remove the hard-coded expectation of an 'item_id' field in the Solr results. This change is in solrservice_extractresults.patch, attached.
  • Make SearchApiSolrService::flattenKeys() more readable, and apply an "AND" conjunction if requested. The caveat on this patch is that the "dismax" Solr query parser (specified as the default in the solrconfig.xml that accompanies search_api_solr) doesn't actually do "AND" or "OR" syntax--the only special syntax it uses are phrases (""), mandatory clauses (+), or prohibited clauses (-). This change is in solrservice_flattenkeys.patch, attached.

These changes have made it possible to do significant chunks of the work we've been discussing in #1064884: Add support for indexing non-entities, mostly in a module we're calling "Sarnia" (for now). Right now, Sarnia works by using its own Search API service class that extends SearchApiSolrService. Especially relevant to the flattenKeys() patch are SarniaSolrService::flattenKeys() and SarniaSolrService::preQuery() -- they provide a bit of switching to allow the service to use the 'standard' Solr query parser for more complex queries.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

becw’s picture

I've been working in the "party-integration" branch of this sandbox repository: http://drupal.org/sandbox/bec/1096826

drunken monkey’s picture

Status: Active » Needs work

Thanks for the patches! Especially the first two make a lot of sense, I'm all for making the service class more easily extendable.

I think the solrservice_getfieldnames.patch is incomplete, though – neither can the mappings for an index without fields be altered, nor does the mapping include the id.
Likewise, the extractFacets() method lacks the return statement.

As for the last patch, that doesn't make a lot of sense to me. Three quarters are just style changes that replace my style with yours, which isn't really helpful. Or are there at least instances where this is due to Drupal coding standards?
It's a good point that dismax doesn't handle "OR" and "AND", thanks for bringing that to my attention! However, your patch makes the results wrong in all cases with multiple terms (as the conjunction is always set to one of the two), instead of just the cases with OR queries, worsening the situation considerably.
Instead, we should probably either think about moving to the standard request handler, or dynamically setting the handler to the standard handler if OR expressions are used in the keys. Do you happen to know what disadvantages the standard handler has, and if we can use the dismax query as-is with the standard handler?

becw’s picture

Oops... I generated those patches a little too breezily, I think. I'll fix them up shortly.

And, it's true: the last patch does really in large part replace your style with mine--in thinking through it, I was just pushing the code around a little. Sorry! It does make code more readable when there are fewer exit points in functions, though...

Part of the reason I handled conjunctions anyway, even though it's wrong, is so that the comment would be illustrated. In some cases there are good reasons to use the 'standard' query parser instead of the 'dismax' parser--if you want to use complex and/or syntax, field qualifiers, or wildcards--but certainly dismax should be the default for user-accessible fulltext search. In the Sarnia service class, I do it like this:


  protected function flattenKeys(array $keys) {
    // This query builder only applies when using the 'standard' query parser.
    if (!isset($keys['#query_type']) || $keys['#query_type'] != 'standard') {
      return parent::flattenKeys($keys);
    }

    // Extract search terms.
    $k = array();
    foreach (element_children($keys) as $i) {
      $key = $keys[$i];

      if (is_array($key)) {
        // Render nested queries.
        $key['#query_type'] = $keys['#query_type'];
        $k[] = $this->flattenKeys($key);
      }
      elseif ($key) {
        // Escape keys.
        $k[] = SearchApiSolrConnection::phrase(trim($key));
      }
    }

    // Don't apply conjunction, negation, etc. to empty queries.
    if (empty($k)) {
      return '';
    }

    // Apply the conjunction type.
    if (!empty($keys['#conjunction']) && in_array($keys['#conjunction'], array('AND', 'OR'))) {
      $conjunction = ') ' . $keys['#conjunction'] . ' (';
      $query = '((' . implode($conjunction, $k) . '))';
    }
    else {
      $query = implode(' ', $k);
    }

    // Apply negation, if requested.
    if (!empty($keys['#negation'])) {
      $query = '-(' . $query . ')';
    }

    // Apply field qualifiers.
    if (!empty($keys['#field_qualifier'])) {
      $query = $keys['#field_qualifier'] . ':' . $query;
    }

    return $query;
  }

I'm not sure that something like this would make sense for SearchApiSolrService::flattenKeys() though, since it uses the Solr-specific 'query_type' and 'field_qualifier' terminology in the $keys array, which as I understand it is a generic structure passed through from the Search API query object.

drunken monkey’s picture

No, we should probably let the flattenKeys() method decide which request handler to use, and give it a way to switch to the standard handler when an OR expression is used.
I'll come up with a patch. If you want to explicitly insert ANDs, I guess you'll have to override the method either way, though. That doesn't seem very useful, when AND is the default operator anyways.

becw’s picture

Actually, I think that the default 'q.op' is OR.

Edit: Er, my bad! You specified AND in the Search API Solr schema.xml :) I was just going by what it said in my Solr book.

drunken monkey’s picture

Status: Needs work » Needs review
FileSize
4.71 KB

See the attached patch which should fix OR queries and allows a more flexible setting of the request handler.

becw’s picture

Ah, your patch in #6 looks very handy! I'll check it out this afternoon.

  • solrservice_getfieldnames.patch now has the id mapping, and the alter always gets called
  • solrservice_extractresults.patch now has a return statement on the ::extractFacets() method
  • solrservice_flattenkeys.patch generates proper dismax syntax... this method may or may not jive with the behavior of the $keys array on other search backends. Unfortunately I didn't see your comment in #6 before I wrote this, though they address different aspects of the issue.
becw’s picture

Status: Needs review » Needs work
FileSize
7.02 KB

Sorry for the repeated broken patches. The previous version of the "extractresults" patch had the return statement in the wrong place.

drunken monkey’s picture

getfieldnames

In general that's a nice patch – however, there are still two issues:
- This will produce a notice if $index->options['fields'] isn't set (rather harmless).
- This will break if a field named "id" is indexed, which is rather serious. Even though IDs will probably only rarely be indexed, we can't allow the search server to break in such cases.
The former is fixed easily enough, but I guess that for the latter you won't get around some special-casing. You should probably call the field "search_api_solr_id" instead, and special-case it when setting the results.
Ah, you'll also have to introduce a special case for the "score" field, as that should be mapped to "score", not to "search_api_relevance", when extracting results.

extractresults

Looks good now, on the whole. But also some issues:
- In the outer loop in extractResults() you should first set $result to an empty array, to prevent data from previous documents to leak through.
- extractFacets() doesn't work, as you iterate (in the outer loop) over an array that you create two lines above, empty. You probably want to iterate over the 'search_api_facets' query option, if it exists.

Did you even test this patch before posting it?

flattenkeys

This isn't needed anymore now, is it?

becw’s picture

getfieldnames

So, one of the things I'm thinking here is that I see "score" and "search_api_relevance" used in different places through Search API and Search API Solr. Do they refer to the same piece of data? In that case, "search_api_relevance" would be the "local property name" and "score" would be the "solr property name"--which is the mapping that is happening in SearchApiSolrService::getFieldNames().

This mapping would also apply to an ID field--if Search API should always be able to find the entity id in the 'search_api_item_id' field, we can map it to 'item_id' (which is the name of the field that the Search API indexer uses for the entity id). Right now, sometimes this field is 'id', sometimes it's 'item_id' (especially in the extract results code--the item_id property is transferred to the id property, regardless of whether there's an existing id property). The extractresults patch does address this--it maps all the properties returned by ::getFieldNames() from solr names back to the property names used by Search API (removes prefixes, etc.)

The mapping field means that it's possible for other service classes to change what Solr field is used as the entity id, or in general to accommodate schemas that vary from the one provided with Search API Solr.

The patch attached to this comment:

  • makes sure that there's no error if $index->options['fields'] is not set.
  • adds straight pass-thru 'score' and 'item_id' properties. IMO, this makes sense when the ::getFieldNames() mappings are used in extracting results.

extractresults

I've been working with the ::extractResults() output, but honestly I haven't been testing with facets at all yet. It seemed convenient to package it up, but it didn't affect me directly. In the attached patch I've fixed up the ::extractResults() method, but I'll have to rearrange a little to test ::extractFacets().

flattenKeys

Yes, your patch from #6 does address alternative query handlers. It still produces incorrect dismax query syntax for AND queries with negation, though.

p.s. I managed to remember to --no-prefix when generating this round of patches.

becw’s picture

One roadblock the search_api_solr.request_handlers patch presents is that searching on non-fulltext fields using keys (rather than a filter query) is not supported by SearchApiQuery and SearchApiIndex. The patch addresses the query parser issues, but then I run into the fact that the only way to get "field qualifiers" into the search query is to set them via $query->fields(), which throws errors if any of the specified fields are not "fulltext" according to the SearchApiIndex entity.

In the current code, all per-field filtering (specifically relevant re: Views integration) is done with "filter queries"--these are cached and don't affect a search result's score. Filter queries are absolutely the best approach for slicing data into searchable subsets, but for user input, or selecting one specific record, I'm concerned that the caching behavior of filter queries could become a limitation. Am I looking for optimization too early? Is a "large number" of filter queries orders of magnitude greater than something I would encounter when allowing "filter queries" based on exposed filter values in Views?

Additionally, I don't know that this approach takes the best advantage of dismax's simplified "user friendly" syntax (translation: doesn't throw syntax errors) and per-field boosts, or of the standard query parser's nested logic, wildcards, and field qualifiers.

drunken monkey’s picture

flattenKeys

OK, this turned out to be a lot more complicated than I had thought, but attached is a patch which should completely correct the flattenKeys() method even for the most complex queries. It even includes a verbose comment about what happens and why.

Regarding filter queries: since the dismax handler doesn't support specifying fields directly in the query, we really have no other choice, there. (At least I don't see any.) So the point, whether this has any performance drawbacks, is rather moot.
However, when switching to the standard handler, we could really move some or all of the filter queries inside the main query. The question is if this is the right thing to do – we can't really tell which filter queries are set by the admin and which are due to user input, so maybe we would just make queries slower by removing that caching behaviour for filters where they would be correct.

getFieldNames

extractResults

(As they only work together, I think that we should probably discuss them together.)

So, one of the things I'm thinking here is that I see "score" and "search_api_relevance" used in different places through Search API and Search API Solr. Do they refer to the same piece of data? In that case, "search_api_relevance" would be the "local property name" and "score" would be the "solr property name"--which is the mapping that is happening in SearchApiSolrService::getFieldNames().

They do refer to the same piece of data, but in different situations. When being used as a field, e.g. for sorting, we have to call it "search_api_relevance", since otherwise entities with an indexed "score" property would break the Search API. In the result array, on the other hand, the used keys are exactly defined, so we can use "score". This is both defined in the Search API, and not specific to the Solr service class, so also can't be changed here.

Your latest patches for this part remove the "id" mapping, resulting in a lot of notices and an empty result. Please test the patches you post!

Also, the patch again doesn't solve the problems the service class will have when an entity has indexed "score" or "id" properties. I think you should just use the "search_api_relevance" and "search_api_item_id" (or, better, "search_api_solr_item_id") keys in the mapping (and drop "score" and "item_id") and then specifically check for those in extractResults(). But even if you decide for another way, you somehow have to solve this. I'm pretty sure you won't get around making a special case for those fields, though.

Maybe you should introduce a dedicated array in a "fields" key (or similar) for the field values, and just set defined values (i.e., 'id' and 'score') directly in the result array?

becw’s picture

flattenKeys

Well, I'm glad that the dismax/standard syntax stuff is working in the general case. I do still want to use more advanced standard query syntax, but it looks like it makes sense for me to keep that separate so that SearchApiSolrService can accommodate the more general needs of Search API.

getFieldNames & extractResults

I have been testing, but at this point I'm working with five different patches, and testing against an extended Solr service class in addition to the standard Solr service class. Clearly some things are slipping through the cracks.

The issue about entities with indexed "score" or "id" properties already exists. In the current loop that extracts Solr results (service.inc, line 453), it:

  • requires the presence of an 'item_id' field
  • sets the the 'id' field equal to the 'item_id' field
  • loops through every key => value in a Solr result document, and sets the key => value pair on the Search API result array to match. This means that if the Solr result has an 'id' field, it overrides the previously set value
  • the 'score' field works because the key in the Solr result matches the expected key in the Search API result array

The solrconfig.xml file included with Search API Solr specifies that for any search, the only fields that should be returned are item_id and score (solrconfig.xml, line 545), so this issue never presents itself with a standard Search API setup. However, it comes up as soon as you request more fields back from solr, or use a Solr core with a different schema. As soon as the Solr query returns documents with more fields, the $results array that Search API expects is polluted. At first this seemed like a feature to me, but after this conversation it looks more like a bug.

My intent with these two patches was to remove all of the hard-coded fields from the ::search(), and put them in ::getFieldNames() where they can be easily overridden or altered.

... introduce a dedicated array in a "fields" key (or similar) for the field values, and just set defined values (i.e., 'id' and 'score') directly in the result array

This suggestion will fix the issue in both versions of the code, and is fairly clean. If we go with this solution, it will probably make sense to update the documentation accompanying SearchApiQueryInterface::execute().

Here's a combined getFieldNames and extractResults patch with the change. This may also affect #1089758: Make use of new flexibility for Views field handlers. I've tested this patch with unpatched Search API dev code and a View based on a standard Search API index + standard Search API Solr service, both with and without the flattenKeys patch.

drunken monkey’s picture

Title: Revisions to the SearchApiSolrService class » Add increased flexibility to the service class
Status: Needs work » Needs review

OK, this seems fine to me now. So would the last two patches (#12 and #13) also work for you in their current state? (I'd just change a few comments in #13.)
I think I could then commit them.

And yes, I agree I should then document the optional "fields" key for search results. And keep this in mind for the Views handlers, first looking into "fields", then "entity" and finally doing the default.

becw’s picture

Status: Needs review » Reviewed & tested by the community

The patches in #12 and #13 work for me. I really, really appreciate your willingness to review and discuss these changes.

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

OK, great – just committed this! Thanks for your input and work!

I really, really appreciate your willingness to review and discuss these changes.

Well, after creating the Search API specifically because other solutions lack flexibility, I can't really ignore advice for increasing it here, can I? ;)
So, no problem! I'm always glad to make things more flexible in my modules (as far as it makes sense).

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.