When working with a Solr stemmer like SnowballPorterFilterFactory, it's a common practice to index both the stemmed and non-stemmed versions of the field. This gives two advantages:

  • It indexes words that are exactly as they appear in the text more highly than near misses
  • It avoids awkward cases where the stemmer stems down then doesn't match the actual original word (e.g. in English I've found that with SnowballPorterFilterFactory, searches on 'unpublished' and 'unravelling' don't match content containing those words, but searches on the stems, 'unpublish' and 'unravel', do match)

The most common method seems to be to use a <copyfield> in the Solr schema.

This is a problem when using Drupal Search API Solr, since fields are defined dynamically, not hard-coded in the schema.

Imagine a fairly ordinary Solr Search API server indexing nodes, processing as fulltext the fields Title, Body, Teaser, and one custom field named Notes, with SnowballPorterFilterFactory enabled on all fulltext fields.

What would be the most robust, Search API-friendly approach to indexing both the stemmed and unstemmed versions of these fields?

(Question also posted on Drupal Answers with no reply)

Comments

RAWDESK’s picture

Hi,
A response after 6 years of silence on this topic, but I thought it would be useful to share my use case and attempts to get phonetic search working in a similar way as described above (using schema.xml copyfields)

Here's what I added manually inside schema.xml :

    <fieldType name="phonetic" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="EXACT" concat="true" languageSet="auto"/>
      </analyzer>
    </fieldType>

Define fields dedicated to fieldtype phonetic

<!-- This field is used to build the phonetic index -->
    <field name="cast_name" type="phonetic" indexed="true" stored="true" multiValued="true"/>
    <field name="crew_name" type="phonetic" indexed="true" stored="true" multiValued="true"/>

Copy values from by Search API Solr indexed fields

<copyField source="tm_field_movie_person_cast$title" dest="cast_name" />
 <copyField source="tm_field_movie_person_director$title" dest="crew_name" />

After re-indexing the Solr instance, searches on phonetisized field values do not yield any results unfortunately.
So my first thought was the Drupal View responsible for executing the Solr search query, is not aware of the "copied fields" inside the altered schema.xml

Using the below hook_api_search_api_views_query_alter in an attempt to have the query also pickup the copied fields failed also.

/**
* Implements hook_search_api_views_query_alter.
*/
function my_module_search_api_views_query_alter(&$view, SearchApiViewsQuery &$query) {
$view->filter['search_api_views_fulltext']->options['fields']['cast_name'] = 'cast_name';
$view->filter['search_api_views_fulltext']->options['fields']['crew_name'] = 'crew_name';
}

Note : the BeiderMorseFilterFactory phonetic Solr filter implies both index and query analyzer configured in schema.xml for a correct working.
See page 68 and 69 in this e-Book :
https://books.google.be/books?id=u6GrCQAAQBAJ&pg=PA68&lpg=PA68&dq=Beider...

So my question is :
Is there a way to make Search API Solr aware of the existance of the copied fields inside schema.xml ?

drunken monkey’s picture

Status: Active » Fixed

Thanks a lot for posting this, might always help others looking for information!

Is there a way to make Search API Solr aware of the existance of the copied fields inside schema.xml ?

See the handbook. You probably just want to change the type of the fields in Solr, or use hook_search_api_solr_field_mapping_alter() to change the fields’ mapping to one with the proper prefix.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.