Problem/Motivation
In the SOLR schema.xml you can define dynamic field types for different purposes to enable things such as filters etc. This allows you to do more with your searches, such as using EdgeNGramFilterFactory to enable partial matching in fulltext searches. However, there are currently limitations in Search API Solr (or perharps Search API itself) that prevent you from being able to do this.
Proposed resolution
The ideal would be able to set up a dynamic field type in the schema.xml for Solr and then be able to select this as an option in the index (ie an alternative to fulltext, string etc). You can define these types by implementing hook_search_api_data_type_info(), including providing a fallback for servers that support the implementation. However, it may be better to provide a hook in Search API Solr that allows you to define the additional bits of information required for Solr (such as prefixes, support etc) which Solr then declares in it's own implementation of hook_search_api_data_type_info().
So the way I imagine it working would be:
- Developer adds the new field type and dynamic field to their schema.xml with the settings that they want. For example:
<dynamicField name="te_*" type="edge_n2_kw_text" termVectors="true" />
Will set any fields named "te_*" to use the Edge N Gram definition that is already in the schema.xml. - Developer then (in a custom module) implements something like:
/** * Implements hook_search_api_solr_dynamic_field_info(). */ function mymodule_search_api_solr_dynamic_field_info() { return array( 'edge_n2_kw_text' => array( // Definitions for hook_search_api_data_type_info(). 'name' => t('Fulltext (with partial matching)'), 'fallback' => 'text', // Additional definitions for Search API Solr. 'prefix' => 'te', 'always multiValued' => TRUE, ), ); }
The issues
- To use a custom data type with a fallback, the server needs to declare that is supports the data type.
SearchApiIndex::index()does this by calling the server's implementation ofSearchApiAbstractService::supportsFeature()with the feature being called'search_api_data_type_' . $custom_type(search_api/includes/index_entity.inc line 430). To allow custom dynamic data types, Search API Solr would need to return TRUE for these custom types, which it currently doesn't. - In
SearchApiSolrService::getFieldNames()we get hold of the type using$field['type'](search_api_solr/service.inc line 347). However, custom fields that provide a fallback store the fallback in'type'but the real type in'real_type'. InSearchApiIndex::index()they use the following code to decide what type to use:
if (isset($info['real_type'])) { $custom_type = search_api_extract_inner_type($info['real_type']); if ($this->server()->supportsFeature('search_api_data_type_' . $custom_type)) { $fields[$field]['type'] = $info['real_type']; ... } }I think
SearchApiSolrService::getFieldNames()needs to do the same process so that it knows what type it should be using. SearchApiSolrService::getFieldNames()then needs to figure out the naming convention for the field. First off it gets hold of the prefix. This list of prefixes is hard coded as part ofSearchApiSolrService, so any dynamic fields would need to be able to be added in. After that, it then works out whether the field is a single or multiple:
if ($pref != 't') { $pref .= $type == $inner_type ? 's' : 'm'; }This hard coding then means it's impossible to declare other text based field types, as they would have a different prefix.
I'm up for doing most of the leg work, but would definitely appreciate some reviewing of it to make sure I've not missed anything obvious or if there is a better solution to the problem!
Comments
Comment #1
andrewbelcher commentedOk, here is a patch that works smoothly for me!
What I've done
hook_search_api_solr_dynamic_field_info()(andhook_search_api_solr_dynamic_field_info_alter()) to define how Search API Solr should index different Search API Data Types.search_api_solr_get_dynamic_field_info()) for retrieving the information.hook_search_api_solr_dynamic_field_info_alter()to give the definitions for the types that were previously hard-coded.How to use it
Here is an example of how to set up an alternative fulltext using the EdgeNGramFilter as already defined in the schema.xml Search API Solr ships with:
mymodule.module
Comment #2
bachbach commentedHi Andrew,
patch #1 works for me against 1.0-rc3, thanks a lot for that, i use it to setup an autocomplete field based on solr, this solution is far better than this one
Comment #3
andrewbelcher commentedI've fixed some of the field type definitions that were missing the prefix key and re-rolled it against the latest dev version. Have also attached a patch for rc3.
Comment #4
drunken monkeyThanks for the feature request and the patch! The feature really does make sense, there currently doesn't seem to be an easy way to do this. (I hope you know, though, that for custom web sites you can just make the specific field use a certain type, instead of using a dynamic field. So it's still easily possible to do this, it's just admittedly far less comfortable than when you can add the field type and later just pick the fields to use it in the UI.)
Your patch also looks great, I rarely get patches this good … Sorry it took me so long to get back to you!
Anyways, I'd still like to discuss this a bit before committing:
hook_search_api_data_type_info()ourselves in this module and there just pass on the information from the newhook_search_api_solr_dynamic_field_info(); or we could let users specify the additional, Solr-specific information right inhook_search_api_data_type_info()– including additional keys isn't forbidden (we should probably just prefix them accordingly), other modules would just ignore them but we would look for them and automatically add support for these types.supportsFeature()to returnTRUEfor all types that have a mapping.Of course, even though functionally probably equivalent it would be a bit less comfortable, having to both add the data type and alter the prefix mappings. Still, in the current patch there's two hooks to implement, too.
always multiValuedflag? People could just specify both (e.g.)tes_*andtem_*as multi-valued fields. If we document this properly (we should add a short explanation on how to add dynamic fields anyways), that's probably still more comfortable than having a special case for that flag. Well, or not. However, with the previous idea of only allowing altering of the prefix mappings, having the flag wouldn't be possible anyways.OK, that's about all of my thoughts so far. Sorry it's a bit much, writing stuff down helps me think. ;)
Generally, I like low-impact solutions, but making this as comfortable as possible for developers is of course important, too. Given that, I guess one of the two variants in item 1 would be the way to go.
What do you think?
Comment #5
andrewbelcher commenteddrunken monkey, no worries about the time! Thanks for your feedback!
hook_search_api_data_type_info()and then usehook_search_api_data_type_info_alter()to add our info to the core Search API types...hook_search_api_data_type_info()/hook_search_api_data_type_info_alter(), is there any reason to move the bits we provide out of a hook? We can just implement it in the same way we'd recommend others to?supportsFeature()method, so my inclination here would be for me to update the patch with the above and then let you tell me if I've done the right thing withsupportsFeature():PSo it was already doing a manual check for
tmbefore deciding whether to treat things as multi valued or not. Obviously leaving it hard-coded for onlytmdoesn't make sense, so I brought it into a parameter. But if everything can be single/multi valued, then that check could just go and we can ignore thealways mulitValuedstuff. Again, you probably know better than me why it was forcingtmto be multi valued, so I'll leave that one to your call...Things my end are manic atm, so may be a couple weeks before I've got the time to take another look at this, but it will happen!
Comment #6
drunken monkeyYou're right, good idea!
It feels a bit hack-ish, but I guess it could really be the best alternative.
You're right, I guess there isn't, and it might make things clearer for others.
OK. ;)
The problem here is two-fold: first, in the Search API, fulltext data can be passed to the service class already tokenized. Also, Solr will tokenize the data anyways, so having it as single-valued (and using
implode()to force tokenized data to be single-valued) doesn't give us any advantage. Therefore, this service class always uses a multi-valued field for fulltext data.However, since we're using common configurations, we can't just define
ts_*to be multi-valued, too. Therefore, we just always force the field to have thetm_*prefix, no matter whether the Search API field is single-valued.However, for custom-added fields there really is no reason why we shouldn't let users define both
XXs_*andXXm_*as multi-valued (as long as we document this properly), and it saves us from having to have this additional key and changing all of these checks.However, as I write this up I'm becoming less and less certain this is the right way, maybe having the key will be more comfortable and/or less error-prone, or maybe there is a hidden assumption somewhere I didn't take into consideration.
I guess we'll just have to see how well this works in the implementation, and re-evaluate whether we want to have the
multiValuedkey after all.But for now, I think we should try to do this without it and see how well it works.
(Oh, in any case it would be great if you could add this use case to the developer documentation once this patch is committed.)
Comment #7
smithmilner commentedThis looks pretty interesting. What kind of performance hits might there be for integrating EdgeNGramFilterFactory into regular text fields?
Comment #8
danielnolde commentedRerolled patch #4 against current dev to make use of it. Works well for me so let's think about whether committing this is better than further discussion. My 2cents: Commit this, it works as intended, solves a problem well, looks clean in terms of code and architecture (to be as sapi noob at least).
Comment #9
drunken monkeyComment #10
andrewbelcher commentedSorry I've not got round to re-working as per the suggestions, things have been absolutely manic! I should have more space in the next couple weeks, so I will try my best to get an update patch up!
Alternatively if you want to commit it create some follow up issues and I can pick them up.
Comment #11
drunken monkeyAbsolutely no problem, happens to me all the time, too. ;)
Follow-up issues are rarely a good idea, especially outside of core, so I'll just leave this for now until you have the time to work on it. Please don't feel pressured, though, lack of time is very common among Drupal developers. ;)
Comment #12
drclaw commentedHi All,
I thought I might take a stab at it as I need this functionality as well. I've attached a patch that I think addresses previous discussions:
hook_search_api_data_type_info_alter()now instead of a custom hook to add our additional data type info to default data types (as well as location, geohash and tokens -- more on this at the end)search_api_get_data_type_info()in our service classgetFieldNames()method to get field infoSupportsFeaturehas been modified to include any defined custom data typesalways multivaluedkey in there, but I lowercased it to keep it drupal-esqueThere was one thing I wasn't sure about. When I first started working on this, I added our extra data type info for the tokens, location and geohash data types. However, this was causing them to show up in the type list on the fields admin page in search api. The seemed incorrect to me since I couldn't choose those before as a type. So, I changed it to only add the extra data if the type had already been defined, presumably by another module (e.g. https://drupal.org/project/search_api_location for location fields)... Is this correct?
Thanks!
drclaw
Comment #13
drclaw commentedOops, needs review. =)
Comment #14
nick_vhI'd change this to cardinality, having a big string that says always multiple is a bit of redundant information because if it's set in code it will always be always ;-)
Also, this is more in line with how drupal defines its structure of telling is a field is singular or multiple (cardinality). See https://api.drupal.org/api/drupal/modules!field!field.module/group/field/7
Comment #15
nick_vhComment #16
drclaw commentedI don't think Cardinality is the right terminology in this case. We're not determining the number of elements of the field. We just want to know if it is or isn't a multivalue field...
Comment #17
drunken monkeyI agree with drclaw, we don't want to know the exact number of items. And "multivalued" would convey the wrong impression that
FALSEmeans that the field will always be single-valued. "always multivalued" is really a bit long, but I can't think of a better option.Sorry for not having had the time yet to do a review on this, I have way too many open issues and way too little time. I'll get back to you, hopefully soon!
Comment #18
nick_vhSo, in Drupal 6 we used the terminology of "multiple", in Drupal 7 this changed to "cardinality". In the apachesolr module we use this field from Drupal to define if a field is single of multivalued. And each multivalued field we also index the first value in a single valued field in order to allow dynamic switch backs if these fields would ever change.
Now, I think it's a good idea to use the existing structure in Drupal to define if a field is single or multiple, certainly in a module that is so Drupal/Entity agnostic like Search API.
If not "cardinality", you should probably use "multiple" and omit the "always" as always is not clear enough.
Comment #19
drclaw commentedYep, I see your point. I don't really like that "always" in there either. However, the terminology we're trying to match here is Solr's, not Drupal's. I think I made a mistake by lower casing the term "multiValued" because that is the how it's spelled in the dynamicField tag of Solr's schema.xml (I've updated the patch). Also, the purpose of the flag is not to decide if a field is single or multiple. The purpose is to indicate whether or not SearchAPI Solr should automatically detect whether or not the field is multiValued, or if it should just always consider it multiValued.
I'm not opposed to changing the phrasing of the flag, but I want to make sure it conveys the correct information.
Comment #20
nick_vhok I got it.
So you are trying to define which fields you have available in Solr and tell Search API a specific field is not available in single or multiValued type.
In order to do so, we can probably have the property single and multiple, default both of them to true when loading the definitions and then in the hook you can set single to FALSE or multiple to FALSE depending on the field definition?
That sounds like a more understandable data structure.
Comment #21
drunken monkeyIt's only half of that: either it's available for both single and multiple values, or it's always multi-valued (or
multiValued). This just comes from my initial decision to make text fields always multi-valued (as you might recall, I previously only had thet_*prefix for fulltext) – to avoid larger changes in the code, we have to be able to define whether new dynamic fields should behave like fulltext fields – i.e., always bemultiValuedin Solr. It doesn't say anything about whether the original data is single- or multi-valued.I now know that this initial decision was probably the wrong one, but now we're stuck with it and have to make do. You are welcome to get rid of this mess in D8. ;)
As for the patch, I now reviewed it. See my comments below. Functionally, it seems to work fine, at least as far as the existing functionality is concerned. But I don't see why it shouldn't work for new dynamic fields, too.
You can't do it this way – Solr doesn't automatically support all custom data types, and there are no
search_api_data_type_*features for the built-in types (although that of course shouldn't lead to any errors).This is needed to support the new, "dynamic fields" data types, right? In that case, I'd probably define the static ones as before, then check against those and return
TRUEfor a match. And only if there is none and the feature starts with the right prefix, get the data type information – and only returnTRUEif the one in question has the Solr prefix defined.Since you reference thatabove already, I don't think the additional
@seeis necessary.I wouldn't override the names here, and also I think for
'always multiValued' == FALSEwe can just leave this blank (we have to useempty()everywhere anyways).tokensis a special case – it's actually defined by the Search API, so it should always be available, but it's no "official" data type (= none that can be selected in the UI). It'S used internally to mark already tokenized fulltext data.So, I guess it's necessary to somehow add this type to our Solr-internal type collection, but not to the "official" one via
hook_search_api_data_type_info_alter().Probably you should wrap all
search_api_get_data_type_info()calls (apart from the one insupportsFeature()) in a custom method which addstokensto the array.Comment #22
drclaw commentedAlright, patch round #2 (at least, round 2 from me). I incorporated all your feedback. Hope I understood everything correctly. In doing #4 I realized we don't even need to implement
hook_search_api_data_type_info()orhook_search_api_data_type_info_alter(). Just the wrapper function is enough. Because of this, I added an example of ahook_search_api_data_type_info()implementation with our new keys to search_api_solr.api.php. Hope that makes sense!Thanks!
drclaw
Comment #23
drunken monkeyThanks for revising the patch, great work! It already looks pretty good, I just found a few remaining issues (largely nit-picking, and nothing functional):
Comments should be full sentences ending with a period.
First, you don't need the
elseif you return in theif.Second, this code is too complicated: you only need to check a single key – just check it, no need to loop. Also, you only need to check this at all if
$featurehas the right prefix.Looping through all the types would only be useful if you then (statically) cache that information (the first time you need it) so further lookups for any data types will go faster. However, since
search_api_get_data_type_info()is already statically cached, there probably wouldn't be much performance to gain here.This should probably not be committed. ;)
Great that you included this example!
However, while I'm not completely sure whether there's a standard or best practice for this, I'd use the documentation method the Entity API uses with
entity_crud_hook_entity_info().That way, primarily, we can properly document how the additional keys should look.
The first sentence of a documentation block should be on a single line of at most 80 characters.
This is called often enough that we should definitely statically cache it.
I already included all of these in the attached patch, so as far as I'm concerned this would be good to go. Please test/review!
Comment #24
drclaw commentedThanks for the notes (and re-roll). All good points. Your docblock explanations are much better than mine too. =P
Tested it out and it works a treat! Am I being too ambitious by setting it RTBC?
Comment #25
drunken monkeyNo, you aren't. Thanks for reviewing, and always good when people then actually set the issue to RTBC!
So, committed. Thanks again for your work, everyone, especially andrewbelcher for the initial idea!
Comment #26
polHi all,
I'm looking for a bit of help on this.
I'm looking for a way to use the 'edge_n2_kw_text' type for the title of a node, so a user can find a substring in the Title of a node easily.
To do this, according to this commit, I have to implement two hooks:
Then, in the fields of my index, I select "Fulltext (partial)" for the Title field. Right ?
I did it, but still, the 'label' of my node is still using the type 'text'.
What am I doing wrong ?
Comment #27
drunken monkeyThe
labelfield in Solr, if your are talking about that, isn't used at all by the Search API Solr module, and it's also not possible to change its type (since that's hardcoded).tem_titleshould be the Solr field containing your node titles.Comment #28
polHi,
Thanks for your insights, I did more research and I'm still not able to get it working.
When I select "Fulltext (partial)" for the
Title field, it should use in Solr thetem_titletype.Using that code in #26, the
Titlefield is still using thetm_titletype.If I select the "
String" type for theTitle field, the schema browser in Solr finds it underss_title, which is right.In the search_api_solr.api.php, there's also an example on how to use it, I copied it and I'm able to see the "Fulltext (w/ partial matching)" option in the select box, when I use it for the title field, I'm unable to find it in the schema browser of Solr. It's always using
tm_title.I guess I'm doing something wrong, but I don't know what I'm missing.
Comment #30
drclaw commented@Pol,
I think maybe we made a mistake in the documentation for using the 'edge_n2_kw_text' field that comes preconfigured in the Schema.xml provided by search_api_solr. When a field is configured as "always multiValued", you need to provide the full dynamicField name prefix as it's defined in Schema.xml. For the edge_n2_kw_text field, there are two dynamicFields defined. One that is multivalued and one that isn't:
Since the field is always multiValued, the 'm' and 's' part isn't automatically added to the prefix for you so you have to do it yourself:
Our example needs to be update to include the above "m" in the prefix. Also it's possible that the hook documentation itself could go into more detail of how the "m" and "s" letters are automatically added for you if you don't specify a field as always multiValued. Maybe @DrunkenMonkey has some thoughts?
Comment #31
polOh cool, that 'm' missing makes my day :)
Indeed, I think the documentation should be updated as well :)
Thanks for the heads up drclaw, it's fantastic, it works pretty good !!!
Comment #32
sgdev commented@drunken monkey, I fixed the documentation issue with a patch over 2 years ago, and @mkalkbrenner continues to close it and recommend that I contact you if it needs to be added.
Please include it in the 7.x branch. Thank you.
https://www.drupal.org/project/search_api_solr/issues/3004407