Just to get a first grip here's a diff from the two (3.x) schema.xml configs. Probably the schema should not be 100% equal but let's see what we can learn from each other.
| Comment | File | Size | Author |
|---|---|---|---|
| #16 | 1601982-16.patch | 30.83 KB | nick_vh |
| #15 | 1601982-15.patch | 30.83 KB | nick_vh |
| #10 | 1601982-10.patch | 5.98 KB | nick_vh |
| #1 | 1601982-2.patch | 29.57 KB | nick_vh |
Comments
Comment #1
nick_vhComment #2
nick_vhThis is general for all the field types defined in the search api schema. It's not necessary to define index/stored if that already happens in a dynamic definition.
This was added to allow base64 content. It makes the schema more complete and display suite is even using it
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-sol...
"Finally, I tried out the Trie stuff by changing the query to use integer_tri. The results are the same as the sortable stuff, but, in my totally informal testing, the Trie stuff is a lot faster (others have done more formal testing, so I feel comfortable with my results.) Good news!"
Not sure when this was added, but it probably solved an existing problem :-)
Why was this commented out? Seems like we want processing of the, by default, English language?
Have to run now, but I'll do more later
Comment #3
cpliakas commentedThanks Nick. This will be a tricky one. The diff is much appreciated, and thanks for the explanations.
Comment #4
cpliakas commentedAdding new component flag.
Comment #5
drunken monkeyYes, that really is a tough one. I'd have said we should leave that for when we have the solrconfig.xml figured out, but I guess we might as well start now. However, we probably should split this as well?
In any case, I guess having common type definitions might be a good sub-ordinate target and also achievable.
For the actual field definitions, a quick look tells me there won't be much chance of those being unified. The approaches are just too different, as far as I can tell. Maybe unifying some fields, e.g., the required ones, might be possible, but I'm not sure how much of an advantage that would bring us.
Search API Solr also uses Trie fields, but we didn't rename the type. This way, people can still use the "normal" types, even though by default only Trie fields are used.
Well, you'll only very rarely want case-sensitive matching in fulltext searches, so I guess that makes sense. And since this field type isn't used by default, I guess adding it to the Search API side wouldn't hurt, either.
Why? Why should English be the default?
In the end, no matter what language we choose, most people will have to change it. Maybe some won't need any setting at all, and in any case that will (very probably) be much better than having stemming for a wrong language.
These are about the thoughts that lead me to that decision.
Comment #6
nick_vhIn regards of the english stemming. I think we should design for the 90% and those that want to modify it can do so afterwards. Having a sensible default only seems logical to me. On a sub-note, I am natively Dutch but I do expect things to work as fluent as possible in English and have a set of clear instructions somewhere that explains how to modify it to support another language.
The whole multilingual story is still an issue anyhow and is not easily solved. Maybe this module can provide a couple of languages or we should set up some site that can generate your schema, similar as we discussed in Drupalcon London.
I know that Typo3 has a repository with over 20+ schemas (see https://svn.typo3.org/TYPO3v4/Extensions/solr/trunk/resources/solr/typo3...) + some stopwords per language.
Comment #7
nick_vhAdding a defaultSearchField is always useful when you want to debug solr directly in the admin interface.
We should probably discuss this since Search Api does not have such a field
"Set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms." Auto-complete does not need boosting nor full text search so this can be omitted
We'll get back to this after we've done the solrconfig. Changes here are less critical as cpliakas mentioned
Comment #8
drunken monkeyI highly doubt that 90% is anywhere near the right figure.
Since there is no field like "content", which will always contain sensible data, in Search API Solr, we cannot easily do that. The best we could do is specify "id" or something like that, to at least avoid errors. Don't know if that makes much sense, though, as the error more clearly shows that you should explicitly specify a field.
Comment #9
nick_vhI've been talking to a lot of people about this initiative and they all vote yes to have this. Now, I've also asked if a default English would be a problem or not and not a single one of them was against the idea of having a default English schema.
I really think and press on the importance of an English default schema and moving the multilanguage issue to another issue. This way we can enable the stopwords/protwords/... at once and allow an easy integration. A readme could be included that explains the possible configuration options for different languages
Comment #10
nick_vhThis diff combines both schema's without making a cut on functionality or backwards compatibility. The patch was diffed against the solr-3.x schema of apachesolr
Last thing I have to check is the geo functionality in search_api_location so it works with this schema. I do have some questions but will post them as a review.
Update geo front : I've asked the search_api_location maintainers to see if they could manage with the current fieldsets. from the looks of it, they are just adding unnecessary fields and I think they can use the dynamic fields that exist in this patch. #1647520: Schema.xml consolidation efforts
Comment #11
nick_vhWhy are these prefixed with f_ss/f_sm*
If you want to say they are for facetting, it conflicts with the namespace philosophy that the namespace is a abbreviation of the field type (string, int/...).
Could we rename them somehow? Having a fss and f_ss is not so nice.
proposal : sts_* stm_* (string termvector single, string termvector multiple)
Also, the default key is still content, but it is not required to have it, so it doesn't harm search api at all.
Comment #12
cpliakas commentedNick, that's awesome!
Comment #13
nick_vhMarking as needs review, because I need input from Tomas/drunkenmonkey
Comment #14
drunken monkeyGreat work!
It makes the schema pretty cluttered, especially in comparison with the Search API, but I guess if this really has no disadvantages regarding functionality, that's probably worth it. We'll just have to unify the type definitions now, up to a certain extent. Specifically, we seem to use a very different approach regarding strings. We might have to introduce two different dynamic fields for those …
Other than that, how compatible are the types used (or, to what extent are they equal)? If they differ in other aspects, maybe we'll have to generally introduce prefixes for our dynamic fields (i.e., all apachesolr fields start with "a_", all Search API ones with "s_").
However, if we have to change field names, or to what fields what data is indexed, this would necessitate re-indexing for users and also make updating more complicated (on code update, users would have to immediately replace the config files, restart Solr and re-index, or things would break).
So maybe forcefully (completely) unifying the schema, too, won't be worth it after all? Having uniform type definitions (and then maybe use different types for the same fields) would be a good first (and maybe only) step here.
Hm, I guess it would be OK to do that …
However, at the very least we should make the "English" a variable, so people can easily change the language (and then just replace the stopwords, etc., files).
Tab.
It's not really for backwards-compatibility if we still use it.
But by the way, is there any merit to distinguishing single-valued from multi-valued text fields? I couldn't really think of any, thus only
t_*.We should probably also note that these are only used by the Search API.
So does, e.g.,
sort_*, and so do the three-letter prefixes Apachesolr uses when compared to the Search API.We should maybe just make the distinction clearer, which fields are used by both modules and which only by one. This way,
f_ss_*andfss_*would be more clearly separated and I wouldn't see such a problem with that.(That is, provided we choose to unify the field definitions at all, that is.)
Comment #15
nick_vhAfter some discussion, drunkenmonkey agreed to use the conventions from apachesolr in the schema.xml. The attached patch is the full schema, as it can be used. This is not compatible for search_api_solr so there is some small work to be done there.
Comment #16
nick_vhfixed whitespace issues
Comment #17
nick_vhcommiting this file just so it becomes easier to see diffs
Comment #18
drunken monkeyThere are two trailing spaces, but apart from that I can live with this. ;)
Amazing to see we really managed to do this!
Comment #19
nick_vhJust checked, there are no trailing spaces in that file anymore :)