Problem
If a field name in a content type definition is near the maximum length (32 characters), the apachesolr_index_key() function prepends the data type and plurality hints (e.g. "bs_" for boolean, single) to the field name, then truncates the result to the maximum length of a field name.
Sample error message when indexing nodes where the machine name is close to 32 characters long:
"400" Status: Bad RequestApache Tomcat/6.0.32 - Error report HTTP Status 400 - ERROR: [doc=4v3uk8/node/47489] : [0, 1, 1, 0, 0]type Status reportmessage ERROR: [doc=4v3uk8/node/47489] multiple values encountered for non multiValued field bs_field_siderbar_1_related_tab_: [0, 1, 1, 0, 0]description The request sent by the client was syntactically incorrect (ERROR: [doc=4v3uk8/node/47489] multiple values encountered for non multiValued field bs_field_siderbar_1_related_tab_: [0, 1, 1, 0, 0]).Apache Tomcat/6.0.32
Proposed resolution
Changing the routine to account for the maximum size of a field name + length of the "type hint" + length of the "plurality hint" resolves the issues. A patch is attached as a comment.
| Comment | File | Size | Author |
|---|---|---|---|
| #10 | apachesolr-field-name-truncates-1707404-9.patch | 447 bytes | hedinfaok |
| #1 | apachesolr-field-name-truncates-1707404.patch | 454 bytes | hedinfaok |
Comments
Comment #1
hedinfaok commentedComment #2
hedinfaok commentedComment #3
ianthomas_ukWon't that cause problems elsewhere? You're effectively increasing the string length to 36 characters, but I think there are database columns that are 32 characters wide.
If not, then why is the limit there are all?
Comment #4
hedinfaok commentedI'm running this in production, but it would be awesome if you could apply the patch and see if it causes problems for you. The more QA the better, I always say. As it stands, the original line of code causes indexing failures if the field name is close to the maximum machine name for a field (32 chars).
Yup, the field name is a database column (in SQL databases), which has a maximum limit of 32 characters. The function apachesolr_index_key() returns an array, which is fed into the Apache Solr web service. Solr does not have a 32 character limit on it's DynamicField names. (The limit is the Java limit for a string.)
You'd have to ask @Nick_vh, as he's the author of the commit, according to git log. My belief is that there is nothing wrong with limiting the length of a variable, and I believe it is generally a good practice. I personally don't think it's necessary, but I like to err on the side of the author's original intentions on this one.
Comment #5
nick_vhThe issue here is the block delta that is being used in facet api. As it states above the code :
Facet API does hash the block deltas if they become bigger but preferably we do not want this practice. The current patch allows the field to grow > 32 chars so I don't think this is a good solution?
Do we have any other route to go to?
Comment #6
hedinfaok commentedThanks for clearing that up :)
I'm assuming that it is preferred to avoid block delta hashing to keep block hook implementations readable. Is this correct? I guess I would argue that it is reasonable to have hashes for deltas as part of the Drupal experience. I admit, it's ugly to read and the resulting hook function is not self describing, but it's expected behavior (See Views module).
If "hash dodging" is more important than the indexing function working for field names that are of legal value, then
It seems to me then, that the best option is to create a static map of "Apache solr truncated field names" and actual drupal field names. Because end characters will be truncated, collision is possible, so that would need to be accounted for as well. This would allow both 32 character truncated deltas and the ability to reference the original drupal field name.
Is this overhead worth the readability? If not, I'd have to push for the pragmatic solution as previously submitted.
Thanks again, I appreciate the input.
Comment #7
ianthomas_ukI've had a closer look at this, and would now advocate removing the substr() altogether.
But this function isn't returning a block delta, it's returning a field name for apachesolr so only needs to respect apachesolr's limits, which I think is practically unlimited (the only reference to a limit that I could find was a performance reference saying the field names would be included in the response).
There could be a problem if something takes the value from this function (or the field name pulled from Solr) and uses it straight as a block delta, but that would be a bug in the code calling this function, rather than the function itself. I've had a look through facetapi/query_type_numeric_range.inc and it's only used for two purposes: sent as a parameter to apachesolr, and used as part of a cache key. Neither should suffer from longer strings.
apachesolr.index.inc doesn't look like it would have any problems, and even adds _end to the key for dates.
apachesolr.module does use it as the key in a couple of arrays, and I'm not sure if it would cause a problems.
There are no other calls to apachesolr_index_key in apachesolr or facetapi.
If we do change the index key then already indexed content will have the wrong key. Do we need to handle that, or do we just add a release note advising a full reindex if you have long field names?
Comment #8
pwolanin commentedI think this is a legacy of the much earlier code that used these directly as block deltas.
Comment #9
pwolanin commentedso, I don't think it's a big problem to change this t stop truncating, but it might causing existing indexes and facets to stop working.
Comment #10
hedinfaok commented/me capitulates
I've re-rolled the patch. I'll look into providing some test coverage on this.
Comment #11
pwolanin commentedWell, look at the prior issue, if we are changing the keys we might also need an update function to check wither any of the facetapi keys changed and fix those, plus at least warn the user if re-indexing is required. See: #1161538: The numeric field id should not be used for Solr index field names
Comment #12
nick_vhI'm ok with this change, but I agree this needs a huge warning as it might change indexes that are currently running
Comment #13
heacu commentedany chance we'll see this patch incorporated into the next release?
Comment #14
nick_vhBefore we roll this change we need an update function and I'm not seeing this here yet. It's very tricky to suddenly change this and not tell people in advance.
Perhaps we could set a flag in the index_key function that tells it not to truncate while the default behaviour is still truncating?
Comment #15
nick_vhComment #16
nick_vhworking on #2118947: Allow mapping to accept multiple index_callbacks that makes this issue a duplicate