Hi,
I talked at Szeged with Robert about adding support for dismax and query fields (to allow people to boost title or any other fields they want). Here's a preliminary patch to address that.
It adds a page in admin where you can configure the weight of each individual field and an option to choose the used request handler.
This is against the current development version for Drupal 5. I'm willing to work on this (Drupal 6, make it more usable etc) if you guys think this make sense.
Andu
| Comment | File | Size | Author |
|---|---|---|---|
| #54 | weighting-extra-323015-54.patch | 6.75 KB | pwolanin |
| #53 | weighting-extra-323015-53.patch | 6.77 KB | pwolanin |
| #52 | weighting-extra-323015-52.patch | 3.81 KB | pwolanin |
| #49 | vocab-names-323015-49.patch | 1.2 KB | pwolanin |
| #45 | rework-weighting-323015-45.patch | 17.25 KB | pwolanin |
Comments
Comment #1
robertdouglass commentedThanks Andu! Looking forward to reviewing.
Comment #2
janusman commentedIt works! I guess it needs some improvements (some friendlier way to specify weights, perhaps instead of names like "imfield_vid1" the actual names of the vocabularies)...
However, I don't understand how fields like "changed" or "comment count" would affect search order, if they are not being searched? Perhaps they might affect ranking not by matching but by affecting sort? (Although if they became facets...)
Comment #3
robertdouglass commentedYou can replace your Luke call with Solr_Base_Query::get_fields_in_index();
Comment #4
voidberg commentedYes, it's pretty rough. I'll do some more work on it these days.
Comment #5
flexer commentedHello,
is this patch supposed to be merged into the stable version?
Isn't the dismax handler the default one with SOLR 1.3?
Comment #6
jarchowk commentedIs it just me or does adding dismax params break the faceted filtering?
Comment #7
janusman commented@jarchowk: make sure you update your schema.xml, restart Solr, and rebuild the Solr index!
Comment #8
pwolanin commentedThe Solr docs say:"NOTE: As of Solr 1.3, the DisMaxRequestHandler is simply the standard request handler with the default query parser set to the DisMax Query Parser (defType=dismax)."
It seems like we need to work on a solrconfig.xml that is customized too? e.g. we could have dismax be the default handler:
" If no qt is defined, the requestHandler that declares default="true""
Drupal core search boosts, for example, the title and terms with h tags: http://api.drupal.org/api/function/search_index/6
Looking at the Solr docs, it seems that you can only boost an entire field at index time, not a term within a field (please correct me if I'm wrong), or boost an entire field at query time. Thus, to get something more similar to core, it would seem like we need to search across several fields with different boost values by default? Alternatively, for the default keyword search, perhaps we can just append the text field with multiple copies of each tem we want to boost?
Perhaps we could treat the text like this, to first capture certain tags for boosting:
http://api.drupal.org/api/function/search_index/6
Comment #9
pwolanin commentedJacob found this interesting articel:
http://e-mats.org/2008/04/solr-using-the-dismax-query-handler-and-still-...
One take home: "dismax does not support fielded searches through the regular query" which jives with what i found playing with it today.
Comment #10
pwolanin commentedSo, the motivation for using dismax is to be able to boost certain terms within the text. For that we could alter the node to doc code something like this (borrowing from http://api.drupal.org/api/function/search_index/6):
And new schema fields like:
and we'd need to configure default dismax
qfparam to include thetextandtitlefields as well as all of these boost fields (or however we want to name them).The boost would actually be performed at index time (or at query time as suggested in the patch above, or we can put default boosts in the solrconfig.xml I think). To do it at index time needs tweaking of the base PHP class, I think, to get output like:
see also: http://lucene.apache.org/java/2_1_0/scoring.html
Shorter fields always get a boost relative to longer fields, so we may want to actually use lower/different boost numbers than above.
Comment #11
robertdouglass commentedConfigurable boost numbers, plz. Somewhere floating around is a patch that does this:
It might already be in D7 - I forget.
Comment #12
voidberg commentedpwolanin, why not use the fq parameter? It also does caching on individual pairs of facet:value which will speed things up.
Also, the motivation imho is not just to boost title. It is to allow people to tweak the searching behavior to better suit their need without hacking code or the schema.
Comment #13
pwolanin commented@Robert - yes, I already had that as a plan, to make it a variable, though I'm not sure that it needs any UI in the base module. It's already in D7 indeed: http://api.drupal.org/api/function/search_index/7
@pixelmonk - we will indeed have to use
fqfor any facets. If you look at the code above, it is really oriented to providing a good out-of-the box result. I'd certainly want to leave a mechanism for specifying an explicit qf with different boosts, though again it's not clear to me that this is really going to be commonly used if we get a reasonable default working.Comment #14
pwolanin commented@Robert - core seems to have code that tries also to boost a node if it has inbound links - have you ever had thoughts about the value of that metric?
Comment #15
pwolanin commentedDiscussing w/ Jacob, we need to refactor for dismax - we should take all facets out of the Drupal path and make them a separate GET parameter something like:
/search/apachesolr_search/my+keywords&facets=tid:2This would make it much easier to transform them into the filter query (
fq) arguments for the real Solr query.We also need to thinking about doing default date biasing: "How can I boost the score of newer documents"
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994...
Comment #16
robertdouglass commented@pwolanin: I don't find the "page rank" algorithm in core very important, to be frank. Others may disagree.
Comment #17
jarchowk commentedI know is by no means publish-worth code (I'm new to php/drupal), but adding this to the search() in Service.php got my facets to work:
Comment #18
pwolanin commented@jarchowk - that's more-or-less what we need to do in the module, though I want to separate out fields into a separate GET parameter
Comment #19
pwolanin commentedLooking at our schema, I'm not sure the current use of termVectors makes sense. The Sole wiki has this:
http://wiki.apache.org/solr/SchemaXml#head-38400cd474e4b7139160ed9c0f921...
Which suggests we should be setting this on fields we wish to commonly use for highlighting (i.e. the body).
By switching to dismax, I think we can drop the text field all together if we append the comments to the body (or as a separate field)- right now we just have it so we can have the title, body, comments, and taxonomy names in a single keyword search.
Comment #20
pwolanin commentedhere's a start on a patch for all the back-end pieces. Note, this is still mostly broken.
we get rid of the text field, add some boost fields, and make a dismax handler the default in solrconfig.xml. A start on re-writing all the fields to be fq rather than q
Comment #21
robertdouglass commentedThe term vectors are there exclusively for the support of more-like-this content recommendation, which I see as one of the best features.
Comment #22
pwolanin commentedcloser to something that works - note change schema.xml AND solrconfig.xml
no UI config yet.
Comment #23
robertdouglass commentedremoving text and adding a term vector to body might have the nice added side-effect of making the content recommendation more flexible. Yay.
Comment #24
pwolanin commentedok, here's a pretty much working patch, with a minimal admin interface even.
Comment #25
pwolanin commentedwhoops - missed diffing the query class. how about this...
Comment #26
pwolanin commentedRobert tested this out and it worked for him. Committing to 6.x with the knowledge that there are surely rough edges or bugs still.
Note - you need to load solr with the new .xml files and reindex to get this to work right.
Please follow-up with other concerns.
Comment #27
pwolanin commentedwe should think about default date biasing. Possible options:
if we switch the "changed" field to a date format, we could use something like this example query:
or something like this example:
which will boost the 1st thousand docs...
Comment #28
pwolanin commentedlike this? (including Luke(2) change for DamZ)
The curves show the effect of changing the steepness - I put in 4 as the default in the patch. No UI component.
Comment #29
pwolanin commentedcommitted this to get us started...
Comment #30
john.money commentedtag +1
Comment #31
pwolanin commentedfound an uncaught exception.
Also, better this way since we'd rather keep side-effects out of the form builder.
Comment #32
pwolanin commentedadded another form too.
Comment #33
pwolanin commentedlet's set the body to 1.0 by default too.
Comment #34
pwolanin commentedWe maybe need ot work on the extraction of a tag contents. Turns out the top terms are things like:
It's not very useful if URL filter is enabled and all links become contents of a tags.
Comment #35
pwolanin commentedrats - even with this regex, most of the stuff indexed seems to be junk. Maybe we should omit a tags normally?
Comment #36
pwolanin commentedhere's a patch witch also omits a tags form the default weighting.
Comment #37
pwolanin commentedcommitted that patch.
Comment #38
pwolanin commentedbug in the form builder and field sorting.
Comment #39
pwolanin commentedcommitted to 6.x
Comment #40
pwolanin commentedTesting this out today, I think we have much too high boosts on some of the fields (like title, H1) and also we should probably omit norms on most of those fields. Having norms means that a keyword match for 1 word in a 2 word title gives a substantially higher boost that I keyword in a 3 or 4 word title. Omitting norms means that finding the a keyword in the title gives the same boost regardless of the length of the title.
I'd iterate next to something like:
Along with omitting norms.
Comment #41
pwolanin commentedneeds more testing, but this seems to give something closer to the core results.
Comment #42
pwolanin commentedwhile we are at it, we ought to be able to remove this form_alter since we are only ever taking keywords via the form.
Comment #43
pwolanin commentedAnd, I just discovered we can use copyfield with wildcard fields - so we can omit an extra copy of this data being sent on each post.
Also, fixes a bug where the term names were not separated by a space, adds an extra '_' after 'vid' in field names for name-space reasons (e.g. to distinguish _vid from _video).
adds field name="type_name", per my conversation w/ Robert about prep for multi-site search, and also changes the smfiield_vid_* fields to use the taxonomy vocabulary name rather than the vid - again as prep for multi-site which was a TODO in code. Note - need to check that the transform there is multi-lingual safe.
Comment #44
pwolanin commentedAlso, I realized we should probably replace the stripped chars with space, not an empty string. for example - what if there is a set of tab-delimited words.
I'd really like to send this in soon - any comments?
Comment #45
pwolanin commentedah, just found the PHP preg code for a unicode letter: \p{L}
Comment #46
robertdouglass commentedThe code looks good to me and I'm very happy with all of the proposed changes.
Comment #47
janusman commentedUsing \p{L} introduces a dependency on PCRE libraries being compiled _a certain way_ in PHP that is difficult to detect and/or fix by end users (having to recompile PHP in mose cases!)
This is a known issue that is a problem, and has been worked around ("hacked around"?) from Drupal core itself (search module). The workaround is to specifically issue a whitelist/blacklist of unicode characters to either filter out or include.
Meaning, to keep our module's dependencies to minimum, we have to go this way (or some other)
References:
* Search Core module (search.module) PREG_CLASS_SEARCH_EXCLUDE
* http://drupal.org/node/315486#comment-1117887
Comment #48
pwolanin commentedThe goal here was to generate valid Solr field names from arbitrary vocabulary names that might include international characters.
However, what we actually need to insure as well is that these are valid PHP object property names. Thus, the correct filter is probably this one from the PHP site:
Comment #49
pwolanin commentedSince we already start like something like 'smfield' as the prefix we can just use the latter part of the above regex.
Comment #50
JacobSingh commentedThe patch looks good. I haven't tested it, but I get the concept, and that regex looks like it will work
Comment #51
pwolanin commentedcommitted to 6.x
Comment #52
pwolanin commentedthe weighting still seems to be way off.
This patch greatly boosts body, changed, and comment count weighting.
Also fixes a bug with the extra field for comment counts.
Comment #53
pwolanin commentedA little more cleanup as well. Looking now, sees like having the q.alt param defined could be useful.
Comment #54
pwolanin commentedcomitting attached to 6.x.
Comment #55
pwolanin commentedComment #56
pwolanin commented