Add support for dismax and query fields [#323015]

Comment	File	Size	Author
#54	weighting-extra-323015-54.patch	6.75 KB	pwolanin
#53	weighting-extra-323015-53.patch	6.77 KB	pwolanin
#52	weighting-extra-323015-52.patch	3.81 KB	pwolanin
#49	vocab-names-323015-49.patch	1.2 KB	pwolanin
#45	rework-weighting-323015-45.patch	17.25 KB	pwolanin
#44	rework-weighting-323015-44.patch	17.17 KB	pwolanin
#43	rework-weighting-323015-43.patch	16.49 KB	pwolanin
#42	rework-weighting-323015-42.patch	11.5 KB	pwolanin
#41	rework-weighting-323015-41.patch	9.85 KB	pwolanin
#38	form-sort-323015-38.patch	2.1 KB	pwolanin
#36	a-tag-323015-36.patch	2.37 KB	pwolanin
#35	a-tag-323015-35.patch	1006 bytes	pwolanin
#33	form-catch-exception-323015-33.patch	6.26 KB	pwolanin
#32	form-catch-exception-323015-32.patch	5.19 KB	pwolanin
#31	catch-exception-323015-31.patch	1.45 KB	pwolanin
#29	dismax-date-323015-29.patch	3.94 KB	pwolanin
#28	dismax-date-323015-28.patch	3.91 KB	pwolanin
#28	date-curves.png	60.62 KB	pwolanin
#25	dismax-323015-25.patch	47.95 KB	pwolanin
#24	dismax-323015-24.patch	33 KB	pwolanin
#22	dismax-323015-22.patch	35.07 KB	pwolanin
#20	dismax-323015-20.patch	27.82 KB	pwolanin
	dismax.patch	5.85 KB	voidberg

Comment #1

robertdouglass commented 23 October 2008 at 17:17

Thanks Andu! Looking forward to reviewing.

Log in or register to post comments

Comment #2

janusman commented 30 October 2008 at 22:09

Status:

Needs review

» Reviewed & tested by the community

It works! I guess it needs some improvements (some friendlier way to specify weights, perhaps instead of names like "imfield_vid1" the actual names of the vocabularies)...

However, I don't understand how fields like "changed" or "comment count" would affect search order, if they are not being searched? Perhaps they might affect ranking not by matching but by affecting sort? (Although if they became facets...)

Log in or register to post comments

Comment #3

robertdouglass commented 31 October 2008 at 07:10

Status:

Reviewed & tested by the community

» Needs work

You can replace your Luke call with Solr_Base_Query::get_fields_in_index();

Log in or register to post comments

Comment #4

voidberg commented 5 November 2008 at 15:37

Yes, it's pretty rough. I'll do some more work on it these days.

Log in or register to post comments

Comment #5

flexer commented 17 November 2008 at 15:24

Hello,

is this patch supposed to be merged into the stable version?

Isn't the dismax handler the default one with SOLR 1.3?

Log in or register to post comments

Comment #6

jarchowk commented 28 November 2008 at 10:37

Is it just me or does adding dismax params break the faceted filtering?

Log in or register to post comments

Comment #7

janusman commented 28 November 2008 at 22:13

@jarchowk: make sure you update your schema.xml, restart Solr, and rebuild the Solr index!

Log in or register to post comments

Comment #8

pwolanin commented 29 November 2008 at 16:31

The Solr docs say:"NOTE: As of Solr 1.3, the DisMaxRequestHandler is simply the standard request handler with the default query parser set to the DisMax Query Parser (defType=dismax)."

It seems like we need to work on a solrconfig.xml that is customized too? e.g. we could have dismax be the default handler:
" If no qt is defined, the requestHandler that declares default="true""

Drupal core search boosts, for example, the title and terms with h tags: http://api.drupal.org/api/function/search_index/6

Looking at the Solr docs, it seems that you can only boost an entire field at index time, not a term within a field (please correct me if I'm wrong), or boost an entire field at query time. Thus, to get something more similar to core, it would seem like we need to search across several fields with different boost values by default? Alternatively, for the default keyword search, perhaps we can just append the text field with multiple copies of each tem we want to boost?

Perhaps we could treat the text like this, to first capture certain tags for boosting:

http://api.drupal.org/api/function/search_index/6

Log in or register to post comments

Comment #9

pwolanin commented 30 November 2008 at 03:11

Jacob found this interesting articel:

http://e-mats.org/2008/04/solr-using-the-dismax-query-handler-and-still-...

One take home: "dismax does not support fielded searches through the regular query" which jives with what i found playing with it today.

Log in or register to post comments

Comment #10

pwolanin commented 1 December 2008 at 03:52

So, the motivation for using dismax is to be able to boost certain terms within the text. For that we could alter the node to doc code something like this (borrowing from http://api.drupal.org/api/function/search_index/6):

    // Build the node body.
    $node->build_mode = NODE_BUILD_SEARCH_INDEX;
    $node = node_build_content($node, FALSE, FALSE);
    $node->body = drupal_render($node->content);
  
    $text = $node->title . ' ' . $node->body;
  
    // Fetch extra data normally not visible
    $extra = node_invoke_nodeapi($node, 'update index');
    $text .= implode(' ', $extra);
    // Multipliers for scores of words inside certain HTML tags.
    // Note: 'a' must be included for link ranking to work.
    $tag_weights = array(
      'h1' => 25,
      'h2' => 18,
      'h3' => 15,
      'h4' => 12,
      'h5' => 9,
      'h6' => 6,
      'u' => 3,
      'b' => 3,
      'i' => 3,
      'strong' => 3,
      'em' => 3,
      'a' => 10
    );

   // Strip off all ignored tags, but insert space before/after them to keep word boundaries.
    $text = str_replace(array('<', '>'), array(' <', '> '), $text);
    $text = strip_tags($text, '<'. implode('><', array_keys($tag_weights)) .'>');

    preg_match_all('@<('. implode('|', array_keys($tag_weights)) .')[^>]*>(.*)</\1>@Ui', $text, $matches);
    foreach ($matches[1] as $index => $tag) {
      $doc->{'boostfield_' . $tag_weights[$tag]} .= ' '. $matches[2][$index];
    }

And new schema fields like:

   <field name="boostfield_25" type="text" indexed="true" stored="false" omitNorms="false"/>

and we'd need to configure default dismax qf param to include the text and title fields as well as all of these boost fields (or however we want to name them).

The boost would actually be performed at index time (or at query time as suggested in the patch above, or we can put default boosts in the solrconfig.xml I think). To do it at index time needs tweaking of the base PHP class, I think, to get output like:

    <field name="office" boost="2.0">Bridgewater</field>

Shorter fields always get a boost relative to longer fields, so we may want to actually use lower/different boost numbers than above.

Log in or register to post comments

Comment #11

robertdouglass commented 1 December 2008 at 07:18

Configurable boost numbers, plz. Somewhere floating around is a patch that does this:

    $tag_weights = variable_get('search_html_weights', array(
      'h1' => 25,
      'h2' => 18,
      'h3' => 15,
      'h4' => 12,
      'h5' => 9,
      'h6' => 6,
      'u' => 3,
      'b' => 3,
      'i' => 3,
      'strong' => 3,
      'em' => 3,
      'a' => 10
    ));

It might already be in D7 - I forget.

Log in or register to post comments

Comment #12

voidberg commented 1 December 2008 at 11:41

pwolanin, why not use the fq parameter? It also does caching on individual pairs of facet:value which will speed things up.
Also, the motivation imho is not just to boost title. It is to allow people to tweak the searching behavior to better suit their need without hacking code or the schema.

Log in or register to post comments

Comment #13

pwolanin commented 1 December 2008 at 13:04

@Robert - yes, I already had that as a plan, to make it a variable, though I'm not sure that it needs any UI in the base module. It's already in D7 indeed: http://api.drupal.org/api/function/search_index/7

@pixelmonk - we will indeed have to use fq for any facets. If you look at the code above, it is really oriented to providing a good out-of-the box result. I'd certainly want to leave a mechanism for specifying an explicit qf with different boosts, though again it's not clear to me that this is really going to be commonly used if we get a reasonable default working.

Log in or register to post comments

Comment #14

pwolanin commented 1 December 2008 at 13:29

@Robert - core seems to have code that tries also to boost a node if it has inbound links - have you ever had thoughts about the value of that metric?

Log in or register to post comments

Comment #15

pwolanin commented 1 December 2008 at 15:40

Discussing w/ Jacob, we need to refactor for dismax - we should take all facets out of the Drupal path and make them a separate GET parameter something like:

/search/apachesolr_search/my+keywords&facets=tid:2

This would make it much easier to transform them into the filter query (fq) arguments for the real Solr query.

We also need to thinking about doing default date biasing: "How can I boost the score of newer documents"
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994...

Log in or register to post comments

Comment #16

robertdouglass commented 2 December 2008 at 17:25

@pwolanin: I don't find the "page rank" algorithm in core very important, to be frank. Others may disagree.

Log in or register to post comments

Comment #17

jarchowk commented 3 December 2008 at 10:54

I know is by no means publish-worth code (I'm new to php/drupal), but adding this to the search() in Service.php got my facets to work:


		$params['qf'][] = 'title^2';
		$params['qf'][] = 'text';
		$params['defType'] = 'dismax';

		$queryexplode = explode(' ', $query);
		foreach ($queryexplode as $field)
		{
			if (!strchr($field,':'))
			{
				$qtemp = $qtemp . $field.' ';
			}
			else
			{
				$params['fq'][] = $field;
			}
		}
		$params['q'] = trim($qtemp);

Log in or register to post comments

Comment #18

pwolanin commented 3 December 2008 at 16:49

@jarchowk - that's more-or-less what we need to do in the module, though I want to separate out fields into a separate GET parameter

Log in or register to post comments

Comment #19

pwolanin commented 5 December 2008 at 22:36

Looking at our schema, I'm not sure the current use of termVectors makes sense. The Sole wiki has this:

http://wiki.apache.org/solr/SchemaXml#head-38400cd474e4b7139160ed9c0f921...

Which suggests we should be setting this on fields we wish to commonly use for highlighting (i.e. the body).

By switching to dismax, I think we can drop the text field all together if we append the comments to the body (or as a separate field)- right now we just have it so we can have the title, body, comments, and taxonomy names in a single keyword search.

Log in or register to post comments

Comment #20

pwolanin commented 6 December 2008 at 04:14

Status	File	Size
new	dismax-323015-20.patch	27.82 KB

here's a start on a patch for all the back-end pieces. Note, this is still mostly broken.

we get rid of the text field, add some boost fields, and make a dismax handler the default in solrconfig.xml. A start on re-writing all the fields to be fq rather than q

Log in or register to post comments

Comment #21

robertdouglass commented 6 December 2008 at 09:51

The term vectors are there exclusively for the support of more-like-this content recommendation, which I see as one of the best features.

Log in or register to post comments

Comment #22

pwolanin commented 8 December 2008 at 04:23

Status	File	Size
new	dismax-323015-22.patch	35.07 KB

closer to something that works - note change schema.xml AND solrconfig.xml

no UI config yet.

Log in or register to post comments

Comment #23

robertdouglass commented 8 December 2008 at 11:59

removing text and adding a term vector to body might have the nice added side-effect of making the content recommendation more flexible. Yay.

Log in or register to post comments

Comment #24

pwolanin commented 8 December 2008 at 20:13

Assigned:	voidberg	» pwolanin
Status:	Needs work	» Needs review

Status	File	Size
new	dismax-323015-24.patch	33 KB

ok, here's a pretty much working patch, with a minimal admin interface even.

Log in or register to post comments

Comment #25

pwolanin commented 8 December 2008 at 21:13

Status	File	Size
new	dismax-323015-25.patch	47.95 KB

whoops - missed diffing the query class. how about this...

Log in or register to post comments

Comment #26

pwolanin commented 9 December 2008 at 00:58

Robert tested this out and it worked for him. Committing to 6.x with the knowledge that there are surely rough edges or bugs still.

Note - you need to load solr with the new .xml files and reindex to get this to work right.

Please follow-up with other concerns.

Log in or register to post comments

Comment #27

pwolanin commented 9 December 2008 at 01:57

we should think about default date biasing. Possible options:

if we switch the "changed" field to a date format, we could use something like this example query:

<str name="bq">incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2</str>

or something like this example:

bf=recip(rord(creationDate),1,1000,1000)^2

which will boost the 1st thousand docs...

Log in or register to post comments

Comment #28

pwolanin commented 9 December 2008 at 03:17

Status	File	Size
new	date-curves.png	60.62 KB
new	dismax-date-323015-28.patch	3.91 KB

like this? (including Luke(2) change for DamZ)

The curves show the effect of changing the steepness - I put in 4 as the default in the patch. No UI component.

Log in or register to post comments

Comment #29

pwolanin commented 9 December 2008 at 03:42

Status	File	Size
new	dismax-date-323015-29.patch	3.94 KB

committed this to get us started...

Log in or register to post comments

Comment #30

john.money commented 9 December 2008 at 07:11

tag +1

Log in or register to post comments

Comment #31

pwolanin commented 9 December 2008 at 14:28

Status	File	Size
new	catch-exception-323015-31.patch	1.45 KB

found an uncaught exception.

Also, better this way since we'd rather keep side-effects out of the form builder.

Log in or register to post comments

Comment #32

pwolanin commented 9 December 2008 at 19:29

Status	File	Size
new	form-catch-exception-323015-32.patch	5.19 KB

added another form too.

Log in or register to post comments

Comment #33

pwolanin commented 10 December 2008 at 14:53

Status	File	Size
new	form-catch-exception-323015-33.patch	6.26 KB

let's set the body to 1.0 by default too.

Log in or register to post comments

Comment #34

pwolanin commented 11 December 2008 at 03:30

We maybe need ot work on the extraction of a tag contents. Turns out the top terms are things like:

http	1091
org	832
drupal	807
com	606
www	506
node	478
httpdrupalorgnod	303

It's not very useful if URL filter is enabled and all links become contents of a tags.

Log in or register to post comments

Comment #35

pwolanin commented 11 December 2008 at 04:04

Status	File	Size
new	a-tag-323015-35.patch	1006 bytes

rats - even with this regex, most of the stuff indexed seems to be junk. Maybe we should omit a tags normally?

Log in or register to post comments

Comment #36

pwolanin commented 11 December 2008 at 15:08

Status	File	Size
new	a-tag-323015-36.patch	2.37 KB

here's a patch witch also omits a tags form the default weighting.

Log in or register to post comments

Comment #37

pwolanin commented 11 December 2008 at 16:18

committed that patch.

Log in or register to post comments

Comment #38

pwolanin commented 11 December 2008 at 21:22

Status	File	Size
new	form-sort-323015-38.patch	2.1 KB

bug in the form builder and field sorting.

Log in or register to post comments

Comment #39

pwolanin commented 11 December 2008 at 21:37

committed to 6.x

Log in or register to post comments

Comment #40

pwolanin commented 14 December 2008 at 23:59

Testing this out today, I think we have much too high boosts on some of the fields (like title, H1) and also we should probably omit norms on most of those fields. Having norms means that a keyword match for 1 word in a 2 word title gives a substantially higher boost that I keyword in a 3 or 4 word title. Omitting norms means that finding the a keyword in the title gives the same boost regardless of the length of the title.

I'd iterate next to something like:

<str name="qf">
body^1.0 title^5.0 name^5.0 taxonomy_names^2.0 tags_h1^5.0 tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0
</str>
<str name="pf">body^1.0 title^1.0</str>

Along with omitting norms.

Log in or register to post comments

Comment #41

pwolanin commented 15 December 2008 at 18:14

Status	File	Size
new	rework-weighting-323015-41.patch	9.85 KB

needs more testing, but this seems to give something closer to the core results.

Log in or register to post comments

Comment #42

pwolanin commented 15 December 2008 at 18:56

Status	File	Size
new	rework-weighting-323015-42.patch	11.5 KB

while we are at it, we ought to be able to remove this form_alter since we are only ever taking keywords via the form.

Log in or register to post comments

Comment #43

pwolanin commented 16 December 2008 at 00:34

Status	File	Size
new	rework-weighting-323015-43.patch	16.49 KB

And, I just discovered we can use copyfield with wildcard fields - so we can omit an extra copy of this data being sent on each post.

Also, fixes a bug where the term names were not separated by a space, adds an extra '_' after 'vid' in field names for name-space reasons (e.g. to distinguish _vid from _video).

adds field name="type_name", per my conversation w/ Robert about prep for multi-site search, and also changes the smfiield_vid_* fields to use the taxonomy vocabulary name rather than the vid - again as prep for multi-site which was a TODO in code. Note - need to check that the transform there is multi-lingual safe.

Log in or register to post comments

Comment #44

pwolanin commented 16 December 2008 at 02:38

Status	File	Size
new	rework-weighting-323015-44.patch	17.17 KB

Also, I realized we should probably replace the stripped chars with space, not an empty string. for example - what if there is a set of tab-delimited words.

I'd really like to send this in soon - any comments?

Log in or register to post comments

Comment #45

pwolanin commented 16 December 2008 at 03:16

Status	File	Size
new	rework-weighting-323015-45.patch	17.25 KB

ah, just found the PHP preg code for a unicode letter: \p{L}

Log in or register to post comments

Comment #46

robertdouglass commented 17 December 2008 at 04:29

The code looks good to me and I'm very happy with all of the proposed changes.

Log in or register to post comments

Comment #47

janusman commented 17 December 2008 at 15:57

Status:

Needs review

» Needs work

Using \p{L} introduces a dependency on PCRE libraries being compiled _a certain way_ in PHP that is difficult to detect and/or fix by end users (having to recompile PHP in mose cases!)

This is a known issue that is a problem, and has been worked around ("hacked around"?) from Drupal core itself (search module). The workaround is to specifically issue a whitelist/blacklist of unicode characters to either filter out or include.

Meaning, to keep our module's dependencies to minimum, we have to go this way (or some other)

References:
* Search Core module (search.module) PREG_CLASS_SEARCH_EXCLUDE
* http://drupal.org/node/315486#comment-1117887

Log in or register to post comments

Comment #48

pwolanin commented 17 December 2008 at 16:32

The goal here was to generate valid Solr field names from arbitrary vocabulary names that might include international characters.

However, what we actually need to insure as well is that these are valid PHP object property names. Thus, the correct filter is probably this one from the PHP site:

Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'

Log in or register to post comments

Comment #49

pwolanin commented 17 December 2008 at 22:12

Status:

Needs work

» Needs review

Status	File	Size
new	vocab-names-323015-49.patch	1.2 KB

Since we already start like something like 'smfield' as the prefix we can just use the latter part of the above regex.

Log in or register to post comments

Comment #50

JacobSingh commented 18 December 2008 at 06:05

Status:

Needs review

» Reviewed & tested by the community

The patch looks good. I haven't tested it, but I get the concept, and that regex looks like it will work

Log in or register to post comments

Comment #51

pwolanin commented 18 December 2008 at 15:28

Version:	6.x-1.x-dev	» 5.x-1.x-dev
Status:	Reviewed & tested by the community	» Patch (to be ported)

committed to 6.x

Log in or register to post comments

Comment #52

pwolanin commented 18 December 2008 at 19:06

Version:	5.x-1.x-dev	» 6.x-1.x-dev
Status:	Patch (to be ported)	» Needs review

Status	File	Size
new	weighting-extra-323015-52.patch	3.81 KB

the weighting still seems to be way off.

This patch greatly boosts body, changed, and comment count weighting.

Also fixes a bug with the extra field for comment counts.

Log in or register to post comments

Comment #53

pwolanin commented 18 December 2008 at 22:40

Status	File	Size
new	weighting-extra-323015-53.patch	6.77 KB

A little more cleanup as well. Looking now, sees like having the q.alt param defined could be useful.

Log in or register to post comments

Comment #54

pwolanin commented 19 December 2008 at 01:21

Status	File	Size
new	weighting-extra-323015-54.patch	6.75 KB

comitting attached to 6.x.

Log in or register to post comments

Comment #55

pwolanin commented 27 January 2009 at 02:44

Status:

Needs review

» Fixed

Log in or register to post comments

Comment #56

pwolanin commented 27 January 2009 at 02:44

Status:

Fixed

» Closed (fixed)

Log in or register to post comments

Add support for dismax and query fields

Comments