Hi,

I talked at Szeged with Robert about adding support for dismax and query fields (to allow people to boost title or any other fields they want). Here's a preliminary patch to address that.

It adds a page in admin where you can configure the weight of each individual field and an option to choose the used request handler.

This is against the current development version for Drupal 5. I'm willing to work on this (Drupal 6, make it more usable etc) if you guys think this make sense.

Andu

Comments

robertdouglass’s picture

Thanks Andu! Looking forward to reviewing.

janusman’s picture

Status: Needs review » Reviewed & tested by the community

It works! I guess it needs some improvements (some friendlier way to specify weights, perhaps instead of names like "imfield_vid1" the actual names of the vocabularies)...

However, I don't understand how fields like "changed" or "comment count" would affect search order, if they are not being searched? Perhaps they might affect ranking not by matching but by affecting sort? (Although if they became facets...)

robertdouglass’s picture

Status: Reviewed & tested by the community » Needs work

You can replace your Luke call with Solr_Base_Query::get_fields_in_index();

voidberg’s picture

Yes, it's pretty rough. I'll do some more work on it these days.

flexer’s picture

Hello,

is this patch supposed to be merged into the stable version?

Isn't the dismax handler the default one with SOLR 1.3?

jarchowk’s picture

Is it just me or does adding dismax params break the faceted filtering?

janusman’s picture

@jarchowk: make sure you update your schema.xml, restart Solr, and rebuild the Solr index!

pwolanin’s picture

The Solr docs say:"NOTE: As of Solr 1.3, the DisMaxRequestHandler is simply the standard request handler with the default query parser set to the DisMax Query Parser (defType=dismax)."

It seems like we need to work on a solrconfig.xml that is customized too? e.g. we could have dismax be the default handler:
" If no qt is defined, the requestHandler that declares default="true""

Drupal core search boosts, for example, the title and terms with h tags: http://api.drupal.org/api/function/search_index/6

Looking at the Solr docs, it seems that you can only boost an entire field at index time, not a term within a field (please correct me if I'm wrong), or boost an entire field at query time. Thus, to get something more similar to core, it would seem like we need to search across several fields with different boost values by default? Alternatively, for the default keyword search, perhaps we can just append the text field with multiple copies of each tem we want to boost?

Perhaps we could treat the text like this, to first capture certain tags for boosting:

http://api.drupal.org/api/function/search_index/6

pwolanin’s picture

Jacob found this interesting articel:

http://e-mats.org/2008/04/solr-using-the-dismax-query-handler-and-still-...

One take home: "dismax does not support fielded searches through the regular query" which jives with what i found playing with it today.

pwolanin’s picture

So, the motivation for using dismax is to be able to boost certain terms within the text. For that we could alter the node to doc code something like this (borrowing from http://api.drupal.org/api/function/search_index/6):

    // Build the node body.
    $node->build_mode = NODE_BUILD_SEARCH_INDEX;
    $node = node_build_content($node, FALSE, FALSE);
    $node->body = drupal_render($node->content);
  
    $text = $node->title . ' ' . $node->body;
  
    // Fetch extra data normally not visible
    $extra = node_invoke_nodeapi($node, 'update index');
    $text .= implode(' ', $extra);
    // Multipliers for scores of words inside certain HTML tags.
    // Note: 'a' must be included for link ranking to work.
    $tag_weights = array(
      'h1' => 25,
      'h2' => 18,
      'h3' => 15,
      'h4' => 12,
      'h5' => 9,
      'h6' => 6,
      'u' => 3,
      'b' => 3,
      'i' => 3,
      'strong' => 3,
      'em' => 3,
      'a' => 10
    );

   // Strip off all ignored tags, but insert space before/after them to keep word boundaries.
    $text = str_replace(array('<', '>'), array(' <', '> '), $text);
    $text = strip_tags($text, '<'. implode('><', array_keys($tag_weights)) .'>');

    preg_match_all('@<('. implode('|', array_keys($tag_weights)) .')[^>]*>(.*)</\1>@Ui', $text, $matches);
    foreach ($matches[1] as $index => $tag) {
      $doc->{'boostfield_' . $tag_weights[$tag]} .= ' '. $matches[2][$index];
    }

And new schema fields like:

   <field name="boostfield_25" type="text" indexed="true" stored="false" omitNorms="false"/>

and we'd need to configure default dismax qf param to include the text and title fields as well as all of these boost fields (or however we want to name them).

The boost would actually be performed at index time (or at query time as suggested in the patch above, or we can put default boosts in the solrconfig.xml I think). To do it at index time needs tweaking of the base PHP class, I think, to get output like:

    <field name="office" boost="2.0">Bridgewater</field>

see also: http://lucene.apache.org/java/2_1_0/scoring.html

Shorter fields always get a boost relative to longer fields, so we may want to actually use lower/different boost numbers than above.

robertdouglass’s picture

Configurable boost numbers, plz. Somewhere floating around is a patch that does this:

    $tag_weights = variable_get('search_html_weights', array(
      'h1' => 25,
      'h2' => 18,
      'h3' => 15,
      'h4' => 12,
      'h5' => 9,
      'h6' => 6,
      'u' => 3,
      'b' => 3,
      'i' => 3,
      'strong' => 3,
      'em' => 3,
      'a' => 10
    ));

It might already be in D7 - I forget.

voidberg’s picture

pwolanin, why not use the fq parameter? It also does caching on individual pairs of facet:value which will speed things up.
Also, the motivation imho is not just to boost title. It is to allow people to tweak the searching behavior to better suit their need without hacking code or the schema.

pwolanin’s picture

@Robert - yes, I already had that as a plan, to make it a variable, though I'm not sure that it needs any UI in the base module. It's already in D7 indeed: http://api.drupal.org/api/function/search_index/7

@pixelmonk - we will indeed have to use fq for any facets. If you look at the code above, it is really oriented to providing a good out-of-the box result. I'd certainly want to leave a mechanism for specifying an explicit qf with different boosts, though again it's not clear to me that this is really going to be commonly used if we get a reasonable default working.

pwolanin’s picture

@Robert - core seems to have code that tries also to boost a node if it has inbound links - have you ever had thoughts about the value of that metric?

pwolanin’s picture

Discussing w/ Jacob, we need to refactor for dismax - we should take all facets out of the Drupal path and make them a separate GET parameter something like:

/search/apachesolr_search/my+keywords&facets=tid:2

This would make it much easier to transform them into the filter query (fq) arguments for the real Solr query.

We also need to thinking about doing default date biasing: "How can I boost the score of newer documents"
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-b1b1cdedcb9cd9bfd9c994...

robertdouglass’s picture

@pwolanin: I don't find the "page rank" algorithm in core very important, to be frank. Others may disagree.

jarchowk’s picture

I know is by no means publish-worth code (I'm new to php/drupal), but adding this to the search() in Service.php got my facets to work:


		$params['qf'][] = 'title^2';
		$params['qf'][] = 'text';
		$params['defType'] = 'dismax';

		$queryexplode = explode(' ', $query);
		foreach ($queryexplode as $field)
		{
			if (!strchr($field,':'))
			{
				$qtemp = $qtemp . $field.' ';
			}
			else
			{
				$params['fq'][] = $field;
			}
		}
		$params['q'] = trim($qtemp);

pwolanin’s picture

@jarchowk - that's more-or-less what we need to do in the module, though I want to separate out fields into a separate GET parameter

pwolanin’s picture

Looking at our schema, I'm not sure the current use of termVectors makes sense. The Sole wiki has this:

http://wiki.apache.org/solr/SchemaXml#head-38400cd474e4b7139160ed9c0f921...

Which suggests we should be setting this on fields we wish to commonly use for highlighting (i.e. the body).

By switching to dismax, I think we can drop the text field all together if we append the comments to the body (or as a separate field)- right now we just have it so we can have the title, body, comments, and taxonomy names in a single keyword search.

pwolanin’s picture

StatusFileSize
new27.82 KB

here's a start on a patch for all the back-end pieces. Note, this is still mostly broken.

we get rid of the text field, add some boost fields, and make a dismax handler the default in solrconfig.xml. A start on re-writing all the fields to be fq rather than q

robertdouglass’s picture

The term vectors are there exclusively for the support of more-like-this content recommendation, which I see as one of the best features.

pwolanin’s picture

StatusFileSize
new35.07 KB

closer to something that works - note change schema.xml AND solrconfig.xml

no UI config yet.

robertdouglass’s picture

removing text and adding a term vector to body might have the nice added side-effect of making the content recommendation more flexible. Yay.

pwolanin’s picture

Assigned: voidberg » pwolanin
Status: Needs work » Needs review
StatusFileSize
new33 KB

ok, here's a pretty much working patch, with a minimal admin interface even.

pwolanin’s picture

StatusFileSize
new47.95 KB

whoops - missed diffing the query class. how about this...

pwolanin’s picture

Robert tested this out and it worked for him. Committing to 6.x with the knowledge that there are surely rough edges or bugs still.

Note - you need to load solr with the new .xml files and reindex to get this to work right.

Please follow-up with other concerns.

pwolanin’s picture

we should think about default date biasing. Possible options:

if we switch the "changed" field to a date format, we could use something like this example query:

<str name="bq">incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2</str>

or something like this example:

bf=recip(rord(creationDate),1,1000,1000)^2

which will boost the 1st thousand docs...

pwolanin’s picture

StatusFileSize
new60.62 KB
new3.91 KB

like this? (including Luke(2) change for DamZ)

The curves show the effect of changing the steepness - I put in 4 as the default in the patch. No UI component.

pwolanin’s picture

StatusFileSize
new3.94 KB

committed this to get us started...

john.money’s picture

tag +1

pwolanin’s picture

StatusFileSize
new1.45 KB

found an uncaught exception.

Also, better this way since we'd rather keep side-effects out of the form builder.

pwolanin’s picture

StatusFileSize
new5.19 KB

added another form too.

pwolanin’s picture

StatusFileSize
new6.26 KB

let's set the body to 1.0 by default too.

pwolanin’s picture

We maybe need ot work on the extraction of a tag contents. Turns out the top terms are things like:

http	1091
org	832
drupal	807
com	606
www	506
node	478
httpdrupalorgnod	303

It's not very useful if URL filter is enabled and all links become contents of a tags.

pwolanin’s picture

StatusFileSize
new1006 bytes

rats - even with this regex, most of the stuff indexed seems to be junk. Maybe we should omit a tags normally?

pwolanin’s picture

StatusFileSize
new2.37 KB

here's a patch witch also omits a tags form the default weighting.

pwolanin’s picture

committed that patch.

pwolanin’s picture

StatusFileSize
new2.1 KB

bug in the form builder and field sorting.

pwolanin’s picture

committed to 6.x

pwolanin’s picture

Testing this out today, I think we have much too high boosts on some of the fields (like title, H1) and also we should probably omit norms on most of those fields. Having norms means that a keyword match for 1 word in a 2 word title gives a substantially higher boost that I keyword in a 3 or 4 word title. Omitting norms means that finding the a keyword in the title gives the same boost regardless of the length of the title.

I'd iterate next to something like:

<str name="qf">
body^1.0 title^5.0 name^5.0 taxonomy_names^2.0 tags_h1^5.0 tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0
</str>
<str name="pf">body^1.0 title^1.0</str>

Along with omitting norms.

pwolanin’s picture

StatusFileSize
new9.85 KB

needs more testing, but this seems to give something closer to the core results.

pwolanin’s picture

StatusFileSize
new11.5 KB

while we are at it, we ought to be able to remove this form_alter since we are only ever taking keywords via the form.

pwolanin’s picture

StatusFileSize
new16.49 KB

And, I just discovered we can use copyfield with wildcard fields - so we can omit an extra copy of this data being sent on each post.

Also, fixes a bug where the term names were not separated by a space, adds an extra '_' after 'vid' in field names for name-space reasons (e.g. to distinguish _vid from _video).

adds field name="type_name", per my conversation w/ Robert about prep for multi-site search, and also changes the smfiield_vid_* fields to use the taxonomy vocabulary name rather than the vid - again as prep for multi-site which was a TODO in code. Note - need to check that the transform there is multi-lingual safe.

pwolanin’s picture

StatusFileSize
new17.17 KB

Also, I realized we should probably replace the stripped chars with space, not an empty string. for example - what if there is a set of tab-delimited words.

I'd really like to send this in soon - any comments?

pwolanin’s picture

StatusFileSize
new17.25 KB

ah, just found the PHP preg code for a unicode letter: \p{L}

robertdouglass’s picture

The code looks good to me and I'm very happy with all of the proposed changes.

janusman’s picture

Status: Needs review » Needs work

Using \p{L} introduces a dependency on PCRE libraries being compiled _a certain way_ in PHP that is difficult to detect and/or fix by end users (having to recompile PHP in mose cases!)

This is a known issue that is a problem, and has been worked around ("hacked around"?) from Drupal core itself (search module). The workaround is to specifically issue a whitelist/blacklist of unicode characters to either filter out or include.

Meaning, to keep our module's dependencies to minimum, we have to go this way (or some other)

References:
* Search Core module (search.module) PREG_CLASS_SEARCH_EXCLUDE
* http://drupal.org/node/315486#comment-1117887

pwolanin’s picture

The goal here was to generate valid Solr field names from arbitrary vocabulary names that might include international characters.

However, what we actually need to insure as well is that these are valid PHP object property names. Thus, the correct filter is probably this one from the PHP site:

Variable names follow the same rules as other labels in PHP. A valid variable name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: '[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*'

pwolanin’s picture

Status: Needs work » Needs review
StatusFileSize
new1.2 KB

Since we already start like something like 'smfield' as the prefix we can just use the latter part of the above regex.

JacobSingh’s picture

Status: Needs review » Reviewed & tested by the community

The patch looks good. I haven't tested it, but I get the concept, and that regex looks like it will work

pwolanin’s picture

Version: 6.x-1.x-dev » 5.x-1.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

committed to 6.x

pwolanin’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev
Status: Patch (to be ported) » Needs review
StatusFileSize
new3.81 KB

the weighting still seems to be way off.

This patch greatly boosts body, changed, and comment count weighting.

Also fixes a bug with the extra field for comment counts.

pwolanin’s picture

StatusFileSize
new6.77 KB

A little more cleanup as well. Looking now, sees like having the q.alt param defined could be useful.

pwolanin’s picture

StatusFileSize
new6.75 KB

comitting attached to 6.x.

pwolanin’s picture

Status: Needs review » Fixed
pwolanin’s picture

Status: Fixed » Closed (fixed)