Motivation

Currently I am working on a project where we need to search multi-word information as a field content

According to
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fields
for search we have to format filters in our code as
title:"Do it right"
instead of
title:Do it right

Proposed resolution

I think it is a core functionality so small addition for the SolrFilterSubQuery->addFq() fucntion had been made and it works fine for my case.

I'd like to invite people who knows the stuff to review the patch. As well as to verify that I choose optimal place for support Lucene functionality.

Let me know if patches like this needs a test for them.

Thanks!

Comments

nick_vh’s picture

StatusFileSize
new742 bytes

What about this? So we do not have to add more if statements? I tested it purely on basis of what is analyzed, not on functionality.

Let me know :)

dpovshed’s picture

Your version is works nicely! Tested on my development server.

Just please tune up your comment

instead of
// Ex.: bundle:(article or page)
lucene likes boolean operators in ALL CAPS
// Ex.: bundle:(article OR page)

thank you for rapid answer! I am voting for commit this.

nick_vh’s picture

Status: Needs review » Fixed

Fixed, you can find it in the latest dev

pwolanin’s picture

We should probably clean this up further - at least validate that the parens or quotes are paired?

nick_vh’s picture

I guess we have 3 options

Easiest:
Not validate anything more, the creativity of developers goes in a weird way so it might be better to not restrict them more.

Easy :
http://www.php.net/manual/en/function.count-chars.php
This functions counts all the chars in a string and then we could do a check on every character we'd like to verify to see if they have equal amounts.

Harder:
preg_match_all
Similar but harder to understand, this allows us to run a regexpr on a string and returns us with all the matches. We could do a regexpr on the parens and quotes (and anything else) and do a similar validation

nick_vh’s picture

Status: Fixed » Needs work
nick_vh’s picture

StatusFileSize
new9.87 KB

Update to also allow [ and ] signs

nick_vh’s picture

StatusFileSize
new1.25 KB

My bad, wrong patchfile! :)

dpovshed’s picture

In my opinion the best option is [Easiest]

IMHO Drupal layer shall be as relaxed as possible about data sent from developer/end-user to SOLR. If people making mistake - they just shall get a response from SOLR about this.

nick_vh’s picture

Maybe you are right, but if we want to make it as easy as possible we should somehow tell and notify people of the mistakes they make. We could add some advanced settings and put a setting there to not validate the search strings but that would be further away from home.

I'm more up for validation that does not break a site radically.

joakaune’s picture

I'm unsure if this is the same issue, but it is in the same method and very similar at least.
I bumped into this bug when trying to facet my search with a facet that contains a whitespace character.
The problem is with the last elseif in the addFq() function.
the current line (line 328) reads like this:
elseif (preg_match('/(-|)([^:]+):(\S+)/', $string, $matches)) {

To fix this problem I changed the last part of the regex to this:
elseif (preg_match('/(-|)([^:]+):(\[\x20-\x7E]+)/', $string, $matches)) {

This makes the last part of the regex match whitespace chars as well.
(\S+) was changed to (\[\x20-\x7E]+)

nick_vh’s picture

@joakaune, can you tell us what you were trying to filter for? In which case do you need a whitespace?

joakaune’s picture

A few of the facets the page I had indexed was using contained whitespaces.
They are in norwegian, so I don't know if they would mean anything to you.
But shouldn't whitespaces in facets be supported anyway?

nick_vh’s picture

Well, I'd like to see the example so I can modify and test some new cases?

would this work?
(-|)([^:]+):(-|)([^:]+)

I'm currently re-thinking all of this. The more examples we have here, the better we can make it

nick_vh’s picture

Came up with the following test to check if all would pass


// Cases to know :
// Exclude? Only when there is an exclamation mark?
// Local?
// addFilter needs documentation!
// addFilter($name, $value, $exclude = FALSE, $local = '') {
//'#name', for the facet field name,
//'#value', for the facet field value, 
//'#local', for Solr LocalParams
//'#exclude' set to TRUE if it is an exclusion filter.
// check if we see any mistakes and show a message
// add the filter to value.
// Examples :
//$example = '{!anything that you want to exclude}';
//$example = 'bundle:(article OR page)';
//$example = '{!bundle:(article OR page)}';
//$example = 'title:"double words"';
//$example = "field_date:[1970-12-31T23:59:59Z TO NOW]";
$example = "field_date:1970-12-31T23:59:59Z TO NOW";
//$example = "field_date:field with space";

    $string = trim($example);
    $local = '';
    $exclude = FALSE;
    // If {!something} is found as first character then exclude = yes
    if (preg_match('/\{!([^}]+)\}(.*)/', $string, $matches)) {
      $exclude = TRUE;
      $local = $matches[0];
      $string = $matches[1];
    }
    // Something with a complicated right-hand-side.
    if (preg_match('/(-|)(.+):(-|)(.+)/', $string, $matches)) {
      //$this->addFilter($matches[2], $matches[4], $exclude, $local);
     dsm('complex');
    }
    elseif (preg_match('/(-|)(.+)/', $string, $matches)) {
      // Something else between brackets
      //$this->addFilter('', $matches[2], $exclude, $local);
     dsm('anything else');
    }
   dsm($exclude, 'Exclude');
   dsm($matches, 'last');

Seems like it does not split on $example = "field_date:1970-12-31T23:59:59Z TO NOW"; yet
More work needs to be done!

nick_vh’s picture

Testing also has to happen with

fq={!cache=false}inStock:true
fq={!frange l=1 u=4 cache=false}sqrt(popularity)
fq={!cache=false cost=5}inStock:true&fq={!frange l=1 u=4 cache=false cost=50}sqrt(popularity)
fq={!tag=impala}model:Impala
fq={!anything that you want to exclude}
fq=bundle:(article OR page)
fq={!bundle:(article OR page)}
fq=title:"double words"
fq=field_date:[1970-12-31T23:59:59Z TO NOW];
joakaune’s picture

I'm not entirely sure what info would be helpful to you.
But the terms I was trying to facet on were simply two letter ones like this one: "Best Practice" (translated from norwegian: "Beste Praksis" .

Let me know if I can provide anything else that would be helpful for you.

dpovshed’s picture

@ joakaune, in my opinion your case is the same as mine, which I described in the node body. It falls under "Enclose multiple words in quotes" rule. Am I right?

nick_vh’s picture

@joakaune, would it be possible to write your example as the earlier examples?

Was it similar to this?

fq=title:"double words"

nick_vh’s picture

StatusFileSize
new6.55 KB

I tried adding all these cases into a simpletest and a renewed logic that allows the addition of any kind of query.
Please try it out for your usecase and report back if this new logic works out for you

If you would have more examples and rarities let me know

nick_vh’s picture

StatusFileSize
new6.55 KB

Some code layout changes

nick_vh’s picture

Status: Needs work » Needs review
nick_vh’s picture

StatusFileSize
new8.99 KB

Added a static function that allows us to validate the manually added fq parameter.

nick_vh’s picture

StatusFileSize
new8.99 KB

Formatting corrections

pwolanin’s picture

Status: Needs review » Needs work

The validation must be much tighter. For example, a range query has a format like "[X TO Y]" and we need to validate the whole thing matches exactly that format.

nick_vh’s picture

StatusFileSize
new10.57 KB

And we tighten the user input validation!
This new patch checks the format if it is a range/date query format (anything between [ and ] )
It also checks on invalid characters

If more advanced input is required they should probably use custom coding. Maybe testing is also required for this function.

nick_vh’s picture

StatusFileSize
new9.57 KB

Cleaner one. The spaces between the EOFs are handled in issue #1359294: Test is not enabling the correct modules

nick_vh’s picture

For the normal value we should use the drupal_validate_utf8($text)
For the field name we need preg_match('/^[a-zA-Z0-9_\x7f-\xff]+$/'

nick_vh’s picture

Status: Needs work » Needs review
StatusFileSize
new9.49 KB

Added those recommendations to the patch and removed a dsm. They pass all the tests so I would maybe need someone to review this and then we can commit this.
No urgent status on this one though

pwolanin’s picture

Status: Needs review » Needs work

Need to validate that we have actually matching ( vs ) and [ vs ]

Right now it looks like (foo] would pass because the code would decide it has the same number of opening and closing elements.

nick_vh’s picture

StatusFileSize
new10.07 KB

I haven't found any simple version (or readable) for counting the brackets/parentheses. The patch I am providing is trying to validate name and value, counting these brackets and verifying the date syntax if the value is a date.

Simpletests are provided to prove the functionality of this test. If anyone can come up with more tests or an example that breaks this test I would love to see it.

nick_vh’s picture

Status: Needs work » Fixed

And committed!

dpovshed’s picture

Thank you Nick for pushing this issue till the successful commit!

nick_vh’s picture

Status: Fixed » Closed (fixed)
nick_vh’s picture

Status: Closed (fixed) » Needs work

Reopening this because of a problem with the and drupal_validate_utf8 and preg_match below

nick_vh’s picture

Status: Needs work » Closed (fixed)