Motivation
Currently I am working on a project where we need to search multi-word information as a field content
According to
http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fields
for search we have to format filters in our code as
title:"Do it right"
instead of
title:Do it right
Proposed resolution
I think it is a core functionality so small addition for the SolrFilterSubQuery->addFq() fucntion had been made and it works fine for my case.
I'd like to invite people who knows the stuff to review the patch. As well as to verify that I choose optimal place for support Lucene functionality.
Let me know if patches like this needs a test for them.
Thanks!
| Comment | File | Size | Author |
|---|---|---|---|
| #31 | 1313698-31.patch | 10.07 KB | nick_vh |
| #29 | 1313698-29.patch | 9.49 KB | nick_vh |
| #27 | 1313698-27.patch | 9.57 KB | nick_vh |
| #26 | 1313698-26.patch | 10.57 KB | nick_vh |
| #24 | 1313698-24.patch | 8.99 KB | nick_vh |
Comments
Comment #1
nick_vhWhat about this? So we do not have to add more if statements? I tested it purely on basis of what is analyzed, not on functionality.
Let me know :)
Comment #2
dpovshed commentedYour version is works nicely! Tested on my development server.
Just please tune up your comment
instead of
// Ex.: bundle:(article or page)
lucene likes boolean operators in ALL CAPS
// Ex.: bundle:(article OR page)
thank you for rapid answer! I am voting for commit this.
Comment #3
nick_vhFixed, you can find it in the latest dev
Comment #4
pwolanin commentedWe should probably clean this up further - at least validate that the parens or quotes are paired?
Comment #5
nick_vhI guess we have 3 options
Easiest:
Not validate anything more, the creativity of developers goes in a weird way so it might be better to not restrict them more.
Easy :
http://www.php.net/manual/en/function.count-chars.php
This functions counts all the chars in a string and then we could do a check on every character we'd like to verify to see if they have equal amounts.
Harder:
preg_match_all
Similar but harder to understand, this allows us to run a regexpr on a string and returns us with all the matches. We could do a regexpr on the parens and quotes (and anything else) and do a similar validation
Comment #6
nick_vhComment #7
nick_vhUpdate to also allow [ and ] signs
Comment #8
nick_vhMy bad, wrong patchfile! :)
Comment #9
dpovshed commentedIn my opinion the best option is [Easiest]
IMHO Drupal layer shall be as relaxed as possible about data sent from developer/end-user to SOLR. If people making mistake - they just shall get a response from SOLR about this.
Comment #10
nick_vhMaybe you are right, but if we want to make it as easy as possible we should somehow tell and notify people of the mistakes they make. We could add some advanced settings and put a setting there to not validate the search strings but that would be further away from home.
I'm more up for validation that does not break a site radically.
Comment #11
joakaune commentedI'm unsure if this is the same issue, but it is in the same method and very similar at least.
I bumped into this bug when trying to facet my search with a facet that contains a whitespace character.
The problem is with the last elseif in the addFq() function.
the current line (line 328) reads like this:
elseif (preg_match('/(-|)([^:]+):(\S+)/', $string, $matches)) {To fix this problem I changed the last part of the regex to this:
elseif (preg_match('/(-|)([^:]+):(\[\x20-\x7E]+)/', $string, $matches)) {This makes the last part of the regex match whitespace chars as well.
(\S+)was changed to(\[\x20-\x7E]+)Comment #12
nick_vh@joakaune, can you tell us what you were trying to filter for? In which case do you need a whitespace?
Comment #13
joakaune commentedA few of the facets the page I had indexed was using contained whitespaces.
They are in norwegian, so I don't know if they would mean anything to you.
But shouldn't whitespaces in facets be supported anyway?
Comment #14
nick_vhWell, I'd like to see the example so I can modify and test some new cases?
would this work?
(-|)([^:]+):(-|)([^:]+)
I'm currently re-thinking all of this. The more examples we have here, the better we can make it
Comment #15
nick_vhCame up with the following test to check if all would pass
Seems like it does not split on $example = "field_date:1970-12-31T23:59:59Z TO NOW"; yet
More work needs to be done!
Comment #16
nick_vhTesting also has to happen with
Comment #17
joakaune commentedI'm not entirely sure what info would be helpful to you.
But the terms I was trying to facet on were simply two letter ones like this one: "Best Practice" (translated from norwegian: "Beste Praksis" .
Let me know if I can provide anything else that would be helpful for you.
Comment #18
dpovshed commented@ joakaune, in my opinion your case is the same as mine, which I described in the node body. It falls under "Enclose multiple words in quotes" rule. Am I right?
Comment #19
nick_vh@joakaune, would it be possible to write your example as the earlier examples?
Was it similar to this?
fq=title:"double words"Comment #20
nick_vhI tried adding all these cases into a simpletest and a renewed logic that allows the addition of any kind of query.
Please try it out for your usecase and report back if this new logic works out for you
If you would have more examples and rarities let me know
Comment #21
nick_vhSome code layout changes
Comment #22
nick_vhComment #23
nick_vhAdded a static function that allows us to validate the manually added fq parameter.
Comment #24
nick_vhFormatting corrections
Comment #25
pwolanin commentedThe validation must be much tighter. For example, a range query has a format like "[X TO Y]" and we need to validate the whole thing matches exactly that format.
Comment #26
nick_vhAnd we tighten the user input validation!
This new patch checks the format if it is a range/date query format (anything between [ and ] )
It also checks on invalid characters
If more advanced input is required they should probably use custom coding. Maybe testing is also required for this function.
Comment #27
nick_vhCleaner one. The spaces between the EOFs are handled in issue #1359294: Test is not enabling the correct modules
Comment #28
nick_vhFor the normal value we should use the drupal_validate_utf8($text)
For the field name we need preg_match('/^[a-zA-Z0-9_\x7f-\xff]+$/'
Comment #29
nick_vhAdded those recommendations to the patch and removed a dsm. They pass all the tests so I would maybe need someone to review this and then we can commit this.
No urgent status on this one though
Comment #30
pwolanin commentedNeed to validate that we have actually matching ( vs ) and [ vs ]
Right now it looks like
(foo]would pass because the code would decide it has the same number of opening and closing elements.Comment #31
nick_vhI haven't found any simple version (or readable) for counting the brackets/parentheses. The patch I am providing is trying to validate name and value, counting these brackets and verifying the date syntax if the value is a date.
Simpletests are provided to prove the functionality of this test. If anyone can come up with more tests or an example that breaks this test I would love to see it.
Comment #32
nick_vhAnd committed!
Comment #33
dpovshed commentedThank you Nick for pushing this issue till the successful commit!
Comment #34
nick_vhComment #35
nick_vhReopening this because of a problem with the and drupal_validate_utf8 and preg_match below
Comment #36
nick_vhFixing this in another issue. #1379128: Treat cardinality=1 fields as single-value + caching stats response