Provide PCRE documentation (examples)

yan - April 30, 2009 - 14:05
Project:Search 404
Version:6.x-1.x-dev
Component:Documentation
Category:support request
Priority:normal
Assigned:anoopjohn
Status:needs review
Description

It would be nice to have at least a couple of examples for the input for "PCRE REGEX". I am trying to figure out how to exclude all numbers from the search, but with the PHP Manual I can't manage to do so.

#1

zyxware - June 15, 2009 - 10:57
Status:active» needs review

Hi yan

We have improved the PCRE Filter documentation information in the latest dev of 5.x

Here is he help info

This regular expression will be applied to filter all queries. The parts of the path that match the expression will be EXCLUDED from the search. You do NOT have to enclose the regex in forward slashes when defining the PCRE. e.g.: use "[foo]bar" instead of "/[foo]bar/". On how to use a PCRE Regex please refer PCRE pages in the PHP Manual.

Basically if you have a "[fb]oo" in the Regex and the search404 URL is http://www.example.com/foo/boo/bar/baz then search will be performed for

"bar" and "baz" and "foo" and "boo" will be excluded from search

Hope this helps you

Regards
Zyxware

#2

zyxware - June 15, 2009 - 11:20
Assigned to:Anonymous» zyxware

#3

fammangold.nl - June 17, 2009 - 16:53
Version:5.x-1.x-dev» 6.x-1.x-dev
Category:feature request» support request

zyxware,

I do have the same question as the OP.
your post doesn't answer the question. the OP and I want to exclude all numbers from a url for example: /boek/343_some_title
now... how to skip "boek" is already clear from your previous answer. but how to skip the "343" (which can be entirely different in another url that has the same syntax)

Rgards Lise

#4

zyxware - July 6, 2009 - 14:19

Hi Lise

My answer was on the functionality of the PCRE Regex, as far as the query regarding how to exclude numbers from parts of URL like the example you stated above /boek/343_some_title, cannot be done by using PCRE Regex. I thought this would be understood from the explanation and the example provided.

You can exclude /boek/343/some_title using PCRE, just use [1234567890]+ as the PCRE Filter entry.

Regards
Zyxware

#5

yan - July 6, 2009 - 18:34

exclude numbers from parts of URL like the example you stated above /boek/343_some_title, cannot be done by using PCRE Regex.

In that case, does the use of this module make any sense? I think it is quite common to include a number in the URL, for example, because search engines like Google News require you to do so. But as soon as there is a number in the URL, the search always yields no results since the number is included in the search query. Or am I getting something wrong?

#6

anoopjohn - July 15, 2009 - 16:08
Assigned to:zyxware» anoopjohn

To exclude all the numbers from the search you can just use the following regex
[0-9]*
This will filter out the following URL
http://example.com/123/456abc/def789ghi/jkl
and search for
abc defghi jkl
To exclude all the numbers followed by an underscore use
[0-9]*_
This will filter out the following URL
http://example.com/123/456_abc/def789ghi/jkl
and search for
123 abc def789ghi jkl

  $keys = $_REQUEST['q'];
  ...
  $regex_filter = variable_get('search404_regex', '');
  $keys_array[] = $keys;
  if (!empty($regex_filter)) {
    $keys = preg_replace("/" . $regex_filter . "/", '', $keys);
  }

This is the relevant code from the module. The function preg_replace replaces parts of the url that matches the regex. This gives you a lot of freedom in deciding what needs to get stripped out.

#7

yan - July 17, 2009 - 03:00

Thanks anoopjohn, that did the trick! Maybe this (and other examples) could be added to the PCRE filter explanation.

#8

MacRonin - August 3, 2009 - 22:22

Another suggested example for the PCRE entry.

I figure that a common exclusion would be the date from a blog entry. Now perfect would be a checkbox to exclude the yyyy/mm/dd portion of the URL. But a good next choice would be the following expression that I use.

((19[0-9][0-9]|20[0-9][0-9])|(0[1-9]|[12][0-9]|3[01]))

This drops any four digit strings that start with 19 or 20 and any two digit strings that start with 0 1 2 or 3. I guess I could get fancy and make it stop at 31, but I'm just happy I got this one to work :-)

#9

promes - October 7, 2009 - 11:28

Thanks anoopjohn, it solves almost my problem. Mine is almost like #4.
If I have some nodes only accessible for special userroles for which I have a small piece of code in PHP which issues a drupal_not_found() call. The drupal_not_found function translates the URL by a $_Get['q'] into the nodenumber, like "node/123" which shown by search 404 as "node 123". But I don't like to have the nodenumber shown - it marks the node as an existing one.
If I use the expression: "node|[0-9]", the text "node 123" is not shown. But it creates a problem: it removes all digits and the characterstring node out of regular URL's.
Who knows how to solve this?

 
 

Drupal is a registered trademark of Dries Buytaert.