I've just been trying to search drupal.org for recommeded webhosting services in the UK.

To filter for the UK I tried using "UK" as a search term, and "£*", both of which, ofcourse, fall foul of the 2-character search filter thing.

Now I could try "Britain" or "GBP", but as I'm looking for the terms that people actually use, they wouldn't really help much.

Hmm. Any thoughts?

Comments

3stripe’s picture

John,

I totally agree!

Just been trying to search for any other UK users here, and only managed to find your comment because I was forced to do a search for 'Britain' instead.

Am sure there must be a few other 2 character search terms out there...

robin monks’s picture

Yeah, including the abbrivation of every state and provience in existance.

Although I can see no clear way to correct this.

Robin

JohnG-1’s picture

Manual Custom Filters for Drupal Search Engine (DSE):

Personally I would rather spend admin time manually tuning the site search engine than f***ing about with menus and categories ...

a/ word-filters

1 - Didn't there used to be a list of 'excluded words' for the search engine in 4.5? If I remember those dim and distant olden times properly, there was a list of specific words that the DSE would ignore: 'the','and', etc. Infact 'etc' was probably one of them :) Looking through my DSE index file on 4.6, I see alot of 'junkwords' now. I think the 'excluded words' filter was a good idea.

2 - If this were re-introduced, it shouldn't be too difficult to also have a 'reverse filter' - a list of 'keywords' that the DSE would always index, so our 'UK's and '£*'s could be put into that. They would be exceptions to the 'wordlength' filter. I don't know if special characters like £ (£) will need fancy coding.

3 - If someone was very clever, they could add a fine-tuning function to the DSE admin, so you go through a the list of indexed words and give them a relevance weighting manually (a bit like the baysian spam filters... ?!!!):

  1. could be 'ignore completely'
  2. could be 'normal'
  3. could be 'special' ie like good-old keywords for the content of your site ...
  4. could be 'very special' whatever that might mean!

would a 'synonyms' function here be going too far? we're getting into the realms of glossaries and taxonomy :)

b/ context-filters

4 - So while I'm at it, some kind of taxonmomy filtering / weighting should also be pretty useful. The early versions of taxonomy/categories module used to be called 'metadata' ... the clue is in the question! Personally I would love to see a 'filter by category' function for the DSE, it would translate to the UI as 'search within these categories' (user select from drop-down list).

5- If taxonmony filters were too complex I would settle for a node-type filter: We used to have (on 4.5) 7 filter combinations for searching by 'nodes' and/or 'comments' and/or 'users'. In 4.6 we're down to 2: 'content' or 'users'. If this could be extended to distinguish between node-types, you could have search in 'books', 'pages', 'images', etc. On the face of it, a bit vague, but once you start using flexinodes, it would make the DSE much more powerful...

6 - I'm trying to figure out how "search whole phrases" can work on the Drupal Search Engine (http://drupal.org/node/21599) - it would save everyone alot of time hunting down help for error messages. The simplest way (for the module developer) could be to allow phrases to be included (by admin hand) into the proposed DSE index/keywords. eg "Fatal Error: call to undefined function" ... etc. Laborious for admin, but at least you'd have something.

From what I can deduce, the DSE creates an index (like a slimmed down version of google's cache) which it can search quickly and keep uptodate. The problem is this system only seems to count the number of incidents of a given word for a documewnt title, so DSEindexed words become isolated from any context like a "whole phrase".

I've often dreamed - and I don't know if even the mighty google has worked out how to do this efficiently - if a "phrase" search could be made more adaptable by calculating a weighting from the sequence of words and their inline proximity to give you a closest match to an actual phrase.

Anyway, one last point:

c/ search result weighting

7 - I don't know exactly how this is done at the moment (with 4.6). However, it would seem that the following should be taken into account in this, proposed, order (most important first):

  1. taxonomy/category associations of the node
  2. node title
  3. keywords module if enabled
  4. <h1> tags
  5. <h2> tags
  6. <h3> tags
  7. etc
  8. <strong> tags
  9. normal content

look familiar?

BTW once I'd written this, I thought it might be better to post it as a feature request for the search.module. Unfortunately there isn't a project page for the search.module http://drupal.org/project/Modules ??? Does anyone know a better place to address these issues?

Steven’s picture

A discussion of the search.module won't go very far until you are familiar with what it does now already and why it does so:
http://drupal.org/node/12232

Search issues belong in the Drupal project, because it is a core module.

In any case, the 3-letter minimum is a setting. On your own site, you are free to change it. On Drupal.org, it would make the search index way too big.

Noise words
There used to be a noise words mechanism, but it was removed because no-one ever used it. The current relative word ranking is a much better mechanism as it figures out on its own which words are commonly used on your site. If you check the search_total table you'll see. See this comment for an example.

2 - If this were re-introduced, it shouldn't be too difficult to also have a 'reverse filter' - a list of 'keywords' that the DSE would always index, so our 'UK's and '£*'s could be put into that. They would be exceptions to the 'wordlength' filter. I don't know if special characters like £ (£) will need fancy coding.

I'm not sure about this: each word to be indexed would have to be compared against this list, further slowing down the indexing process.

By the way, the pound sign works fine, but really you should type the character literally instead of using fugly entities. They are unnecessary with UTF-8 encoding (although search module will handle them if it encounters any).

4 - So while I'm at it, some kind of taxonmomy filtering / weighting should also be pretty useful. The early versions of taxonomy/categories module used to be called 'metadata' ... the clue is in the question! Personally I would love to see a 'filter by category' function for the DSE, it would translate to the UI as 'search within these categories' (user select from drop-down list).

This is not that hard to add actually, just that there is no elegant method of integrating it into the UI. Ideally we would have a collapsed "Advanced search options" fieldset. But we need to wait for that patch first.

5- If taxonmony filters were too complex I would settle for a node-type filter: We used to have (on 4.5) 7 filter combinations for searching by 'nodes' and/or 'comments' and/or 'users'. In 4.6 we're down to 2: 'content' or 'users'. If this could be extended to distinguish between node-types, you could have search in 'books', 'pages', 'images', etc. On the face of it, a bit vague, but once you start using flexinodes, it would make the DSE much more powerful...

The old system is no longer applicable, all it did was give you separated lists of X, Y and Z. With the new search, each tab gives you one result set. Comments are indexed with nodes. User search is orthogonal to content search, so it was moved to a separate tab. Ideally, user search would be expanded to search user profiles too.

6 - I'm trying to figure out how "search whole phrases" can work on the Drupal Search Engine (http://drupal.org/node/21599) - it would save everyone alot of time hunting down help for error messages. The simplest way (for the module developer) could be to allow phrases to be included (by admin hand) into the proposed DSE index/keywords. eg "Fatal Error: call to undefined function" ... etc. Laborious for admin, but at least you'd have something.

The problem with phrase searching is that you can no longer collapse identical words down to one index entry, which means the search index grows to an insane size as each instance of a word is stored separately. The only compact phrase searching mechanism I've seen so far was one where they simply stored which words followed each instance of a particular word ("drupal" followedby "release, updates, modules") and did some sort of match on that. It would result in a (low) number of false positives though.

7 - I don't know exactly how this is done at the moment (with 4.6). However, it would seem that the following should be taken into account in this, proposed, order (most important first)...

The search module already recognizes and uses HTML tags. It even resolves links between nodes.

JohnG-1’s picture

Steven, thanks for taking the trouble to reply at such length and in such depth. I sincerely hope you might find something useful in all this! I must say I know very little about search engine theory and haven't the foggiest about making them! But I do use them alot and value them very highly. I hope you take everything I say the way it is intended; as constructive criticism and doing my best to help.

A discussion of the search.module won't go very far until you are familiar with what it does now already and why it does so: http://drupal.org/node/12232

Thanks, I've been searching for that for ages ;) Perhaps the context-weighting (eg page title) doesn't seem to carry enough SE clout, or i think I would have found your patch thread through the drupalsearch. I trawled through several titles which contained none of my searchterms (guess how relevant they were!) and ended up using the in-browser 'find' tool to scan search results listings page after page (it kept getting stuck on teasers in sidebars, which was annoying).

1. Your 'noisewords' weighting/soft-filter system is a work of art! It's up there with the spam.module for truly beautiful logic.

But ... I don't see how it helps to keep the index filesize or workload down (that can't help search-processing-speed?) and it doesn't lend itself to admin-tuning (unless you spam your own search engine!). Perhaps the noisewords it identifies could be used as hard filters in subsequent index passes ... a noise-threshold would have to be established, likely based on a % of the total wordcount for the site ...?

2. index file structure?The index you describe sounds like a book-index (term : list of page refs). That's not how I imagined a search engine index file. How about creating the index by caching each page after stripping out anything that you don't want for the search engine. ie: almost all formatting and noisewords - in theory you should be left with keywords and smaller index files. This can be performed when the page is submitted (or updated). So you're kind of generating a set of metatags native to the DSE. Then you run an SQL table search ? i don't know anything about that bit ...! Infact, why not just use the cache that drupal generates anyway? with CSS and theme template data stored externally (no?) it should be pretty well stripped down already - and avoids druplicating data ...?

3. manual override for exceptions to filter-rules (=administrater hacks):

In any case, the 3-letter minimum is a setting. On your own site, you are free to change it. On Drupal.org, it would make the search index way too big.

Thanks, I'm not asking how to set the wordlength filter, I'm asking how to override it where exceptions require ... The wordlength filter is effective but crude. You must acknowlege that there will be exceptions to every rule, and this too should be provided for.

I do think there is a case for admin-defined keywords which could also be phrases. These would always be indexed.
- Keyphrases like error-messages, would be useful for sites like drupal.org.
- All acronyms and terms like 'FAQ', 'How to', 'How do I', even 'How ...?' could benefit from a way of setting & manipulating synonyms.
- It would also allow for requests like 'UK' to be processed by hand (admin) when necessary (they can't come up that often, surely!).

4. "phrase searching"I realise this is tricky because of the streamlined index design. But it seems like such a useful function ... How do most other search engines do it?

5. combining search & taxonomy modules

Search issues belong in the Drupal project, because it is a core module.

Thanks. I was a bit confused, 'core modules' sounds a bit paradoxical to me. why not put on the projects/modules list anyway - seems like the obvious place to find it? this is opensource right? and ... how about combining the search and taxonomy modules? they would make such a powerful partnership.

Your word-count data sample is very interesting and instructive: ( I can't get these logs for my site can I?)
a/ to compare with a search-terms-used.
b/ The context-keywords it throws up (drupal,module,updates) suggest sitenav section headings/ categories/ etc

Advanced Search: Taxonomy Filters ? - since when were advanced-search interfaces elegant anyway? I'm absolutely delighted to hear that it's not too far off!
I guess checkboxes for every term & subterm could get a bit clumsy.
Should definitely allow admin to prevent certain vocabularies from appearing as search filters (like the sitemenu.module).

How about a row (not a column mind you) of drop-down (multiple) select boxes, one for each 'vocabulary', containing the list of terms & subterms plus *none* and *all*, so you select the categories you want to search in (as opposed to selecting the ones to exclude...).
What have I missed?

JohnG-1’s picture

SQL search (trip search) module has been revived and doe most of the things you would expect from a search engine. http://drupal.org/proect/trip_search

JohnG-1’s picture

Steven’s picture

Status: Active » Closed (won't fix)