Two improvements to search.module that i'd like to discuss:

- when indexing the words, store as well in which $node->field the word appears. It's nifty to know if a search result refers to $node->title, $node->author or $node->body. With custom, complex, node types this is even more necessary.

- when storing words from freeform node fields, such as body, it would be fine to store 'the n words before' + 'the word' + 'the n words after', to provide a context for the result, maybe even higlighting the sought word. Like in:

Search: 'ipsum'

Result: 'My title'
By 'author'
Context: '... lorem __ipsum__ sic amet ...'

I'm not sure if any of these things are feasible and/or interesting to somebody else, so i'd like some feedback before asigning them to me.

Comments

Bèr Kessels’s picture

http://drupal.org/node/view/8237 was marked duplicate of this. It might still be of some use though, for people looking at this specific case.

stevew’s picture

I would really love to have some context along with the search results. People are not strict about how they title their nodes, comments, etc, so when you get back a list of search results with titles like "LOL!", "Broken module", and "feature request", you still have to go to each one of those links and check it out to see if what you want is there. A few words before and a few words after, like Google, would be a great improvement on the search page.

Bèr Kessels’s picture

StatusFileSize
new6.41 KB

Hi All,

I spend an afternoon working on the search engine.

I made it all themeable.
I removed search_item() since those things should be handled by the theme, not the modules.
I introduced theme_search_item()
I introduced theme_search_display() that handles the display of the results

The latter also handles the ordering.

The UI for administratin ordering might need some improvement. Maybe some dropdowns with weigting, in some future?

And I did some other improvements on the returned variables. They are now all associative arrays, so that the themeable functions have total power.

dries’s picture

Ideally, search results should be ordered by how well they match the query, not by their type. Grouping the search results by type requires more effort from the user. Would it be hard to get rid of this grouping?

Also, I suggest renaming theme_search_display() to theme_search_results(). The term 'display' is rather generic.

Bèr Kessels’s picture

StatusFileSize
new6.28 KB

Yes, Dries, that is possible.

What I already did, is add the type to the search result. So the themeable function could do this magick as you describe it.

The problem, i found is that the search rating as it current stands is BAD. if you would order only on rating the results make no sense at all.

The next step in this search improvement project was to make a (by the adminstrator) definable rating system. At that moment I can simply remove the grouping.

But I beleive the grouping can be of use for a lot of people. esp. grouping things like projects, and users makes a lot of sense to me. And on a blogging site you would definately want to split comments and nodes! Becuse it is themeable, you could re-define your theme-function for drupal.org to not group the results.

The function name is changes as suggested.

moshe weitzman’s picture

Grouped results are quite standard IMO. As an example, see the upcoming Spotlight search engine in OSX: http://www.apple.com/macosx/tiger/spotlight.html

dries’s picture

The Apple-way to present the different result groups is much more intuitive than the long, long pages generated by Drupal. It is only effective when you have a good overview of all results.

moshe weitzman’s picture

trip_search.module already takes the Apple approach (before Apple did, mind you). You have 'more' links in each group.

mort-1’s picture

Hi,

(i'm really sorry i lost track of this discussion, but i never noticed any follow-ups on drupal-devel until tonight)

In the meantime, i've been working on this. A summary of the first part of my work can be seen here

http://drupal.org/node/view/8739

As for the 'context' part, i've got some semi-working code that does the following

1 Alter the search_index table, adding a 'field_type' and a 'context' column
2 Modify the module_update_index hook, so the context can be stored for some fields and not for others. For example, an 'author' or 'recipe_ingredients' field hardly deserves context, but 'body' or 'recipe_text' do.
3 Modify the actual indexing:
3.1 For each one of the indexed fields strings
3.1.1 store the field type in the search_index table
3.1.2 After removing whitespace, etc but before removing the punctuation from the strings, duplicate it.
3.1.3 Explode both (puntuated and not) strings into two different arrays. Puntuated = alpha / Non punctuated = beta
3.1.4 For each of the words in the beta array that's not a noise word
3.1.4.1 Do an array_slice on the alpha array of the n words before and n words after this word. The reason for doing this in alpha is obviously keeping the punctuation on the contexts
3.1.4.2 Concatenate the 'pre' slice, the word and the 'post' slice
3.1.4.3 Store this string in an array, since there's only a record per word per node, the context will be an array if the word count is greater than one
3.1.4.4 Serialize the contexts array and insert it into search_index

I've got this working to this point but for a stupid bug that's causing the serialized array to get screwed when the insert takes place. I'll post the code between tomorrow and monday to help headaches caused by all this block of nonsense :)

Comments?

Bèr Kessels’s picture

Hi all, ack from the movies and found quite some life here!

Its goodto mention that my patch does non of all this, but allows *themers* to do all this.

The grouping is done by a themeable function
The weigting of grouped results is done in the same themeable function: So if you want, you could even create google-adbar alike results in the theme.
The display of a single result is done by a themeable function.

The result is that on the outside nothing has yet changed for xtemplate (and other themes too).

See this patch as a re-wiring of the search engine, so that we can now start tweaking and upgrading the engine to give better search results!

dries’s picture

The fact that the theme can sort search results doesn't mean this became a non-issue. As Ber said, we must tweak the search/index code itself. So far very little changed: the real problem ('Better search results') is still unsolved. (In many ways, trip_search.module does a better job.)

Bèr Kessels’s picture

I am wondering what others think to be the best route to go from now.

We can go three ways, AFAIK.
1. replace core search with trip search
2. Add trip search code to the core search
3. Leave trip serach alone and add wiegting to the core search.

To be honoust: i would only be able to 3. Since i have the idea and schemes lying here to do so. But I think its no good idea to re-invent something that works well. So can anybody here tell me waht (if there are any) disadvantages of the trip_search are? And can people tell me what they prefer to happen from here?
If we go for either 2. or 3. the patch attached above should be applied. For I intend to continue working on top of those changes.

Bèr

dries’s picture

I like the idea of removing the search index (trip search doesn't have an index), however, that introduces new issues as we do have quite a few PHP nodes here on drupal.org and these can be searched with a query.

Having both 'trip search' and the current search module in core is not an option, IMO.

Adding weighting to core might be a good path to take, but that is not going to address the fact that searching for 'Drupal 4.4.1 release' does not yield any results. Earlier today I searched for 'CVS access' and 'apply CVS access' and could not find the CVS application form. In the end, I used Google and it came up as the first hit. Indexing is broken more than anything else (and weighting is not going to fix that). I see two paths here:

1. Fix the indexing of content.
2. Look into removing the index all together.

matt westgate’s picture

Perhaps we should again explore the possibilites of using MySQL fulltext indices. I don't know what percentage of our user base uses Postgres, but they could use the current search system indexing scheme or a fulltext plugin.

Steven’s picture

I think indexed searching is still the way to go for speed reasons. Full-text search doesn't scale properly. Someone suggested using PHPDig, but I took a look at it and like most PHP projects, its code is a big mess. There's some useful things in there, but they're packed inside code duplication and messy design.

Still, I think the search index could work great if we do the following things:

- Have stricter rules for including words: have a larger "noise words" list and only include words of 5 letters or more, except those that appear in a "clean words" list (lots of acronyms). Taking a look at my own blog's index, it's full of 2-3-4 letter words with no meaning (not, but, is, for, ...). These noise lists would of course be language specific: maybe this belongs with locale? These can be determined easily just by taking a look at the most popular words in an existing index and picking out the noise ones.

- Use different splitting rules: split on spacing characters only first, then trim away punctuation from the beginning and end (this helps for acronyms, version numbers, website addresses, etc). Apply the same rules to indexing as well as typed queries.

- Use basic weighting for different fields, as provided by the module. Titles score higher than teasers, score higher than bodies, etc. Improvements for determining extra weights could also have a positive effect: this would be mostly arbitrary, but I'm sure together we can come up with some meaningful rules (early vs late in the text?).

- If we store word positions as well as counts, we could support phrase searching (where word 1 is in position X and word 2 in position Y). Not sure how the queries for this would look though.

- I'm not sure the idea of storing context along with the search index is a good idea: the index is already huge for a normal sized site. Wouldn't simply loading the node/comment/thing in question be a better idea? If search results are more relevant, we can afford showing less results per page.

- Making sure all content is indexed is obviously important (PHP pages and such).

Bèr Kessels’s picture

Hi,

"- Have stricter rules for including words: have a larger "noise words" list and only include words of 5 letters or more, except those that appear in a "clean words" list (lots of acronyms). Taking a look at my own blog's index, it's full of 2-3-4 letter words with no meaning (not, but, is, for, ...). These noise lists would of course be language specific: maybe this belongs with locale? These can be determined easily just by taking a look at the most popular words in an existing index and picking out the noise ones."

This was exactly what i thad in mind. Some user-power should be in place too though, because a user might want to allow certain words to appear anyway. (not that the results for that word make sense though). I was thinking of adding another column in the DB table with a flag: noise. that way future changes in the noisewords can be done easy. Currently it is not possible: once your index is dirty you have to hand-clean it.

"- Use basic weighting for different fields, as provided by the module. Titles score higher than teasers, score higher than bodies, etc. Improvements for determining extra weights could also have a positive effect: this would be mostly arbitrary, but I'm sure together we can come up with some meaningful rules (early vs late in the text?)."

This was exactly what I wanted to introduce. In the Config. you can set a weighting multiplier for each field. The sum of the multipied filds make the score.

"- If we store word positions as well as counts, we could support phrase searching (where word 1 is in position X and word 2 in position Y). Not sure how the queries for this would look though."

My plan/idea for this was to do a node_load for all the results from a query with a maximum of (lets say) 100 (but that should be definable from UI) . Then to perform certain filter actions to the teaser, body etc. This way you can:
Add context (the phrase) around the word.
Search for prases
Perform the weighting I defined above
Do other cool stuff. (a hook_search_node_load($key, $node) )

Also note that I am not a search guru. But waiting for one took me too long, so i just give it a chance.

Steven’s picture

Perhaps we should also look at taxonomy-based searching. For example, imagine that searching with a taxonomy term in the keywords automatically restricts the search to nodes with that taxonomy term.

I feel exactly the same way as Ber: I know very little about searching, but I'm tired of waiting for a guru. If we put our heads together I'm sure we can come up with a great solution.

bertboerland’s picture

as per thisd posting, why not make POST the standard or at least make it an option

bertboerland’s picture

oops, correct link. so the question is: make the POST and GET both available and make a dropdown of what the default one for searching would be or make GET the default (it is more usefull, for making links etc)

merlimat’s picture

Just a couple of thought:

1. I've looked closely at the DB schema for the search index and (althought a SQL table is not
the perfect system for storing a search index) i believe it can be improved in space occupation
using word IDs in the table.
That is: the current table scheme is ( word, nid, count ) can be changed to ( wid, nid, count )
with 'wid' integer. This will drive the space occupation to at most 4 bytes for every word.
Now you can use another table to keep mapping from word => wid, or (better for performance) use
an hash function to compute the word id.

2. Collation in searching. If you use the search engine with non-ascii text you will likely want
to match text whether or not it contains accented letters. In fact with the actual system to
match the italian word 'perché' you cannot enter 'perche' in the query.
The question is that many times words ( especially names ) are written without accent and so
when searching you cannot know in which form was written the word you are looking for.
Collation can be done is several ways:
- Trough MySQL. From version 4.x you can specify the collation encoding for a DB, table
or a field. Then is MySQL that will do it trasparently.
- Doing normalization. That is to normalize accented letters to ascii correspondant. This
can easily be done when working on unicode strings.

ciao
matteo

merlimat’s picture

Just a couple of thought:

1. I've looked closely at the DB schema for the search index and (althought a SQL table is not
the perfect system for storing a search index) i believe it can be improved in space occupation
using word IDs in the table.
That is: the current table scheme is ( word, nid, count ) can be changed to ( wid, nid, count )
with 'wid' integer. This will drive the space occupation to at most 4 bytes for every word.
Now you can use another table to keep mapping from word => wid, or (better for performance) use
an hash function to compute the word id.

2. Collation in searching. If you use the search engine with non-ascii text you will likely want
to match text whether or not it contains accented letters. In fact with the actual system to
match the italian word 'perché' you cannot enter 'perche' in the query.
The question is that many times words ( especially names ) are written without accent and so
when searching you cannot know in which form was written the word you are looking for.
Collation can be done is several ways:
- Trough MySQL. From version 4.x you can specify the collation encoding for a DB, table
or a field. Then is MySQL that will do it trasparently.
- Doing normalization. That is to normalize accented letters to ascii correspondant. This
can easily be done when working on unicode strings.
Another way for doing normalization is trough the Php recode module.

ciao
matteo

merlimat’s picture

>>4 bytes for every word.
>>   Now you can use another table to keep mapping from  word => wid, or
>>(better for performance) use
>>   an hash function to compute the word id."

> What would be the purpose of using a hash there?

For hash function I mean a function that derive a unique integer from the starting string. This way you have no need to keep a separate table for the convertion (word => wid), but you can do this with the hash function.

Steven’s picture

With hashes, you could no longer do wildcard searches though.

moshe weitzman’s picture

So far, only 2 complaints against trip_search.module remain standing. The oft cited performance concerns are unfounded; we compared trip_search vs. search.module using the drupal.org database - they were equally fast.

The first concern is lack of relevance ranking. This is a matter of preference. In practice, I find reverse chrono ordering quite agreeable. Our current search.module does relevance ranking, but it is quite crude. It ranks based on word frequency. Compared to that algorithm, I'd actually prefer reverse chrono.

The second concern is inability to properly index PHP pages. This is a problem with current search.module, and search.module has the luxury of maintaining its own index! To resolve this shortcoming, we have no choice but to add an index which modules can optionally use. This index would initially be used only for PHP pages. It need not be as sophisticated as current search_index - just a hunk of HTML is all we need.

In summary, I think trip_search.module is a big step forward, and recommend it be considered for core. Folks - please activate this module on your sites (it can coexist with current search.module) and evaluate the quality of results.

P.S. Full text indexing using the DB server does not resolve the 'PHP pages' problem.

bertboerland’s picture

Something that might be usefull for returning the order of the surch result: number of pageviews.

We dont have to make a very complicated ranking mechanism. It *is* okay to weight taxonomy, number of words etc in the ranking, but why not use just the number of hits on the page as a last resort? This has I think at least one drawback. Giving pages with more views a higher ranking will create one's own reality (higer ranking > more pageviews > higher ranking > ...). This might be solved by only taking external pageviews in account or excluding internal searches.

Bèr Kessels’s picture

Before we all dive into giving our preferred set of metadata to rank on, I would like to outline the code i have here, (but wich needs to be debugged, cleaned and tested, first)

Any metadata should be possible to count in the ranking. For different sites, we need different arguments to rank on. News portals indeed want to give back ranked on date, but an aficiando site will want the first article ever written about "Sahara Cacti" to rank #1. And we at drupal probably want bookpages to always rank higher then forum entries.

Therefore I made a more general anking algorithm, that will allow adminstrators to set the weight (0-10) of each of these meta data items (I call them multipliers).
For now my code ranks on:
Date,
Title,
Teaser,
Body.

But we have (amoungst others): flexinode fields, path, taxonomy, menu callback, pageviews, external pageviews, amount of comments, amount of comments containing keys, author, type of node, parent of bookpage etceteras etceteras.

So it should be, in some way extendible. I am not sure how i will actually do this extendibility, but i will probably end up creating a hook. Of course we need to keep in mind that adding all abovementioned metatadata in the serach ranking might not make sense (overhead!), but adding some, will for sure enable adminstrators to generate a search ranking on their specific sitem, that make the most sense for their audience and their content.

I am not sure about if and how this will fit into trip_search, but I assume it will not be much harden than to fit it into search.module.

dries’s picture

First, I'd like to have a patch that makes the search module return good results (in whatever order). Only _after_ such a patch landed, and the fundamental problem is fixed, I'll consider a patch that worries about the ranking.

I just tried trip_search but it doesn't seem to work with Drupal HEAD. The idea of removing the index is really tempting.

dries’s picture

Marking this 'active' as there is consensus yet.

Steven’s picture

Ranking and results are very much related though. Any Google query will return tons of results, but only the first 10-20 are useful for 99.9% of the queries.

Bèr Kessels’s picture

I have been of this for a while, due to lack of time. However, I found that the problem with not-indexing of pages is not really a search.module problem. If we know that only the end result of any page should be indexed (the user searches for what he sees, not for tokens or so, he might have typed) we can narrow this to a small push vs pull problem.

If we have the search engine pull all information to index it, its needs to take all sorts of filters, but also runned php code.
A push system, however, would force module designers to think of their indexing. They should, at some point insert a serach-index hook, or function. One that inserts the output information in the index.
We do such a thing now, but I see no search hook in, for example page.module. Nor do I see any filters doing indexing.

I know this has some other problems. One being, that if you insert evaluated code, that one might change with every page-view. But having at least some of the ouput in the index, might improve things drastically.
If in addition we make the re-indexing a bit smart, so that it only refreshes parts of the index (a flag column in thd DB stating dynamic or not) I think this might help

So I would like to know: Should we move this indexing from a hybrid push-pull (as it is now) to a full push system? If so, we can do this after branching 4.5 and start with all core modules and filters.

As Steven said, the weighting of the results is very important too. As i said before I have some code for that here, just did not find the time to re-do that code and put it in the search module.

Conclusion:
Indexing needs re-thinking to have all results
Weighting needs to be implemented to sort all those results

moshe weitzman’s picture

Status: Active » Closed (fixed)

search has changed since this issue