D6 search refactoring (backport to 5)

Steven - May 24, 2007 - 06:01
Project:Drupal
Version:5.x-dev
Component:search.module
Category:task
Priority:normal
Assigned:douggreen
Status:patch (code needs review)
Description

This patch implements the node link tracker changes in the search indexer, as discussed in #145560.

  • Instead of keeping track of linked words in search_index, we put this relation in a separate table search_node_links. Then, whenever a node is indexed, we fetch all the links pointing to it from the database, and add their caption in as normal body content.

    In order to ensure all links are indexed properly, we touch the changed timestamp of any node that is linked to, to force it to be reindexed. It will correctly deal with nodes that point to each other: it only forces a re-indexing if new links have been discovered or old ones have been removed. In extreme cases, this can (temporarily) cause the indexed percentage to decrease or stagnate. However, it will always still converge to 100% eventually.

    The end result is behaviour that is near-identical to what we had before. The main difference is that as a simplification, we no longer look at any inline markup inside links, since it is so rarely used. This virtually no effect on real life HTML content.

  • This change simplifies the search tables a lot (especially search_index) and allows us to use primary (unique) keys for more columns.
  • Since I changed the table structure anyway, I also took the liberty of removing some redundant keys. Any multi-column database index (a, b, c) automatically serves as an index for (a, b) and (a) too. So, with a primary key of (sid, type, word), having a separate (sid, type) index is not necessary. I verified this with EXPLAIN on MySQL, and checked both the MySQL and PgSQL docs (which confirm this).
  • I also realized the (word) index on search_index is sub-optimal, since the search query always searches by type, as well as keyword. So, the index (type, word) should be used instead.
  • Finally, as an improvement, I kept the link caption in the source text under a strong score penalty, rather than removing it entirely. This means the source page will still show up in queries for the link caption, but that the target node will show up higher and was a one line change I've long wanted to do.

If someone wants to benchmark this, feel free to, but it's rather obvious this will improve things: a lot of complexity is moved out of the search process and replaced with simple logic in the indexer. The tables are much smaller, and there are less and stricter indexes. This means that the first search pass will be more effective as well.

Because the database changes are so invasive, a re-indexing is needed.

Testers: the search settings page is broken due to the Form API 3 changes and the re-index button is broken. For now, you can manually go to admin/settings/search/wipe to wipe the index.

Note: search_update_1() calls search_install(). This unorthodox pattern is justified because we're wiping the search index completely. Should a future schema change occur, we wouldn't ever want to deal with the intermediate schema version.

AttachmentSize
search-node-links.patch9.25 KB

#1

moshe weitzman - May 24, 2007 - 06:07

subscribe

#2

douggreen - May 24, 2007 - 14:21

Thanks! Two very minor changes, $links = array() should be initialized and there's a lingering print_r.

Would you like to consider improvements here or just refactoring? For example, we're not catching path.module aliases.

#3

douggreen - May 24, 2007 - 16:32

Just testing out the decay function, and for anyone else who is curious: From the patch:

  // Focus is a decaying value in terms of the amount of unique words up to this point.
  // From 100 words and more, it decays, to e.g. 0.5 at 500 words and 0.3 at 1000 words.
  $focus = min(1, .01 + 3.5 / (2 + count($results[0]) * .015));

Is this the TF/IDF math from #145560?

To verify the comment, I wrote this snippet:

  for ($i = 1; $i < 1000; $i ++) {
    print "[$i]=". min(1, .01 + 3.5 / (2 + $i * .015)) ."<br>\n";
  }

and the abridged results are:

[1]=1
.. all values = 1 until...
[102]=1
[103]=0.99730606488011
[104]=0.99314606741573
... decays to ...
[500]=0.37842105263158
[501]=0.37784025223332
... and further to this ...
[998]=0.21624631703005
[999]=0.21606417427142

These results are a little off from the comment in the code. Is the level of decay important? Or is it just important that it's decaying?

#4

Steven - May 24, 2007 - 19:50

The decay portion has not been touched. It is still the same as before and not part of this patch.

#5

pwolanin - June 20, 2007 - 22:20
Status:patch (code needs review)» patch (code needs work)

needs to be redone for schema API, at the least

#6

hunmonk - June 21, 2007 - 17:24
Status:patch (code needs work)» patch (code needs review)
  1. updated the patch to current HEAD -- this only involved getting it in line w/ the schema API.
  2. fixed a variable declaration bug -- $links must be set to an array, or a foreach loop bombs later in the search index code.
  3. created a couple of nodes, 1 with a link, and examined the data in the search tables -- seems to be ok, although i don't really know anything about the scoring stuff.
  4. tested both a new install and D5 -> D6 upgrade on postgres -- both work perfectly.
AttachmentSize
search_links.patch7.87 KB

#7

moshe weitzman - June 26, 2007 - 01:27

i tried adding a link from node/3 to node/2 and then running cron but no matter what i am not seeing an entry into search_node_links(). here is the source of node 3

preamble <a href="node/2">two is great shoe</a>. appendix

any ideas?

#8

Steven - June 26, 2007 - 05:07

<a href="node/2"> is not a valid link to a node. Drupal menu paths != urls. You need to run them through url(), to get the base path prefixed and the non-clean url marker inserted.

#9

moshe weitzman - June 26, 2007 - 14:13

fair enough.

yet if a link works, people will use it. the vast majority of authors have no access to url(), so they fiddle with URLs until they find one that works. it would be nice if our link tracker recognized these "non standard" yet still working links.

#10

moshe weitzman - June 27, 2007 - 23:38
Status:patch (code needs review)» patch (reviewed & tested by the community)

i rested and this works as designed. the link discovery algorithm can be improved in a different patch. it is not part of this patch.

#11

Gábor Hojtsy - June 28, 2007 - 00:18

I'd name the update function as with other core modules, not starting from one, but from 6000. Otherwise as the reviewers indicated that the code runs fine, with this little fix, I think this is ready to go in.

#12

Dries - June 28, 2007 - 08:50

I'm trying to figure out whether this has any unwanted side-effects:

<?php
+function search_touch_node($nid) {
db_query("UPDATE {node} SET changed = %d WHERE nid = %d", time(), $nid);
+}
?>

Touching the node's changed date will cause Drupal to render 'updated' tags next to titles.

In general, I think the code could benefit from some additional documentation -- a paragraph of text that provides an algorithmic overview of the code would help me figure things out. Like, it would be useful to explain in 1 or 2 sentences why it is important to update/delete the captions of nodes.

#13

douggreen - August 3, 2007 - 02:33

Per the conversation on the [infrastructure] list, attached is a Drupal 5.x version of the patch. This needs to be heavily tested. I've only confirmed that this version, doesn't spew errors.

AttachmentSize
search-node-links-d5.patch8.99 KB

#14

Gerhard Killesreiter - August 6, 2007 - 00:13

I've applied the patch to scratch.d.o and I am re-indexing the site.

#15

Gerhard Killesreiter - August 6, 2007 - 00:16

Invalid argument supplied for foreach() in /var/www/scratch.drupal.org/htdocs/modules/search/search.module on line 639

which is this line:
foreach ($links as $nid) {

#16

Gerhard Killesreiter - August 6, 2007 - 00:23

Initializing the $links array should help.

#17

Gerhard Killesreiter - August 6, 2007 - 19:18

Re-indexing a site of the size of drupal.org really takes ages and puts a lot of stress on the server.

Would it be possible to find an upgrade path that not simply drops the tables?

#18

Narayan Newton - August 6, 2007 - 23:39

One issue with this, it removes the word index on search_index. This makes the preen of the search_total table really painful, see below:

mysql> explain SELECT t.word AS realword, i.word FROM search_total t LEFT JOIN search_index i ON t.word = i.word WHERE i.word IS NULL;
| id | select_type | table | type  | possible_keys | key       | key_len | ref  | rows   | Extra            

  |  1 | SIMPLE      | t     | index | NULL          | PRIMARY   | 152     | NULL |  23193 | Using index      
                |
|  1 | SIMPLE      | i     | index | NULL          | type_word | 202     | NULL | 483309 | Using where;
                Using index; Not exists |

Where as before the join examined around 23 rows, this has expanded dramatically without the word index. Word is covered in type_word, but its not the left-most prefix and can't be used well for this. Adding a word index make this query look like this:

|  1 | SIMPLE      | t     | index | NULL          | PRIMARY | 152     | NULL           | 22028 | Using index                          |
|  1 | SIMPLE      | i     | ref   | word          | word    | 152     | scratch.t.word |     8 | Using where; Using index; Not exists |

#19

Dries - August 7, 2007 - 08:35
Status:patch (reviewed & tested by the community)» patch (code needs work)

I guess this means the code needs work.

#20

moshe weitzman - August 7, 2007 - 11:29

there is better way to update than to reindex. this patch makes indexing slightly less heavy, but yes it is unfortunate.

#21

Gerhard Killesreiter - August 15, 2007 - 00:02

I withdraw the comment about re-indexing. The tables were re-created as myISAM and the MySQL config didn't deal with this correctly. Indexing is much faster now.

scratch.drupal.org is now runnogn this patch, so please test it.

#22

moshe weitzman@... - August 15, 2007 - 01:21

oops. my comment was supposed to read "there is *no* better way to update than reindexing"

#23

BioALIEN - August 23, 2007 - 11:20

Subscribing.

#24

douggreen - August 28, 2007 - 12:22

The 5.x update doesn't properly reset the search variables. So while the search indexes are wiped clean, the indexing doesn't actually pick up unless you visit admin/settings/search and re-index. I'll check if the 6.x version has the same problem and submit a patch within the next couple days.

#25

douggreen - August 28, 2007 - 13:45

Two minor 5.x fixes:

  1. update needs to reset variables
  2. uninitialized array $links
AttachmentSize
146466-D5.patch9.09 KB

#26

douggreen - August 28, 2007 - 15:09

Without the index on search_index (word) the SUM() in search_update_totals take forever so this patch adds that index back.

AttachmentSize
146466-D5_0.patch9.04 KB

#27

douggreen - September 11, 2007 - 14:44

@Steven, is "refactor" the right word in this title? Does this patch maintain identical results? I didn't think that it did, but the title of the node has me confused.

#28

fgm - September 16, 2007 - 07:08

subscribe

#29

douggreen - September 22, 2007 - 06:17

I spoke to Steven yesterday at DrupalCon, and he reminded me that Dries had a pending unsolved question in this thread. The problem is that by re-setting the nodes changed time to indicate that re-indexing is needed, causes Drupal to render 'updated' tags next to titles.

Steven suggested that we add a queue to keep track of which nodes need indexing/re-indexing. Rather then create a new table, we could add a new field to the search_data table (call it reindex and make it a timestamp). I don't like changing the behavior of the table, because I think that the table should then be renamed, and I don't really like renaming tables. But I think that this would be the better implementation to creating an entirely new table used only for the reindexing queue.

#30

douggreen - September 22, 2007 - 06:30

The above suggestion (search_dataset.reindex field) feels like a separate patch. So that we can simplify the query for getting the list of nodes needing re-indexing, we'd would also need to explicitly set the reindex field whenever the node is changed.

#31

douggreen - September 23, 2007 - 15:50

I spoke to chx today and he suggested that we roll a "fix search" patch, which will include the above, plus the fix for changed that Dries asked for, plus the new queries that don't use db_temporary_table. I was originally waiting to do the new queries until THIS link patch landed. But based on chx's recommendation, I hope to submit a new patch later today.

I'm at the unofficial DrupalCon Barcelona Hackathon and I've mostly completed a search patch that includes this one and the fixes for changed. My new work completely removes the cron variable's by adding a reindex field to the search_dataset. This value is set on hook_nodeapi 'update' and hook_comment 'insert'. The query to determine if something needs to be indexed, then becomes:

SELECT COUNT(*) FROM {node} n LEFT OUTER JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0

I spoke to Dries at DrupalCon and he said that as this is a relatively important performance patch, that it will get looked at for 6.x.

More to come soon...

#32

douggreen - September 23, 2007 - 16:22

As promised above, here's the first 80% of the patch. What's left for me to do is rewrite do_search() to use the new tables, remove the db_query_temporary calls, and add use subquery for the exclude and phrase searches. I've mostly done this in VFS, wrote an article on it here and spoke about the changes at DrupalCon ... so hopefully, the actually writing of the code will be trivial.

AttachmentSize
146466.patch12.11 KB

#33

douggreen - September 23, 2007 - 16:27

To test this you can use the devel module to create nodes (apply this patch to create a lot of nodes) and the reindex module to quickly index this site.

#34

catch - September 23, 2007 - 17:00

Not to muddy the water, but this seems to be a similar approach (for different reasons) to: http://drupal.org/node/148849 - which adds an "activity" timestamp column in node to replace node_comment_statistics

Last patch there was at: http://drupal.org/node/148849#comment-283965

#35

douggreen - September 23, 2007 - 17:26

Sortof ... However, the bulk of this patch so far is the link refactoring. With the change, a new column is needed by search to keep track of which nodes need reindexing, and a by-product of this new column is that it drastically simplified the search indexing. Since all of this involves search, I don't think that this patch is really similar to the one you mentioned.

#36

catch - September 23, 2007 - 17:37

Yeah no problem. Just thought the new column could double up to be used for the denormalisation in Drupal 7, or even help the tracker query a bit now as a side effect. No point holding this up because of that though so feel free to ignore.

#37

douggreen - September 23, 2007 - 17:39

Minor update to fix the Re-Index button on admin/settings/search found by snufkin.

AttachmentSize
146466-1.patch12.16 KB

#38

douggreen - September 23, 2007 - 20:30

Attached is an update that has the refactored SQL that takes advantage of having a cleaner {search_index}.

It fixes a bug found by dopry in hook_comment that was not checking all of the appropriate comment operations. And if fixes another bug (also found by dopry) that was updating nid when it should of updated sid.

AttachmentSize
146466-2.patch24.18 KB

#39

douggreen - September 23, 2007 - 20:51

Still left to do:

  1. I really haven't tested the new SQL handling well
  2. Try to restore the word rank normalization (which was previously done in a separate query)

Note: This patch drops some syntax that was previously available. It still supports implied AND's, like "term1 term3", and it supports "OR", such as "term1 OR term2". But it no longer supports, mixing these operators. as in "term1 term2 OR term3". I actually don't think that this ever worked as expected, and IMHO dropping this functionality is not really a loss. The new SQL handling that I just added does need to cleaned up in search_parse_query, so that it won't try to process the previously supported mixing of operators.

The new code also needs to be heavily commented.

Anyone wanting to test should test:

  • That the change in the search indexing queue mechanism works
  • That the search syntax generates the right queries to return the right results set
  • That the link relevancy improves the scoring of frequently linked to nodes (node/$nid links only)
  • the scoring (sorting) still works -- remembering that we still need to try to normalize the word rank to be in the same range as the most-recently-changed and number-of-comments ranks

I think that we might need to add {search_index} unique key to guarantee unique values (the new SQL won't work if there are duplicates). I'd prefer to have the index but Steven was against it. We don't need the index if search can't run on the same node twice simultaneously. And since cron no longer can have overlapping runs, this shouldn't happen. The only problem with not having the unique index is that we will need a yet-to-be-implemented synchronization method to coordinate contrib modules (such as reindex) indexing and search.module indexing.

AttachmentSize
146466-2_0.patch25.29 KB

#40

douggreen - September 23, 2007 - 21:02

My apologies to anyone who tried the last patch... here's the correct one without my local sandbox's bootstrap changes.

AttachmentSize
146466-3.patch24.03 KB

#41

douggreen - September 23, 2007 - 21:13

One more TODO: The search_parse_query needs to return different values for the exclude and phrase queries so that the SQL can be built as either "i.sid IN (SELECT ..)" or "i.sid NOT IN (SELECT ...)" so that the IN clause has the fewest values possible.

#42

dopry - September 23, 2007 - 23:08

After a cursory review of the patch I like the re-index column in the db instead of trying the crazy joins and checking timestamps.

I don't think the loss of mixed search operators is a big loss.

#43

chx - September 24, 2007 - 07:49
Title:Refactor node link tracker in search.module» Remove temporary table usage from search module

This makes things so much more speedier: temp tables with a 'text' field can not be stored in memory by MySQL and temp tables are not cacheable.

#44

Steven - September 24, 2007 - 19:52

But it no longer supports, mixing these operators. as in "term1 term2 OR term3". I actually don't think that this ever worked as expected, and IMHO dropping this functionality is not really a loss.

Why do you say this never worked as expected? These types of mixed OR/ANDs were resolved previously by doing a join on the dataset table, just like we did for phrase searching. I thought the plan was to detect the situations where joins on the dataset are not needed (pure AND, pure OR, no phrases) and avoid the dataset join only in these cases as a performance optimization.

As far as I know, phrase searching and mixed OR/AND can be similarly implemented using search_dataset.

The updates should also be done in search.install, not system.install. What if I never turned on search.module?

Next, the following queries assume only the node system is using the indexer and blast away any other indexed types:
+ db_query("UPDATE {search_dataset} SET reindex = %d WHERE sid = %d", time(), $nid);
+ db_query('UPDATE {search_dataset} SET reindex = %d', time());

I also don't know where you got this:

I'd prefer to have the index but Steven was against it

With fromsid removed, adding a primary key on search_index.sid/type seems like a natural progression.

#45

chx - September 24, 2007 - 20:13

Steven, thank you for the review. I am taking the blow for the install file 'cos I told Doug that we decided to keep stuff in system.install there is an issue for moving updates to their install file, now postponed to 7.x and I do not want to clutter this issue with that. http://drupal.org/node/80526 Doug is adding a db_table_exists though. Surely the index was some miscommunication. About the mixed AND/OR implementation, if we need to choose between a huge performance blow and a not-too-much-used feature regression...

#46

douggreen - September 24, 2007 - 21:25
Category:task» bug report
Priority:normal» critical
Status:patch (code needs work)» patch (code needs review)

@Steven, thanks also from me. I've added:

  • AND type = 'node' to the two update statements. Is that the right change?
  • the unique index.

I've also added the following (with some help from chx):

  • a header comment to do_search
  • changed to standard LEFT JOIN SQL syntax instead LEFT OUTER JOIN
  • changed (i.word = '%s' OR i.word = '%s') to i.word IN ('%s', '%s')

And lastly, I've:

  • removed the subqueries and replaced them with an INNER JOIN
  • fixed the missing ORDER BY (was a simple variable typo)
  • refactored node_update_index so that _node_index_node can be called from other places.
  • ran the new code set through coder's style review

The only TODO on my list is to try and restore the word rank normalization.

I'm changing the status to code-needs-review, and also marking as critical and bug (per chx's request).

AttachmentSize
146466-4.patch26.79 KB

#47

douggreen - September 25, 2007 - 06:50

I think that we can improve the re-indexing so that the current search results aren't initially dropped. I propose that we ALTER the tables rather than DROP and CREATE them, then UPDATE {search_dataset} SET reindex=1. Each word/nid keyword relevancy will be slightly different once it gets re-indexed. But this way, we still have search results while the site re-indexes itself.

#48

Steven - September 25, 2007 - 10:26

About the mixed AND/OR implementation, if we need to choose between a huge performance blow and a not-too-much-used feature regression...

Sorry, but I just can't participate in these discussions anymore. In my previous post, I pointed out that we can have our cake and eat it too. Yet like so many times before, it ends up not parsing in other contributor's heads for whatever reason, and Drupal ends up doing less, and ends up doing things badly.

I am effectively talking to Search person #2 in this issue, and instead of looking at algorithmic improvements, I'm doing damage control.

#49

chx - September 25, 2007 - 20:34

Steven, could "whatever reason" be that you are smarter than any two of us, combined? Please do not stop helping us just elaborate a bit more. We know from Lord of the Rings that it tires wizards to explain their things to mere mortals, but please bear with us.

#50

douggreen - September 26, 2007 - 00:06

@Steven, I just got off the plane from Barcelona and am responding from the airport while awaiting my connection. I'm sad you feel this way.

I have a few concerns about the mixed AND's and OR's.

  1. I really don't know what syntax this supports... and if I don't know what the syntax is, I doubt many of our users do. Since it doesn't accept parenthesis, IMO, it doesn't work in a natural language way. For example "sheep dog OR miniature poodle" doesn't work. I know why it doesn't work, but as a user of the system, that would be the natural language expression.
  2. What do the major search engines do? Again, since I don't "get" the syntax, then I can't try it out
  3. If we support the mixed AND's and OR's, we'll be adding 20-30 lines of code back to core that I suspect are used by very few. I just feel that since we're trying to make the php footprint smaller that this is the wrong direction. I'd be happy to try it however, so we can really see how much it adds.

#51

douggreen - September 26, 2007 - 00:07

Attached patch adds support to not drop the search tables, so re-indexing can happen over time, while the search still "somewhat" works.

AttachmentSize
146466-5.patch28.13 KB

#52

chx - September 27, 2007 - 05:36

I am posting here Doug's email with my own additions about testing. A simpletest is forthcoming for these from me if I can pull it.

  1. node indexing works - use cron.php, edit a node after it's been indexed, add a comment, remove a comment, etc, add a link from one node to another and make sure that both nodes get reindexed, make sure that the node that is linked to gets higher relevancy on some words from the linking-from node (I think that's how this node linking works, but Steven wrote it)
  2. search syntax - create a couple nodes and then do the four types of searchs (AND, OR, exclude, and phrase searches) and make sure the right results come back
  3. relevancy. create two nodes where in node A one of the words appear more times than in node B. Check searching that word.

#53

douggreen - September 27, 2007 - 10:32

For 3 above, relevancy isn't just about word count, but also which tags the word appears in. So a tag inside an <H1> should have higher relevancy for tags inside <H2>. This actually shouldn't have changed from the previous implementation. I'm more concerned about testing relevancy of linking documents. So, if document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

#54

mfer - September 27, 2007 - 16:42

@douggreen - I think what you mean to write was

<H1> should have higher relevancy for tags inside <H2>. This actually shouldn't have changed from the previous implementation. I'm more concerned about testing relevancy of linking documents. So, if document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

#55

pillarsdotnet - September 28, 2007 - 06:38

Patch for D5 had errors under Postgres; re-rolled for 5.2.

AttachmentSize
146466-D5_1.patch8.43 KB

#56

Dries - September 28, 2007 - 08:30

Steven: Doug is helping and willing to listen. Please.

#57

chx - September 29, 2007 - 05:51

@pillarsdotnet please do not litter this issue with Drupal 5 backport, open a new issue with "backporting the new search patch to Drupal 5".

#58

douggreen - September 29, 2007 - 09:59

@chx, good point. I was the one who originally posted the 5.x backport, but I should point out that the 5.x backport is no longer current. We should wait for the 5.x backport issue until this one is completely resolved.

#59

moshe weitzman - October 3, 2007 - 01:08
Status:patch (code needs review)» patch (code needs work)

Patches apply easier if rolled from root of drupal. I use cvs diff instead of diff. Anyway, I got a hunk fail in search.module. Please reroll and I will look.

#60

douggreen - October 3, 2007 - 17:54
Status:patch (code needs work)» patch (code needs review)

Please be sure to use the one I uploaded for D6 (146466-5.patch, being the 6.x patch I've rolled), and not the one uploaded by pillarsdotnet for D5. The 6.x patch is rolled with CVS diff from the root, and seems to work for me. It's cut against the TDRUPAL-6-0-BETA-1 tag. Is this what doesn't work anymore?

#61

chx - October 4, 2007 - 09:42

Patch against HEAD.

AttachmentSize
146466-61.patch27.5 KB

#62

moshe weitzman - October 4, 2007 - 22:08
Status:patch (code needs review)» patch (code needs work)

I applied chx' patch but got an error with a similarly numbered update. So i moved that update to search.install as search_update_6001 (is OK?) and proceeded.

I then enabled search.module and saw an index status of "-9% of the site has been indexed. There are 23 items left to index." That could be a problem in my dev database, but merits some checking I think.

During indexing, I saw a query like this UPDATE variable SET value = 'd:0.200000000000000011102230246251565404236316680908203125;' WHERE name = 'node_cron_comments_scale'. Whats the point of that gigantic string?

I posted content with a link to another node and the link was nicely recorded in the search_node_links table upon indexing. But I can't yet grok how that link affects search result ranking. I pumped up keyword relevance and searched for the linked word but the linked to article did not bubble to the top as i expected. The search query is a bit complicated (no temp tables though - nice). here is that query. my keyword was persto.

SELECT i.type, i.sid, COUNT(*) AS matches, (8 * SUM(i.score * t.count) + 5 * POW(2, (GREATEST(n.created, n.changed, c.last_comment_timestamp) - 5) * 6.43e-8) + 1191535083 * (2.0 - 2.0 / (1.0 + c.comment_count * 0.2))) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN node_comment_statistics c ON c.nid = i.sid INNER JOIN users u ON n.uid = u.uid INNER JOIN search_dataset d ON i.sid = d.sid WHERE (n.status = 1)AND d.data LIKE '% persto %' AND i.word IN ('persto') AND i.type = 'node' GROUP BY i.type, i.sid ORDER BY score LIMIT 0, 10

the default order by is ASC and we want DESC no? seems like thats missing.

#63

douggreen - October 5, 2007 - 02:32
Status:patch (code needs work)» patch (code needs review)

Thanks moshe for looking at this!

I think that you identified a few problems:

  1. obviously, another install update has been added, so this one needs to be changed to 6034. It shouldn't be moved to search.install for reason's chx has mentioned before.
  2. I think I know what the problem is with the -9%, and I think that the solution is to add a unique key to {search_dataset}(type, sid)
    • The only way that this can happen is if node_search($op = 'status') returns more $remaining items than $total items. I think that the only way this can happen is if {search_dataset}(type, sid) has duplicates. If you check for these dups (SELECT type, sid FROM search_dataset GROUP BY type, sid HAVING COUNT(*) > 1), I think that you'll have two (type, sid) dups.
    • If we wiped the indexes on update, this problem wouldn't happen. The update will start a full index, but I'm trying not to wipe the index on update, so that search will continue to work while the site re-indexes itself.
  3. Yes, the sort order should be DESC.

I've re-rolled with three changes:

  1. Changed the update to 6034
  2. Added a unique index to {search_dataset}(type, sid). Note: the update might not work on pgsql for existing sites with duplicates. Does anyone know if postgres supports ALTER IGNORE TABLE? If not, we might just have to wipe the index on pgsql updates.
  3. Added the sort for DESC

And a little explanation:

  1. nothing has changed in regards to node_cron_comments_scale. This value is used to help normalize the ranking for comments. The admin/settings/search page allows you to set relative weights for the three ranking factors (keyword relevancy, node create recency, and number of comments), and thus it normalizes the rankings for all three of these. For example, you'd want the highest ranked word to contribute the same weight as the node with the most comments and as the most recently posted node. The very long float value you are seeing could easily be truncated at a certain number of digits, but you're basicallly looking at something divided by 1, with some float rounding. Again, nothing here is new.
  2. The link indexing works such that when you add a link in node 1 to node 2, that node 2 gets re-indexed with the link text from node 1 added to it. The desire is to improve the keyword relevancy of the node being linked to, in this case node 2. This only works if you search for the one of the words within the <a> (in node 1 that links to node 2).
  3. The search query isn't any more or less complicated than what we use now. HOWEVER THE IMPORTANT DISTINCTION IS THAT IT DOESN'T USE TEMPORARY TABLES.

While I'm marking this back to "patch (code needs review)", I'd like to say again that this patch is still missing keyword normalization and that this needs to be done before we RTBC. (See comment 1 in "a little explanation" for rank normalization.)

AttachmentSize
146466_0.patch28.57 KB

#64

douggreen - October 9, 2007 - 13:14

I'm no longer certain on my statement:

... this patch is still missing keyword normalization and that this needs to be done ...

What I'm referring to is this snippet from the existing code (search.module+854):

  // Calculate maximum relevance, to normalize it
  $normalize = db_result(db_query('SELECT MAX(relevance) FROM temp_search_sids'));
  if (!$normalize) {
    return array();
  }
  $select2 = str_replace('i.relevance', '('. (1.0 / $normalize) .' * i.relevance)', $select2);

However, I think that I was previously mistaken that this was the keyword normalization. I think that each of the rank normalization is done in node.module.

The only purpose that I can surmise for the above was to make the score values symmetric across search's (that is, a score of 1 is always the highest ranked result, and the search results degrade from there, however because of the two phase search's, a 1 was the highest possible value, and not actually the highest value -- I think). This is a nifty feature, but I can't figure out how to implement it without a second query and/or a temporary table.

#65

catch - October 9, 2007 - 14:24

This is a lot over my head, but would losing the normalisation actuall affect the order of results at all?

#66

douggreen - October 9, 2007 - 16:41

No, that is if my assumption about what we're normalizing here is correct. Another set of eyes on the code would help.

#67

Steven - October 10, 2007 - 00:26

The only purpose that I can surmise for the above ... [...] This is a nifty feature, but I can't figure out how to implement it without a second query and/or a temporary table.

Two things:

  • Normalization of the TF*IDF scores isn't just a "nifty feature", it is required for rank mixing to be meaningful.
  • I said so in very unambiguous terms in the issue that is linked to in the first line on this very page. It also proposes a solution for this problem.

Thanks for yet another kick in the nuts. It's very much appreciated.

#68

douggreen - October 10, 2007 - 04:07
Version:6.x-dev» 7.x-dev
Priority:critical» normal
Status:patch (code needs review)» patch (code needs work)

I give up :(

#69

catch - October 10, 2007 - 10:19

That normalization stuff from the other issue starts here: http://drupal.org/node/145560#comment-247397

#70

douggreen - October 14, 2007 - 20:23
Version:7.x-dev» 6.x-dev
Priority:normal» critical
Status:patch (code needs work)» patch (code needs review)

Per a request from chx and dries, here's one last 6.x attempt... I admit that the Barcelona patch was striff with problems. This is much cleaner and (I hope) truer to the original search module. In this patch, I:

  1. started over with CVS HEAD.
  2. added link tracking from #145560.
  3. added system.install changes from the Barcelona patch.
  4. moved the search.schema changes from the Barcelona patch to search.install.
  5. changed the node_cron_last variable logic to use the search_data.reindex method from the Barcelona patch
  6. kept search_parse_query() mostly unchanged from 5.x, adding two additional returns.
  7. reworked do_search() without temporary tables, maintaining the keyword normalization, and conditionally joins {search_dataset} when needed.
  8. reworked the comment in do_search() to reflect the new algorithm.

I tried to keep as much of the existing code as possible, making as few changes as possible. I did add an additional warning message (lower case "or") and I re-factored one function (node_index_node).

I feel like something is missing, but I can't put my finger on it. Everything seems to be working. I tested:

  • A fresh 6.x install with this patch, creating 100 nodes with devel, indexing, and searching
  • A HEAD 6.x install, creating 100 nodes with devel, indexing, install this patch, update.php, search, reindex, search again
  • The four types of searches, and some combinations thereof:
    • alpha beta
    • alpha OR beta
    • "alpha beta"
    • alpha -beta
    • "alpha beta" OR delta

I haven't run any performance tests. I'm concerned with the complexity of the COUNT and normalization queries. I wonder if any of the SQL would be faster if we rounded the floating point numbers to a decimal place or two.

The COUNT query is more complicated than it needs to be because pager_query() passes the same $args to the $query and $count_query, and thus the count query needs to keep the additional columns from $select2.

AttachmentSize
146466_1.patch25.77 KB

#71

moshe weitzman - October 15, 2007 - 00:09

I have no idea if it makes a difference, but you can run your own count query to get the actual count and then pass a count query like SELECT 345 or whatever the count result is.

#72

Bevan - October 15, 2007 - 02:21

subscribing

#73

douggreen - October 15, 2007 - 09:25

Sorry about the formatting problems :( I really should hit "PREVIEW".

In comment #70 above, should read

2. added link tracking from this node's original patch.

@moshe, Thanks for the COUNT suggestion. I'm aware of that technique. That's actually how search.module used to do it. We used to have more readily accessible because the results were sitting in a temporary table. I'm not sure that I could do it here, though. My comment that the COUNT query is more complicated than it needs to be is that I think the SQL could probably be simplified to not join as many tables, since all were doing is counting. However, I don't know how smart the database engine is in caching this against the actual count, and we shouldn't simplify it, if indeed it is cached.

#74

chx - October 23, 2007 - 06:22

How to test:

  • check whether node indexing works - use cron.php, edit a node after it's been indexed, add a comment, remove a comment, etc,
  • search syntax - create a couple nodes and then do the four types of searchs (AND, OR, exclude, and phrase searches) and make sure the right results come back
  • relevancy. create two nodes where in node A one of the words appear more times than in node B. Check searching that word. Also try putting that word inside H1 tags -- should have higher relevancy than the same word appearing inside H2 and higher than just appearing in the document.
  • add a link from one node to another and make sure that both nodes get reindexed. If document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

#75

catch - October 23, 2007 - 08:50

Patch applied fine.

* check whether node indexing works - use cron.php, edit a node after it's been indexed, add a comment, remove a comment, etc

All this fine - indexing, reindexing after edits, everything - until I deleted a comment then:

Fatal error: Cannot use object of type stdClass as array in /home/libcomwww/public_html/drupal6/modules/search/search.module on line 646

- the comment actually deletes - got drupal_set_message and the comment was gone when I went manually back to /node/2 - but it's not removed from the search index after running cron again. Forcing a reindex takes the comment out of the index.

To help any prospective re-rollers should doug not be around, line 646 is the search_touch_node bit of:

/**
* Implementation of hook_comment().
*/
function search_comment($a1, $op) {
  switch ($op) {
    // Reindex the node when comments are added or changed
    case 'insert':
    case 'update':
    case 'delete':
    case 'publish':
    case 'unpublish':
      search_touch_node($a1['nid']);
      break;
  }
}

Continuing:

* search syntax - create a couple nodes and then do the four types of searchs (AND, OR, exclude, and phrase searches) and make sure the right results come back.

All of these work great, everything as expected.

* relevancy. create two nodes where in node A one of the words appear more times than in node B. Check searching that word. Also try putting that word inside H1 tags -- should have higher relevancy than the same word appearing inside H2 and higher than just appearing in the document.

Did exactly this and it's good again!

* add a link from one node to another and make sure that both nodes get reindexed. If document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

That works fine as well.

So everything on chx's list works (and with my two nodes the search is nice and snappy) - except deleting a comment throws that fatal error.

Setting to code needs work but that shouldn't stop others from reviewing the rest of the functionality until it gets re-rolled, as far as I'm concerned all of that is good to go. I did some very rough tests on recency and number of comments and these seem fine too.

#76

catch - October 23, 2007 - 08:52
Status:patch (code needs review)» patch (code needs work)

ack!

#77

catch - October 23, 2007 - 09:02
Status:patch (code needs work)» patch (code needs review)

OK this one should downgrade that fatal error to a notice on submit instead of a fatal error on delete.

Trying to get property of non-object - line 646

But I can confirm that deleting comments now takes nodes out of the index. Marking back to review since a notice really shouldn't hold this up getting some more testing.

AttachmentSize
146466_1_0.patch25.77 KB

#78

snufkin - October 23, 2007 - 21:15

the last patch results indeed in that notice on comment change. it removes the comment from index when it is being removed, however i noticed two things in the patch that look suspicious:

Extra space between ++ operator and variable on line 246, 283, 460. According to php.net manuals this should be without the separating space ($var++ instead of $var ++). I did not find relevant entry in the handbook.

The return of searc_nodeapi is return '<a>('. implode(', ', $output) .')</a>'; on line 355, this looks suspicious.

loading the cron.php nicely updates the index, except that when i started it it said 100 items to index, after start it jumped to 101, and 99% of the items got indexed running cron.php.

Relevancy: repeating keyword superseeds those nodes where keyword is occurring fewer, so this is ok. Linking put the linked node to the first hit, so i think this is ok too.

The search doesn't seem to understand AND, however it understands OR, so if it should, then this is a bug. excluding works, phrase search gives back the phrase too, so its ok except for AND.

Editing nodes wont add the changes to the index, but running cron.php after the change will add the modified content to the index.

#79

catch - October 24, 2007 - 07:12
Status:patch (code needs review)» patch (code needs work)

Just to clarify with AND searches.

Searching with [term term] which defaults to AND works fine. [term AND term] - that isn't working. So CNW.

#80

robertDouglass - October 24, 2007 - 10:28

subscribe

#81

douggreen - October 24, 2007 - 12:43
Status:patch (code needs work)» patch (code needs review)

@catch, Thanks for all the review. A few things:

  • Looks like I need to reroll for a problem in #75 - #77... what was the solution to this?
  • What do you think is wrong with the spacing on the $var ++? I think that the spacing is part of our coding standards.
  • I don't think that there's anything wrong with return '<a>('. implode(', ', $output) .')</a>', but please look at it closer. The reason an href isn't included here is that it isn't needed. The <a> is returned to search.module itself, so that it will find the anchor and process it and increase the relevancy. No actual HTML is rendered and search.module doesn't need the href.
  • [term1 AND term2] never worked and isn't part of the spec. AND is implied. Try it on a 5.x site. Switching back to CNR because of this.

#82

catch - October 24, 2007 - 13:21

75 - 77

hook_comment I changed:

search_touch_node($a1['nid']); (which produces a fatal error on comment delete) to
search_touch_node($a1->nid); (notice on comment submit).

It may well be the wrong approach since my PHP is ultra-basic, but it "works".

I see [term AND term] actually searches for [term and term] on drupal.org - so we were wrong and that's not a regression at all!

Can't comment on code style issues but if they're not really issues then it's just that hook_comment error that needs fixing as far as I can see.

#83

douggreen - October 24, 2007 - 15:44

If [term1 AND term2] searches on d.o match [term1 and term2], what I think is happening is that "and" is a three letter word that is being search for and you are finding documents that match all three words "term1" "and" and "term2".

#84

catch - October 24, 2007 - 15:57

re #83 - yeah that's exactly what I meant, sorry it wasn't clearer.

I think it's a valid issue but it has nothing to do with this patch, so have split it into a new one: http://drupal.org/node/186242

#85

catch - October 25, 2007 - 10:28
Status:patch (code needs review)» patch (code needs work)

So #40934 went in but this still applies with offset.

Only the notice to fix as far as I can see, notwithstanding snufkin's code style comments, so marking as needs work for that/those.

#86

chx - October 26, 2007 - 07:47
Status:patch (code needs work)» patch (reviewed & tested by the community)

So... what we have here. Reviews that say the code works. Yay. And then some code style issues with variables and ++. I do not know whether we have a rule for variables and ++ but I feel like these "stick" to the variable unlike + or - . Core seems to follow this notation. And then a notice. Gone. Added some missing doxygen (yes the function was there earlier, but why no doc it while we are here). No functionality was changed. So if it works and adheres to standards and has comments then it is ready, isn't it?

By the way, I feel like I understand now what search module does. I do not think I would want to debug/patch it but I still :) Very nice job simplifying it.

AttachmentSize
search_146466-86.patch25.46 KB

#87

catch - November 1, 2007 - 13:11

Patch still applies with 2-15 lines offset.

#88

catch - November 4, 2007 - 18:05
Status:patch (reviewed & tested by the community)» patch (code needs work)

Not any more - system_update_6035 now that system index got in.

#89

chx - November 4, 2007 - 21:43
Status:patch (code needs work)» patch (reviewed & tested by the community)

Bother! said Pooh and rerolled the patch.

AttachmentSize
search_146466-89.patch25.81 KB

#90

Arancaytar - November 5, 2007 - 21:38
Status:patch (reviewed & tested by the community)» patch (code needs work)

Hunk #4 in node.module fails. This looks like a linebreak problem:

@@ -2620,4 +2607,4 @@ function _node_mass_update_batch_finishe
     $message .= theme('item_list', $results);
     drupal_set_message($message);
   }
-}
\ No newline at end of file
+}

#91

Arancaytar - November 5, 2007 - 21:58
Status:patch (code needs work)» patch (reviewed & tested by the community)

I have rerolled the patch from the changed files, effectively removing hunk #4 from the patch. As this hunk, from the above quote and the .rej file, doesn't seem to do anything, the patch should now work fine.

If this is a problem specific to my version of patch/diff, sorry for the trouble.

(Setting back to RTBC as the patch is essentially identical to the previous one.)

AttachmentSize
search_146466-91.patch26.33 KB

#92

catch - November 6, 2007 - 16:31

Just to confirm the RTBC - tested 91 and it throws no errors, still works etc. #89 that 4th hunk in node module failed but the patch worked without it.

#93

Liam McDermott - November 6, 2007 - 17:18

Tested #91 myself on a CVS Drupal and it seems RTBC. No errors, used devel to generate lots of test data, ran a cron, then did a search. Search results seemed relevant and to be ordered in a logical way.

Yay! Said Piglet.

#94

Gábor Hojtsy - November 7, 2007 - 10:29

This is a huge database and API change, so I am handing off decision to Dries on this issue.

#95

douggreen - November 7, 2007 - 12:05

Dries and I spoke in Barcelona and he indicated that because of the performance impact of this patch, he was still open to it. And we emailed about the time of #68, and he encouraged me to continue trying. It should be Dries's decision.

I think that it's mostly a huge database change and not a huge API change.

There are a few new functions, and a few new optional arguments tagged onto the end of existing functions, but I think that all existing function calls should work as-is. For this reason, I think that the API should be "backwards" compatible, something that we rarely strive for in Drupal. I'm not guaranteeing this (and it would be good if someone could double-check this statement) but that's what I recall.

#96

Dries - November 10, 2007 - 09:00

I'm OK with this change. I promised Doug long ago that it could make its way into Drupal 6, and I want to keep my word.

However, it would be great if someone could setup two Drupal sites, one indexed with Drupal 6 and one indexed with Drupal 6 + patch. Then, do some simple and complex searches and compare the results side by side.

Should searches return the same result sets in all cases?

#97

catch - November 10, 2007 - 12:38

OK here goes. Excuse the verbosity but want to make sure I don't have to do this again!

I did a fresh drupal install. Used devel module to generate 110 nodes + some taxonomy terms.

Some nodes have no comments
Some have comments
Some have comments + taxonomy terms.

I then used mysqlhotcopy to clone the db, changed the settings.php in another subdomain with D6 installed. Patched the cloned site (applies with some offset now but all fine). Ran system update 6036 although no queries performed.

Install 1 is 'd6'. Install 2 is 'd6-patched' from now on.

Enabled search on both installs. hit cron.php a few times (not that it matters, but indexing feedback was the same on each, percentages, numbers left etc.)

Left weighting on 5 5 5 for now.

OK now for some searches. Screenshots attached

1. Simple search for {vero} - identical first page

2. {vero OR singularis} - first eight results the same, then some variance - then last page results the same.

3. {vero -singularis} - identical results

OK this next set are with relevance set to 10, the other two weights to 0.

4. {nibh} again identical - even down to page 6

5. {gemino nibh} - looks identical again.

6. {"Gemino proprius"} - identical again.

So to me, it looks like just about every result is the same. Given this is devel generated content, timestamps and comment numbers have very little variance, the slight differences may well be down to the homegeneity of the content rather than any big change in functionality - nodes getting the same scores even.

I have to go out, and this is already a lot of screenshots. Will leave those sites as they are in case there's more testing needs to be done (didn't do any very, very complex searches) - probably late tonight or tomorrow though.

Anyway looks good to me, and the patched site was noticeably more snappy.

AttachmentSize
6-d6-patched.png62.73 KB
6-d6.png62.19 KB
5-d6-pg7.png68.98 KB
5-d6-pg1.png69.97 KB
5-d6-patched-pg7.png69.49 KB
5-d6-patched-pg1.png70.46 KB
4-d6-pg6-patched.p