D6 search refactoring (backport to 5) [#146466]

Comment	File	Size	Author
#159	146466.patch	25 KB	douggreen
#142	146466-2.patch	25 KB	douggreen
#134	146466-1.patch	25.01 KB	douggreen
#124	146466-5x.patch	21.48 KB	douggreen
#106	146466.patch	26.13 KB	douggreen
#99	search-link-hello-world.png	39.4 KB	catch
#97	1-d6.png	68.84 KB	catch
#97	1-patched.png	70.4 KB	catch
#97	2-d6-firstpage.png	70.34 KB	catch
#97	2-d6-patched-firstpage.png	70.06 KB	catch
#97	3-d6.png	41.12 KB	catch
#97	3-d6-patched.png	41.65 KB	catch
#97	4-d6patched-pg1.png	71.28 KB	catch
#97	4-d6-pg1.png	70.78 KB	catch
#97	4-d6-pg6.png	68.47 KB	catch
#97	4-d6-pg6-patched.png	68.97 KB	catch
#97	5-d6-patched-pg1.png	70.46 KB	catch
#97	5-d6-patched-pg7.png	69.49 KB	catch
#97	5-d6-pg1.png	69.97 KB	catch
#97	5-d6-pg7.png	68.98 KB	catch
#97	6-d6.png	62.19 KB	catch
#97	6-d6-patched.png	62.73 KB	catch
#91	search_146466-91.patch	26.33 KB	cburschka
#89	search_146466-89.patch	25.81 KB	chx
#86	search_146466-86.patch	25.46 KB	chx
#77	146466_1_0.patch	25.77 KB	catch
#70	146466_1.patch	25.77 KB	douggreen
#63	146466_0.patch	28.57 KB	douggreen
#61	146466-61.patch	27.5 KB	chx
#55	146466-D5_1.patch	8.43 KB	pillarsdotnet
#51	146466-5.patch	28.13 KB	douggreen
#46	146466-4.patch	26.79 KB	douggreen
#40	146466-3.patch	24.03 KB	douggreen
#39	146466-2_0.patch	25.29 KB	douggreen
#38	146466-2.patch	24.18 KB	douggreen
#37	146466-1.patch	12.16 KB	douggreen
#32	146466.patch	12.11 KB	douggreen
#26	146466-D5_0.patch	9.04 KB	douggreen
#25	146466-D5.patch	9.09 KB	douggreen
#13	search-node-links-d5.patch	8.99 KB	douggreen
#6	search_links.patch	7.87 KB	hunmonk
	search-node-links.patch	9.25 KB	Steven

Comment #1

moshe weitzman commented 24 May 2007 at 06:07

subscribe

Log in or register to post comments

Comment #2

douggreen commented 24 May 2007 at 14:21

Thanks! Two very minor changes, $links = array() should be initialized and there's a lingering print_r.

Would you like to consider improvements here or just refactoring? For example, we're not catching path.module aliases.

Log in or register to post comments

Comment #3

douggreen commented 24 May 2007 at 16:32

Just testing out the decay function, and for anyone else who is curious: From the patch:

  // Focus is a decaying value in terms of the amount of unique words up to this point.
  // From 100 words and more, it decays, to e.g. 0.5 at 500 words and 0.3 at 1000 words.
  $focus = min(1, .01 + 3.5 / (2 + count($results[0]) * .015));

Is this the TF/IDF math from #145560?

To verify the comment, I wrote this snippet:

  for ($i = 1; $i < 1000; $i ++) {
    print "[$i]=". min(1, .01 + 3.5 / (2 + $i * .015)) ."<br>\n";
  }

and the abridged results are:

[1]=1
.. all values = 1 until...
[102]=1
[103]=0.99730606488011
[104]=0.99314606741573
... decays to ...
[500]=0.37842105263158
[501]=0.37784025223332
... and further to this ...
[998]=0.21624631703005
[999]=0.21606417427142

These results are a little off from the comment in the code. Is the level of decay important? Or is it just important that it's decaying?

Log in or register to post comments

Comment #4

Steven commented 24 May 2007 at 19:50

The decay portion has not been touched. It is still the same as before and not part of this patch.

Log in or register to post comments

Comment #5

pwolanin commented 20 June 2007 at 22:20

Status:

Needs review

» Needs work

needs to be redone for schema API, at the least

Log in or register to post comments

Comment #6

hunmonk commented 21 June 2007 at 17:24

Status:

Needs work

» Needs review

Status	File	Size
new	search_links.patch	7.87 KB

updated the patch to current HEAD -- this only involved getting it in line w/ the schema API.
fixed a variable declaration bug -- $links must be set to an array, or a foreach loop bombs later in the search index code.
created a couple of nodes, 1 with a link, and examined the data in the search tables -- seems to be ok, although i don't really know anything about the scoring stuff.
tested both a new install and D5 -> D6 upgrade on postgres -- both work perfectly.

Log in or register to post comments

Comment #7

moshe weitzman commented 26 June 2007 at 01:27

i tried adding a link from node/3 to node/2 and then running cron but no matter what i am not seeing an entry into search_node_links(). here is the source of node 3


preamble <a href="node/2">two is great shoe</a>. appendix

any ideas?

Log in or register to post comments

Comment #8

Steven commented 26 June 2007 at 05:07

<a href="node/2"> is not a valid link to a node. Drupal menu paths != urls. You need to run them through url(), to get the base path prefixed and the non-clean url marker inserted.

Log in or register to post comments

Comment #9

moshe weitzman commented 26 June 2007 at 14:13

fair enough.

yet if a link works, people will use it. the vast majority of authors have no access to url(), so they fiddle with URLs until they find one that works. it would be nice if our link tracker recognized these "non standard" yet still working links.

Log in or register to post comments

Comment #10

moshe weitzman commented 27 June 2007 at 23:38

Status:

Needs review

» Reviewed & tested by the community

i rested and this works as designed. the link discovery algorithm can be improved in a different patch. it is not part of this patch.

Log in or register to post comments

Comment #11

gábor hojtsy

he/him

Hungarian

Hungary

commented 28 June 2007 at 00:18

I'd name the update function as with other core modules, not starting from one, but from 6000. Otherwise as the reviewers indicated that the code runs fine, with this little fix, I think this is ready to go in.

Log in or register to post comments

Comment #12

dries commented 28 June 2007 at 08:50

I'm trying to figure out whether this has any unwanted side-effects:

+function search_touch_node($nid) {
+  db_query("UPDATE {node} SET changed = %d WHERE nid = %d", time(), $nid);
+}

Touching the node's changed date will cause Drupal to render 'updated' tags next to titles.

In general, I think the code could benefit from some additional documentation -- a paragraph of text that provides an algorithmic overview of the code would help me figure things out. Like, it would be useful to explain in 1 or 2 sentences why it is important to update/delete the captions of nodes.

Log in or register to post comments

Comment #13

douggreen commented 3 August 2007 at 02:33

Status	File	Size
new	search-node-links-d5.patch	8.99 KB

Per the conversation on the [infrastructure] list, attached is a Drupal 5.x version of the patch. This needs to be heavily tested. I've only confirmed that this version, doesn't spew errors.

Log in or register to post comments

Comment #14

gerhard killesreiter commented 6 August 2007 at 00:13

I've applied the patch to scratch.d.o and I am re-indexing the site.

Log in or register to post comments

Comment #15

gerhard killesreiter commented 6 August 2007 at 00:16

Invalid argument supplied for foreach() in /var/www/scratch.drupal.org/htdocs/modules/search/search.module on line 639

which is this line:
foreach ($links as $nid) {

Log in or register to post comments

Comment #16

gerhard killesreiter commented 6 August 2007 at 00:23

Initializing the $links array should help.

Log in or register to post comments

Comment #17

gerhard killesreiter commented 6 August 2007 at 19:18

Re-indexing a site of the size of drupal.org really takes ages and puts a lot of stress on the server.

Would it be possible to find an upgrade path that not simply drops the tables?

Log in or register to post comments

Comment #18

nnewton commented 6 August 2007 at 23:39

One issue with this, it removes the word index on search_index. This makes the preen of the search_total table really painful, see below:

mysql> explain SELECT t.word AS realword, i.word FROM search_total t LEFT JOIN search_index i ON t.word = i.word WHERE i.word IS NULL;
 | id | select_type | table | type  | possible_keys | key       | key_len | ref  | rows   | Extra             

  |  1 | SIMPLE      | t     | index | NULL          | PRIMARY   | 152     | NULL |  23193 | Using index       
                | 
 |  1 | SIMPLE      | i     | index | NULL          | type_word | 202     | NULL | 483309 | Using where; 
                Using index; Not exists |

Where as before the join examined around 23 rows, this has expanded dramatically without the word index. Word is covered in type_word, but its not the left-most prefix and can't be used well for this. Adding a word index make this query look like this:

|  1 | SIMPLE      | t     | index | NULL          | PRIMARY | 152     | NULL           | 22028 | Using index                          | 
|  1 | SIMPLE      | i     | ref   | word          | word    | 152     | scratch.t.word |     8 | Using where; Using index; Not exists |

Log in or register to post comments

Comment #19

dries commented 7 August 2007 at 08:35

Status:

Reviewed & tested by the community

» Needs work

I guess this means the code needs work.

Log in or register to post comments

Comment #20

moshe weitzman commented 7 August 2007 at 11:29

there is better way to update than to reindex. this patch makes indexing slightly less heavy, but yes it is unfortunate.

Log in or register to post comments

Comment #21

gerhard killesreiter commented 15 August 2007 at 00:02

I withdraw the comment about re-indexing. The tables were re-created as myISAM and the MySQL config didn't deal with this correctly. Indexing is much faster now.

scratch.drupal.org is now runnogn this patch, so please test it.

Log in or register to post comments

Comment #22

moshe weitzman@drupal.org commented 15 August 2007 at 01:21

oops. my comment was supposed to read "there is *no* better way to update than reindexing"

Log in or register to post comments

Comment #23

BioALIEN commented 23 August 2007 at 11:20

Subscribing.

Log in or register to post comments

Comment #24

douggreen commented 28 August 2007 at 12:22

The 5.x update doesn't properly reset the search variables. So while the search indexes are wiped clean, the indexing doesn't actually pick up unless you visit admin/settings/search and re-index. I'll check if the 6.x version has the same problem and submit a patch within the next couple days.

Log in or register to post comments

Comment #25

douggreen commented 28 August 2007 at 13:45

Status	File	Size
new	146466-D5.patch	9.09 KB

Two minor 5.x fixes:

update needs to reset variables
uninitialized array $links

Log in or register to post comments

Comment #26

douggreen commented 28 August 2007 at 15:09

Status	File	Size
new	146466-D5_0.patch	9.04 KB

Without the index on search_index (word) the SUM() in search_update_totals take forever so this patch adds that index back.

Log in or register to post comments

Comment #27

douggreen commented 11 September 2007 at 14:44

@Steven, is "refactor" the right word in this title? Does this patch maintain identical results? I didn't think that it did, but the title of the node has me confused.

Log in or register to post comments

Comment #28

fgm

French

Paris, France

commented 16 September 2007 at 07:08

subscribe

Log in or register to post comments

Comment #29

douggreen commented 22 September 2007 at 06:17

I spoke to Steven yesterday at DrupalCon, and he reminded me that Dries had a pending unsolved question in this thread. The problem is that by re-setting the nodes changed time to indicate that re-indexing is needed, causes Drupal to render 'updated' tags next to titles.

Steven suggested that we add a queue to keep track of which nodes need indexing/re-indexing. Rather then create a new table, we could add a new field to the search_data table (call it reindex and make it a timestamp). I don't like changing the behavior of the table, because I think that the table should then be renamed, and I don't really like renaming tables. But I think that this would be the better implementation to creating an entirely new table used only for the reindexing queue.

Log in or register to post comments

Comment #30

douggreen commented 22 September 2007 at 06:30

The above suggestion (search_dataset.reindex field) feels like a separate patch. So that we can simplify the query for getting the list of nodes needing re-indexing, we'd would also need to explicitly set the reindex field whenever the node is changed.

Log in or register to post comments

Comment #31

douggreen commented 23 September 2007 at 15:50

I spoke to chx today and he suggested that we roll a "fix search" patch, which will include the above, plus the fix for changed that Dries asked for, plus the new queries that don't use db_temporary_table. I was originally waiting to do the new queries until THIS link patch landed. But based on chx's recommendation, I hope to submit a new patch later today.

I'm at the unofficial DrupalCon Barcelona Hackathon and I've mostly completed a search patch that includes this one and the fixes for changed. My new work completely removes the cron variable's by adding a reindex field to the search_dataset. This value is set on hook_nodeapi 'update' and hook_comment 'insert'. The query to determine if something needs to be indexed, then becomes:

SELECT COUNT(*) FROM {node} n LEFT OUTER JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex <> 0

I spoke to Dries at DrupalCon and he said that as this is a relatively important performance patch, that it will get looked at for 6.x.

More to come soon...

Log in or register to post comments

Comment #32

douggreen commented 23 September 2007 at 16:22

Status	File	Size
new	146466.patch	12.11 KB

As promised above, here's the first 80% of the patch. What's left for me to do is rewrite do_search() to use the new tables, remove the db_query_temporary calls, and add use subquery for the exclude and phrase searches. I've mostly done this in VFS, wrote an article on it here and spoke about the changes at DrupalCon ... so hopefully, the actually writing of the code will be trivial.

Log in or register to post comments

Comment #33

douggreen commented 23 September 2007 at 16:27

To test this you can use the devel module to create nodes (apply this patch to create a lot of nodes) and the reindex module to quickly index this site.

Log in or register to post comments

Comment #34

catch

he/him

English

commented 23 September 2007 at 17:00

Not to muddy the water, but this seems to be a similar approach (for different reasons) to: http://drupal.org/node/148849 - which adds an "activity" timestamp column in node to replace node_comment_statistics

Last patch there was at: http://drupal.org/node/148849#comment-283965

Log in or register to post comments

Comment #35

douggreen commented 23 September 2007 at 17:26

Sortof ... However, the bulk of this patch so far is the link refactoring. With the change, a new column is needed by search to keep track of which nodes need reindexing, and a by-product of this new column is that it drastically simplified the search indexing. Since all of this involves search, I don't think that this patch is really similar to the one you mentioned.

Log in or register to post comments

Comment #36

catch

he/him

English

commented 23 September 2007 at 17:37

Yeah no problem. Just thought the new column could double up to be used for the denormalisation in Drupal 7, or even help the tracker query a bit now as a side effect. No point holding this up because of that though so feel free to ignore.

Log in or register to post comments

Comment #37

douggreen commented 23 September 2007 at 17:39

Status	File	Size
new	146466-1.patch	12.16 KB

Minor update to fix the Re-Index button on admin/settings/search found by snufkin.

Log in or register to post comments

Comment #38

douggreen commented 23 September 2007 at 20:30

Status	File	Size
new	146466-2.patch	24.18 KB

Attached is an update that has the refactored SQL that takes advantage of having a cleaner {search_index}.

It fixes a bug found by dopry in hook_comment that was not checking all of the appropriate comment operations. And if fixes another bug (also found by dopry) that was updating nid when it should of updated sid.

Log in or register to post comments

Comment #39

douggreen commented 23 September 2007 at 20:51

Status	File	Size
new	146466-2_0.patch	25.29 KB

Still left to do:

I really haven't tested the new SQL handling well
Try to restore the word rank normalization (which was previously done in a separate query)

Note: This patch drops some syntax that was previously available. It still supports implied AND's, like "term1 term3", and it supports "OR", such as "term1 OR term2". But it no longer supports, mixing these operators. as in "term1 term2 OR term3". I actually don't think that this ever worked as expected, and IMHO dropping this functionality is not really a loss. The new SQL handling that I just added does need to cleaned up in search_parse_query, so that it won't try to process the previously supported mixing of operators.

The new code also needs to be heavily commented.

Anyone wanting to test should test:

That the change in the search indexing queue mechanism works
That the search syntax generates the right queries to return the right results set
That the link relevancy improves the scoring of frequently linked to nodes (node/$nid links only)
the scoring (sorting) still works -- remembering that we still need to try to normalize the word rank to be in the same range as the most-recently-changed and number-of-comments ranks

I think that we might need to add {search_index} unique key to guarantee unique values (the new SQL won't work if there are duplicates). I'd prefer to have the index but Steven was against it. We don't need the index if search can't run on the same node twice simultaneously. And since cron no longer can have overlapping runs, this shouldn't happen. The only problem with not having the unique index is that we will need a yet-to-be-implemented synchronization method to coordinate contrib modules (such as reindex) indexing and search.module indexing.

Log in or register to post comments

Comment #40

douggreen commented 23 September 2007 at 21:02

Status	File	Size
new	146466-3.patch	24.03 KB

My apologies to anyone who tried the last patch... here's the correct one without my local sandbox's bootstrap changes.

Log in or register to post comments

Comment #41

douggreen commented 23 September 2007 at 21:13

One more TODO: The search_parse_query needs to return different values for the exclude and phrase queries so that the SQL can be built as either "i.sid IN (SELECT ..)" or "i.sid NOT IN (SELECT ...)" so that the IN clause has the fewest values possible.

Log in or register to post comments

Comment #42

dopry commented 23 September 2007 at 23:08

After a cursory review of the patch I like the re-index column in the db instead of trying the crazy joins and checking timestamps.

I don't think the loss of mixed search operators is a big loss.

Log in or register to post comments

Comment #43

chx commented 24 September 2007 at 07:49

Title:

Refactor node link tracker in search.module

» Remove temporary table usage from search module

This makes things so much more speedier: temp tables with a 'text' field can not be stored in memory by MySQL and temp tables are not cacheable.

Log in or register to post comments

Comment #44

Steven commented 24 September 2007 at 19:52

But it no longer supports, mixing these operators. as in "term1 term2 OR term3". I actually don't think that this ever worked as expected, and IMHO dropping this functionality is not really a loss.

Why do you say this never worked as expected? These types of mixed OR/ANDs were resolved previously by doing a join on the dataset table, just like we did for phrase searching. I thought the plan was to detect the situations where joins on the dataset are not needed (pure AND, pure OR, no phrases) and avoid the dataset join only in these cases as a performance optimization.

As far as I know, phrase searching and mixed OR/AND can be similarly implemented using search_dataset.

The updates should also be done in search.install, not system.install. What if I never turned on search.module?

Next, the following queries assume only the node system is using the indexer and blast away any other indexed types:
+ db_query("UPDATE {search_dataset} SET reindex = %d WHERE sid = %d", time(), $nid);
+ db_query('UPDATE {search_dataset} SET reindex = %d', time());

I also don't know where you got this:

I'd prefer to have the index but Steven was against it

With fromsid removed, adding a primary key on search_index.sid/type seems like a natural progression.

Log in or register to post comments

Comment #45

chx commented 24 September 2007 at 20:13

Steven, thank you for the review. I am taking the blow for the install file 'cos I told Doug that we decided to keep stuff in system.install there is an issue for moving updates to their install file, now postponed to 7.x and I do not want to clutter this issue with that. http://drupal.org/node/80526 Doug is adding a db_table_exists though. Surely the index was some miscommunication. About the mixed AND/OR implementation, if we need to choose between a huge performance blow and a not-too-much-used feature regression...

Log in or register to post comments

Comment #46

douggreen commented 24 September 2007 at 21:25

Category:	task	» bug
Priority:	Normal	» Critical
Status:	Needs work	» Needs review

Status	File	Size
new	146466-4.patch	26.79 KB

@Steven, thanks also from me. I've added:

AND type = 'node' to the two update statements. Is that the right change?
the unique index.

I've also added the following (with some help from chx):

a header comment to do_search
changed to standard LEFT JOIN SQL syntax instead LEFT OUTER JOIN
changed (i.word = '%s' OR i.word = '%s') to i.word IN ('%s', '%s')

And lastly, I've:

removed the subqueries and replaced them with an INNER JOIN
fixed the missing ORDER BY (was a simple variable typo)
refactored node_update_index so that _node_index_node can be called from other places.
ran the new code set through coder's style review

The only TODO on my list is to try and restore the word rank normalization.

I'm changing the status to code-needs-review, and also marking as critical and bug (per chx's request).

Log in or register to post comments

Comment #47

douggreen commented 25 September 2007 at 06:50

I think that we can improve the re-indexing so that the current search results aren't initially dropped. I propose that we ALTER the tables rather than DROP and CREATE them, then UPDATE {search_dataset} SET reindex=1. Each word/nid keyword relevancy will be slightly different once it gets re-indexed. But this way, we still have search results while the site re-indexes itself.

Log in or register to post comments

Comment #48

Steven commented 25 September 2007 at 10:26

About the mixed AND/OR implementation, if we need to choose between a huge performance blow and a not-too-much-used feature regression...

Sorry, but I just can't participate in these discussions anymore. In my previous post, I pointed out that we can have our cake and eat it too. Yet like so many times before, it ends up not parsing in other contributor's heads for whatever reason, and Drupal ends up doing less, and ends up doing things badly.

I am effectively talking to Search person #2 in this issue, and instead of looking at algorithmic improvements, I'm doing damage control.

Log in or register to post comments

Comment #49

chx commented 25 September 2007 at 20:34

Steven, could "whatever reason" be that you are smarter than any two of us, combined? Please do not stop helping us just elaborate a bit more. We know from Lord of the Rings that it tires wizards to explain their things to mere mortals, but please bear with us.

Log in or register to post comments

Comment #50

douggreen commented 26 September 2007 at 00:06

@Steven, I just got off the plane from Barcelona and am responding from the airport while awaiting my connection. I'm sad you feel this way.

I have a few concerns about the mixed AND's and OR's.

I really don't know what syntax this supports... and if I don't know what the syntax is, I doubt many of our users do. Since it doesn't accept parenthesis, IMO, it doesn't work in a natural language way. For example "sheep dog OR miniature poodle" doesn't work. I know why it doesn't work, but as a user of the system, that would be the natural language expression.
What do the major search engines do? Again, since I don't "get" the syntax, then I can't try it out
If we support the mixed AND's and OR's, we'll be adding 20-30 lines of code back to core that I suspect are used by very few. I just feel that since we're trying to make the php footprint smaller that this is the wrong direction. I'd be happy to try it however, so we can really see how much it adds.

Log in or register to post comments

Comment #51

douggreen commented 26 September 2007 at 00:07

Status	File	Size
new	146466-5.patch	28.13 KB

Attached patch adds support to not drop the search tables, so re-indexing can happen over time, while the search still "somewhat" works.

Log in or register to post comments

Comment #52

chx commented 27 September 2007 at 05:36

I am posting here Doug's email with my own additions about testing. A simpletest is forthcoming for these from me if I can pull it.

node indexing works - use cron.php, edit a node after it's been indexed, add a comment, remove a comment, etc, add a link from one node to another and make sure that both nodes get reindexed, make sure that the node that is linked to gets higher relevancy on some words from the linking-from node (I think that's how this node linking works, but Steven wrote it)
search syntax - create a couple nodes and then do the four types of searchs (AND, OR, exclude, and phrase searches) and make sure the right results come back
relevancy. create two nodes where in node A one of the words appear more times than in node B. Check searching that word.

Log in or register to post comments

Comment #53

douggreen commented 27 September 2007 at 10:32

For 3 above, relevancy isn't just about word count, but also which tags the word appears in. So a tag inside an <H1> should have higher relevancy for tags inside <H2>. This actually shouldn't have changed from the previous implementation. I'm more concerned about testing relevancy of linking documents. So, if document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

Log in or register to post comments

Comment #54

mfer commented 27 September 2007 at 16:42

@douggreen - I think what you mean to write was

<H1> should have higher relevancy for tags inside <H2>. This actually shouldn't have changed from the previous implementation. I'm more concerned about testing relevancy of linking documents. So, if document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

Log in or register to post comments

Comment #55

pillarsdotnet commented 28 September 2007 at 06:38

Status	File	Size
new	146466-D5_1.patch	8.43 KB

Patch for D5 had errors under Postgres; re-rolled for 5.2.

Log in or register to post comments

Comment #56

dries commented 28 September 2007 at 08:30

Steven: Doug is helping and willing to listen. Please.

Log in or register to post comments

Comment #57

chx commented 29 September 2007 at 05:51

@pillarsdotnet please do not litter this issue with Drupal 5 backport, open a new issue with "backporting the new search patch to Drupal 5".

Log in or register to post comments

Comment #58

douggreen commented 29 September 2007 at 09:59

@chx, good point. I was the one who originally posted the 5.x backport, but I should point out that the 5.x backport is no longer current. We should wait for the 5.x backport issue until this one is completely resolved.

Log in or register to post comments

Comment #59

moshe weitzman commented 3 October 2007 at 01:08

Status:

Needs review

» Needs work

Patches apply easier if rolled from root of drupal. I use cvs diff instead of diff. Anyway, I got a hunk fail in search.module. Please reroll and I will look.

Log in or register to post comments

Comment #60

douggreen commented 3 October 2007 at 17:54

Status:

Needs work

» Needs review

Please be sure to use the one I uploaded for D6 (146466-5.patch, being the 6.x patch I've rolled), and not the one uploaded by pillarsdotnet for D5. The 6.x patch is rolled with CVS diff from the root, and seems to work for me. It's cut against the TDRUPAL-6-0-BETA-1 tag. Is this what doesn't work anymore?

Log in or register to post comments

Comment #61

chx commented 4 October 2007 at 09:42

Status	File	Size
new	146466-61.patch	27.5 KB

Patch against HEAD.

Log in or register to post comments

Comment #62

moshe weitzman commented 4 October 2007 at 22:08

Status:

Needs review

» Needs work

I applied chx' patch but got an error with a similarly numbered update. So i moved that update to search.install as search_update_6001 (is OK?) and proceeded.

I then enabled search.module and saw an index status of "-9% of the site has been indexed. There are 23 items left to index." That could be a problem in my dev database, but merits some checking I think.

During indexing, I saw a query like this UPDATE variable SET value = 'd:0.200000000000000011102230246251565404236316680908203125;' WHERE name = 'node_cron_comments_scale'. Whats the point of that gigantic string?

I posted content with a link to another node and the link was nicely recorded in the search_node_links table upon indexing. But I can't yet grok how that link affects search result ranking. I pumped up keyword relevance and searched for the linked word but the linked to article did not bubble to the top as i expected. The search query is a bit complicated (no temp tables though - nice). here is that query. my keyword was persto.

SELECT i.type, i.sid, COUNT(*) AS matches, (8 * SUM(i.score * t.count) + 5 * POW(2, (GREATEST(n.created, n.changed, c.last_comment_timestamp) - 5) * 6.43e-8) + 1191535083 * (2.0 - 2.0 / (1.0 + c.comment_count * 0.2))) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid LEFT JOIN node_comment_statistics c ON c.nid = i.sid INNER JOIN users u ON n.uid = u.uid INNER JOIN search_dataset d ON i.sid = d.sid WHERE (n.status = 1)AND d.data LIKE '% persto %' AND i.word IN ('persto') AND i.type = 'node' GROUP BY i.type, i.sid ORDER BY score LIMIT 0, 10

the default order by is ASC and we want DESC no? seems like thats missing.

Log in or register to post comments

Comment #63

douggreen commented 5 October 2007 at 02:32

Status:

Needs work

» Needs review

Status	File	Size
new	146466_0.patch	28.57 KB

Thanks moshe for looking at this!

I think that you identified a few problems:

obviously, another install update has been added, so this one needs to be changed to 6034. It shouldn't be moved to search.install for reason's chx has mentioned before.
I think I know what the problem is with the -9%, and I think that the solution is to add a unique key to {search_dataset}(type, sid)
- The only way that this can happen is if node_search($op = 'status') returns more $remaining items than $total items. I think that the only way this can happen is if {search_dataset}(type, sid) has duplicates. If you check for these dups (SELECT type, sid FROM search_dataset GROUP BY type, sid HAVING COUNT(*) > 1), I think that you'll have two (type, sid) dups.
- If we wiped the indexes on update, this problem wouldn't happen. The update will start a full index, but I'm trying not to wipe the index on update, so that search will continue to work while the site re-indexes itself.
Yes, the sort order should be DESC.

I've re-rolled with three changes:

Changed the update to 6034
Added a unique index to {search_dataset}(type, sid). Note: the update might not work on pgsql for existing sites with duplicates. Does anyone know if postgres supports ALTER IGNORE TABLE? If not, we might just have to wipe the index on pgsql updates.
Added the sort for DESC

And a little explanation:

nothing has changed in regards to node_cron_comments_scale. This value is used to help normalize the ranking for comments. The admin/settings/search page allows you to set relative weights for the three ranking factors (keyword relevancy, node create recency, and number of comments), and thus it normalizes the rankings for all three of these. For example, you'd want the highest ranked word to contribute the same weight as the node with the most comments and as the most recently posted node. The very long float value you are seeing could easily be truncated at a certain number of digits, but you're basicallly looking at something divided by 1, with some float rounding. Again, nothing here is new.
The link indexing works such that when you add a link in node 1 to node 2, that node 2 gets re-indexed with the link text from node 1 added to it. The desire is to improve the keyword relevancy of the node being linked to, in this case node 2. This only works if you search for the one of the words within the <a> (in node 1 that links to node 2).
The search query isn't any more or less complicated than what we use now. HOWEVER THE IMPORTANT DISTINCTION IS THAT IT DOESN'T USE TEMPORARY TABLES.

While I'm marking this back to "patch (code needs review)", I'd like to say again that this patch is still missing keyword normalization and that this needs to be done before we RTBC. (See comment 1 in "a little explanation" for rank normalization.)

Log in or register to post comments

Comment #64

douggreen commented 9 October 2007 at 13:14

I'm no longer certain on my statement:

... this patch is still missing keyword normalization and that this needs to be done ...

What I'm referring to is this snippet from the existing code (search.module+854):

  // Calculate maximum relevance, to normalize it
  $normalize = db_result(db_query('SELECT MAX(relevance) FROM temp_search_sids'));
  if (!$normalize) {
    return array();
  }
  $select2 = str_replace('i.relevance', '('. (1.0 / $normalize) .' * i.relevance)', $select2);

However, I think that I was previously mistaken that this was the keyword normalization. I think that each of the rank normalization is done in node.module.

The only purpose that I can surmise for the above was to make the score values symmetric across search's (that is, a score of 1 is always the highest ranked result, and the search results degrade from there, however because of the two phase search's, a 1 was the highest possible value, and not actually the highest value -- I think). This is a nifty feature, but I can't figure out how to implement it without a second query and/or a temporary table.

Log in or register to post comments

Comment #65

catch

he/him

English

commented 9 October 2007 at 14:24

This is a lot over my head, but would losing the normalisation actuall affect the order of results at all?

Log in or register to post comments

Comment #66

douggreen commented 9 October 2007 at 16:41

No, that is if my assumption about what we're normalizing here is correct. Another set of eyes on the code would help.

Log in or register to post comments

Comment #67

Steven commented 10 October 2007 at 00:26

The only purpose that I can surmise for the above ... [...] This is a nifty feature, but I can't figure out how to implement it without a second query and/or a temporary table.

Two things:

Normalization of the TF*IDF scores isn't just a "nifty feature", it is required for rank mixing to be meaningful.
I said so in very unambiguous terms in the issue that is linked to in the first line on this very page. It also proposes a solution for this problem.

Thanks for yet another kick in the nuts. It's very much appreciated.

Log in or register to post comments

Comment #68

douggreen commented 10 October 2007 at 04:07

Version:	6.x-dev	» 7.x-dev
Priority:	Critical	» Normal
Status:	Needs review	» Needs work

I give up :(

Log in or register to post comments

Comment #69

catch

he/him

English

commented 10 October 2007 at 10:19

That normalization stuff from the other issue starts here: http://drupal.org/node/145560#comment-247397

Log in or register to post comments

Comment #70

douggreen commented 14 October 2007 at 20:23

Version:	7.x-dev	» 6.x-dev
Priority:	Normal	» Critical
Status:	Needs work	» Needs review

Status	File	Size
new	146466_1.patch	25.77 KB

Per a request from chx and dries, here's one last 6.x attempt... I admit that the Barcelona patch was striff with problems. This is much cleaner and (I hope) truer to the original search module. In this patch, I:

started over with CVS HEAD.
added link tracking from #145560.
added system.install changes from the Barcelona patch.
moved the search.schema changes from the Barcelona patch to search.install.
changed the node_cron_last variable logic to use the search_data.reindex method from the Barcelona patch
kept search_parse_query() mostly unchanged from 5.x, adding two additional returns.
reworked do_search() without temporary tables, maintaining the keyword normalization, and conditionally joins {search_dataset} when needed.
reworked the comment in do_search() to reflect the new algorithm.

I tried to keep as much of the existing code as possible, making as few changes as possible. I did add an additional warning message (lower case "or") and I re-factored one function (node_index_node).

I feel like something is missing, but I can't put my finger on it. Everything seems to be working. I tested:

A fresh 6.x install with this patch, creating 100 nodes with devel, indexing, and searching
A HEAD 6.x install, creating 100 nodes with devel, indexing, install this patch, update.php, search, reindex, search again
The four types of searches, and some combinations thereof:
- alpha beta
- alpha OR beta
- "alpha beta"
- alpha -beta
- "alpha beta" OR delta

I haven't run any performance tests. I'm concerned with the complexity of the COUNT and normalization queries. I wonder if any of the SQL would be faster if we rounded the floating point numbers to a decimal place or two.

The COUNT query is more complicated than it needs to be because pager_query() passes the same $args to the $query and $count_query, and thus the count query needs to keep the additional columns from $select2.

Log in or register to post comments

Comment #71

moshe weitzman commented 15 October 2007 at 00:09

I have no idea if it makes a difference, but you can run your own count query to get the actual count and then pass a count query like SELECT 345 or whatever the count result is.

Log in or register to post comments

Comment #72

Bevan commented 15 October 2007 at 02:21

subscribing

Log in or register to post comments

Comment #73

douggreen commented 15 October 2007 at 09:25

Sorry about the formatting problems :( I really should hit "PREVIEW".

In comment #70 above, should read

2. added link tracking from this node's original patch.

@moshe, Thanks for the COUNT suggestion. I'm aware of that technique. That's actually how search.module used to do it. We used to have more readily accessible because the results were sitting in a temporary table. I'm not sure that I could do it here, though. My comment that the COUNT query is more complicated than it needs to be is that I think the SQL could probably be simplified to not join as many tables, since all were doing is counting. However, I don't know how smart the database engine is in caching this against the actual count, and we shouldn't simplify it, if indeed it is cached.

Log in or register to post comments

Comment #74

chx commented 23 October 2007 at 06:22

How to test:

check whether node indexing works - use cron.php, edit a node after it's been indexed, add a comment, remove a comment, etc,
search syntax - create a couple nodes and then do the four types of searchs (AND, OR, exclude, and phrase searches) and make sure the right results come back
relevancy. create two nodes where in node A one of the words appear more times than in node B. Check searching that word. Also try putting that word inside H1 tags -- should have higher relevancy than the same word appearing inside H2 and higher than just appearing in the document.

add a link from one node to another and make sure that both nodes get reindexed. If document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

Log in or register to post comments

Comment #75

catch

he/him

English

commented 23 October 2007 at 08:50

Patch applied fine.

* check whether node indexing works - use cron.php, edit a node after it's been indexed, add a comment, remove a comment, etc

All this fine - indexing, reindexing after edits, everything - until I deleted a comment then:

Fatal error: Cannot use object of type stdClass as array in /home/libcomwww/public_html/drupal6/modules/search/search.module on line 646

- the comment actually deletes - got drupal_set_message and the comment was gone when I went manually back to /node/2 - but it's not removed from the search index after running cron again. Forcing a reindex takes the comment out of the index.

To help any prospective re-rollers should doug not be around, line 646 is the search_touch_node bit of:

/**
 * Implementation of hook_comment().
 */
function search_comment($a1, $op) {
  switch ($op) {
    // Reindex the node when comments are added or changed
    case 'insert':
    case 'update':
    case 'delete':
    case 'publish':
    case 'unpublish':
      search_touch_node($a1['nid']);
      break;
  }
}

Continuing:

* search syntax - create a couple nodes and then do the four types of searchs (AND, OR, exclude, and phrase searches) and make sure the right results come back.

All of these work great, everything as expected.

* relevancy. create two nodes where in node A one of the words appear more times than in node B. Check searching that word. Also try putting that word inside H1 tags -- should have higher relevancy than the same word appearing inside H2 and higher than just appearing in the document.

Did exactly this and it's good again!

* add a link from one node to another and make sure that both nodes get reindexed. If document A is linked to by document B, and document A does not refer to word X, but document B does refer to word X, when you search for word X, both documents should return results. This is in essence how linking from one node to another, improves the word relevancy of the linked to document.

That works fine as well.

So everything on chx's list works (and with my two nodes the search is nice and snappy) - except deleting a comment throws that fatal error.

Setting to code needs work but that shouldn't stop others from reviewing the rest of the functionality until it gets re-rolled, as far as I'm concerned all of that is good to go. I did some very rough tests on recency and number of comments and these seem fine too.

Log in or register to post comments

Comment #76

catch

he/him

English

commented 23 October 2007 at 08:52

Status:

Needs review

» Needs work

ack!

Log in or register to post comments

Comment #77

catch

he/him

English

commented 23 October 2007 at 09:02

Status:

Needs work

» Needs review

Status	File	Size
new	146466_1_0.patch	25.77 KB

OK this one should downgrade that fatal error to a notice on submit instead of a fatal error on delete.

Trying to get property of non-object - line 646

But I can confirm that deleting comments now takes nodes out of the index. Marking back to review since a notice really shouldn't hold this up getting some more testing.

Log in or register to post comments

Comment #78

snufkin commented 23 October 2007 at 21:15

the last patch results indeed in that notice on comment change. it removes the comment from index when it is being removed, however i noticed two things in the patch that look suspicious:

Extra space between ++ operator and variable on line 246, 283, 460. According to php.net manuals this should be without the separating space ($var++ instead of $var ++). I did not find relevant entry in the handbook.

The return of searc_nodeapi is return '<a>('. implode(', ', $output) .')</a>'; on line 355, this looks suspicious.

loading the cron.php nicely updates the index, except that when i started it it said 100 items to index, after start it jumped to 101, and 99% of the items got indexed running cron.php.

Relevancy: repeating keyword superseeds those nodes where keyword is occurring fewer, so this is ok. Linking put the linked node to the first hit, so i think this is ok too.

The search doesn't seem to understand AND, however it understands OR, so if it should, then this is a bug. excluding works, phrase search gives back the phrase too, so its ok except for AND.

Editing nodes wont add the changes to the index, but running cron.php after the change will add the modified content to the index.

Log in or register to post comments

Comment #79

catch

he/him

English

commented 24 October 2007 at 07:12

Status:

Needs review

» Needs work

Just to clarify with AND searches.

Searching with [term term] which defaults to AND works fine. [term AND term] - that isn't working. So CNW.

Log in or register to post comments

Comment #80

robertdouglass commented 24 October 2007 at 10:28

subscribe

Log in or register to post comments

Comment #81

douggreen commented 24 October 2007 at 12:43

Status:

Needs work

» Needs review

@catch, Thanks for all the review. A few things:

Looks like I need to reroll for a problem in #75 - #77... what was the solution to this?
What do you think is wrong with the spacing on the $var ++? I think that the spacing is part of our coding standards.
I don't think that there's anything wrong with return '<a>('. implode(', ', $output) .')</a>', but please look at it closer. The reason an href isn't included here is that it isn't needed. The <a> is returned to search.module itself, so that it will find the anchor and process it and increase the relevancy. No actual HTML is rendered and search.module doesn't need the href.
[term1 AND term2] never worked and isn't part of the spec. AND is implied. Try it on a 5.x site. Switching back to CNR because of this.

Log in or register to post comments

Comment #82

catch

he/him

English

commented 24 October 2007 at 13:21

75 - 77

hook_comment I changed:

search_touch_node($a1['nid']); (which produces a fatal error on comment delete) to
search_touch_node($a1->nid); (notice on comment submit).

It may well be the wrong approach since my PHP is ultra-basic, but it "works".

I see [term AND term] actually searches for [term and term] on drupal.org - so we were wrong and that's not a regression at all!

Can't comment on code style issues but if they're not really issues then it's just that hook_comment error that needs fixing as far as I can see.

Log in or register to post comments

Comment #83

douggreen commented 24 October 2007 at 15:44

If [term1 AND term2] searches on d.o match [term1 and term2], what I think is happening is that "and" is a three letter word that is being search for and you are finding documents that match all three words "term1" "and" and "term2".

Log in or register to post comments

Comment #84

catch

he/him

English

commented 24 October 2007 at 15:57

re #83 - yeah that's exactly what I meant, sorry it wasn't clearer.

I think it's a valid issue but it has nothing to do with this patch, so have split it into a new one: http://drupal.org/node/186242

Log in or register to post comments

Comment #85

catch

he/him

English

commented 25 October 2007 at 10:28

Status:

Needs review

» Needs work

So #40934 went in but this still applies with offset.

Only the notice to fix as far as I can see, notwithstanding snufkin's code style comments, so marking as needs work for that/those.

Log in or register to post comments

Comment #86

chx commented 26 October 2007 at 07:47

Status:

Needs work

» Reviewed & tested by the community

Status	File	Size
new	search_146466-86.patch	25.46 KB

So... what we have here. Reviews that say the code works. Yay. And then some code style issues with variables and ++. I do not know whether we have a rule for variables and ++ but I feel like these "stick" to the variable unlike + or - . Core seems to follow this notation. And then a notice. Gone. Added some missing doxygen (yes the function was there earlier, but why no doc it while we are here). No functionality was changed. So if it works and adheres to standards and has comments then it is ready, isn't it?

By the way, I feel like I understand now what search module does. I do not think I would want to debug/patch it but I still :) Very nice job simplifying it.

Log in or register to post comments

Comment #87

catch

he/him

English

commented 1 November 2007 at 13:11

Patch still applies with 2-15 lines offset.

Log in or register to post comments

Comment #88

catch

he/him

English

commented 4 November 2007 at 18:05

Status:

Reviewed & tested by the community

» Needs work

Not any more - system_update_6035 now that system index got in.

Log in or register to post comments

Comment #89

chx commented 4 November 2007 at 21:43

Status:

Needs work

» Reviewed & tested by the community

Status	File	Size
new	search_146466-89.patch	25.81 KB

Bother! said Pooh and rerolled the patch.

Log in or register to post comments

Comment #90

cburschka

they

commented 5 November 2007 at 21:38

Status:

Reviewed & tested by the community

» Needs work

Hunk #4 in node.module fails. This looks like a linebreak problem:

@@ -2620,4 +2607,4 @@ function _node_mass_update_batch_finishe
     $message .= theme('item_list', $results);
     drupal_set_message($message);
   }
-}
\ No newline at end of file
+}

Log in or register to post comments

Comment #91

cburschka

they

commented 5 November 2007 at 21:58

Status:

Needs work

» Reviewed & tested by the community

Status	File	Size
new	search_146466-91.patch	26.33 KB

I have rerolled the patch from the changed files, effectively removing hunk #4 from the patch. As this hunk, from the above quote and the .rej file, doesn't seem to do anything, the patch should now work fine.

If this is a problem specific to my version of patch/diff, sorry for the trouble.

(Setting back to RTBC as the patch is essentially identical to the previous one.)

Log in or register to post comments

Comment #92

catch

he/him

English

commented 6 November 2007 at 16:31

Just to confirm the RTBC - tested 91 and it throws no errors, still works etc. #89 that 4th hunk in node module failed but the patch worked without it.

Log in or register to post comments

Comment #93

liam mcdermott commented 6 November 2007 at 17:18

Tested #91 myself on a CVS Drupal and it seems RTBC. No errors, used devel to generate lots of test data, ran a cron, then did a search. Search results seemed relevant and to be ordered in a logical way.

Yay! Said Piglet.

Log in or register to post comments

Comment #94

gábor hojtsy

he/him

Hungarian

Hungary

commented 7 November 2007 at 10:29

This is a huge database and API change, so I am handing off decision to Dries on this issue.

Log in or register to post comments

Comment #95

douggreen commented 7 November 2007 at 12:05

Dries and I spoke in Barcelona and he indicated that because of the performance impact of this patch, he was still open to it. And we emailed about the time of #68, and he encouraged me to continue trying. It should be Dries's decision.

I think that it's mostly a huge database change and not a huge API change.

There are a few new functions, and a few new optional arguments tagged onto the end of existing functions, but I think that all existing function calls should work as-is. For this reason, I think that the API should be "backwards" compatible, something that we rarely strive for in Drupal. I'm not guaranteeing this (and it would be good if someone could double-check this statement) but that's what I recall.

Log in or register to post comments

Comment #96

dries commented 10 November 2007 at 09:00

I'm OK with this change. I promised Doug long ago that it could make its way into Drupal 6, and I want to keep my word.

However, it would be great if someone could setup two Drupal sites, one indexed with Drupal 6 and one indexed with Drupal 6 + patch. Then, do some simple and complex searches and compare the results side by side.

Should searches return the same result sets in all cases?

Log in or register to post comments

Comment #97

catch

he/him

English

commented 10 November 2007 at 12:38

Status	File	Size
new	6-d6-patched.png	62.73 KB
new	6-d6.png	62.19 KB
new	5-d6-pg7.png	68.98 KB
new	5-d6-pg1.png	69.97 KB
new	5-d6-patched-pg7.png	69.49 KB
new	5-d6-patched-pg1.png	70.46 KB
new	4-d6-pg6-patched.png	68.97 KB
new	4-d6-pg6.png	68.47 KB
new	4-d6-pg1.png	70.78 KB
new	4-d6patched-pg1.png	71.28 KB
new	3-d6-patched.png	41.65 KB
new	3-d6.png	41.12 KB
new	2-d6-patched-firstpage.png	70.06 KB
new	2-d6-firstpage.png	70.34 KB
new	1-patched.png	70.4 KB
new	1-d6.png	68.84 KB

OK here goes. Excuse the verbosity but want to make sure I don't have to do this again!

I did a fresh drupal install. Used devel module to generate 110 nodes + some taxonomy terms.

Some nodes have no comments
Some have comments
Some have comments + taxonomy terms.

I then used mysqlhotcopy to clone the db, changed the settings.php in another subdomain with D6 installed. Patched the cloned site (applies with some offset now but all fine). Ran system update 6036 although no queries performed.

Install 1 is 'd6'. Install 2 is 'd6-patched' from now on.

Enabled search on both installs. hit cron.php a few times (not that it matters, but indexing feedback was the same on each, percentages, numbers left etc.)

Left weighting on 5 5 5 for now.

OK now for some searches. Screenshots attached

1. Simple search for {vero} - identical first page

2. {vero OR singularis} - first eight results the same, then some variance - then last page results the same.

3. {vero -singularis} - identical results

OK this next set are with relevance set to 10, the other two weights to 0.

4. {nibh} again identical - even down to page 6

5. {gemino nibh} - looks identical again.

6. {"Gemino proprius"} - identical again.

So to me, it looks like just about every result is the same. Given this is devel generated content, timestamps and comment numbers have very little variance, the slight differences may well be down to the homegeneity of the content rather than any big change in functionality - nodes getting the same scores even.

I have to go out, and this is already a lot of screenshots. Will leave those sites as they are in case there's more testing needs to be done (didn't do any very, very complex searches) - probably late tonight or tomorrow though.

Anyway looks good to me, and the patched site was noticeably more snappy.

Log in or register to post comments

Comment #98

douggreen commented 10 November 2007 at 15:08

@catch, Thanks for the independent testing. You get my award for D6 tester of the year too! (I saw webchick's post on this :)

In any further testing, please try adding arbitrary link references <a href="node/123">hello world</a> (where node/123 is a valid node number) and then search for "hello world" and see what happens. The results should be slightly different between the current search implementation and patched versions, but the results should be reasonable with the new patched version.

I would also like to see some postgres upgrade testing from 5.x because I think we might need to change them a bit. Steven's original upgrade code required a full site index on upgrade; I was able to do something a little less extreme which is going to help us on all mysql sites, including d.o, but we may need to fall back to the original full upgrade code for postgres installs. But some testing is needed. I don't have postgres.

Log in or register to post comments

Comment #99

catch

he/him

English

commented 10 November 2007 at 20:22

Status	File	Size
new	search-link-hello-world.png	39.4 KB

Doug:

Added a link in the same node (to the same, different, node) in each version. Searched for hello world.

patched version was fine. Screenshot attached.

unpatched site gave me these notices after cron:

# notice: Undefined index: hello in drupal6/modules/search/search.module on line 538.
# notice: Undefined index: world in drupal6/modules/search/search.module on line 538.

and didn't reindex the nodes on a cron run.

I forced a reindex in admin/settings/search, ran cron a few times again - got those same notices after the last cron run as well when I refreshed the home page. Although 100% of the site was indexed, no results for Hello World at all. So this is a bug with the search module as it is at the moment in D6.
Like I said, patched version works fine though.

Log in or register to post comments

Comment #100

catch

he/him

English

commented 11 November 2007 at 09:54

I did a couple more OR searches to check that variance.

three different searches all gave me identical results (including the {vero OR singlaris} search from earlier with keywords weighted to 10.

With recency and comment rating back up to 5 there's some variance again, but I'm 99% sure this is because I have loads of comments with identical timestamps and comment numbers due to devel generation. The only drupal 5 site I could upgrade takes about a month or more to re-index, so no chance of testing with that.

Log in or register to post comments

Comment #101

douggreen commented 11 November 2007 at 18:11

@catch, just to be clear, the php errors are on the unpatched site, and you don't get errors on the patched site?

Log in or register to post comments

Comment #102

catch

he/him

English

commented 11 November 2007 at 21:30

doug: yeah the errors are on the unpatched site. Patched site is flawless.

Log in or register to post comments

Comment #103

agentrickard

he/him

English

Georgia (US)

commented 11 November 2007 at 22:38

Digging in (thanks to chx).

ALTER IGNORE is mySQL specific.

http://dev.mysql.com/doc/refman/5.0/en/alter-table.html

IGNORE is a MySQL extension to standard SQL. It controls how ALTER TABLE works if there are duplicates on unique keys in the new table or if warnings occur when strict mode is enabled. If IGNORE is not specified, the copy is aborted and rolled back if duplicate-key errors occur. If IGNORE is specified, only the first row is used of rows with duplicates on a unique key, The other conflicting rows are deleted. Incorrect values are truncated to the closest matching acceptable value.

Taking a look to see what pgSQL options are available.

(Although this may no longer be an issue.) pgSQL testing still needed?

Log in or register to post comments

Comment #104

catch

he/him

English

commented 12 November 2007 at 11:31

agentrickard: pgSQL testing is still needed, especially for the upgrade path, for the reasons you mention. However, as discussed on IRC, I think we should open a new issue for this, since that seems to be the only thing holding this patch up at this point.

To summarise the pgSQL issues. The current patch allows an upgraded 5.x site on MySQL to maintain some search functionality using the old search index - this is due to the ALTER IGNORE TABLE command (MySQL only) which skips over possible duplicate sids when adding the unique index, to allow for a less destructive re-indexing over time. It's also worth noting that duplicate sids are only caused by failed cron runs - so this is a relatively small proportion of sites in total, and may be close to zero actual pgSQL sites anyway.

Since IGNORE isn't supported by pgSQL, it may be necessary to wipe the old index and re-index from scratch if an alternative can't be found, and/or a change in the upgrade code to ensure the new index is created at all. A re-index from scratch isn't necessarily all that painful, since Doug Green already has a module using the batch api to speed up the process. Either way, these seem like tweaks to me, rather than anything critical.

Log in or register to post comments

Comment #105

catch

he/him

English

commented 12 November 2007 at 12:02

Possible solution for pgSQL proposed on the development list:

http://lists.drupal.org/archives/development/2007-11/msg00226.html

Log in or register to post comments

Comment #106

douggreen commented 12 November 2007 at 12:10

Status	File	Size
new	146466.patch	26.13 KB

To remove the pgsql hurdle, I've rerolled this with the original simpler code that forces a full re-index for pgsql. I've kept the more complicated, but more optimal, ALTER TABLE IGNORE update code for mysql.

Log in or register to post comments

Comment #107

douggreen commented 12 November 2007 at 14:53

I think that there might be some confusion in regards to the update code, so I'd like to clarify it here. A full site index is needed for both mysql and pgsql. With mysql (and thus d.o), the search index will remain intact while the full index takes place, which means that users will be able to continue to search the site and get somewhat meaningful results while the full index takes place.

Log in or register to post comments

Comment #108

agentrickard

he/him

English

Georgia (US)

commented 12 November 2007 at 15:04

I get it. So the converse to that is that pgSQL users may have search "broken" for a while during the upgrade phase. The upgrade phase could take an unknown length of time because it runs during cron to build the search index a piece at a time.

I'll try the approach from #105 and see if is works for pgSQL. Otherwise, we should go with the 're-index' pgSQL with documentation notes in the UPGRADE.txt file.

Log in or register to post comments

Comment #109

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 November 2007 at 16:01

agentrickard: Several people suggested the reindex module, which could be useful for pgSQL users, instead of depending on the uncertain cron timeframes.

Log in or register to post comments

Comment #110

dries commented 13 November 2007 at 14:07

Status:

Reviewed & tested by the community

» Needs work

I've committed this patch. Thanks for your hard work Steven and Doug. Doug, your persistence is much appreciated. :)

I'm not marking this patch as 'fixed' yet as I'd like to see us extend the high-level documentation a bit. It would be great if we could update the big picture. For example, an two-sentence paragraph that explains why we have a separate table for links. A bird eye's view would help people to navigate its way through some of the design decisions. I think this would help other people to wrap their head around the design. We want to involve more search people and we should go the extra mile to pave the path.

If you disagree that this would be useful, feel free to mark this issue 'fixed'.

Thanks!

Log in or register to post comments

Comment #111

gábor hojtsy

he/him

Hungarian

Hungary

commented 13 November 2007 at 14:39

Title:	Remove temporary table usage from search module	» Document search module design
Category:	bug	» task
Priority:	Critical	» Normal

Great, removing from the critical queue, so we have a better picture of code to fix still.

Log in or register to post comments

Comment #112

bjaspan commented 15 November 2007 at 02:16

FYI, the applied patch results in incorrect database indices after an upgrade from D5 to D6. See http://drupal.org/node/192348.

Log in or register to post comments

Comment #113

David_Rothstein commented 31 December 2007 at 20:40

I found two bugs introduced as a result of this patch that have to do with how the search index deals with links from one node to another (well, the first bug definitely was introduced by this patch; the second might not have been). Since the patch has already been committed, I filed separate issues:
http://drupal.org/node/205199
http://drupal.org/node/205202

Just thought it would be good to "advertise" them here... and also to point out they should be looked at if this patch ever gets backported to 5.x. Thanks.

Log in or register to post comments

Comment #114

robertgarrigos commented 1 January 2008 at 16:17

subscribing

Log in or register to post comments

Comment #115

robertgarrigos commented 1 January 2008 at 16:22

Log in or register to post comments

Comment #116

Anonymous (not verified) commented 8 January 2008 at 10:51

Could you point me to the current 5.5 backport? I can't find it but would like to test it.

Thanks a lot,
Stephan

Log in or register to post comments

Comment #117

catch

he/him

English

commented 8 January 2008 at 12:06

There is no 5.5 backport, and will not be. You can install 6.x and test that of course.

Log in or register to post comments

Comment #118

Anonymous (not verified) commented 8 January 2008 at 13:02

I think there should be one and http://drupal.org/node/146466#comment-559982 shows interest in it, doesn't it? A lot of people have drastic search engine problems right now. In short, the search engine in 5.5 is not usable for many of us, see http://drupal.org/node/207361 , http://drupal.org/node/203806 as only two examples - the boards are full of problem descriptions.

I have just updated a bunch of 4.7 installations in the hope the search engine issues are resolved. Now I am told to upgrade to 6.x of which most of the modules are not even available, yet. Sorry, but this is unrealistic from a professional point of view. Don't leave your users out in the cold, please.

Log in or register to post comments

Comment #119

catch

he/him

English

commented 8 January 2008 at 13:13

In that case I'd recommend opening a new issue for the 5.x backport, as recommended in #57, #58 and #59.

Log in or register to post comments

Comment #120

Anonymous (not verified) commented 8 January 2008 at 13:14

Ok, I will. Thanks.

Log in or register to post comments

Comment #121

Anonymous (not verified) commented 8 January 2008 at 13:40

Opened a new issue for backporting these search engine improvements to 5.x.

Log in or register to post comments

Comment #122

robertdouglass commented 15 April 2008 at 09:45

Status:

Needs work

» Active

Just a reminder, this issue is now one of documentation:

I'm not marking this patch as 'fixed' yet as I'd like to see us extend the high-level documentation a bit. It would be great if we could update the big picture. For example, an two-sentence paragraph that explains why we have a separate table for links. A bird eye's view would help people to navigate its way through some of the design decisions. I think this would help other people to wrap their head around the design. We want to involve more search people and we should go the extra mile to pave the path.

Log in or register to post comments

Comment #123

douggreen commented 27 April 2008 at 19:04

Let's talk through this at the search sprint, and let someone else take a stab at the documentation...

Log in or register to post comments

Comment #124

douggreen commented 13 July 2008 at 19:12

Version:	6.x-dev	» 5.x-dev
Status:	Active	» Needs work

Status	File	Size
new	146466-5x.patch	21.48 KB

Attached is a version backported for 5.x... It needs testing and review. I've only done minimal testing and I haven't tested the upgrade paths or written much of the pgsql sections.

This patch was created so we could improve performance on drupal.org, before the 6.x upgrade.

Log in or register to post comments

Comment #125

jeremy commented 20 July 2008 at 14:41

I was trying to benchmark this patch with SearchBench using a snapshot of the D.O. database from scratch. When trying to re-index the site content, the indexer got stuck looping forever.

This is because there are numerous nodes on D.O. that contain a call to drupal_goto(). In the Drupal 5 version of search there is a node_update_shutdown() function that "solves" this by recording the last nid that we tried indexing. In the Drupal 6 backported patch, this function has been removed. Thus, when indexing and we get to a drupal_goto() node we get stuck in an infinite loop.

Log in or register to post comments

Comment #126

jeremy commented 20 July 2008 at 15:05

I added some debug output to the code, and found that the problematic node has a nid of 9730. To get past it, I executed the following query:
insert into search_dataset (sid, type, reindex) VALUES (9730, 'node', 0);

I was then able to successfully index nid 9731, the next with a reindex flag, but was almost immediately stuck again when we got to nid 9732 which also seems to contain a drupal_goto(). Anyway, this is just to confirm that drupal_goto() is the problem and still needs to be properly handled.

Log in or register to post comments

Comment #127

catch

he/him

English

commented 20 July 2008 at 21:08

Title:

Document search module design

» D6 search refactoring (backport to 5)

hmm, http://drupal.org/node/9370 http://drupal.org/node/9372 - both look harmless enough.

Changing title of the issue.

Log in or register to post comments

Comment #128

jeremy commented 21 July 2008 at 13:05

The issue is with the nodes on scratch.drupal.org. http://scratch.drupal.org/node/9730 and http://scratch.drupal.org/node/9732 are only two examples if nodes causing this problem.

Log in or register to post comments

Comment #129

catch

he/him

English

commented 21 July 2008 at 13:33

It's possible that scratch needs updating per: #277021: Convert all PHP nodes and blocks into modules - which ought to fix the issue for the d.o database indexing if not in general.

Log in or register to post comments

Comment #130

douggreen commented 23 July 2008 at 12:45

This is probably a bug in the 6.x version as well and probably needs to be fixed. if d.o is having this problem, other sites will as well. It feels like a new issue. @catch, can you reproduce this in 6.x? And if so, can you open up a new issue (and link to it here). Thanks!

Log in or register to post comments

Comment #131

jeremy commented 23 July 2008 at 13:33

I have confirmed that this is also a bug in Drupal 6 proper, and opened an issue.

Log in or register to post comments

Comment #132

Rhino commented 28 July 2008 at 14:16

subscribing.

Log in or register to post comments

Comment #133

c960657 commented 12 August 2008 at 08:22

From 146466.patch :

@@ -94,10 +88,10 @@ function search_schema() {
       ),
     ),
     'indexes' => array(
-      'from_sid_type' => array('fromsid', 'fromtype'),
       'sid_type' => array('sid', 'type'),
       'word' => array('word')
     ),
+    'unique keys' => array('word_sid_type' => array('word', 'sid', 'type')),
   );
 
   $schema['search_total'] = array(

I know it is late in the game, but doesn't the addition of "word_sid_type" make the "word" index redundant? Or is the ability to use the left-most part of an index a MySQL-specific feature?

Log in or register to post comments

Comment #134

douggreen commented 28 August 2008 at 13:32

Status:

Needs work

» Needs review

Status	File	Size
new	146466-1.patch	25.01 KB

Attached is an updated version that includes #286263: Make search indexing smart to handle nodes wth php redirects

Log in or register to post comments

Comment #135

damienmckenna

TN, USA

commented 29 August 2008 at 16:21

FYI when I run the update on a patched 5.10 install I get the following:

The following queries were executed
system module

Update #1023
* Failed:
* C
* UPDATE {search_dataset} SET reindex = 1
* ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY sid_word_type (sid, word, type)
* ALTER IGNORE TABLE {search_dataset} ADD UNIQUE KEY sid_type (sid, type)

That said all of the updates applied correctly. Go figure.

Log in or register to post comments

Comment #136

damienmckenna

TN, USA

commented 29 August 2008 at 16:45

After starting the reindexing process I've started getting these:

user warning: Duplicate entry '117-meribel-node' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('meribel', 117, 'node', 11) in \includes\database.mysql.inc on line 172

In this case the node's title is "Méribel, France". Odd.

Log in or register to post comments

Comment #137

damienmckenna

TN, USA

commented 4 September 2008 at 03:47

An extra bonus: have the search index status ignore nodes of content types that shouldn't be indexed:

Index: modules/node/node.module
===================================================================
--- modules/node/node.module	(revision 668)
+++ modules/node/node.module	(working copy)
@@ -859,9 +859,18 @@
       return;
 
     case 'status':
-      $total = db_result(db_query('SELECT COUNT(*) FROM {node} WHERE status = 1'));
-      $remaining = db_result(db_query("SELECT COUNT(*) FROM {node} n LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND (d.sid IS NULL OR d.reindex > 0)"));
-      $pending = db_result(db_query("SELECT COUNT(*) FROM {node} n LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND d.reindex < 0"));
+      // see if some of the types have been disabled
+      $types = variable_get('search_config_disable_index_type', '');
+      if(is_array($types) && count($types) > 0) {
+        foreach($types as $i => $type)
+          $types[$i] = "'{$type}'";
+        $types = " AND n.type NOT IN (" . implode(',', $types) .")";
+      }
+      else
+        $types = '';
+      $total = db_result(db_query('SELECT COUNT(*) FROM {node} WHERE status = 1'.$types));
+      $remaining = db_result(db_query("SELECT COUNT(*) FROM {node} n LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND (d.sid IS NULL OR d.reindex > 0)".$types));
+      $pending = db_result(db_query("SELECT COUNT(*) FROM {node} n LEFT JOIN {search_dataset} d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND d.reindex < 0".$types));
       return array('remaining' => $remaining, 'total' => $total, 'pending' => $pending);
 
     case 'admin':

Note that the revision number is against our internal code release, but it should be easy to manually apply the patch.

Log in or register to post comments

Comment #138

douggreen commented 7 September 2008 at 06:37

Please create a new ticket for the "extra bonus," and link to it here if you wish.

Log in or register to post comments

Comment #139

damienmckenna

TN, USA

commented 9 September 2008 at 02:44

douggreen, good idea, I've submitted it as #305659: Patch for node_search($op=status) to give correct status message totals.

NOTE: the patch actually requires the search_config module, sorry about the confusion.

Log in or register to post comments

Comment #140

damienmckenna

TN, USA

commented 9 September 2008 at 02:12

Has anyone else noticed that their advanced search no longer works after applying this patch? The regular search is great, but advanced search doesn't return anything regardless of what is entered.

Log in or register to post comments

Comment #141

doc2@drupalfr.org commented 9 September 2008 at 08:55

Is patch from comment #134 suitable for d5, as the topic's title and header version suggest? If not, which patch is the latest for d5?

So many modules involved, this is concerning for core/security updates. But I have thousands of nodes in my DMS, with a lot of word redundancy. Core searches are much too slow. So I need this fix.

Many thanks for the work. Best regards.
Arsène

Log in or register to post comments

Comment #142

douggreen commented 9 September 2008 at 12:30

Status	File	Size
new	146466-2.patch	25 KB

@DamienMcKenna,

replying to 135, that's a bug in the patch. Please try the attached patch.
replying to 136, 'meribel' and 'Méribel' are different words, but they are stored in the same {search_index} row. We have run into this before, and I thought that it was fixed (I don't recall the issue # though). Can someone else search for that issue and/or test this against 6.x or 7.x. If it's still an issue in 7.x, please either reopen the previous issue on it, or create a new one if you can't find that issue.
replying to 140, The advanced search form works for me. I have a 100k devel generated database that I'm using to test with. It has "story" and "page" nodes. I was able to use the advanced search form to search "containing any words", to search for a phrase, and to search for a type. Please be more specific on what is broken for you.

@doc2@drupalfr.org, replying to 141, yes, that is the latest patch for 5.x. But beware that this is still beta code... and not guaranteed to work yet.

Log in or register to post comments

Comment #143

doc2@drupalfr.org commented 9 September 2008 at 15:35

With the latest 146466-2.patch, I immediately got on a 5.10 drupal install:

user warning: Unknown column 'd.reindex' in 'where clause' query: SELECT COUNT(*) FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND (d.sid IS NULL OR d.reindex > 0) in drupal/includes/database.mysql.inc on line 172.
user warning: Unknown column 'd.reindex' in 'where clause' query: SELECT COUNT(*) FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE n.status = 1 AND d.reindex < 0 in drupal/includes/database.mysql.inc on line 172.

The watchdog warning corresponding to it seem to be:
Unknown column 'd.reindex' in 'where clause' query: SELECT n.nid FROM node n LEFT JOIN search_dataset d ON d.type = 'node' AND d.sid = n.nid WHERE d.sid IS NULL OR d.reindex > 0 ORDER BY d.reindex ASC, n.nid ASC LIMIT 0, 50 dans /Shared Items/iDoc/drupal/includes/database.mysql.inc à la ligne 172.

I then ran update.php
On the "finished" page I got:

user warning: Duplicate entry '104-1-node' for key 1 query: ALTER TABLE search_index DROP fromsid in drupal/includes/database.mysql.inc on line 172.
Updates were attempted. If you see no failures below, you may proceed happily to the administration pages. Otherwise, you may need to update your database manually. All errors have been logged.

and:

The following queries were executed

system module

Update #1023
ALTER TABLE {search_dataset} ADD reindex int NOT NULL default '0'
ALTER TABLE {search_index} DROP INDEX from_sid_type
Failed: ALTER TABLE {search_index} DROP fromsid
ALTER TABLE {search_index} DROP fromtype
ALTER TABLE {search_dataset} DROP INDEX sid_type
CREATE TABLE {search_node_links} ( sid int unsigned NOT NULL default '0', type varchar(16) default NULL, nid int unsigned NOT NULL default '0', caption TEXT NOT NULL, PRIMARY KEY sid_type_nid (sid, type, nid), KEY nid (nid) ) /*!40100 DEFAULT CHARACTER SET UTF8 */
UPDATE {search_dataset} SET reindex = 1
ALTER IGNORE TABLE {search_index} ADD UNIQUE KEY sid_word_type (sid, word, type)
ALTER IGNORE TABLE {search_dataset} ADD UNIQUE KEY sid_type (sid, type)

I must say that I use VFS and I ran the SQL as suggested in admin/settings/search before I applied this patch.
Before the patch as well, as my crons were still running (thousands of nodes to index), I wanted to make sure I had no duplicate words. So I used the SQL query a second time. As I had two sid indexes, I deleted sid_2, which had no effect on my site (no cache). I hope this is not a problem for this patch.

My watchdog shows, on node views:
Cannot modify header information - headers already sent by (output started at drupal/index.php:33) dans drupal/includes/common.inc à la ligne 319.

EDIT: by the way, a search in the default search bar (not the block) redirect to a clear server cache page (or the node in which I pasted the PHP code to do so). A second attempt in this search bar wipe the URL to http://. But a null value redirects correctly to the search/node page and searches here are normal.
- I get (too) many messages that the "cache is cleared".
- In admin/settings/search, the "Indexing status" (translation from French) indicated first
100% of the site has been indexed for quite a while. I couldn't understand cause my crons and search settings weren't able to index so much in so few time.
But it finally says now 0% of the site has been indexed. xxxx elements remaining. I did nothing to get that. It's gonna take a while.
- On a cron run success message, I got dozens of error messages like this one exactly on the same page:
warning: Invalid argument supplied for foreach() in drupal/modules/cck/fieldgroup.module on line 394.

Hope this helps.

I'm on a Beta site as well, but I'm not sure this patch is safe enough. (EDIT) I'll wait before I downgrade and hope something will be fixed/commited soon... Keep it up

Arsène

Log in or register to post comments

Comment #144

damienmckenna

TN, USA

commented 11 September 2008 at 18:59

I solved this problem, I had applied my #291854: Turn off Redirection on teaser view patch to the wrong part of search.module which stopped it redirecting to the search results. Doh.

Log in or register to post comments

Comment #145

doc2@drupalfr.org commented 15 September 2008 at 08:44

I applied the patch at the right place and it seems to work. But what about the wrong place where it was implemented...
Could you re-roll the patch please?

Log in or register to post comments

Comment #146

doc2@drupalfr.org commented 15 September 2008 at 12:39

I have no more problems appearing on cron runs.

But, each time a single node is saved/updated, my watchdog logs dozens of error messages such as this one:
Duplicate entry '3124-photographie-node-0' for key 1 query: INSERT INTO search_index (word, sid, type, score) VALUES ('photographie', 3124, 'node', 1) dans drupal/includes/database.mysql.inc à la ligne 172.

This line 172 is trigger_error(check_plain(mysql_error($active_db) ."\nquery: ". $query), E_USER_WARNING);

Log in or register to post comments

Comment #147

catch

he/him

English

commented 15 September 2008 at 15:15

This is the duplicate entry issue: #218403: Duplicate entry errors in search indexer

Log in or register to post comments

Comment #148

doc2@drupalfr.org commented 8 October 2008 at 13:52

Status:

Needs work

» Needs review

EDIT: Most of the topic is about d5.

ORIGINAL MESSAGE:
_{Thank you Catch. Any quick hint for a d5 fix? Cause your topic deals more about d6... and it's quite complicated. Please, could you point out any d5 relevant comment in your quoted topic?}

Best Regards,
Arsène

Log in or register to post comments

Comment #149

vkr11 commented 22 September 2008 at 18:13

Subscribe.

-Victor
http://drupalsearch.org

Log in or register to post comments

Comment #150

doc2@drupalfr.org commented 8 October 2008 at 13:50

Status:

Needs review

» Needs work

So, all patches are now applied for Views_FastSearch. I wipped my search index and re-indexed my 8000+ nodes site. 3rd time now. I tried to simplify my problem:

Each time I did, my index status from admin/settings/search says, at the very end of the index process:

Cron ran successfully
Indexing status
99% of the site has been indexed. There are 2 items left to index.

Neither another manual cron-run nor an automated one could get them indexed.

Now, I have 2 search indexes, and '2 items left to index' for ever in the latest one, while the process was about enhancement.

What about the following message on the admin/settings/search page?
'The search index is not cleared but systematically updated to reflect the new settings. Searching will continue to work but new content won't be indexed until all existing content has been re-indexed.'

Many thanks in advance for your answer, and for all the work you've already done.

Arsène

Log in or register to post comments

Comment #151

damienmckenna

TN, USA

commented 22 October 2008 at 19:31

Status:

Needs review

» Needs work

[quote]Now, I have 2 search indexes, and '2 items left to index' for ever in the latest one, while the process was about enhancement.[/quote]

Do you have content types that are not being indexed (using the Search Config module)? Try one of the queries listed earlier in the discussion to find out which nodes are not indexed and see if there is a reason.

Log in or register to post comments

Comment #152

naught101 commented 13 December 2008 at 07:42

subscribe. I've a hunch this is killing my server...

Log in or register to post comments

Comment #153

mpaler commented 12 January 2009 at 07:43

After reading http://drupal.org/node/139537 and now this thread, I must confess that I am thoroughly confused about this search indexing issue.

A couple questions to get us non-core coders up to speed:

1. Is search indexing "repaired" in 5.14?
2. Does core indexing with work alongside http://drupal.org/project/search_config?

I've just imported 30k vbulletin threads into my drupal installation and indexing fails without errors. Haven't switched off search_config module, however, it's critical to my current mission :(

Any updates are greatly appreciated,

Mike

Log in or register to post comments

Comment #154

damienmckenna

TN, USA

commented 15 January 2009 at 02:35

mpaler:
This has not been applied to v5, I believe the reasons being:
* it is a massive change to the search engine logic rather than just a "bug" fix thus out of the scope of a regular 5.x release,
* I believe it probably needs more testing.

That said, it has been working on skinet.com quite well since September 2008 :) though we've (hopefully temporarily, and due to management request) changed to using GoogleCSE though we're still indexing using this patch.

Log in or register to post comments

Comment #155

moshe weitzman commented 29 January 2009 at 15:49

The update function related to this issue, system_update_6036(), is horribly horribly ineffecient. The alter table statements are so slow when we drop columns and indexes. We really should have dropped all the data first IMO. Am using InnoDb on groups.drupal.org and seeing this. Live and learn.

Log in or register to post comments

Comment #156

doc2@drupalfr.org commented 4 February 2009 at 13:52

@ DamienMcKenna (#150 & 151): thank you, that was it: Search Config.
Works now, after resetting and reindexing.

Log in or register to post comments

Comment #157

mpaler commented 7 March 2009 at 18:50

Can anyone tell me which patche(s) above (post # would be ideal) to apply to get my 5.15 search indexing working? I've been stuck here:

0% of the site has been indexed. There are 40893 items left to index.

Edit: Applied the patch in #142 to 5.15 install (there were offsets but it applied successfully) and I can confirm that my site is now indexing.

Log in or register to post comments

Comment #158

douggreen commented 19 April 2009 at 12:13

Moshe, we intentionally didn't drop the data, with the idea that a full re-index would take a long time, and that we'd still have search results during the update. But I've run into the same problem that the new indexes take quite a long time and hang the update process. So maybe it was a poor decision, but it was discussed.

Log in or register to post comments

Comment #159

douggreen commented 19 April 2009 at 12:16

Status:

Needs work

» Needs review

Status	File	Size
new	146466.patch	25 KB

Attached is an updated 5.x patch per #437998: Search score normalization incorrect when node access returns multiple rows. I think that this is the solution, but it needs some review.

Log in or register to post comments

Comment #160

drumm

he/him

NY, US

commented 29 April 2009 at 18:57

Status:

Needs review

» Closed (won't fix)

Large backports won't go into Drupal 5 unless there is a really good reason.

Log in or register to post comments

Comment #161

heliod commented 22 October 2009 at 17:10

Tried to install this patch on my server (5.10) and the result was that immediately began seeing SQL errors in everything connected to Search or Reindex.

Log in or register to post comments

Comment #162

lias commented 1 March 2010 at 20:11

subscribe, using v 5.21

Log in or register to post comments

D6 search refactoring (backport to 5)

Comments