Index content when created

catch - June 27, 2009 - 16:36
Project:Drupal
Version:7.x-dev
Component:search.module
Category:task
Priority:normal
Assigned:Unassigned
Status:needs work
Description

Search module should add content to the search index when it's created by default. This avoids search being 'broken' on new sites when cron hasn't been set up (or is running more infrequently than an admin is testing search). I added a checkbox to the search settings form, defaulting to on, for sites which want to disable this - since with a big search index it could make node creation very slow. Node updates are still done on cron - we just want content added to the index in the first place so it shows up at all.

AttachmentSizeStatusTest resultOperations
search_index_content.patch4.51 KBIdleFailed: 11477 passes, 0 fails, 2 exceptionsView details | Re-test

#1

System Message - June 27, 2009 - 16:50
Status:needs review» needs work

The last submitted patch failed testing.

#2

catch - June 27, 2009 - 18:05
Status:needs work» needs review

search_index() had a hidden bug after body as field -looks like it's missing some test coverage (this patch at least runs the code through).

AttachmentSizeStatusTest resultOperations
search_index_content.patch3.83 KBIdleFailed: 9010 passes, 65 fails, 1 exceptionView details | Re-test

#3

catch - June 27, 2009 - 18:13

webchick reviewed in irc.

1. remove dpm($node), whoops.

2. Convert the broken search query to dbtng while we're in there.

Otherwise unchanged.

AttachmentSizeStatusTest resultOperations
search_index_content.patch3.81 KBIdleFailed: 11825 passes, 1 fail, 0 exceptionsView details | Re-test

#4

Dries - June 27, 2009 - 18:39

This patch feels a bit like a one-off.

With this patch, some people might think that they no longer need cron, but the reality is that they still need cron to handle node updates, new comments being posted, taxonomy terms being updated, etc. We _really_ want people to setup cron for search to work correctly. I wonder if there are better ways to educate users by that.

Maybe we should have a very basic poorman's cron in core instead?

#5

pwolanin - June 27, 2009 - 18:44
Status:needs review» needs work

While this is in general a fine idea, this patch totally breaks the framework model of the search module, which basically says that search module itself knows nothing about what kind of data is being indexed by which modules.

However, was discussing the search module in D7 with webchick and she preferred to be able to have search module choose which module's have active search vs. splitting node and user search into separate modules. That makes it hard to delegate this operation to node module.

#6

webchick - June 27, 2009 - 18:47

@Dries: There's an issue for poormanscron in core at #331611: Add a poormanscron-like feature to core. There were some pretty compelling cases over there not to include it, including from Moshe, but these might not be insurmountable.

#7

JacobSingh - June 27, 2009 - 19:06

I'm on the fence about it.

Ideally, Drupal wouldn't need to run cron, and still function, but as Dries says, it is essential. One of the disadvantages of indexing as you save content is that it can introduce load to end users which is not needed. For instance, consider a digg type site which is getting 100 comments a minute. If that node indexes (with all of those comments) every time it is saved, it's going to hurt.

Plus, it puts too great a responsibility on the site to be stable. Since search indexing is asynch, if some part of saving (but not saving itself) fails before search runs, that node would presumably never make it in.

ON the flip side, cron time indexing is slow, and doesn't respond instantly to unpublishing, etc unless a nodeapi, or whatever api hook is hit.

#8

catch - June 27, 2009 - 19:42
Status:needs work» needs review

@Dries, we already have a cron warning if it's not run after 24 hours. However, even with the poormanscron in core patch, search index would run every three hours. That's a long time for an admin to wait to try out search module.

Webchick's primary concern with enabling search by default in the other issue was that users would 1. install drupal 2. post content 3. search for content 4. post a support request in the forums when that didn't work. If cron runs every 3 hours, one hour, even 30 minutes, that's plenty of time to run into trouble minutes after installing Drupal. So this is about instant gratification in general, not about compensating for lack of cron runs - and the reason I restricted this only to new content.

If we want to add a search box to the 'find content' screen - then again, we need quicker turnaround times - Post my page, lose it, go to 'find content', search, empty result, big fail.

@pwolanin, it doesn't break any search API cleanliness that's not already broken - see http://api.drupal.org/api/function/search_nodeapi/6. That issue needs fixing much more centrally than this one, so I'm marking this back to CNW.

@JacobSingh - please read the patch. This is done on hook_node_insert(), not hook_node_update() - so it's not going to run every time a comment is saved, just the initial node creation so that /something/ is in the index when you immediately try to find that thing you just lost. There's also an option to turn it off which is recommended for high traffic sites.

#9

gpk - June 27, 2009 - 20:12
Status:needs review» needs work

Interesting, I've been experimenting with immediate search index updates on node submit, via a custom action: http://drupal.org/node/216508#comment-1724668.

A few comments:

- search indexing may require a 2nd pass (which generally happens the next time search_cron is invoked), because the bottom of http://api.drupal.org/api/function/search_index/7 may http://api.drupal.org/api/function/search_touch_node/7 various nodes. However, if per #9 the aim is merely to have *something* in the search index then maybe we can live with that. But is the fact that updates to the node (whether comments or directly to the node itself) don't get indexed immediately going to cause more confusion?

- this would likely cause problems with content permissions module since the node content in that case will depend on the current user; simple fix - turn this off. Poormanscron has the same problem, as has the run-cron link on the status report page #431776: Cron should run as anonymous when invoked via the run-cron link on the status report page.

- @2: are we saying that the redirect problem with links to uncacheable nodes has gone away? Or perhaps it was useless anyway since search would in any case take cron down when indexing a PHP node with a drupal_goto() in it?

- @4: as webchick notes, poormanscron in core #331611: Add a poormanscron-like feature to core has currently got the thumbs down. Granted there are technical and usability issues, but there also seem to be 2 perspectives in that debate - those who are thinking from the perspective of "Drupal must be great for big sites" but also the "Drupal must be great for small/blog sites" point of view. 2 different audiences, 2 different requirements. Question: does Drupal need to be great for both audiences? If so, can we manage to achieve it while keeping both audiences happy?

#10

gpk - June 27, 2009 - 19:54
Status:needs work» needs review

Cross posted..

#11

catch - June 27, 2009 - 20:00

On updates - we really can't index every node on update as well because that's going to get slow (especially with lots of comments, that takes ages) - by the time you're updating content, then searching for it, then you've probably been using Drupal slightly longer so should hopefully have found your way to search settings and/or setting up cron by then. I really think it's fine to pre-optimize this just for brand new content and leave the rest as is - it's balance between the immediate feedback and slowing sites down. And it's easy to add more options for this from contrib (I think there's already a Drupal 6 module which does this for new and updated content).

#12

gpk - June 27, 2009 - 20:07

@11: can easily distinguish between addition of comments and updating of the node itself... http://api.drupal.org/api/function/hook_comment_insert/7 vs. http://api.drupal.org/api/function/hook_node_update/7. The challenge here is I think to get round the initial 'big fail' without causing confusion for the user on day 2 (why do some things appear in the search index immediately but not others).

#13

catch - June 27, 2009 - 20:17

@gpk, hmm. That's a good point - I'm still worried about performance though. If I add a node, it will be indexed once with or without this patch. If I add a comment, 50 people could add a comment before the next cron run, resulting in 50 times as much work done by the server. It's also less likely that admins are posting comments to their own content and searching for the contents of the comments than it is for nodes, although I agree it's an extremely fine line.

Attached patch uses register_shutdown_function() for node_index_node() and search_update_totals() - this is in line with the logic in search_cron(), and I think it should partially allay Jacob's concerns about node saving getting interrupted.

AttachmentSizeStatusTest resultOperations
search_index_content.patch4 KBIdleFailed: Failed to apply patch.View details | Re-test

#14

gpk - June 27, 2009 - 20:21

>50 people could add a comment before the next cron run
Sounds like a high traffic site to me ;)

#15

catch - June 27, 2009 - 20:26

Agreed, added that to the patch.

AttachmentSizeStatusTest resultOperations
search_index_content.patch4.44 KBIdleFailed: Failed to apply patch.View details | Re-test

#16

catch - June 27, 2009 - 20:34

This time in a less broken way.

AttachmentSizeStatusTest resultOperations
search_index_content.patch4.81 KBIdleFailed: Failed to apply patch.View details | Re-test

#17

TheRec - June 28, 2009 - 01:41

The patch does what is advertised so far. Tried to reindex multiple times, tested with and without the checkbox enabled and all works as expected. (On a side note... the sentence "100% of the site has been indexed. There are 0 items left to index." is descriptive, but in terms of usability, the user does not know how many items were indexed.)

Indexing content when it is created is a good idea, but this might give a false impression to users that their content is generally indexed on-the-fly and make them think that this will be done on update too, which is not the case (and, as stated before, "should not" or we would end up with huge performance issues). I think that the same people who follow the path described by catch ("1. install drupal 2. post content 3. search for content 4. post a support request in the forums when that didn't work") will just see their problem be delayed until they realize that their content is not updated in the index (because they did not run cron on a regular basis) when they modify it (or it gets modified with comments, etc.).

In all cases, the users who enable search module should be warned that if cron is not ran on a regular basis, their content will not be indexed... or if this patch gets committed that it will be indexed upon creation and not updated (unless cron is ran). We should also give a scope of what is indexed (content, fields, comments, etc.), because this is not clear either and somewhat leave the users in obscurity (unless they read the documentation).

#18

System Message - July 2, 2009 - 11:00
Status:needs review» needs work

The last submitted patch failed testing.

#19

catch - July 2, 2009 - 11:38
Status:needs work» needs review

Re-rolled.

AttachmentSizeStatusTest resultOperations
search_index_content.patch3.94 KBIdleFailed: 11560 passes, 2 fails, 0 exceptionsView details | Re-test

#20

System Message - July 2, 2009 - 11:55
Status:needs review» needs work

The last submitted patch failed testing.

#21

Bojhan - July 17, 2009 - 11:19

So, spend some time looking at the code - it looks like its outdated hence why tests are failing. Other then that, I see no UX reason why this shouldn't be done.

#22

catch - July 17, 2009 - 16:30

Hmm, sadly not stale. I have no idea why the search ranking tests or upload are failing.

#23

catch - July 31, 2009 - 20:55
Status:needs work» needs review

Scaling back to just nodes.

AttachmentSizeStatusTest resultOperations
search_index_content_5.patch3.19 KBIdleFailed: 11816 passes, 2 fails, 0 exceptionsView details | Re-test

#24

System Message - July 31, 2009 - 21:25
Status:needs review» needs work

The last submitted patch failed testing.

#25

Damien Tournoud - September 20, 2009 - 14:40
Status:needs work» needs review

The previous patches failed because the node was indexed before all the hook_node_insert() implementation had the time to kick-in (especially the upload module one). Moved the indexation to a shutdown function instead.

AttachmentSizeStatusTest resultOperations
504012-index-on-creation.patch3.05 KBIdleFailed: 13657 passes, 0 fails, 72 exceptionsView details | Re-test

#26

pwolanin - September 20, 2009 - 14:45

Seems like this is even less necessary with the poorman's cron in core?

#27

catch - September 20, 2009 - 16:03

The idea here is that you can enter content, then immediately have it return in search results - mainly if we replace the filters at admin/content with a search box, poormanscron wouldn't help with that since it's every three hours rather than immediate. So it's two different things IMO.

#28

Bojhan - September 20, 2009 - 16:50

Wow, great - some traction on this topic! I think catch explains it clearly, from almost all of the observations done where the content administration page is flooded (many nodes) it becomes nearly impossible to filter correctly. I spoke about this big usability problem last year, I think its very fundamental to how we should approach filters.

#29

pwolanin - September 24, 2009 - 12:39

I'm a bit concerned that anyone trying to do a mass import will have their site crippled by this code - any way to index a max # of nodes in one page load this way?

#30

catch - September 24, 2009 - 13:14

@pwolanin - there's an admin option to switch this off, I imagine sites doing mass imports wouldn't want this enabled anyway.

#31

pwolanin - September 24, 2009 - 23:25

@catch - wouldn't want is different than realizing it's an option. It's an edge case, but it would be nice to at least respect the same # per page load as per cron run.

#32

Bojhan - September 25, 2009 - 05:07

I think we need to understand that its not the small movement of beign able to search fo a specific item, but current interactions that we can increase tremendously. For example, menu would be able to have a search box. instead of having to remember the url for every menu item.

#33

pwolanin - September 25, 2009 - 11:23

@Bojan - what are you talking about and how does it relate to this patch?

#34

Bojhan - September 25, 2009 - 21:55

@pwolanin I am kind of getting dissapointed that so code to feature freeze we are getting stuck up on these issues, but basicly as I layed out. The current interaction that you need to know the node url to create a menu item is somewhat broken.

However I am not sure how this is all related to this issue, and why its beign held up. I dont see it beign an edge case that you are trying to find content in less then 3 hours after you have created it. But if node import is the problem, then lets fix it. I dont see any outcome of that hurting the UX at all.

#35

webchick - September 29, 2009 - 16:55

Bojhan asked me to chime in here.

Poormanscron has fixed a number of cron-related usability problems, including peoples' watchdog entries ballooning to 4,000 GB and bringing down the site. However, the usability issue still remains of when people create a piece of content and immediately lose it, they can't find it through the search box (at least until cron has reset).

What I don't like about this fix though, and believe Dries said the same up in earlier replies, is that it couples search module and node module together, in a way that contrib cannot replicate. It's a one-off for node module and doesn't help any other module who thinks its content ought to be instantly available.

Is it possible to achieve the same results in a more generic manner? For example, node.module triggering that cron needs to be run again on insert/update, so that poormanscron would perform the action in the background when they're redirected to their view page?

#36

JacobSingh - September 29, 2009 - 17:11

I totally get how this helps with the obvious problem of time lapse and finding new content.

Couple things to sort out.

The deluge:
Indexing is a dicey operation, and can be quite a drain on the system.
From our experience managing indexing / querying of tons of Drupal sites, there is an issue with indexing on nodeapi. With ApacheSolr, we index immediately in the case of a delete or unpublish (for obvious reasons). But when people do massive deletes or massive inserts, it sends a massive amount of index requests which can be a problem. This would likely kill core search.

Also, take into account a site with a lot of writes, like a blog post that gets dugg or something...

Solution?
I can't think of any other way to do it than to implement a leaky bucket. So the search indexer can handle 5 indexing requests per minute, and if it gets to 10, it will start putting them in the queue to run at cron, if it gets below 5 again, it will start processing them. It adds a check on every page load, but it's likely a small check.

Plugable
Also it needs to be plugable as @webchick mentions. Sadly, we never got a massive overhaul of the search API into D7 as we (speaking to myself as a search guy) should have done. We certainly cannot have node module have anything to do with search module.

Solution? - No, but an idea
I'm thinking in terms of the concept of Entities which we have in fields.

So the psuedo code is:

$entity being created?
search_index($entity);

which actual handler(s) indexes the content and people use to search is defined elsewhere, in a variable, or whatever, and it would be specific per entity. Perhaps they would have a signature something like:

(need not be OOP, just easier to psuedocode)
ApachesolrIndexer ext SearchIndexer
+indexableEntities()
+index($entity)
....

where as entities would have to explain themselves in a common format to Indexers.

Best,
Jacob

#37

catch - September 29, 2009 - 17:13

Search module and node module are already coupled together in horrible, horrible ways, the coupling here is a symptom of that, not its cuase.

Also if other modules are maintaining their own index, and have to provide their own equivalent of node_index_node() to do that, then nothing stops them calling that directly when they insert content. They won't have search module do it for them, but search module doesn't provde an equivalent of hook_node_update_index() for them either.

#38

catch - September 29, 2009 - 17:24

I really think refactoring search module is out of scope for this issue - are we going to deal with http://api.drupal.org/api/function/node_search_execute/7 here too? I agree it's horrible, but needs to be dealt with in its own right since that's not the problem this patch is trying to solve.

If a post gets dugg - then surely the issue is a lot of reads rather than a lot of writes? We're not indexing when comments are posted in the latest patch - so once it's posted in the first place, that's it.

On bulk updates, search_node_index() could either keep a static count, or like Jacob says a leaky bucket across requests - we don't really have a convenient place to store that kind of information across requests though off the top of my head, well, maybe flood control would do it.

#39

Damien Tournoud - September 29, 2009 - 17:31

What I don't like about this fix though, and believe Dries said the same up in earlier replies, is that it couples search module and node module together, in a way that contrib cannot replicate. It's a one-off for node module and doesn't help any other module who thinks its content ought to be instantly available.

This is not really accurate. The search module and the node module *are* tightly coupled together, in the sense that the search modules indexes *nodes*. I already explained in a few other places that this is not wrong per se. Three pieces are, at minimum, required to make any search solution work:

  1. an indexer
  2. one or more data providers
  3. one or more search UIs

The data indexer can be cleanly separated from the other two pieces. It currently lives in the search module.

The data providers provide data from the entities (node, user, etc.) to the data indexer. They cannot be logically separated from either the entity nor the data indexer. In core, the data providers live in their entity (see node_search()). In contrib (see Apachesolr), they are implemented by the search solution itself. None of those architecture make more sense then the other.

The search UIs are kind of a grey area. It could be theoretically possible to cleanly separate them from the indexer (if the data store has a clean query interface), but not from the entities.

All this to say that this patch doesn't add anything to the coupling between node and search.

#40

Xano - October 4, 2009 - 14:53

This functionality should be in Node and *not* in Search. Let's use Search only for indexing and searching. Other modules should expose themselves to Search so we don't have functions that are specific to Node, User, etc in Search. This patch also stabs #375397: Make Node.module optional in the back.

#41

Bojhan - October 4, 2009 - 15:39
Status:needs review» reviewed & tested by the community

After trying to persuade many into looking at this issue, I am going to be bold and marking it RTBC - perhaps to get more eyes, but mostly to get it in. As Damien explained in #39 are webchick's concerns in-exclusive to the implementation direction - and do not increase the coupling of node and search. It seems a lot of arguments, are more related to search.module its weirdness then this actual issue, I understand there is a large gray area - but please don't let that hold up this improvement.

This is important to get in, to fix a major critical (really that critical) issue in creating menu's (searching on a node title, when creating a menu item)

#42

webchick - October 5, 2009 - 02:24
Status:reviewed & tested by the community» needs work

Xano makes a good point. Also, the way this is implemented (I think) would result in inconsistent user experience between core search and something like Apache Solr, unless every search module implemented this same code in it (or am I wrong?).

I agree that Jacob's approach is ultimately the best, but too heavy for this late in the development cycle. Is there a middle-ground here, like a more minor "API clean-up" style of refactoring in search.module that can be done to allow this code to be exposed in node rather than search module?

Also, I still need to clear with Dries to see whether he's up for one-offing this behaviour in general, since he was opposed in #4, though those reasons are largely moot now that we do have poormanscron in core. I can see the obvious usability benefits this patch has. But I do not want to sacrifice software architecture for usability.

#43

catch - October 5, 2009 - 03:23
Assigned to:catch» Anonymous
Status:needs work» reviewed & tested by the community

I'm slightly perturbed that #38 and #39 have been ignored when marking this CNW, please point out why those comments don't hold up. Xano also admitted on irc that he hadn't read the issue, and that he more or less agreed with me and Damien until there's a significant refactoring after reading the whole thing. This is my last post on this issue, I'm frankly annoyed that stuff like the toolbar gets in with critical bugs all over the place., but small patches fixing critical usability issues get held up on philosophical objections with zero code backing them up.

Oh, and we can do this in node.module by sticking a module_exists('search') in node_save(), that'll be pretty.

#44

webchick - October 5, 2009 - 03:50

If you can point me to code in menu.module that does an if (module_exists('toolbar')) and/or implements hook_toolbar_whoozit_whatzit(), I will happily eat my misgivings about hard-coding (more) node assumptions into search module to fix a usability issue. Unless I'm terribly mistaken, there is nothing of the sort in there, because it's widely understood that the menu module is an API-related module that should be de-coupled from any specific implementation.

So is search. The fact that it is not currently as de-coupled from node as it ideally needs to be doesn't change that fact.

Is refactoring search out of scope for this issue? Of course. But that doesn't mean it couldn't be a prerequisite for it. We are in API clean-up phase right now, after all. I asked a very simple question about whether or not there was a middle ground which could be explored between gutting the entire thing and de-coupling this one area so that node could implement its own stuff. I didn't say that the lack of this was a complete no-go for this patch, it was simply a question.

Committing this patch anyway would mean committing something with a known flaw in its implementation. Not only does this give general "ooky" feelings to me as a core committer, but it also is totally insufficient for solving this very real usability problem in contributed modules. For example, I'm sure that Advanced Help module would also love to have its content indexed as soon as it's created, but this patch leaves it with no option apart from "modify search module directly, and add these few lines in." :\

So, the question: what would the most bare-bones modification to Search module's APIs to allow both node and Advanced Help module to index their content immediately upon insert/deletion/update? Is it something that could fall within scope of API clean-up? Or is it D8 material? Once we know the answer to that, we can move back to discussing whether we commit this solution anyway.

#45

webchick - October 5, 2009 - 03:51
Status:reviewed & tested by the community» needs review

#46

catch - October 5, 2009 - 04:57

node.module is an API module too, and unlike search is currently required, and the order of responsibility currently lies with optional modules to implement any hooks. This is why taxonomy module implements node hooks, and not the other way around - only field API and a generic query builder in core will put an end to those hook implementations / assumptions, a job which belongs to neither taxonomy nor node, and relies on two third party APIs to have proper decoupling. For an example of d7ux patches putting their own specific stuff in API modules, see the latest overlay patch, which adds overlay-specific markup to node module regardless of whether overlay is enabled or not, instead of implementing hook_form_alter(). That code has been in the patch for nearly six months now. toolbar, rather than putting it's own stuff in menu module, or attempting to fix menu module's API, did it's own re-implementation of menu's API functions instead, in the process introducing both bugs and performance issues all over the place, which are only recently starting to get cleaned up.

Back to this patch, there are two ways to index content with search module:

1. Call search_index() within a hook_update_index() to get your content indexed on cron. In fact hook_update_index() is just your own hook_cron() implementation, without the need to do if (module_exists('search')) in there, since it doesn't offer any index updating functionality of its own.

2. Call search_index() at any other time to index content whenever you want.

To index new help as soon as it's available, advanced_help.module could implement hook_modules_enabled() and call search_index() within there for any newly enabled modules, that's equivalent to the node_save()/module_exists('search') option here. So search.module already provides this ability, to the extent that it provides any API for indexing content at all.

However, what's being asked for here, is that search module offer a hook which is run on any possible system event where you might possibly have new indexable content added - to do that, it would not only have to duplicate hook_cron() as it currently does, but also duplicate a few dozen other hooks as well. If you can explain another way it could offer such a service that'd be great, but I don't see it.

We have two modules here - either we put assumptions about search module (which is optional, and has a far less rich API than node does) inside node, or we put assumptions about node inside search.

To avoid either, you need a search_node_sql_index.module which implements both search and node hooks to maintain the {search_index} tables, and would do the trick here on hook_node_insert() - that's not going to get done in 10 days though.

/me fails to not post on this issue again :(

#47

Xano - October 5, 2009 - 05:47

I'm with catch and Damien that a small search_node_insert() outweighs the downsides in this case. There are plans in the works for Drupal 8 that will make this code (and a lot more) unnecessary for D8, but for now we'll at least have this feature/UX improvement.

#48

kika - October 12, 2009 - 20:42

I'd strongly support getting this in. There is a well-documented usability problem: search is broken in early and most tender stage of people first experiencing Drupal. catch has given a thorough explanations why this change do not make things architecturally worse than it is now. We do not have time to rewrite search API for D7 in any case. So -- let's get this in, fix search API in proper way in D8.

#49

System Message - October 16, 2009 - 08:25
Status:needs review» needs work

The last submitted patch failed testing.

#50

catch - October 16, 2009 - 13:57

Haven't looked at those exceptions here.

By the way, while everyone was picking on search_node_insert() being such an API horror, we forgot to mention http://api.drupal.org/api/function/search_node_update/7

Which is almost exactly the same anyway except for the 'instant' bit.

 
 

Drupal is a registered trademark of Dries Buytaert.