#2232253: Make the indexing batch process work introduced a default tracker which performs write operations during index for keeping track of indexed items. This should be optimized so that keeping track of indexed items has no heavy impact on large indexes.

Comments

drunken monkey’s picture

Priority: Major » Normal
Issue tags: -sprint

I think that's neither "in focus"/"sprint" nor "Major" – this can be done at any time, in beta, RC or even after the initial stable release. We can either improve the default tracker, or add a new, optimized one (if there is some trade-off involved), or even decide we are happy with this one and wait for people to complain (or tell them to put it into a different project).
In any case, getting the basic framework/API stable has priority now, everything not impacting that (or making testing/development harder in some way) can wait.

However, if you really want to work on it and can't find something else you'd want to do, you can of course also work on this. But what optimization exactly would you implement, and how (in the existing or in a new tracker)?

freblasty’s picture

Assigned: freblasty » Unassigned
drunken monkey’s picture

Ah, I see, so this is about Nick's complaint in #2232253-40: Make the indexing batch process work, right?
OK, then let's continue the discussion here.

Well, we don't want to keep that 0 thing. Let's make it conform with everything. Since we have a FIFO now and you are tracking the new items with their real changed value we do need the pointer don't we? Or we have to start writing 0's again for each item we index. I don't like that approach :)

Ah, OK, then I misunderstood you in #2232253-27: Make the indexing batch process work. I thought you had now agreed with me that using a pointer for keeping track of indexing status doesn't offer the same flexibility, clarity and fault-tolerance as our old approach.
I don't really want to rehash my arguments from there, but I really think our current implementation is more than "good enough" for now and the pointer solution causes more problems than it solves. Also, with the pluggable tracker system, anyone out there (including us in this module) can provide a new tracker implementation at any point. So if, out in the wild, companies really start to complain about the additional DB writes for this, we can still easily implement this. For now, I'd rather concentrate on the framework and API itself, getting Views integration, facets and (last, but not least) the service classes ready, etc.

Nick_vh’s picture

That's also why I agreed to have it as a follow up so we don't block the framework. I don't necessarily need to see the pointer in there but setting 0 to the changed column when there's an update just doesn't make sense in a system like this. It's a hack, and it works. But it's not the most correct way of doing things. The pointer also has it's flzws as people can change it unwillingly in their code and mess up the tracking system.

It's a good issue to think more about it and as we extend the tests we can think of changing the system to be better or add another tracker as you said.

Doing a pointer + having the priority queue was not working in combination. But there was nothing wrong or broken with the pointer system if all you have is a fifo approach.

drunken monkey’s picture

I don't necessarily need to see the pointer in there but setting 0 to the changed column when there's an update just doesn't make sense in a system like this. It's a hack, and it works. But it's not the most correct way of doing things.

None of this is really an argument, these are just axioms. Why shouldn't it be correct? The changed column reflects the time the item was changed since it was last indexed – 0 meaning that it wasn't. (Though I'd have to admit NULL would be cleaner here, don't know why we're not using it.)
If you want cleaner, then we can re-introduce the status column, I'd be fine with that, too. Would have at least one benefit (getting all three counts (indexed, remaining, total) in one query) and certainly make things a bit cleaner and hopefully clearer.

Doing a pointer + having the priority queue was not working in combination. But there was nothing wrong or broken with the pointer system if all you have is a fifo approach.

Yes, there was, as I discussed verbosely in #2232253-26: Make the indexing batch process work. None of these arguments (as far as I can see/remember) were targeted at a priority queue, they were all problems with just the normal functionality of a tracker we have defined.

freblasty’s picture

The changed column reflects the time the item was changed since it was last indexed – 0 meaning that it wasn't.

Actually its the otherway around. As items should be processed according to the added order we use a timestamp to determine that order. Once an item is indexed its marked with changed value 0.

drunken monkey’s picture

Actually its the otherway around. As items should be processed according to the added order we use a timestamp to determine that order. Once an item is indexed its marked with changed value 0.

I think that's what I wrote. It's the same as in D7, right?

freblasty’s picture

Indeed it's the same as in D7.

drunken monkey’s picture

OK, then that's what I meant, too.

Another thing, maybe OK to discuss here, maybe better in a new issue: Do we really want to catch exceptions in the track*() methods? This code should only fail in case of database errors anyways, so the chance of this actually happening (without coding errors, which we should fix then ASAP of course) is pretty small.
However, when it happens and there really was some action (e.g., a node got created) which succeeded, then we might want to roll back that initial action, too. Otherwise, we'd be in an invalid state, in which only a complete re-indexing would help (or manually looking through the logs for the action which failed to be tracked and adding it to the tracking table).

In general, I'd say that SearchApiExceptions should never leave the Search API ecosystem (the module itself or any module that explicitly uses it – e.g., for a search). However, catching \Exception is something completely different, since we don't know what kind of error we would hide there. And since it's in any case not an error in our domain, I'm not sure we should be responsible for handling it.

For now, in my #2230931: Supporting multiple item types per index code, I have just removed the boolean return values from those methods (as it's more or less impossible to react to those in a meaningful way anyways) but kept the try/catch.

What's everyone else's opinion about that?

Nick_vh’s picture

I agree with keeping the exceptions in the core of search api only. People working with the APIs should not be expected to deal with the exceptions unless they work really low level.

Nick_vh’s picture

We discussed that it would be good to split up the changed column with a changed column and with a state column. The state would have just 0 and 1. 1 for needing to be indexed and 0 for the contrary.

This would make it easier to say, index everything from moment X again.

Nick_vh’s picture

Committed a change so the status column is now present.

drunken monkey’s picture

Project: Search API (8.x) » Search API
Version: » 8.x-1.x-dev
Component: Backend » Plugins
drunken monkey’s picture

Title: Default tracker needs to optimized » Optimize the tracker plugin (or add a new one)
Category: Bug report » Feature request

Not a bug.

legolasbo’s picture

Status: Active » Closed (outdated)

This issue has not seen activity in over 2,5 years. I am therefore closing this issue to clean up the issue queue. Feel free to re-open and update this issue if you feel this issue is still relevant and of importance.