Problem/Motivation

The Drupal.org issue queue is powered by Views and has a large amount of joins that could bring down the server if we don't simplify this or improve the logic. On Drupal.org this could be a huge performance win (since a lot of the queries, even if we optimize the default case, are still painful if you do alternate sorts or filters). It'd also simplify our infra and reduce DB overhead, since we'd no longer need the core search index for anything.

Historically we tried to figure out if Solr would be a solution for this problem. But the challenge to make this work are not small, and this would either require Solr 4.0 (which is not supported in any of the modules, Apachesolr nor Search API solr) or suffer from noticeable latency between updating an issue and that change being reflected in search results.

Solution

We are using Search API with Search API DB. The dev site is http://searchapi-drupal_7.redesign.devdrupal.org/project/issues/searchapi

To do:

  • Store single-valued fields in a common table, reducing JOINs.
  • Participants field, usernames who have commented on the issue.
  • Multiple, comma-separated selection for Project, Assigned, Author, Participants, and Issue Tags.
  • Autocomplete for Project, Assigned, Author, Participants, and Issue Tags. (Not Search API autocomplete's "did you mean"-like suggestions.)
  • Color table according to issue status.
  • Export everything to code.
  • #1832356: Figure out how to dereference option lists with Search API

Comments

dww’s picture

senpai’s picture

@DamZ, based upon your https://drupal.org/node/1699164#comment-6354532 comment, what is the approximate level of effort needed to implement something like this?

dww’s picture

Title: Consider a Solr backend for issue views » Port issue views to Search API so we can have a Solr backend
Version: 6.x-1.x-dev » 7.x-2.x-dev
Priority: Normal » Critical
Status: Postponed » Active
Issue tags: +solr, +project, +drupal.org D7, +80hr

nnewton and I sat down at the coder lounge yesterday to investigate the queries from the default issue views in D7. As we expected, thanks to Field API and per-table storage, the queries are absolutely punishingly bad. nnewton said it's a critical blocker to *not* use SQL-backed views for the issue queues. D.o will melt.

So, we have to do something here. Based on numerous conversations, by far my preference would be to port all the default views provided by project_issue to depend on Search API. Then, sites can use whatever backend is appropriate for their needs, including a SQL-table based denormalization thing (Database Search). The only downside of using Search API is that we're going to need a separate index for the issues. But we already have that problem on d.o since we're separately indexing the issues with the core search index for the SQL-based views text search to work. ;) So, that's a bogus reason not to do this.

I'm going to solicit feedback from some of the folks who were interested in doing Search API + Project* porting to get some more direction on this before we start coding, but I'm pretty sure this is the right path forward.

Based on a quick chat with Senpai, tagging this as a 2-week (80hr) critical task.

damien tournoud’s picture

I took this for a spin, so as to figure out a precise task list.

The good news is that it was relatively easy to put a prototype together. The bad news is that not everything magically worked :)

Here is the task list (I tagged the open issues with the "Search API on Drupal.org" tag):

damien tournoud’s picture

I just pushed the prototype to the 949372-search-api branch.

dww’s picture

Great, thanks! Haven't had a chance to look at any of this closely, but one quick comment based on #4:

except that Assigned itself is not an Entity Reference field

Uhh, sure it is:

  $field = array(
    'cardinality' => '1',
    'entity_types' => array(),
    'field_name' => 'field_issue_assigned',
    'module' => 'entityreference',
    'settings' => array(
      'target_type' => 'user',
      'handler' => 'assigned',
      'handler_submit' => 'Change handler',
      'handler_settings' => array(
        'project_field' => 'field_project',
        'sort' => array(
          'type' => 'property',
          'property' => 'name',
          'field' => 'none',
          'direction' => 'ASC',
        ),
        'behaviors' => array(
          'views-select-list' => array(
            'status' => 0,
          ),
        ),
      ),
    ),
    'translatable' => '0',
    'type' => 'entityreference',
  );
  if (!field_info_field('field_issue_assigned')) {
    field_create_field($field);
  }

It's just that it has a custom selector plugin (to do the magic of project-specific choices for assigned, given the per-project maintainer permission stuff).

damien tournoud’s picture

@dww: that was a brain fart. I meant "Submitted by" is not an entity reference field.

drunken monkey’s picture

There is already an issue for the comments proerty: #1414688: Add the comments property in entity_metadata_node_entity_property_info(). I tagged it accordingly and marked yours as a duplicate. It seems like fago is fine with committing it, it just needs a quick polish.

Figure out how to implement autocomplete for Assigned, Submitted by, Participant and Issue Tags (I suppose we could implement a generic solution for this in Entity Reference, except that Assigned "Submitted by" itself is not an Entity Reference field...)

Would this really deal with the form generated by Views? Or do you mean overriding the Views integration in the Entity Reference module?
In any case, a custom override will probably be necessary, at least for the “Submitted by” field. And probably will be rather simple to do.

Discuss if it's acceptable not to be able to sort by title

Otherwise we can always add a aggregated field for indexing the title as fulltext, and index the title itself as a string. (In case you didn't know that.)
In principle, I'd say sorting by title is no really useful feature, or only in rare cases. However, those cases exist, I guess – and also the table would look a bit off, with only one column not sortable.
(Also, when sorting by „Assigned to” remember that this will normally sort by the users' UIDs, not their names! If you want the probably expected behavior of sorting by name, the name is what you need to index, not the “Assigned to” field itself.)

Discuss if we should transform the advanced search for into a facetted form

Adding facets in one way or another would definitely be a huge plus! I can't really say if the impact on the performance would be acceptable, but I'd say that having facets in lieu of nearly all exposed form elements (except “Search” and “Status”, I'd say) would even be a big win for the normal search.
I'm not a UX expert in any way (as you're using the Search API, you're bound to have noticed), but that's what I'd rather use, I think.

grendzy’s picture

What are the benefits of Search API vs apachesolr_views?

dww’s picture

re #8: Thanks for all the input, very helpful!

- Submitted: I'm not sure why that matters much for the issue views. We don't currently render that in any of the views, neither for display nor filtering. Oh, whoops, except that's one of the filters for advanced search. Hrm. Yeah, that. ;) So I guess that's the issue -- we just need to make sure there's a way to filter by node author. Seems like that's *got* to be handled by Search API already, no?

- Sorting by title: I'm not too concerned about this. AFAICT, the only useful *sorts* for issue views are: 'last commented on' (default), 'priority', 'created', # of replies, and maybe version and category. Generally I'd say that category, version and assigned are much more useful as a filter. If it's a pain in the ass to be able to sort by title, I'm fine with dropping that (although I must admit I don't understand Search API well enough to know why this doesn't already work).

- Facets: I've never really understood the difference between facets and filters. All seems the same to me. ;) But yes, to the extent that Search API makes it trivial to get counts of issues matching a certain combination of filters, I think that'd be a huge win. I was under the impression that search engines like solr make the performance cost of facets basically negligible.

While we're talking about implementation details, the biggest thing I'm concerned about would be how to get this to play nicely with the "Follow" flag as described at #1828756: D7 views don't honor tracker and issue following. I think we're going to need a field in the Search API index for each issue with the uids of everyone following that issue, so that both the Your issues view and the "participant" filter on advanced search work (although that's a bit confusing, and we should probably keep a separate filter for "participant" vs. "following" or something). That'd also mean we'd need to reindex the issue for Search API every time a user (un)flags that issue. Ideally with little or no latency. ;) I'd rather discuss this at the other issue, but I wanted to raise it here to get it in everyone's mind.

re #9 and the benefits over apachesolr_views: Good question! Here's how I see it:

For me, as the maintainer of Project* for both the d.o and non-d.o cases, by far the biggest win is that (at least as I understand it all) I'd only have to maintain a single set of all the default views for project_issue, which would all use the search API views handlers. Then, sites can use whatever Search API backend they want (solr vs. SQL vs. whatever), none of the views have to be touched, and it all Just Works(tm). If we use apachesolr_views, we have to manually craft all the solr-specific views, and keep them in sync with the SQL-based default views, and then if someone wants something *else* (e.g. mongo, xapian, whatever), we'd have to either provide yet another set of views, or they'd be totally on their own, etc. Plus, even if we have apachesolr_views to *display* the issue data out of a solr index, we'd *still* have to write something like project_issue_solr.module that knew how to index all the issue-specific stuff for solr. So, we have to manually handle solr-based display, and we'd have to manually manage solr-specific indexing.

Basically, Search API seems like The Right Way To Do It(tm) from an architecture standpoint -- you write stuff once and let that API handle the abstraction of how all the different backends have to work, how things need to be indexed/denormalized, and how all the views handlers work. Not my problem. I'd rather keep it that way.

At least, that's the koolaid I've been offered. ;) Whether to drink it or not depends on how painful this is in practice once we actually start trying it. Thankfully, DamZ is willing and able to crank forward with a prototype, so I'm thrilled to just let him see how far he can get and re-assess when we have more concrete data.

Cheers,
-Derek

drunken monkey’s picture

So I guess that's the issue -- we just need to make sure there's a way to filter by node author. Seems like that's *got* to be handled by Search API already, no?

Of course that's handled – it probably just lacks autocompletion. And you have to remember to index and use “Author » Name”, not “Author”.

- Sorting by title: I'm not too concerned about this. AFAICT, the only useful *sorts* for issue views are: 'last commented on' (default), 'priority', 'created', # of replies, and maybe version and category. Generally I'd say that category, version and assigned are much more useful as a filter. If it's a pain in the ass to be able to sort by title, I'm fine with dropping that (although I must admit I don't understand Search API well enough to know why this doesn't already work).

Since fulltext fields are parsed and tokenized, we don't have the single string we'd need in the index to sort by the field. Therefore, you can have a single field either fulltext-searchable or sortable, not both. You'd have to copy the field with the “Aggregated fields“ data alteration to do that.

Anyways, I totally agree with you, most of the sorts aren't very useful, so we could just drop some. However, being able to sort by title isn't really a “pain in the ass”, just a few clicks and a bit of extra space consumption for the index.

While we're talking about implementation details, the biggest thing I'm concerned about would be how to get this to play nicely with the "Follow" flag as described at #1828756: D7 views don't honor tracker and issue following. I think we're going to need a field in the Search API index for each issue with the uids of everyone following that issue, so that both the Your issues view and the "participant" filter on advanced search work (although that's a bit confusing, and we should probably keep a separate filter for "participant" vs. "following" or something). That'd also mean we'd need to reindex the issue for Search API every time a user (un)flags that issue. Ideally with little or no latency. ;) I'd rather discuss this at the other issue, but I wanted to raise it here to get it in everyone's mind.

Ah, yes, that's an interesting issue … Including that data into the index wouldn't be much work, I guess. However, having to reindex the whole issue when a user (un)flags it would really be unfortunate. I can't see how we could avoid that, though. I can't really tell how much additional strain this would put on the server, though. Probably there aren't a lot of people subscribing to new issues, compared to the whole rest – but I might be mistaken there.

pwolanin’s picture

I think it would be an very bad architecture decision to use Search API in addition to Apache Solr.

nick_vh’s picture

Just to add on to this discussion, you should figure out if solr is even an option for this problem. Keep in mind that Solr (< 4) has a waiting time before it processes the items. If you create an issue, it will not show up in the issue list because it will need to wait before it is being sent to solr and then you need to wait before it actually processes it. This latency can get pretty high.

Imagine : A user creates an issue

wait for cron to pass by : 5 min (optimistic)
wait for solr to process : 2 min
wait for solr to replicate to the slave : 1 min

You could avoid the cron by adding it to the index directly when you do a node save, but I'm still seeing some thing being shoehorned in to place where it does not belong.

Please think this through before making hasty decisions. I also agree with pwolanin that adding Search API now, just for this functionality & with the arguments about solr I added above, is not the right way to go.

pwolanin’s picture

If table joins are an issue, maybe look at http://drupal.org/project/pbs instead of materialized Views?

dww’s picture

@pwolanin and @nick_vh: Thanks for the input on this.

Re: #12:

I think it would be an very bad architecture decision to use Search API in addition to Apache Solr.

I explained why I think this is a win in #10. Although I definitely respect pwolanin's abilities and opinions, the above quote doesn't do much to change my mind. Can you provide any justification, other than that we'd have to maintain two stacks/indexes, etc? I know we've been doing our own custom solr stuff ever since DamZ and I wrote project_solr module in Boston while we were porting d.o to D6, but both of us hate that module and have been trying to kill it ever since, while other folks seem to want to keep it alive for some reason I can't understand. ;)

Re: #13: Yes -- the latency is an issue (as I also wrote in #10). The 5 seconds for cron is bogus, we can have something churning through reindexing much faster than that. If there's additional latency once it gets to the solr server before the results are changing in searches, we can try to tune our solr cores to reduce that latency, too. nnewton got very high-end solr hardware since he knew (and wanted) that the site would rely on it more and more heavily to take some load off the DB. Already, the issue queue search pages are often between 2 and 10 seconds for a page load, and it's going to get dramatically worse if we keep trying to do this via "instantaneous" SQL-backed views as opposed to some kind of search index.

Re #14: pbs hasn't been touched in 3 years and has no releases at all, much less an official stable release. That's totally unworkable as an alternative here. If we're going to do denormalized SQL tables, I'd rather use database search as I linked in #3. At least that's at 1.0-beta4 and under active development and support. Besides, issue queues still contain some things that aren't just core fields that we need to be able to search + sort on (comment count, last updated, flag data for following and "my issues", etc), so we'd need to denormalize beyond even the field data to get anything resembling performant SQL queries.

Basically, if we're going to put any real custom effort into this problem, I'd rather see that going into reducing latency for a Search API + Solr backed solution. Everything else seems like continuing down a path that ties us to one possible backend, and even then, in a non-ideal sort of way that would require even more customization and effort on our part.

I continue to believe Search API is our best option here. I'm totally willing and ready to be convinced otherwise, but so far, that hasn't happened. ;)

Thanks again,
-Derek

nick_vh’s picture

@dww I wrote in minutes, you clearly ignored my statement of the solr replication and process time?

There is no way that in Solr < 4 you can tune solr to directly return you the committed docs. Not without completely breaking it. If you will rely on solr for this listing, I think we can expect that a new issue is seen almost directly and that will never be the case with solr.
I've seen many people try, and it's just not feasible right now. Unless drupal.org wants to make investments in upgrading to solr 4.0. But given the timeframe, that is also not a workable option.

pwolanin’s picture

I have to agree strongly with Nick here - the indexing and replication lag makes Solr unsuitable for listing pages where uses expect to see immediate updates.

damien tournoud’s picture

Solr 3.x should be able to handle a ~10s auto commit time, without a lot of issue (especially on our hardware). The near realtime feature of Solr 4.x is compelling to reduce the latency further, and I don't see why we couldn't update the cluster to this version.

nick_vh’s picture

http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Fr...

I'll try to explain why what you are saying is not plausible in a solr master-slave setup. And I assume drupal.org does have a master-slave setup.
A quote from this wiki

If you desire frequent new collections in order for your most recent changes to appear "live online", you must have both frequent commits/snapshots and frequent snappulls. The most frequently you can distribute index changes and maintain good performance is probably in the range of 1 to 5 minutes, depending on your reliance on caching for good query times, and the time it takes to autowarm those caches.

1 to 5 minutes, nothing even below that is recommended.

There is significant work in upgrading both solr modules to solr 4.0 and as I said before, then we need to prepare for this work and this won't be that easy given the timeframe and available human forces working on this.

pwolanin’s picture

@Damien Tournoud - Solr 4 needs a revised set of config files. We have never deployed them in production yet. Have you?

The NRT feature does NOT come free. I don't think you want to delay this d.o upgrade by figuring out how to run SolrCloud or otherwise doing distributed updates rather than master-slave replication.

Note also - the current d.o setup has Varnish in front of Solr, and I expect the TTL is in the 100's of seconds?

Again, I think this is not the right tool for the job in this case.

damien tournoud’s picture

All this confirms that Search API is the good solution for this. If Solr as a backend doesn't work in this particular scenario, we can easily switch to another backend (ElasticSearch and Sphinx come immediately to mind).

dww’s picture

Title: Port issue views to Search API so we can have a Solr backend » Port issue views to Search API so we can have a performant backend

We're having a good discussion about this in #drupal-infrastructure right now. Everyone agrees Solr is off the table. Now we're trying to hash out our other options. For now, I continue to agree with DamZ that Search API is the right solution on the project_issue side, and we can continue to argue about what backend makes the most sense.

nnewton’s picture

Here is the major problem query off the main issue listing (/project/issues): http://drupal.privatepaste.com/208f1d6a0a

I have not review it in detail, but this + the COUNT() is taking 8+ seconds.

-N

dww’s picture

I still haven't had a chance to update the issue summary, but I had a thought I wanted to capture for this thread:

#1836928: Materialized Views Search API backend

That seems ideal. Apparently MV already needs some work to do views integration. It could just implement the Search API stuff and let that talk to views as needed.

If we're going to deploy for d.o D7, we'd also need some kind of tagged release, preferably a 7.x-1.0.

Fall-back is we stick with the DB Search backend (perhaps use that for now for prototyping).

pwolanin’s picture

This final part of the where looks rather odd:

(node.status = 1 OR (node.uid = 94675 AND 94675 <> 0 AND 1 = 1) OR 1 = 1)

So, is it just the one single view for the issue listings that's the issue?

drunken monkey’s picture

Thanks for your input, Nick! It seems Solr is really not an option then, unless you want to loosen the realtime criterion …

Regarding MV, I can't really say I know much about this module, but what I've just read doesn't sound like it would fit into the Search API model very neatly. Since it defines both the data and how it is queried, in Search API terms I think it would have to be both an item type and a service class? Or just a service class, where the index settings determine what the materialized views look like – but that would be practically no different from the Database search backend. We'd probably also have to add fulltext searching capabilities on our own, since I wouldn't think MV supports that currently (other than through the DBMS's native method).
Also, it seems the module hasn't had a commit in more than a year and hasn't got more than two installs, both of which also aren't very good signs …

Using the normal Database search backend for now might really be the best option – I usually can't recommend this at all to larger sites, but it ought to be better than the normal Views queries.
And, as you say, using the Search API gives you at least the flexibility to change this at any point should something better come up.

nick_vh’s picture

In the weekly solr - drupal.org meeting we still were not convinced there is a solution that is able to be deployed on drupal.org in a short-term. I started looking at mysql views and materialized views.

The code below is the complete views query without filters and others. It is not a full materialized view but we could make it that way if we want, a full materialized view would be more performant but it requires a trigger in the node table. This trigger would update the entry in the materialized view and this way we would have our denormalized table in mysql with real time updates.

There is *no* need for a drupal module to achieve this functionality, just some mariadb/mysql mastery. The only drupal adjustment we'd have to make then is to expose this table to drupal views and recreate the view.

On another note, I'd advise against materialized views as a module as this would add more overhead to a site that does not want more modules. Also we know not a lot about it, and lack the expertise I think? I'd be happy to be proven otherwise.

In the meantime, if you want to start playing with this and/or make the code to expose the table to the view :

CREATE OR REPLACE VIEW issue_queue_nodes_cache AS 
SELECT node.title AS node_title, node.nid AS nid, history.timestamp AS history_timestamp, node.created AS node_created, node.changed AS node_changed, node_comment_statistics.comment_count AS node_comment_statistics_comment_count, node.type AS node_type, node_comment_statistics.last_comment_timestamp AS node_comment_statistics_last_comment_timestamp, users_field_data_field_issue_assigned.name AS users_field_data_field_issue_assigned_name, users_field_data_field_issue_assigned.uid AS users_field_data_field_issue_assigned_uid, 'node' AS field_data_field_issue_status_node_entity_type, 'node' AS field_data_field_issue_priority_node_entity_type, 'node' AS field_data_field_issue_category_node_entity_type, 'node' AS field_data_field_issue_version_node_entity_type, 'node' AS field_data_field_issue_component_node_entity_type
FROM 
node node
LEFT JOIN field_data_field_issue_assigned field_data_field_issue_assigned ON node.nid = field_data_field_issue_assigned.entity_id AND (field_data_field_issue_assigned.entity_type = 'node' AND field_data_field_issue_assigned.deleted = '0')
LEFT JOIN users users_field_data_field_issue_assigned ON field_data_field_issue_assigned.field_issue_assigned_target_id = users_field_data_field_issue_assigned.uid
LEFT JOIN field_data_field_project field_data_field_project ON node.nid = field_data_field_project.entity_id AND (field_data_field_project.entity_type = 'node' AND field_data_field_project.deleted = '0')
INNER JOIN node node_field_data_field_project ON field_data_field_project.field_project_target_id = node_field_data_field_project.nid
INNER JOIN field_data_field_issue_status field_data_field_issue_status ON node.nid = field_data_field_issue_status.entity_id AND (field_data_field_issue_status.entity_type = 'node' AND field_data_field_issue_status.deleted = '0')
LEFT JOIN history history ON node.nid = history.nid AND history.uid = '1'
INNER JOIN node_comment_statistics node_comment_statistics ON node.nid = node_comment_statistics.nid
ORDER BY node_comment_statistics_last_comment_timestamp DESC

Some links : http://www.fromdual.com/mysql-materialized-views

webchick’s picture

The thing I find appealing about getting MV module to work is that if it can be made to work for Drupal.org, it can likely also be made to work for any other D7 site out there hitting performance issues with Views. This helps justify the investment in the D7 upgrade by bringing real, tangible benefits 99% of our userbase, where custom SQL triggers and custom searching infrastructure only benefit ourselves and are not easily reproducible in other environments that do not have a Narayan. ;) They also make deployments trickier (or at least Solr has been blamed for staving off at least 3 d.o features I've been involved with, and it's been really frustrating).

I pointed this out on IRC and was told that the query was more complicated than merely MV being able to help with, but it's not clear to me exactly how/why that is. Would be great to get an issue summary that reflects the current state of things.

nick_vh’s picture

Title: Port issue views to Search API so we can have a performant backend » Make the issue queue listings faster by leveraging techniques such as materialized views or solr

Changing the topic title, will tackle the summary also

nick_vh’s picture

Project: Project issue tracking » Drupal.org customizations
Issue summary: View changes

Adding several sections, and a blocker: This issue is being blocked by #949374: Add a project_issue_solr module to put issue metadata into the solr index

damien tournoud’s picture

Title: Make the issue queue listings faster by leveraging techniques such as materialized views or solr » Port issue views to Search API so we can have a performant backend
Project: Drupal.org customizations » Project issue tracking

Let's not forget that we need a fulltext search. The main view is perfectly trivial to denormalize. This issue is about rebuilding the Project Issue architecture using Search API so that we can leverage external fulltext search engines instead of trying to do this in SQL.

nick_vh’s picture

Then let's confirm the full-text search in sql is the issue. FWIW, this has been working in Drupal 6 and I don't see major changes in D7 in the functional requirements.
My understanding is that the join causes the high mysql load as pointed out in and this materialised view could solve this. And don't get me wrong, I don't care how it is being solved, but the solution should be a workable option for everyone.
I just don't see how adding Search API to this equation would solve the issue? Or is someone committed to work on the materialised views backend for search api? That would be awesome.

Just keep in mind that we are ending the development period for drupal.org on D7 and that we should be careful in adding new layers. Perhaps we should do some more profiling to find and make the issue reproducible, so that we can properly test and benchmark the work that is being done.

damien tournoud’s picture

@Nick_vh: it has always been an issue. Those are part of the slowest queries of Drupal.org.

The pages at http://drupal.org/project/issues/search?text=[any popular keyword] usually just timeout. This whole view is disabled to anonymous users because it is so slow.

Let's stop using SQL as a fulltext search engine, this is what this issue is about.

ElasticSearch seems to be a good match for all our requirements (including the realtime aspect). There is a Elasticsearch connector for Search API already (actually two: 1 and 2), so we have something to start working from.

We had good success leveraging ElasticSearch in production in the past (on non-Drupal projects). ElasticSearch is in particular the indexation engine behind the Graylog2 log management solution. The setup (on a two server cluster with one replica) should be trivial on Drupal.org.

damien tournoud’s picture

@webchick:

They also make deployments trickier (or at least Solr has been blamed for staving off at least 3 d.o features I've been involved with, and it's been really frustrating).

This has been the case because we built a lot of custom layers on top of the relatively feature-poor Apache Solr module. One of those features is the project browsing that we implemented in the project_solr module with dww back when we ported Drupal.org to Drupal 6. This module can (and should) die. It can be replaced by a Search API index and a standard view very very easily.

Rebuilding those features on top of the standard Search API would allow us to be more agile and more flexible. This is in part what this issue is about. I'm not sure what all the push-back is about.

nick_vh’s picture

Even though I dig elasticsearch, I'm not sure if it is a solution Drupal.org wants to maintain next to a dozen of mysql and solr servers. And the investment is just as much as moving to Solr 4.0. Perhaps drupal.org can also wait till the solr modules are stable on Solr 4.0 as it would offer the same solution and there would be better community support (regardlessly of which backend, apachesolr or search api, you use) imho.

I'll leave it to the drupal.org crew to decide what they want to do here.

nick_vh’s picture

Project: Project issue tracking » Drupal.org customizations
Issue summary: View changes

Changing summary

fabianx’s picture

Project: Drupal.org customizations » Project issue tracking

For full text I have another suggestion:

* I have worked on a prototype after Munich to replace the core search module with a Solr Query.

That is actually quite easy as Views is just loading the nids anyway, so all we need to do is filter:

SELECT MV QUERY ... WHERE nid IN (list-of-nids-from-solr) AND ...

The only problematic parts are:

* Exposed filters need to be translated into Solr queries
* Paging needs to be done via Solr

Besides that core search is doing nothing else than to restrict the result-set of the MySQL query, so with some constraints it is a viable solution.

Yes, it will take a moment for new content to show up, but for full text search that should not be a problem ... (and core search is dependent on cron too).

damien tournoud’s picture

@FabianX: are you really suggesting doing the intersection of two queries like this? This will fail in 99% of the use cases. The way to make this work is to guarantee that the list fetched from Solr is bounded, which means doing the paging in Solr, which means doing the ordering in Solr too. So this will only work when doing full text queries *and* sorting by relevancy.

Anyway, this is just too much work when we can just rely on Search API and be done with it :)

steven jones’s picture

Just want to clarify, where is the requirement for Solr 4 coming from? Is it just the speed of the index update?

dww’s picture

Re Steven Jones: There's no requirement for Solr 4. In fact, there's no requirement for Solr at all. The main requirements are:

1) Doesn't melt the d.o DB servers.

2) Doesn't introduce significant latency in issue queue search results (e.g. if you update an issue to change the status, that change needs to be reflected in the search results In A Timely Manner(tm) -- for some definition of "timely").

3) Doesn't regularly take 5-15 seconds just to find the results (aka "performant" as per the title of this issue).

---

Ideally, whatever we do here also:

4) Doesn't require massive code maintenance and custom effort to get it working.

5) Benefits other sites suffering from similar performance problems due to D7's per-field data storage.

From all the Solr expertise that's already been shared in this issue, Solr seems like the wrong solution to this, primarily because of point 2 (latency). In theory, Solr 4 would make it possible to do better, but that would open a whole new can of worms about trying to upgrade all of d.o's Solr infra to Solr 4. I don't know enough about Solr 4 to speak intelligently about the pros/cons of that, but it's definitely *not* a requirement that we use Solr for this, much less Solr 4.

Re Nick_vh: As Damien pointed out, the issue queue views have regularly been some of the most expensive and problematic queries on d.o. This is not a new issue. What's new is the magnitude of the problem, due to D7's core field storage exploding the cost of the queries. In D6 we could play some denormalization tricks and got things sorta-kinda-working, but we don't want to continue doing our own custom hacks for this if we can help it.

Finally, we're all keenly aware of the d.o D7 development process and that we're "ending the development period for drupal.org on D7". I wouldn't have tagged this for the upgrade and bumped it to critical if nnewton thought it was okay to launch without it. This isn't just adding new layers for the sake of elegance or because it seems like a nice and clean solution. It is a critical launch blocker to resolve this massive performance regression. However, given that we have to do something here, I'm going to strongly resist custom hacks that only work part of the time and just for the benefit of d.o if there *is* an elegant and clean solution that will solve our needs *and* help other sites in the same boat.

@all: Since this is a critical task, and it seems we're at risk of getting hung up going around in pointless circles rehashing the fact that Solr isn't going to work for this, can we please move on? My proposal:

- Continue with the Search API port that Damien started in the 949372-search-api branch.
- Continue with the Database Search backend as I mentioned at #3 and drunken monkey supported in #26 in response to #1836928: Materialized Views Search API backend.
- Get nnewton's input on if he'd like to continue with database search or investigate some other possible search backends.

Can I get an 'amen'? ;)

If not, can someone clearly articulate why there appears to be so much hostility towards using Search API in this case? I'm really confused. Smart people who I generally trust are making what seem like irrational arguments against a solution that appears to pretty obviously address all of the concerns here. Pardon my candor, but WTF?

Thanks,
-Derek

steven jones’s picture

@dww Thanks for the summary of where we are.

I've just had a quick look at the code for the Search API DB backend, and it seems to index all fields into different tables, so isn't denormalised at all? The 'quick win' Search API DB backend might be a non-starter too.

dww’s picture

Thanks for looking. Are you sure it's indexing all of its fields into separate tables? I was under the impression it was putting all the data for each index into a separate table, and that each index corresponded to a given view (more or less).

Hrm, but you're right, a very quick skim of the code seems to agree with what you're saying. Hrmph. Maybe we're going to need #1836928: Materialized Views Search API backend after all if this is actually going to be performant at d.o-scale. Or we go with a real search backend as Damien is proposing. Webscale FTW! ;) (Sorry, I couldn't resist).

webchick’s picture

I'm a huge proponent of something like #1836928: Materialized Views Search API backend which makes the solution to this problem something sites outside of Drupal.org can benefit from. That's a sure-fire way to both demonstrate "return on investment" for the Drupal.org upgrade, as well as help to encourage people other than the Drupal Association to work on further improvements.

steven jones’s picture

Issue tags: +solr

How about instead of #1836928: Materialized Views Search API backend we improve the Search API Database Search module to be able to put as much in a single table as possible, given the usage statistics of the two modules, that might benefit the community more? It might also be significantly easier since Search API Database Search already does full-text search, whereas Materialised views does not.

If such an approach was desired, then I think the following could happen:

  1. Continue with the Search API port that Damien started in the 949372-search-api branch.
  2. Continue with the Database Search backend, but try to improve it so that as much of the data as possible gets denormalised (PBS has some nice code that might be useful here)
  3. (In the future) When the Search API Solr backend, and apachesolr modules can talk to Solr 4, and the d.o infra has it, move to Solr 4, and the low latency goodness that that brings..
drumm’s picture

Issue tags: -solr

Removing the solr tag since solr isn't a likely solution, and if it is, it is different from the solr elsewhere on Drupal.org D7.

drumm’s picture

Looking at http://drupal.org/project/search_api_db, I'm not seeing that it is suited fully for our needs. It appears to keep each field in a separate table. What we need is a a big, denormalized table so we can generate issue queue listings with more simple queries that can be executed well enough by SQL, that's Materialized Views.

The way forward is still what dww said in #38:

- Continue with the [Project Issue] Search API port that Damien started in the 949372-search-api branch.
- Continue with the Database Search backend as I mentioned at #3 and drunken monkey supported in #26 in response to #1836928: Materialized Views Search API backend.
- [Do a proof of concept and] get nnewton's input on if he'd like to continue with database search or investigate some other possible search backends.

drunken monkey’s picture

Issue tags: -solr

Looking at http://drupal.org/project/search_api_db, I'm not seeing that it is suited fully for our needs. It appears to keep each field in a separate table. What we need is a a big, denormalized table so we can generate issue queue listings with more simple queries that can be executed well enough by SQL, that's Materialized Views.

As Steven said, we'd need to improve/adapt it so it is also capable of storing as much as possible in a single table. Which could be less work than creating a MV Search API backend from scratch.
The main problem I see with using a single table for the whole index are multi-valued and fulltext fields. When using a single table, we'd practically need to have one row per field value (or, for fulltext fields, one row per distinct word in it). That, of course, would be insane. So, we'd need a plan to deal with this. I'm coming up with the following options here:

  • Store only all single-valued, non-fulltext fields in the combined table, and keep the separate tables for multi-valued and fulltext fields. I'd imagine this would be an improvement for most sites, so we could just add this as a normal patch to the DB search module.
  • Possible solution for fulltext fields: Don't index them in a special way, and use LIKE (or, optionally, vendor-specific methods) at search time to search it. This would also allow the backend to correctly handle phrase searches, which currently aren't supported by the DB backend.
  • Possible solution for multi-valued fields: Maybe we could concatenate all values of those with a special separator, so we wouldn't need the additional table. Probably too detrimental to performance to be generally used, but maybe still the right thing for drupal.org? Could be tested, I guess.

Apart from the first one, these probably couldn't be just added to the Search API Database Search module. We'd either have to add an option for servers, letting users decide which way they'd like their fulltext/multi-valued fields handled; or fork the module (either into a new project or a 2.x branch) and hard-code the other mechanism, explaining this in the project description. (The latter would also allow us to specify that some features are not supported with that service backend, should we find we don't need them on drupal.org and they'd make it more difficult to implement.) The first would of course make the code more bloated and complicated (since code in the service class would then have to deal with two different data structures), the second would be harder to maintain (having to commit patches unrelated to the differing part to both projects).

That said, I don't really know much about Materialized Views, so it could easily be the better option here. The above are just my thoughts if you decided to go with the existing DB backend.

nick_vh’s picture

Before we dive in the very complex matter of creating a new module, did someone ever test materialized views with the current problem set?

steven jones’s picture

I really don't want to open up can of worms again, but what exactly were the reasons for not using Solr as the backend?

Latency (or speed of updates)? Infrastructure change/setup?

dww’s picture

Re: #46: Not sure what you mean. What do you want to test with MV? Also, sadly Materialized Views API has no releases, and I have no idea what state the code is in.

Re: #47 and Solr: See comment #38:

From all the Solr expertise that's already been shared in this issue, Solr seems like the wrong solution to this, primarily because of point 2 (latency). In theory, Solr 4 would make it possible to do better, but that would open a whole new can of worms about trying to upgrade all of d.o's Solr infra to Solr 4. I don't know enough about Solr 4 to speak intelligently about the pros/cons of that, but it's definitely *not* a requirement that we use Solr for this, much less Solr 4.

The main problem (as I understand it) is the (fairy significant) indexing latency. If it takes 10 seconds or more for the Solr index to reflect that an issue has been updated, the fear is that people will get confused and pissed off if they mark an issue needs work, reload the issue queue, and still see the issue as RTBC or something.

Cheers,
-Derek

drumm’s picture

Re #45, search_api_db improvements:

Drupal.org doesn't really have multi-value fields for issues, so it is a moot point. The best thing to do for search_api_db would be to leave the existing functionality of one field per table. That is probably best for sites that do have multi-value fields, a JOIN seems nicer than a LIKE, etc; they can do their own improvements if they need them.

For full text, I think it is worth considering using MySQL-specific full text. That would certainly be better than LIKE. I'm guessing the existing word-by-word scoring in search_api_db is nicer than LIKE, especially when we get into multiple words. Unless I'm mistaken, core search style word-by-word scoring is what we are doing today, so that's the baseline to meet.

The end goal is to have scalable queries. Less tables and less JOINs are great for that. We don't need to get down to one table. In the past, we used Materialized Views to denormalize the tables, we don't necessarily have to use it again, but some code has to do it. Either way, we're going to be improving one of the modules.

drunken monkey’s picture

Drupal.org doesn't really have multi-value fields for issues, so it is a moot point.

It has tags and participants, right? So while the majority of properties is luckily single-valued, we would have to solve the multi-valued problem. But if you're also of the opinion that it needn't necessarily be only a single table, then I guess we're good to go.
For fulltext, we could really just add a server option for letting users decide whether they want to use our old method (you're right, it works similar to core search – although they use two queries, I think) or a vendor-specific method of searching a single field (LIKE for systems we don't know or which have no fulltext searching capabilities).

You can (and probably should) of course also evaluate MV as an option, but I think it wouldn't be too much effort with search_api_db, and I'd sure be willing to help with that. Currently, I'd even have the time.

drunken monkey’s picture

Project: Project issue tracking » Drupal.org customizations
Issue summary: View changes

a) this is no longer blocked on the (won't fixed) proposal to add more custom project* solr crap to manually add issue metadata to the solr index since we're probably not going to use solr at all. b) the PM folks have changed their minds on how to track hours estimates -- that's now handled via issue tags, not blurbs in the issue summary like this.

drumm’s picture

Project: Drupal.org customizations » Project issue tracking

I updated the issue summary a bit for how to proceed with Search API. It integrates well with Views and gets the boilerplate code, which we would need for updating Materialized Views, out of the way. And I think there is good community & maintainer momentum around the Search API modules.

I clicked together a test view at http://searchapi-drupal_7.redesign.devdrupal.org/project/issues/searchapi. The dev site does come with some caveats, mainly our migration is buggy, #1975866: Issue data being migrated to D7 out of order, and #1982520: Need to migrate priority to new int values hadn't made it into the DB snapshot this dev site was built with. It does have data we can work with. More people can hack on it, see http://drupal.org/node/1018084.

The default page's query is currently

SELECT t.item_id AS item_id, '1' AS score FROM search_api_db_default_node_index_search_api_language t INNER JOIN search_api_db_default_node_index_status t_2 ON t.item_id = t_2.item_id INNER JOIN search_api_db_default_node_index_field_issue_status t_3 ON t.item_id = t_3.item_id INNER JOIN search_api_db_default_node_index_changed t_4 ON t.item_id = t_4.item_id WHERE (( (t_2.value = '1') AND( (t_3.value = '1') OR (t_3.value = '13') OR (t_3.value = '8') OR (t_3.value = '14') OR (t_3.value = '15') OR (t_3.value = '2') OR (t_3.value = '4') OR (t_3.value = '16') OR (t_3.value = '0') OR (t_3.value = '0') OR (t_3.value = '0') OR (t_3.value = '0') OR (t_3.value = '0') ))) ORDER BY t_4.value DESC LIMIT 50 OFFSET 0

which is 3 JOINs, a temporary table, and a filesort. Consolidating single-valued fields into one table will help a lot.

The rest of the rendering is fetching data from the field cache, which will be backed by MemCache in production.

drumm’s picture

Oh and I ran into #1984694: Increase weight #delta along with the number of grouped filters along the way. That's not a blocker, just annoying when manually configuring Views.

drumm’s picture

Project: Project issue tracking » Drupal.org customizations
Issue summary: View changes

Updating with solution

drunken monkey’s picture

Project: Drupal.org customizations » Project issue tracking

Since the issue summary just got updated which made me look at the paragraph about Solr: Solr 4 is now supported (by both Search API Solr and Apachesolr Search Integration) – see #1676224: Support Apache Solr 4.0. I don't know to what extent this changes things (i.e., makes Solr an eligible option), though, or if re-evaluating this decision is even still possible at this time. Just wanted to note that.

drumm’s picture

For Solr 4, #18 makes it sound like we could get a Solr 4 index updated in near realtime. We could consider it.

I'd like to avoid a full reindex after upgrading, reducing our search downtime. We already have to reindex all projects to get the data moved to fields. Issue indexing will have to happen with any solution. For search_api_db, we could shortcut it with a hacky INSERT SELECT and then proper indexing to come back and get the full text index.

If we do go with Solr for issues, I like the idea of using Search API to keep a separate index from the main site index. I don't want to upgrade our main search code again in the Drupal.org D7 project. Solr 3 and 4 will have to coexist on our Solr server.

I'm still leaning toward search_api_db, but search_api_solr is now an option.

drunken monkey’s picture

I have to admit, I haven't really worked with Solr 4 on a production site yet (let alone one as large as d.o's issue queues), so I can't say what obstacles you might run into with it. However, my first impulse is generally to use Solr, since most queries just perform way better with it than with the database. On the other hand, if we mostly have just filter queries (which I think is the case for the issue queues) and don't use facets, this might be the other way round.
Also, your point with re-indexing is of course valid, you couldn't really circumvent this with Solr. You could perhaps do incremental updates as well (leaving out the fulltext at first), but I don't know how much time that would save on the initial run.
Using the database first, letting a Solr server index the data parallel to that and switching the index to the Solr server later should be easily possible with the Search API, but would probably be too much additional effort for you/us.

So, I don't know how you were planning on deciding this, but I can see how the database might still be the better option here.

drumm’s picture

Assigned: Unassigned » drumm
drumm’s picture

Project: Project issue tracking » Drupal.org customizations
Version: 7.x-2.x-dev » 7.x-3.x-dev
Component: Views integration » Drupal 7 upgrade

I posted an initial pass at the necessary changes to Search API DB in #2083079: Store single-valued fields in a single, denormalized table.

Moving this to drupalorg since this will be a bit Drupal.org-specific initially. For example, issue tags are not in project module. We may be able to get by with altering the Search API and Views configuration in Drupalorg, but that will be a bit of a pain to manage.

dww’s picture

Actually, the issue tag filters on the default issue views are provided by project_issue. see _project_issue_views_add_taxonomy_filters() from project_issue/views/project_issue.views_default.inc.

One of the reasons I was so excited about this issue was that we would not have to maintain a bunch of separate sets of default views. So, I'd really prefer to keep doing this in a way that can be shared and reused instead of making a bunch of assumptions and having all the effort end up only being a drupal.org customization.

That said, I know we're desperately late getting this done, and we might have to cut corners to get something launched and then have to refactor it later. I just fear that later will become never and we'll be stuck with d.o-specific Search API views that are used and maintained, and project_issue non-Search API views that are poorly maintained.

So, one final vote for doing this properly from the start and not turning it into a bunch of d.o-customizations unless absolutely necessary (and via altering whenever possible).

Thanks!
-Derek

drumm’s picture

Do we want project issue to depend on Search API? With Search API DB, installation and indexing could mostly just work, but it is another dependency to install and inevitably will need some amount of attention from site admins. Alternatively, Views of core nodes and fields will tend to work more simply. I think from the perspective of others running project issue, they should get regular, basic views.

I don't want to attempt any sort of magic that would convert a basic View to a Search API view. I'm sure it could be done, and would get us back down to one set of views to maintain, and then they get converted. That leaves us with two sets of views to maintain, unless we require Search API.

It took me a few minutes to wrap my head around Project*/drupalorg's views altering for #1807726: Port project index view to D7. While good & extensible, I had a moment of wondering why we don't just export the view we need for our site, shortcutting moving bits of the export into 3 places.

At this point, this issue feels like something we are trying out on Drupal.org and is indeed site specific. I expect this first draft might be a bit rough. If there is a need for faster views in project issue itself, the second round of work using what we know will be a whole lot better.

dww’s picture

Yes, I was prepared to make Search API itself another requirement, and leave it up to sites to choose an appropriate backend. However, your other points are well-taken. Perhaps as a transition period we should just leave the plain-vanilla views as what's provided by default, and try out the Search API backed views as a d.o customization at first. Once we've got more experience with it, are happy with it, and can more easily document and support it, we can reconsider moving those back into project_issue and generalizing them as the default views.

Thanks,
-Derek

drumm’s picture

#2083079: Store single-valued fields in a single, denormalized table is now passing its simpletests. More testing in general would be great.

The next steps for this issue are setting up the index and views to replace project_issue's views. And then filling in the missing pieces from the issue summary.

drumm’s picture

I pushed a few commits for support in project_issue:

drumm’s picture

Issue summary: View changes

Update summary to explain why Solr 3 doesn't meet the requirements

drumm’s picture

Issue summary: View changes

update todo

drumm’s picture

2 more commits for /project/issues:

I've also updated the issue summary for completed work.

drumm’s picture

Issue summary: View changes

more done

drumm’s picture

2 more project_issue commits:

drumm’s picture

https://drupal.org/node/185188/commits has numerous fixes, including some mentioned in #2112839: Various issue queue view anomalies.

Remaining work:

  • A big INSERT SELECT statement to quickly prepopulate the non-fulltext index.
  • Queuing proper indexing to get the full text indexed.
  • The red updated/new after the issue title.
  • Show and default sort by the score column when full text searching.
drumm’s picture

The big INSERT SELECT statement to quickly prepopulate the non-fulltext index was added with http://drupalcode.org/project/drupalorg.git/commit/5ab11e0.

Now git7site.devdrupal.org has a complete index of non-text fields. It is fully indexing updated nodes immediately, and existing nodes oldest to newest on cron. I think this is ready for performance testing.

The red updated/new after the issue title - the first half of this is in http://drupalcode.org/project/drupalorg.git/commit/452c263. I still need to update the Views to use it.

drumm’s picture

Status: Active » Fixed

I've gotten everything I've been able to spot so far done:

drumm’s picture

Issue summary: View changes

More done!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

  • Commit 0e521ec on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] If searching for text in an issue queue, show and default sort...
  • Commit 120c678 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Add initial SearchAPI issues
    
  • Commit 18e7dc0 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Make score lables consistent.
    
  • Commit 2469718 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Add Create a new issue...
  • Commit 2a054c8 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#2044475] Use canonical path for issue search, and more on [#949372]
    
  • Commit 32440f0 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] SearchAPI issue queue updates.
    
  • Commit 356cb05 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Add new indicator
    
  • Commit 452c263 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Add an entity property for the red updated/new after the issue...
  • Commit 5ab11e0 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Prepopulate project issues search index.
    
  • Commit 6d7fe0d on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Do not overwrite existing rows from legitimate indexing.
    
  • Commit 9f4b47d on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Move project/issues view to SearchAPI
    
  • Commit a1e5cfa on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Move project/issues view to SearchAPI
    
  • Commit aeca21e on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Add advanced search view
    
  • Commit ba4dea0 on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Do not render anonymous in issue tables.
    
  • Commit bfee7eb on 7.x-3.x, 7.x-3.x-dev by drumm:
    [#949372] Switch to entity type to user so user-related filtering works...