Port issue views to Search API so we can have a performant backend [#949372]

Problem/Motivation

The Drupal.org issue queue is powered by Views and has a large amount of joins that could bring down the server if we don't simplify this or improve the logic. On Drupal.org this could be a huge performance win (since a lot of the queries, even if we optimize the default case, are still painful if you do alternate sorts or filters). It'd also simplify our infra and reduce DB overhead, since we'd no longer need the core search index for anything.

Historically we tried to figure out if Solr would be a solution for this problem. But the challenge to make this work are not small, and this would either require Solr 4.0 (which is not supported in any of the modules, Apachesolr nor Search API solr) or suffer from noticeable latency between updating an issue and that change being reflected in search results.

Solution

We are using Search API with Search API DB. The dev site is http://searchapi-drupal_7.redesign.devdrupal.org/project/issues/searchapi

To do:

~~Store single-valued fields in a common table, reducing JOINs.~~
~~Participants field, usernames who have commented on the issue.~~
~~Multiple, comma-separated selection for ~~Project~~, Assigned, Author, Participants, and Issue Tags.~~
~~Autocomplete for Project, Assigned, Author, Participants, and Issue Tags. (Not Search API autocomplete's "did you mean"-like suggestions.)~~
~~Color table according to issue status.~~
~~Export everything to code.~~
#1832356: Figure out how to dereference option lists with Search API

Comments

Comment #1

dww

we/he/they

commented 22 October 2010 at 08:22

Status:

Active

» Postponed

First step is here: #949374: Add a project_issue_solr module to put issue metadata into the solr index

Comment #2

senpai commented 16 August 2012 at 18:28

@DamZ, based upon your https://drupal.org/node/1699164#comment-6354532 comment, what is the approximate level of effort needed to implement something like this?

Comment #3

dww

we/he/they

commented 5 November 2012 at 01:16

Title:	Consider a Solr backend for issue views	» Port issue views to Search API so we can have a Solr backend
Version:	6.x-1.x-dev	» 7.x-2.x-dev
Priority:	Normal	» Critical
Status:	Postponed	» Active
Issue tags:		+solr, +project, +drupal.org D7, +80hr

nnewton and I sat down at the coder lounge yesterday to investigate the queries from the default issue views in D7. As we expected, thanks to Field API and per-table storage, the queries are absolutely punishingly bad. nnewton said it's a critical blocker to *not* use SQL-backed views for the issue queues. D.o will melt.

So, we have to do something here. Based on numerous conversations, by far my preference would be to port all the default views provided by project_issue to depend on Search API. Then, sites can use whatever backend is appropriate for their needs, including a SQL-table based denormalization thing (Database Search). The only downside of using Search API is that we're going to need a separate index for the issues. But we already have that problem on d.o since we're separately indexing the issues with the core search index for the SQL-based views text search to work. ;) So, that's a bogus reason not to do this.

I'm going to solicit feedback from some of the folks who were interested in doing Search API + Project* porting to get some more direction on this before we start coding, but I'm pretty sure this is the right path forward.

Based on a quick chat with Senpai, tagging this as a 2-week (80hr) critical task.

Comment #4

damien tournoud commented 5 November 2012 at 15:27

I took this for a spin, so as to figure out a precise task list.

The good news is that it was relatively easy to put a prototype together. The bad news is that not everything magically worked :)

Here is the task list (I tagged the open issues with the "Search API on Drupal.org" tag):

#1832348: Add a computed property to load the comments of a node
#1832356: Figure out how to dereference option lists with Search API
#1832360: Implement a Views argument for entities
Figure out how to implement autocomplete for Assigned, Submitted by, Participant and Issue Tags (I suppose we could implement a generic solution for this in Entity Reference, except that ~~Assigned~~ "Submitted by" itself is not an Entity Reference field...)
Discuss if it's acceptable not to be able to sort by title
Discuss if we should transform the advanced search for into a facetted form

Comment #5

damien tournoud commented 5 November 2012 at 15:44

I just pushed the prototype to the 949372-search-api branch.

Comment #6

dww

we/he/they

commented 5 November 2012 at 18:45

Great, thanks! Haven't had a chance to look at any of this closely, but one quick comment based on #4:

except that Assigned itself is not an Entity Reference field

Uhh, sure it is:

  $field = array(
    'cardinality' => '1',
    'entity_types' => array(),
    'field_name' => 'field_issue_assigned',
    'module' => 'entityreference',
    'settings' => array(
      'target_type' => 'user',
      'handler' => 'assigned',
      'handler_submit' => 'Change handler',
      'handler_settings' => array(
        'project_field' => 'field_project',
        'sort' => array(
          'type' => 'property',
          'property' => 'name',
          'field' => 'none',
          'direction' => 'ASC',
        ),
        'behaviors' => array(
          'views-select-list' => array(
            'status' => 0,
          ),
        ),
      ),
    ),
    'translatable' => '0',
    'type' => 'entityreference',
  );
  if (!field_info_field('field_issue_assigned')) {
    field_create_field($field);
  }

It's just that it has a custom selector plugin (to do the magic of project-specific choices for assigned, given the per-project maintainer permission stuff).

Comment #7

damien tournoud commented 5 November 2012 at 19:49

@dww: that was a brain fart. I meant "Submitted by" is not an entity reference field.

Comment #8

drunken monkey

he/him

German

Vienna, Austria

commented 5 November 2012 at 22:44

There is already an issue for the comments proerty: #1414688: Add the comments property in entity_metadata_node_entity_property_info(). I tagged it accordingly and marked yours as a duplicate. It seems like fago is fine with committing it, it just needs a quick polish.

Figure out how to implement autocomplete for Assigned, Submitted by, Participant and Issue Tags (I suppose we could implement a generic solution for this in Entity Reference, except that ~~Assigned~~ "Submitted by" itself is not an Entity Reference field...)

Would this really deal with the form generated by Views? Or do you mean overriding the Views integration in the Entity Reference module?
In any case, a custom override will probably be necessary, at least for the “Submitted by” field. And probably will be rather simple to do.

Discuss if it's acceptable not to be able to sort by title

Otherwise we can always add a aggregated field for indexing the title as fulltext, and index the title itself as a string. (In case you didn't know that.)
In principle, I'd say sorting by title is no really useful feature, or only in rare cases. However, those cases exist, I guess – and also the table would look a bit off, with only one column not sortable.
(Also, when sorting by „Assigned to” remember that this will normally sort by the users' UIDs, not their names! If you want the probably expected behavior of sorting by name, the name is what you need to index, not the “Assigned to” field itself.)

Discuss if we should transform the advanced search for into a facetted form

Adding facets in one way or another would definitely be a huge plus! I can't really say if the impact on the performance would be acceptable, but I'd say that having facets in lieu of nearly all exposed form elements (except “Search” and “Status”, I'd say) would even be a big win for the normal search.
I'm not a UX expert in any way (as you're using the Search API, you're bound to have noticed), but that's what I'd rather use, I think.

Comment #9

grendzy commented 5 November 2012 at 23:03

What are the benefits of Search API vs apachesolr_views?

Comment #10

dww

we/he/they

commented 6 November 2012 at 02:19

re #8: Thanks for all the input, very helpful!

- Submitted: I'm not sure why that matters much for the issue views. We don't currently render that in any of the views, neither for display ~~nor filtering~~. Oh, whoops, except that's one of the filters for advanced search. Hrm. Yeah, that. ;) So I guess that's the issue -- we just need to make sure there's a way to filter by node author. Seems like that's *got* to be handled by Search API already, no?

- Sorting by title: I'm not too concerned about this. AFAICT, the only useful *sorts* for issue views are: 'last commented on' (default), 'priority', 'created', # of replies, and maybe version and category. Generally I'd say that category, version and assigned are much more useful as a filter. If it's a pain in the ass to be able to sort by title, I'm fine with dropping that (although I must admit I don't understand Search API well enough to know why this doesn't already work).

- Facets: I've never really understood the difference between facets and filters. All seems the same to me. ;) But yes, to the extent that Search API makes it trivial to get counts of issues matching a certain combination of filters, I think that'd be a huge win. I was under the impression that search engines like solr make the performance cost of facets basically negligible.

While we're talking about implementation details, the biggest thing I'm concerned about would be how to get this to play nicely with the "Follow" flag as described at #1828756: D7 views don't honor tracker and issue following. I think we're going to need a field in the Search API index for each issue with the uids of everyone following that issue, so that both the Your issues view and the "participant" filter on advanced search work (although that's a bit confusing, and we should probably keep a separate filter for "participant" vs. "following" or something). That'd also mean we'd need to reindex the issue for Search API every time a user (un)flags that issue. Ideally with little or no latency. ;) I'd rather discuss this at the other issue, but I wanted to raise it here to get it in everyone's mind.

re #9 and the benefits over apachesolr_views: Good question! Here's how I see it:

For me, as the maintainer of Project* for both the d.o and non-d.o cases, by far the biggest win is that (at least as I understand it all) I'd only have to maintain a single set of all the default views for project_issue, which would all use the search API views handlers. Then, sites can use whatever Search API backend they want (solr vs. SQL vs. whatever), none of the views have to be touched, and it all Just Works(tm). If we use apachesolr_views, we have to manually craft all the solr-specific views, and keep them in sync with the SQL-based default views, and then if someone wants something *else* (e.g. mongo, xapian, whatever), we'd have to either provide yet another set of views, or they'd be totally on their own, etc. Plus, even if we have apachesolr_views to *display* the issue data out of a solr index, we'd *still* have to write something like project_issue_solr.module that knew how to index all the issue-specific stuff for solr. So, we have to manually handle solr-based display, and we'd have to manually manage solr-specific indexing.

Basically, Search API seems like The Right Way To Do It(tm) from an architecture standpoint -- you write stuff once and let that API handle the abstraction of how all the different backends have to work, how things need to be indexed/denormalized, and how all the views handlers work. Not my problem. I'd rather keep it that way.

At least, that's the koolaid I've been offered. ;) Whether to drink it or not depends on how painful this is in practice once we actually start trying it. Thankfully, DamZ is willing and able to crank forward with a prototype, so I'm thrilled to just let him see how far he can get and re-assess when we have more concrete data.

Cheers,
-Derek

Comment #11

drunken monkey

he/him

German

Vienna, Austria

commented 6 November 2012 at 15:54

So I guess that's the issue -- we just need to make sure there's a way to filter by node author. Seems like that's *got* to be handled by Search API already, no?

Of course that's handled – it probably just lacks autocompletion. And you have to remember to index and use “Author » Name”, not “Author”.

- Sorting by title: I'm not too concerned about this. AFAICT, the only useful *sorts* for issue views are: 'last commented on' (default), 'priority', 'created', # of replies, and maybe version and category. Generally I'd say that category, version and assigned are much more useful as a filter. If it's a pain in the ass to be able to sort by title, I'm fine with dropping that (although I must admit I don't understand Search API well enough to know why this doesn't already work).

Since fulltext fields are parsed and tokenized, we don't have the single string we'd need in the index to sort by the field. Therefore, you can have a single field either fulltext-searchable or sortable, not both. You'd have to copy the field with the “Aggregated fields“ data alteration to do that.

Anyways, I totally agree with you, most of the sorts aren't very useful, so we could just drop some. However, being able to sort by title isn't really a “pain in the ass”, just a few clicks and a bit of extra space consumption for the index.

While we're talking about implementation details, the biggest thing I'm concerned about would be how to get this to play nicely with the "Follow" flag as described at #1828756: D7 views don't honor tracker and issue following. I think we're going to need a field in the Search API index for each issue with the uids of everyone following that issue, so that both the Your issues view and the "participant" filter on advanced search work (although that's a bit confusing, and we should probably keep a separate filter for "participant" vs. "following" or something). That'd also mean we'd need to reindex the issue for Search API every time a user (un)flags that issue. Ideally with little or no latency. ;) I'd rather discuss this at the other issue, but I wanted to raise it here to get it in everyone's mind.

Ah, yes, that's an interesting issue … Including that data into the index wouldn't be much work, I guess. However, having to reindex the whole issue when a user (un)flags it would really be unfortunate. I can't see how we could avoid that, though. I can't really tell how much additional strain this would put on the server, though. Probably there aren't a lot of people subscribing to new issues, compared to the whole rest – but I might be mistaken there.

Comment #12

pwolanin commented 8 November 2012 at 15:00

I think it would be an very bad architecture decision to use Search API in addition to Apache Solr.

Comment #13

nick_vh

he/him

Ghent

commented 8 November 2012 at 16:37

Just to add on to this discussion, you should figure out if solr is even an option for this problem. Keep in mind that Solr (< 4) has a waiting time before it processes the items. If you create an issue, it will not show up in the issue list because it will need to wait before it is being sent to solr and then you need to wait before it actually processes it. This latency can get pretty high.

Imagine : A user creates an issue

wait for cron to pass by : 5 min (optimistic)
wait for solr to process : 2 min
wait for solr to replicate to the slave : 1 min

You could avoid the cron by adding it to the index directly when you do a node save, but I'm still seeing some thing being shoehorned in to place where it does not belong.

Please think this through before making hasty decisions. I also agree with pwolanin that adding Search API now, just for this functionality & with the arguments about solr I added above, is not the right way to go.

Comment #14

pwolanin commented 8 November 2012 at 18:10

If table joins are an issue, maybe look at http://drupal.org/project/pbs instead of materialized Views?

Comment #15

dww

we/he/they

commented 9 November 2012 at 00:53

@pwolanin and @nick_vh: Thanks for the input on this.

Re: #12:

I think it would be an very bad architecture decision to use Search API in addition to Apache Solr.

I explained why I think this is a win in #10. Although I definitely respect pwolanin's abilities and opinions, the above quote doesn't do much to change my mind. Can you provide any justification, other than that we'd have to maintain two stacks/indexes, etc? I know we've been doing our own custom solr stuff ever since DamZ and I wrote project_solr module in Boston while we were porting d.o to D6, but both of us hate that module and have been trying to kill it ever since, while other folks seem to want to keep it alive for some reason I can't understand. ;)

Re: #13: Yes -- the latency is an issue (as I also wrote in #10). The 5 seconds for cron is bogus, we can have something churning through reindexing much faster than that. If there's additional latency once it gets to the solr server before the results are changing in searches, we can try to tune our solr cores to reduce that latency, too. nnewton got very high-end solr hardware since he knew (and wanted) that the site would rely on it more and more heavily to take some load off the DB. Already, the issue queue search pages are often between 2 and 10 seconds for a page load, and it's going to get dramatically worse if we keep trying to do this via "instantaneous" SQL-backed views as opposed to some kind of search index.

Re #14: pbs hasn't been touched in 3 years and has no releases at all, much less an official stable release. That's totally unworkable as an alternative here. If we're going to do denormalized SQL tables, I'd rather use database search as I linked in #3. At least that's at 1.0-beta4 and under active development and support. Besides, issue queues still contain some things that aren't just core fields that we need to be able to search + sort on (comment count, last updated, flag data for following and "my issues", etc), so we'd need to denormalize beyond even the field data to get anything resembling performant SQL queries.

Basically, if we're going to put any real custom effort into this problem, I'd rather see that going into reducing latency for a Search API + Solr backed solution. Everything else seems like continuing down a path that ties us to one possible backend, and even then, in a non-ideal sort of way that would require even more customization and effort on our part.

I continue to believe Search API is our best option here. I'm totally willing and ready to be convinced otherwise, but so far, that hasn't happened. ;)

Thanks again,
-Derek

Comment #16

nick_vh

he/him

Ghent

commented 9 November 2012 at 09:06

@dww I wrote in minutes, you clearly ignored my statement of the solr replication and process time?

There is no way that in Solr < 4 you can tune solr to directly return you the committed docs. Not without completely breaking it. If you will rely on solr for this listing, I think we can expect that a new issue is seen almost directly and that will never be the case with solr.
I've seen many people try, and it's just not feasible right now. Unless drupal.org wants to make investments in upgrading to solr 4.0. But given the timeframe, that is also not a workable option.

Comment #17

pwolanin commented 9 November 2012 at 14:16

I have to agree strongly with Nick here - the indexing and replication lag makes Solr unsuitable for listing pages where uses expect to see immediate updates.

Comment #18

damien tournoud commented 9 November 2012 at 15:30

Solr 3.x should be able to handle a ~10s auto commit time, without a lot of issue (especially on our hardware). The near realtime feature of Solr 4.x is compelling to reduce the latency further, and I don't see why we couldn't update the cluster to this version.

Comment #19

nick_vh

he/him

Ghent

commented 9 November 2012 at 15:38

http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Fr...

I'll try to explain why what you are saying is not plausible in a solr master-slave setup. And I assume drupal.org does have a master-slave setup.
A quote from this wiki

If you desire frequent new collections in order for your most recent changes to appear "live online", you must have both frequent commits/snapshots and frequent snappulls. The most frequently you can distribute index changes and maintain good performance is probably in the range of 1 to 5 minutes, depending on your reliance on caching for good query times, and the time it takes to autowarm those caches.

1 to 5 minutes, nothing even below that is recommended.

There is significant work in upgrading both solr modules to solr 4.0 and as I said before, then we need to prepare for this work and this won't be that easy given the timeframe and available human forces working on this.

Comment #20

pwolanin commented 9 November 2012 at 15:42

@Damien Tournoud - Solr 4 needs a revised set of config files. We have never deployed them in production yet. Have you?

The NRT feature does NOT come free. I don't think you want to delay this d.o upgrade by figuring out how to run SolrCloud or otherwise doing distributed updates rather than master-slave replication.

Note also - the current d.o setup has Varnish in front of Solr, and I expect the TTL is in the 100's of seconds?

Again, I think this is not the right tool for the job in this case.

Comment #21

damien tournoud commented 9 November 2012 at 16:57

All this confirms that Search API is the good solution for this. If Solr as a backend doesn't work in this particular scenario, we can easily switch to another backend (ElasticSearch and Sphinx come immediately to mind).

Comment #22

dww

we/he/they

commented 9 November 2012 at 18:40

Title:

Port issue views to Search API so we can have a Solr backend

» Port issue views to Search API so we can have a performant backend

We're having a good discussion about this in #drupal-infrastructure right now. Everyone agrees Solr is off the table. Now we're trying to hash out our other options. For now, I continue to agree with DamZ that Search API is the right solution on the project_issue side, and we can continue to argue about what backend makes the most sense.

Comment #23

nnewton commented 9 November 2012 at 21:54

Here is the major problem query off the main issue listing (/project/issues): http://drupal.privatepaste.com/208f1d6a0a

I have not review it in detail, but this + the COUNT() is taking 8+ seconds.

-N

Comment #24

dww

we/he/they

commented 9 November 2012 at 23:26

I still haven't had a chance to update the issue summary, but I had a thought I wanted to capture for this thread:

#1836928: Materialized Views Search API backend

That seems ideal. Apparently MV already needs some work to do views integration. It could just implement the Search API stuff and let that talk to views as needed.

If we're going to deploy for d.o D7, we'd also need some kind of tagged release, preferably a 7.x-1.0.

Fall-back is we stick with the DB Search backend (perhaps use that for now for prototyping).

Comment #25

pwolanin commented 10 November 2012 at 00:12

This final part of the where looks rather odd:

(node.status = 1 OR (node.uid = 94675 AND 94675 <> 0 AND 1 = 1) OR 1 = 1)

So, is it just the one single view for the issue listings that's the issue?

Comment #26

drunken monkey

he/him

German

Vienna, Austria

commented 10 November 2012 at 01:30

Thanks for your input, Nick! It seems Solr is really not an option then, unless you want to loosen the realtime criterion …

Regarding MV, I can't really say I know much about this module, but what I've just read doesn't sound like it would fit into the Search API model very neatly. Since it defines both the data and how it is queried, in Search API terms I think it would have to be both an item type and a service class? Or just a service class, where the index settings determine what the materialized views look like – but that would be practically no different from the Database search backend. We'd probably also have to add fulltext searching capabilities on our own, since I wouldn't think MV supports that currently (other than through the DBMS's native method).
Also, it seems the module hasn't had a commit in more than a year and hasn't got more than two installs, both of which also aren't very good signs …

Using the normal Database search backend for now might really be the best option – I usually can't recommend this at all to larger sites, but it ought to be better than the normal Views queries.
And, as you say, using the Search API gives you at least the flexibility to change this at any point should something better come up.

Comment #27

nick_vh

he/him

Ghent

commented 12 November 2012 at 23:11

In the weekly solr - drupal.org meeting we still were not convinced there is a solution that is able to be deployed on drupal.org in a short-term. I started looking at mysql views and materialized views.

The code below is the complete views query without filters and others. It is not a full materialized view but we could make it that way if we want, a full materialized view would be more performant but it requires a trigger in the node table. This trigger would update the entry in the materialized view and this way we would have our denormalized table in mysql with real time updates.

There is *no* need for a drupal module to achieve this functionality, just some mariadb/mysql mastery. The only drupal adjustment we'd have to make then is to expose this table to drupal views and recreate the view.

On another note, I'd advise against materialized views as a module as this would add more overhead to a site that does not want more modules. Also we know not a lot about it, and lack the expertise I think? I'd be happy to be proven otherwise.

In the meantime, if you want to start playing with this and/or make the code to expose the table to the view :

CREATE OR REPLACE VIEW issue_queue_nodes_cache AS 
SELECT node.title AS node_title, node.nid AS nid, history.timestamp AS history_timestamp, node.created AS node_created, node.changed AS node_changed, node_comment_statistics.comment_count AS node_comment_statistics_comment_count, node.type AS node_type, node_comment_statistics.last_comment_timestamp AS node_comment_statistics_last_comment_timestamp, users_field_data_field_issue_assigned.name AS users_field_data_field_issue_assigned_name, users_field_data_field_issue_assigned.uid AS users_field_data_field_issue_assigned_uid, 'node' AS field_data_field_issue_status_node_entity_type, 'node' AS field_data_field_issue_priority_node_entity_type, 'node' AS field_data_field_issue_category_node_entity_type, 'node' AS field_data_field_issue_version_node_entity_type, 'node' AS field_data_field_issue_component_node_entity_type
FROM 
node node
LEFT JOIN field_data_field_issue_assigned field_data_field_issue_assigned ON node.nid = field_data_field_issue_assigned.entity_id AND (field_data_field_issue_assigned.entity_type = 'node' AND field_data_field_issue_assigned.deleted = '0')
LEFT JOIN users users_field_data_field_issue_assigned ON field_data_field_issue_assigned.field_issue_assigned_target_id = users_field_data_field_issue_assigned.uid
LEFT JOIN field_data_field_project field_data_field_project ON node.nid = field_data_field_project.entity_id AND (field_data_field_project.entity_type = 'node' AND field_data_field_project.deleted = '0')
INNER JOIN node node_field_data_field_project ON field_data_field_project.field_project_target_id = node_field_data_field_project.nid
INNER JOIN field_data_field_issue_status field_data_field_issue_status ON node.nid = field_data_field_issue_status.entity_id AND (field_data_field_issue_status.entity_type = 'node' AND field_data_field_issue_status.deleted = '0')
LEFT JOIN history history ON node.nid = history.nid AND history.uid = '1'
INNER JOIN node_comment_statistics node_comment_statistics ON node.nid = node_comment_statistics.nid
ORDER BY node_comment_statistics_last_comment_timestamp DESC

Some links : http://www.fromdual.com/mysql-materialized-views

Comment #28

webchick

she/they

English

Vancouver 🇨🇦

commented 13 November 2012 at 02:08

Issue tags:

+Needs issue summary update

The thing I find appealing about getting MV module to work is that if it can be made to work for Drupal.org, it can likely also be made to work for any other D7 site out there hitting performance issues with Views. This helps justify the investment in the D7 upgrade by bringing real, tangible benefits 99% of our userbase, where custom SQL triggers and custom searching infrastructure only benefit ourselves and are not easily reproducible in other environments that do not have a Narayan. ;) They also make deployments trickier (or at least Solr has been blamed for staving off at least 3 d.o features I've been involved with, and it's been really frustrating).

I pointed this out on IRC and was told that the query was more complicated than merely MV being able to help with, but it's not clear to me exactly how/why that is. Would be great to get an issue summary that reflects the current state of things.

Comment #29

nick_vh

he/him

Ghent

commented 13 November 2012 at 09:34

Title:

Port issue views to Search API so we can have a performant backend

» Make the issue queue listings faster by leveraging techniques such as materialized views or solr

Changing the topic title, will tackle the summary also

Comment #29.0

nick_vh

he/him

Ghent

commented 13 November 2012 at 09:34

Project:	Project issue tracking	» Drupal.org customizations
Issue summary:	View changes

Adding several sections, and a blocker: This issue is being blocked by #949374: Add a project_issue_solr module to put issue metadata into the solr index

Comment #30

damien tournoud commented 13 November 2012 at 12:56

Title:	Make the issue queue listings faster by leveraging techniques such as materialized views or solr	» Port issue views to Search API so we can have a performant backend
Project:	Drupal.org customizations	» Project issue tracking

Let's not forget that we need a fulltext search. The main view is perfectly trivial to denormalize. This issue is about rebuilding the Project Issue architecture using Search API so that we can leverage external fulltext search engines instead of trying to do this in SQL.

Comment #31

nick_vh

he/him

Ghent

commented 13 November 2012 at 13:07

Then let's confirm the full-text search in sql is the issue. FWIW, this has been working in Drupal 6 and I don't see major changes in D7 in the functional requirements.
My understanding is that the join causes the high mysql load as pointed out in and this materialised view could solve this. And don't get me wrong, I don't care how it is being solved, but the solution should be a workable option for everyone.
I just don't see how adding Search API to this equation would solve the issue? Or is someone committed to work on the materialised views backend for search api? That would be awesome.

Just keep in mind that we are ending the development period for drupal.org on D7 and that we should be careful in adding new layers. Perhaps we should do some more profiling to find and make the issue reproducible, so that we can properly test and benchmark the work that is being done.

Comment #32

damien tournoud commented 13 November 2012 at 13:13

@Nick_vh: it has always been an issue. Those are part of the slowest queries of Drupal.org.

The pages at http://drupal.org/project/issues/search?text=[any popular keyword] usually just timeout. This whole view is disabled to anonymous users because it is so slow.

Let's stop using SQL as a fulltext search engine, this is what this issue is about.

ElasticSearch seems to be a good match for all our requirements (including the realtime aspect). There is a Elasticsearch connector for Search API already (actually two: 1 and 2), so we have something to start working from.

We had good success leveraging ElasticSearch in production in the past (on non-Drupal projects). ElasticSearch is in particular the indexation engine behind the Graylog2 log management solution. The setup (on a two server cluster with one replica) should be trivial on Drupal.org.

Comment #33

damien tournoud commented 13 November 2012 at 13:17

@webchick:

They also make deployments trickier (or at least Solr has been blamed for staving off at least 3 d.o features I've been involved with, and it's been really frustrating).

This has been the case because we built a lot of custom layers on top of the relatively feature-poor Apache Solr module. One of those features is the project browsing that we implemented in the project_solr module with dww back when we ported Drupal.org to Drupal 6. This module can (and should) die. It can be replaced by a Search API index and a standard view very very easily.

Rebuilding those features on top of the standard Search API would allow us to be more agile and more flexible. This is in part what this issue is about. I'm not sure what all the push-back is about.

Comment #34

nick_vh

he/him

Ghent

commented 13 November 2012 at 13:19

Even though I dig elasticsearch, I'm not sure if it is a solution Drupal.org wants to maintain next to a dozen of mysql and solr servers. And the investment is just as much as moving to Solr 4.0. Perhaps drupal.org can also wait till the solr modules are stable on Solr 4.0 as it would offer the same solution and there would be better community support (regardlessly of which backend, apachesolr or search api, you use) imho.

I'll leave it to the drupal.org crew to decide what they want to do here.

Comment #34.0

nick_vh

he/him

Ghent

commented 13 November 2012 at 13:19

Project:	Project issue tracking	» Drupal.org customizations
Issue summary:	View changes

Changing summary

Comment #35

fabianx commented 16 November 2012 at 19:15

Project:

Drupal.org customizations

» Project issue tracking

For full text I have another suggestion:

* I have worked on a prototype after Munich to replace the core search module with a Solr Query.

That is actually quite easy as Views is just loading the nids anyway, so all we need to do is filter:

SELECT MV QUERY ... WHERE nid IN (list-of-nids-from-solr) AND ...

The only problematic parts are:

* Exposed filters need to be translated into Solr queries
* Paging needs to be done via Solr

Besides that core search is doing nothing else than to restrict the result-set of the MySQL query, so with some constraints it is a viable solution.

Yes, it will take a moment for new content to show up, but for full text search that should not be a problem ... (and core search is dependent on cron too).

Comment #36

damien tournoud commented 20 November 2012 at 08:22

@FabianX: are you really suggesting doing the intersection of two queries like this? This will fail in 99% of the use cases. The way to make this work is to guarantee that the list fetched from Solr is bounded, which means doing the paging in Solr, which means doing the ordering in Solr too. So this will only work when doing full text queries *and* sorting by relevancy.

Anyway, this is just too much work when we can just rely on Search API and be done with it :)

Comment #37

steven jones commented 20 November 2012 at 21:38

Just want to clarify, where is the requirement for Solr 4 coming from? Is it just the speed of the index update?

Comment #38

dww

we/he/they

commented 21 November 2012 at 20:42

Re Steven Jones: There's no requirement for Solr 4. In fact, there's no requirement for Solr at all. The main requirements are:

1) Doesn't melt the d.o DB servers.

2) Doesn't introduce significant latency in issue queue search results (e.g. if you update an issue to change the status, that change needs to be reflected in the search results In A Timely Manner(tm) -- for some definition of "timely").

3) Doesn't regularly take 5-15 seconds just to find the results (aka "performant" as per the title of this issue).

---

Ideally, whatever we do here also:

4) Doesn't require massive code maintenance and custom effort to get it working.

5) Benefits other sites suffering from similar performance problems due to D7's per-field data storage.

From all the Solr expertise that's already been shared in this issue, Solr seems like the wrong solution to this, primarily because of point 2 (latency). In theory, Solr 4 would make it possible to do better, but that would open a whole new can of worms about trying to upgrade all of d.o's Solr infra to Solr 4. I don't know enough about Solr 4 to speak intelligently about the pros/cons of that, but it's definitely *not* a requirement that we use Solr for this, much less Solr 4.

Re Nick_vh: As Damien pointed out, the issue queue views have regularly been some of the most expensive and problematic queries on d.o. This is not a new issue. What's new is the magnitude of the problem, due to D7's core field storage exploding the cost of the queries. In D6 we could play some denormalization tricks and got things sorta-kinda-working, but we don't want to continue doing our own custom hacks for this if we can help it.

Finally, we're all keenly aware of the d.o D7 development process and that we're "ending the development period for drupal.org on D7". I wouldn't have tagged this for the upgrade and bumped it to critical if nnewton thought it was okay to launch without it. This isn't just adding new layers for the sake of elegance or because it seems like a nice and clean solution. It is a critical launch blocker to resolve this massive performance regression. However, given that we have to do something here, I'm going to strongly resist custom hacks that only work part of the time and just for the benefit of d.o if there *is* an elegant and clean solution that will solve our needs *and* help other sites in the same boat.

@all: Since this is a critical task, and it seems we're at risk of getting hung up going around in pointless circles rehashing the fact that Solr isn't going to work for this, can we please move on? My proposal:

- Continue with the Search API port that Damien started in the 949372-search-api branch.
- Continue with the Database Search backend as I mentioned at #3 and drunken monkey supported in #26 in response to #1836928: Materialized Views Search API backend.
- Get nnewton's input on if he'd like to continue with database search or investigate some other possible search backends.

Can I get an 'amen'? ;)

If not, can someone clearly articulate why there appears to be so much hostility towards using Search API in this case? I'm really confused. Smart people who I generally trust are making what seem like irrational arguments against a solution that appears to pretty obviously address all of the concerns here. Pardon my candor, but WTF?

Thanks,
-Derek

Comment #39

steven jones commented 21 November 2012 at 23:02

@dww Thanks for the summary of where we are.

I've just had a quick look at the code for the Search API DB backend, and it seems to index all fields into different tables, so isn't denormalised at all? The 'quick win' Search API DB backend might be a non-starter too.

Comment #40

dww

we/he/they

commented 22 November 2012 at 00:03

Thanks for looking. Are you sure it's indexing all of its fields into separate tables? I was under the impression it was putting all the data for each index into a separate table, and that each index corresponded to a given view (more or less).

Hrm, but you're right, a very quick skim of the code seems to agree with what you're saying. Hrmph. Maybe we're going to need #1836928: Materialized Views Search API backend after all if this is actually going to be performant at d.o-scale. Or we go with a real search backend as Damien is proposing. Webscale FTW! ;) (Sorry, I couldn't resist).

Comment #41

webchick

she/they

English

Vancouver 🇨🇦

commented 22 November 2012 at 00:20

I'm a huge proponent of something like #1836928: Materialized Views Search API backend which makes the solution to this problem something sites outside of Drupal.org can benefit from. That's a sure-fire way to both demonstrate "return on investment" for the Drupal.org upgrade, as well as help to encourage people other than the Drupal Association to work on further improvements.

Comment #42

steven jones commented 10 April 2013 at 04:10

Issue tags:

+solr

How about instead of #1836928: Materialized Views Search API backend we improve the Search API Database Search module to be able to put as much in a single table as possible, given the usage statistics of the two modules, that might benefit the community more? It might also be significantly easier since Search API Database Search already does full-text search, whereas Materialised views does not.

If such an approach was desired, then I think the following could happen:

Continue with the Search API port that Damien started in the 949372-search-api branch.
Continue with the Database Search backend, but try to improve it so that as much of the data as possible gets denormalised (PBS has some nice code that might be useful here)
(In the future) When the Search API Solr backend, and apachesolr modules can talk to Solr 4, and the d.o infra has it, move to Solr 4, and the low latency goodness that that brings..