Add support for indexing non-entities [#1064884]

I have a project and I hope can use Search API can be part of the solutions. For this project, a solr core will index a set of non-drupal data. I would like to use search_api and in particular search_api's Views support to allow users to search this data and display it in a list and detail record format.

After looking through the code, it seems like search_api is well suited for something like this. I could see how I might extend the views handlers for any special data types I need to handle, and in general, it looks like it is well abstracted to support any data set.

Do you see any major roadblocks to using Search API to pull data from a Solr core that indexes non-Drupal data? Any hints or advice?

Comment	File	Size	Author
#36	1064884--follow-up.patch	5.53 KB	Damien Tournoud
#36
#33	search_api-add-support-for-indexing-non-entities-1064884-33.patch	1.82 KB	das-peter
#33
#32	1064884--follow-up-32.patch	5.29 KB	drunken monkey
#32
#30	1064884--indexing-non-entities-30.patch	127.26 KB	drunken monkey
#30
#27	1064884--indexing-non-entities-27.patch	112.67 KB	drunken monkey
#27
#23	read_only_flag-1064884-23.patch	14.53 KB	becw
#23
#23	search_api_views_workarounds-1064884-23.patch	1.94 KB	becw
#23
#15	SearchApiIndex-method-usage.png	257.76 KB	becw

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 February 2011 at 12:51

I think that your impression is misleading — sadly, the Search API's architecture isn't very suitable for searching data indexed by another application. Especially the fact that searches (and views) are always linked to an index (which you wouldn't have) is a roadblock here. This is probably the biggest flaw in the Search API at the moment.

To do it you'd probably have to set up a "dummy" entity type with property information to tell the Search API the whole metadata (otherwise, Views integration won't work). If you want to use your own Views field/filter/* handlers for some fields, use hook_views_data_alter(). Then create an index and set the indexed fields appropriately (again, to make Views integration work — even though I guess you could also just add all Views handlers manually in the alter hook, which also means you would only have to provide minimal property info for the entity type). You also either need a way to load the data by ID, or use a customized Solr service class to load the entity data into the result set. And if you want to use the normal Solr service class (resp. its indexing method), you'd have to index the data with field names according to the Solr service class' schema (with the field names and appropriate prefixes).
I've also never tried this myself, so there might be some additional hurdles that I'm overlooking at the moment.

I still think it can be done, though, and might well be worth the effort, if done well. And if you still want to do this, I would be glad to help with any further questions or problems that come up, as well as explain my plan above in more detail.
This drawback of the Search API has been pointed out numerous times now and a whole bunch of people would probably be glad if a feasible work-around would be known and tested. Sadly, even though I would love to address this directly with the Search API, this would require significant architectural changes which would (very surely) break compatibility and therefore can't easily be introduced now.

Comment #2

robeano CreditAttribution: robeano commented 21 February 2011 at 16:08

Component:

Code

» Miscellaneous

To start, you ROCK drunkenmonkey! Thanks for the thoughtful response to my open ended question.

The more I investigate Drupal and Apachesolr and this non-Drupal data set, it is becoming clear that I must create an entity to support it. This entity needs to store a minimal amount of data to support a much larger schema. My goal is to limit the amount of knowledge Drupal needs in order to handle a large schema. It isn't clear to me if I will have to manually create the Views handlers. I'm not against it at this point, especially if it simplifies other parts of this integration.

I see what you're saying about search_api/views needing an index (hence requiring a new entity type). The solr cores I will be working with will have 100s of fields. Do you think Search Api can scale to that size of a data structure?

Comment #3

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 21 February 2011 at 22:04

I see what you're saying about search_api/views needing an index (hence requiring a new entity type). The solr cores I will be working with will have 100s of fields. Do you think Search Api can scale to that size of a data structure?

Hm, the "Fields" page will be pretty huge and hard to handle, but since no indexing is done by the Search API, I can't imagine any real problems, performance-wise. The result data will of course have to be in the memory when searching, but there is hardly a way around that, and the Views computations also shouldn't take too long. And the search time is (or should be) completely independent of data size — as long as you don't filter on all of those fields simultaneously, of course … ;)

If you do find any performance bootlenecks in the Search API part, though, it would be great to know about them!
Likewise, it would of course be great to hear how this worked for you, what you did and what you learned. As said, there surely are others with similar requirements.

Comment #4

robeano CreditAttribution: robeano commented 22 February 2011 at 20:01

Thanks again. This info is helping me come to some decisions.

Crell pointed me to this Search in D8 discussion. You mention in this comment: http://www.google.com/url?q=http%3A%2F%2Fgroups.drupal.org%2Fnode%2F1172...

"So basically the Search API's design should be able to handle other data sources, as long as the data-extracting step in the index is abstracted. And, of course, the "Fields" form, resp. the act of retrieving the metadata."

You make it sound so easy. ;)

Do you have an idea of how long (in hours) it would take to abstract the data-extraction step from the index and to change the Fields form to support other data sources?

I'm not sure if I can work on that, but I'd like to keep open as an option for this project I'm working on.

As it stands, my current plan is to

* create a solr entity
* dynamically add fields to this entity based on the Solr core I want to search or at least the ones that Search API needs to know about
* create the Search API Index
* create a view to work with that Search API index
* extend views to support a large set of fields so that the Views UI can provide an autocomplete text field for adding Fields to a view (instead of the usual checkbox list it provides now)

Comment #5

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 22 February 2011 at 20:58

"So basically the Search API's design should be able to handle other data sources, as long as the data-extracting step in the index is abstracted. And, of course, the "Fields" form, resp. the act of retrieving the metadata."

You make it sound so easy. ;)

Well, it is comparatively easy, if you are rewriting the Search API to a 2.0 or D8 version. Alas, it's a lot more complicated if you want to work with it now, with the existing Search API code. :-/ Rewriting the Search API while maintaining compatibility would then probably be somewhere in between …

Do you have an idea of how long (in hours) it would take to abstract the data-extraction step from the index and to change the Fields form to support other data sources?

I'm terrible at estimates, and doing it for someone else is even worse. I'd guess it would take about 20 to 30 hours — but usually I underestimate the effort badly, so be warned. (I have gotten better, lately, though …)
The question is, how you would want to do this: as a patch for the Search API, as a hack, as kind of a 2.x branch or as a compatible extension. My estimate is for the first.

Your current plan looks good. I'll only have to say once again that you should keep in mind that the data in Solr will have to comply to the schema used by the search_api_solr module. id as the unique ID, a correct index_id for all documents and correct field names for the data. Otherwise you'd also have to use a hook_search_api_solr_query_alter() to dynamically adapt all queries. copyField elements in the schema.xml would probably help, if your third party application sends the indexed data in a fixed way.

Comment #6

robeano CreditAttribution: robeano commented 23 February 2011 at 23:51

I was thinking of a patch.

The solr schema I'm using may not be the same, and I'm prepared to use the alter hook to adapt all queries as needed. My bigger issue is that I can not assume that index_id will exist in my solr core. As I've played with Search API the last couple of days it seems like the index_id is used throughout. I'll need to come up with some sort of map or the patch supports an index with no index_id field or something. That is unclear to me. What do you think?

I was thinking 30 hours too for a patch.

I may have another patch to add to Search API which retrieves the available fields in a solr core. I need that for my Solr entity. I could put it in a custom module too, but I thought you might like to add this feature to Search API?

Comment #7

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 25 February 2011 at 17:22

The solr schema I'm using may not be the same, and I'm prepared to use the alter hook to adapt all queries as needed. My bigger issue is that I can not assume that index_id will exist in my solr core. As I've played with Search API the last couple of days it seems like the index_id is used throughout. I'll need to come up with some sort of map or the patch supports an index with no index_id field or something. That is unclear to me. What do you think?

I don't think that's much of a problem, as long as only a single index is stored on the server. (Which I assume is the case.) Then you can just change the Solr request being made to exclude the filter on index_id (either via the hook or the service class' brand-new preQuery() method).

If you are really doing the patch (which would of course be awesome!), it would just be good if we went through the planned approach beforehand, to avoid having to redo part of the patch afterwards.

I may have another patch to add to Search API which retrieves the available fields in a solr core. I need that for my Solr entity. I could put it in a custom module too, but I thought you might like to add this feature to Search API?

Well, not in the Search API itself, of course, but for the Solr backend module it would be a nice addition, yes. I thought about that, too, already, but then decided it wasn't really necessary.
If you are now implementing this anyways, I'd be glad to include it into the module. How would you do this, with additional methods in the service and/or connection class?

Comment #8

fago

German

Vienna

CreditAttribution: fago commented 28 February 2011 at 09:46

Haven't read the whole issue in detail, but for integrating non-drupal data I'd go with wsclient and improve it/extend it to provide the remote data as entities. As the wsclient module already makes use of the entity property info, all data properties would be available in the search api then!

Comment #9

robeano CreditAttribution: robeano commented 2 March 2011 at 23:31

Thanks drunken monkey.

I have two thoughts for my approach to supporting non-Drupal Solr data:

1) change the SearchApiIndex class to support non-drupal data
OR
2) create a new index class that supports non-drupal data and leave SearchApiIndex class as drupal-specific

The latter option allows us to move forward without interrupting others as much. Plus, I don't think it's all that bad to have an index class that is optimized for Drupal-centric data. I'm open to opinions here, so let me know.

During my testing, I have been connecting to a Solr Core that has indexed data from another Drupal site. I can exclude the index_id field and filter in the SearchAPISolrService and I get results back, but I run into minor issues in several spots that seem to be relying on the index_id.

In particular the SearchApiSolrService class search() method relies on index_id as well as SearchApiViewsQuery class.

For the SearchAPIViewsQuery issue, I think if I load my non-drupal entities before the SearchApiViewsQuery class method addResults is run, then Views should be able to work with a non-Drupal result set.

If I have this right, let's say I created a new index class to support non-drupal data. This class can provide a postprocessor which loads the entities and provides a hook so other modules could help define the attributes of these entities. Basically, I want to create a generic solr index that allows for the ability to get results from solr no matter what fields are in the Solr Core. This generic solr index can provide some generic attribute loading. Another module (be it custom or contributed) can define any specific attributes that need to be available.

I feel like I'm still getting my head wrapped around Search API architecture. Is this making any sense? I'm totally open to working with you to make sure a patch is created that works for all of us. BTW, are you going to DrupalCon Chicago next week? I feel like we're at a point where we need to communicate in additional ways besides the Issue queue. :)

You are welcome to ping me in IRC. I'm usually on #drupal-contribute most days.

Comment #10

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 4 March 2011 at 17:08

Category:

support

» feature

Sorry, no, I won't be at DrupalCon Chicago. You are right, it would probably have been the best way to discuss this. I also don't really use IRC, but I think it will probably also work like this, in the issue queue.

Regarding the rest of the post, it seems like you are already quite familiar with how the Search API works, which is of course needed for this task. For the approach, I'd suggest a combination of your two suggestions, similar to the current layout of the server:

There is a single SearchApiIndex class used as the entity class (i.e., returned for search_api_index_load() and used in all API methods).
The class only implements the CRUD methods, and similarly generic methods.
All data-source specific methods are moved to a new "datasource" class, calls to which are passed through by the index class.
All current methods in the SearchApiIndex class that are entity-specific would have to be moved to a new entity-based datasource class, that is provided directly by the Search API module.
The datasource object is automatically loaded by the index, based on a new "datatype" (?) field. (Or we could change the entity_type field to the format "DATATYPE:ENTITYTYPE" (e.g., "entity:node"), storing both the type of data source and the entity type (or, specific type in the data source).)

This would (hopefully) take care of all entity-specific code / logic in the index. However, there are other things that would then still be entity-specific. As you have noticed, I have already provided a way how a search service class or postprocessor can bypass entity loading and provide the entities itself. Nevertheless, there are still several other places that can't be bypassed and that use, for the most part, either entity_load() or the Entity API's metadata wrapper. The datasource class would therefore have to provide methods abstracting from this principle. To ease code updating, these abstractions should probably be as similar to the currently used entity-based variants as possible. The default entity datasource would just implement this methods by calling the mentioned entity-specific methods, while other data sources would have to provide their own means.
Or, maybe, we'll even discover that we can contain all things that use the entity metadata wrapper inside of the datasource class, which would again greatly ease the task (as abstracting the wrapper doesn't sound like an easy thing to do).

Methods/Hooks for discovering new, changed and deleted items of other data sources would of course also be needed in most cases, but those could probably be taken care of by the module providing the respective datasource class.

What also came to my mind while thinking about this is that some forms, like the "Fields" tab, might best be made flexible by moving them into the datasource class, too. This would have to be investigated, though.

Basically I think that once there is a generic way to specify different data sources, we can more easily see what needs to be adapted to that more generic approach, and what can remain "hard-coded". Making the "Fields" tab generic later on, after the patch has landed, probably wouldn't be hard, for instance.
(Regarding "landing": I hope to go RC, and then stable, soon, so this would then probably start a backwards-compatible 2.x branch, with the 1.x branch only receiving bug fixes and no database changes.)

For your use case, you might then want to just change the index() method of your data source class to do nothing, for example, so data can only be searched, not indexed, via indexes of that type.

Comment #11

becw CreditAttribution: becw commented 16 March 2011 at 15:33

Hi drunken monkey, I'm working with robeano so I have some thoughts here too...

It might make sense to be smarter about wrapping our data in an entity, rather than rearchitecting Search API. I'm inclined to try to use Drupal entities as the datasource, rather than wrapping them in an abstraction--as you note, abstracting entity wrappers really does not sound easy. Wrapping our data in an entity also provides more opportunities to interact with it through other Drupal APIs.

This still means a couple of changes to Search API and Search API Views, though:

make sure that Search API can use alternative index classes
provide a Search API index class that is read only
make the Views integration (in Search API Views) more flexible; it
needs to match Field API field types with Search API Views field
handlers via a hook implementation, rather than using hard-coded
values in _search_api_views_add_handlers()

I'm sure I need to check in with robeano to see what parts of this exist, and surf the issue queue to see if any are in progress.

Anyway, I'm thinking we can provide arbitrary properties from a Solr doc in a Field API field, and put the field on a Drupal entity--and with more flexible Field API field handling in Search API plus a read-only index class, we can make this solr doc field behave how we like. The solr doc field stuff would be specific to solr, but the read-only index class need not be (because of your separation between the search service class and the index class).

Comment #12

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 March 2011 at 22:02

While just defining your data as entities does of course work (and even has some benefits), it's still a workaround. I agree with you that it is currently a viable option, and maybe really better than working on a generic solution usable. However, in the long run, more people will want to do this, so at some point this really should be possible more comfortably/directly, without any "hacks" to the entity system.
I'm thinking right now, maybe we don't even have to abstract from the entity metadata wrappers. They are also capable of wrapping data other than entities, as far as I know, so we'd maybe only let modules wanting to define search metadata use the same structure as for providing entity property information. Then, only the problem of the used entity-specific functions remains, which could probably just be abstracted to analogous methods on the index class.

make the Views integration (in Search API Views) more flexible; it
needs to match Field API field types with Search API Views field
handlers via a hook implementation, rather than using hard-coded
values in _search_api_views_add_handlers()

I really don't think that has many use cases. This isn't something you couldn't do already with hook_views_data_alter() (as far as I can tell), and with flexibility a custom hook very probably couldn't reach. Granted, it's a little less comfortable than having a dedicated hook for this, but considering the probably rather small number of people who'll do something like that, I don't really think that the added comfort justifies the effort (and added complexity).

Another potential point of interest: I plan to submit a proposal for this year's Google Summer of Code that would include this issue. If I get accepted, you'd have my full support (or could even leave the issue to me) by the end of May, when the program starts. But of course, I'll also help here until then, and wouldn't object if you would get this working before GSoC starts. ;)

Comment #13

Crell CreditAttribution: Crell commented 21 March 2011 at 16:41

Hi monkey. I'm the third part of the Palantir team working on this project along with becw and robeano. :-) (Full court press, I know.)

I agree that a solr pseudo-entity is not the best long-term solution. The best long-term solution is to separate search_api from its entity dependency in general. However, we do have a limited budget and timeframe in which to complete our project, and unfortunately that does not allow for a complete refactoring. It also needs to be substantially done before GSoC starts. :-(

Our current plan involves only two changes of note to search_api, I believe. One involves splitting the index class into two composed classes, a searcher and an indexer, which can be selected in configuration. In our case we would select a "null" indexer that does nothing while in the common case people would just use the default, which is essentially the current code just moved around.

The second is the views integration. Having hard-coded values in a list like that is very inflexible and not very Drupal-ish. In our case, it also means that we cannot add decent support for a new type of field with its own custom views handlers, which we will need to do, without a very hackish and round-about approach.

It also tends to hide problems in the API that you could find if you were leveraging your own API. I actually wrote a very similar change for ApacheSolr for Drupal 7, which I reference twice in my DrupalCon Chicago presentation on API design: http://chicago2011.drupal.org/sessions/aphorisms-api-design </shameless plug>

While it's possible you could accomplish the same thing with hook_views_data_alter (I'm not actually sure), that's actually a really ugly hook to deal with since it's such a complex array structure. If we can provide an easier way to manipulate that data in a more Drupal-consistent way, so much the better for everyone. There's really not much additional complexity, either.

The rest of what we're planning to do is new code development, which could live in search_api or not. (The Solr pseudo-entity, solr doc field, etc.) We can by and large decide that later.

Bec put together a nice graphic of what we're proposing looks like:
http://skitch.com/becw/ripg4/search-api-sarnia-1.4

We're ready to go, but at this point just need to make sure you'd accept the patches we need as they're written. We really don't want to go down a path that you as maintainer wouldn't accept, for obvious reasons. Even if it's not the ideal long-term path forward, it's at least an incremental improvement on the current code.

Let us know if you have any serious issues with the above approach. If you're available on skype it may be good to try and get some voice time in as well, since the issue queue, while effective, sucks for synchronous discussion. ;-)

Comment #14

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 21 March 2011 at 22:08

OK, that's of course understandable. The indexer patch I'd of course accept and support (as already stated — it could be leveraged to give it complete entity-independence later); and if you see definite use cases for the Views field handler hook, I'd accept that, too, yes.
I believe the Search API very much relies on leveraging its own API, since you mention it. There are only very few places where stuff provided by the Search API is really special-cases or hard-coded. Of course, following this line of thought I'd also have to conclude that making field handler selection easily alterable is a good step.

The Solr pseudo-entity also sounds like it would be interesting for some people, but I think it should probably be in a separate module. If I link to it on the Solr backend project page, people interested in doing this should find it.

Comment #15

becw CreditAttribution: becw commented 24 March 2011 at 02:56

File	Size
SearchApiIndex-method-usage.png	257.76 KB

So, I've started on an indexer patch, and in the process I built a method usage diagram--I felt like I didn't know the code very well, so I spent some time reading it. The diagram came out pretty cool; see attached.

Comment #16

BenK CreditAttribution: BenK commented 24 March 2011 at 17:04

Subscribing

Comment #17

becw CreditAttribution: becw commented 26 March 2011 at 00:27

Here's a quick status update:

I have some work on abstracting the searcher and indexer components of the SearchApiIndex class in a sandbox: http://drupal.org/sandbox/bec/1096826

I've started to experiment with implementing the searcher and indexer classes as CTools plugins. I'm also wondering how to abstract the SearchApiIndex classes a bit more cleanly.

I've been going a little crazy with the diagrams this week. This is a diagram of where the external solr data will tie into Search API: http://img.skitch.com/20110318-kgm1m9f34csye46d72ttqi8esc.png

This is a revised version of the diagram from the other day, of SearchApiIndex methods: http://img.skitch.com/20110325-tw3fe1n6rw2drxmwxxrwubqxsd.png

And this is a quick diagram of the entity inheritance going on here: https://img.skitch.com/20110326-kcjmk7h7wdxdwsad7mpt4auwth.png

Comment #18

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 26 March 2011 at 11:06

Thanks for the update! Haven't really had the time to look at the code in detail, though. But as long as there are no problems, there's probably no need to. ;)

I'd just be careful about using CTools, I'm a bit hesitant to add another dependency to the module. Although, of course, only very few people won't use Views / CTools …
I don't really know, haven't looked into this: What are the advantages of designing those classes as CTools plugins?

On a different note: I didn't realize you also want to abstract the searcher part. I thought that there's only very little index-specific code there — or, really, none at all. What exactly is it you want to abstract there?
Also, how would selecting the indexer (and searcher) plugin to use, work? Would this be fixed on the entity type you select, or would this be another option when creating the index? Would this be stored directly on the index or in the options array, and could this be edited later?

Comment #19

becw CreditAttribution: becw commented 26 March 2011 at 17:01

I'm definitely wary of adding a ctools dependency, but I found myself building an info hook for plugins, and ctools seemed like a good place to start. Since there are some indexer-specific forms on the index config, it may make sense to give indexer classes a method like "isConfigurable()", and then to use form callbacks provided in the plugin info hook to determine which forms should be used for configuring the indexer. I do not like the pattern of providing form callbacks as object methods.

Abstracting the searcher is at this point about symmetry and encapsulation. At this point the only use case I can think of is if you wanted multiple sites writing to a core and only one site reading from it. Even in that case, disabling the searcher wouldn't be strictly necessary.

One thought I had is that the next step here could be providing the datasource as a plugin too, as you described above. This way each index would have a searcher, indexer, and datasource plugin, and we could deal with providing config interfaces and options storage with the same mechanism across each of them--they would all just be "plugins" as far as the index was concerned.

As far as selecting indexer and searcher plugins for a particular index, I think that there should be a select box in the ui of the index config page, with qa list of the options from the plugin ingfo hook. If a particular indexer is only appropriate for a particular data source, that could be negotiated via the plugin info array later.

I'm stretching my brain to think of use cases for alternate indexers and searchers other than "off" and "on", though. And I also would rather only abstract data in one place rather than three.

Comment #20

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 27 March 2011 at 16:49

I'm definitely wary of adding a ctools dependency, but I found myself building an info hook for plugins, and ctools seemed like a good place to start. Since there are some indexer-specific forms on the index config, it may make sense to give indexer classes a method like "isConfigurable()", and then to use form callbacks provided in the plugin info hook to determine which forms should be used for configuring the indexer. I do not like the pattern of providing form callbacks as object methods.

What's wrong with providing form callbacks as object methods? Especially since you want to use the object's properties anyways, I think this approach makes much sense. I also wouldn't have though that CTools promotes a different pattern, as Views handlers also work that way.
In any case, even if there are good reasons against this approach: it's the approach taken in all other Search API plugin classes (servers, processors, data alterations) and should therefore be kept just for the sake of consistency.
Which would, of course, also be an argument against using the CTools plugin system (as this doesn't happen anywhere else in the Search API). However, I don't know how much this would even affect other contributors. They don't really have to know whether they are coding a CTools plugin or a specific Search API one, do they?

(One unrelated thing I'm curious of, though: Why do you still have to specify the class file and path for CTools plugins? Do you happen to know?)

I'm fine with your other points. The indexer/searcher options on the index page make perfect sense, of course. We just have to think about whether later changing those makes sense.
While abstracting the searcher without any real use case in mind seems a bit strange, I guess there also isn't any real drawback (the select boxes can be hidden anyways, if there is only one option to select), so doing it just for the sake of a cleaner architecture might make sense.
On the other hand:

I'm stretching my brain to think of use cases for alternate indexers and searchers other than "off" and "on", though. And I also would rather only abstract data in one place rather than three.

Wouldn't it then be a lot easier, and maybe also more practical, to just add a "Read only" option to indexes? It's maybe kinda late to ask that question, but if there isn't really a good use case in sight, why go all the way and abstract all of this with plugins, etc.? And while it provides me with a good example for later abstracting the data source, the indexer / searcher abstraction doesn't seem to be really necessary for that step.

Comment #21

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 29 March 2011 at 10:35

Just noting that #1109130: Add better structure for Views field rendering might affect you (API change slightly changing how Views field rendering works).

Comment #22

becw CreditAttribution: becw commented 30 March 2011 at 15:20

I've created a ticket (with code) in the search_api_solr queue about adding a getFields() method to the Solr service class: #1110820: Add support for the Luke request handler

Comment #23

becw CreditAttribution: becw commented 26 April 2011 at 00:32

File	Size
search_api_views_workarounds-1064884-23.patch	1.94 KB

read_only_flag-1064884-23.patch	14.53 KB

I ended up adding a 'read only' flag to Search API index entities. This was straightforward, but involved touching several places in the code. These changes are in read_only_flag-1064884-23.patch, attached.

I also ran into some issues with search_api_views:

There is a comment in the code, "Maybe the service class or a postprocessor already set the entities." This is exactly the capability I was looking for, but I wasn't able to figure out where to do this. The service class or a postprocessor doesn't have access to the View object--the only way to pass through loaded entities is in the $results array itself.
Search API Views generates an array of arrays as the View result (the results property of the View object, like $view->results), but Views core always has an array of objects. This makes it hard or impossible to make new Views handlers for use with Search API Views that extend Views' own bundled handlers.

I've worked around these two issues with the changes in search_api_views_workarounds-1064884-23.patch, attached. I'm not sure that this should go in as-is, and I'm willing to write more code or review an alternative solution if we can work out how to solve these issues.

It did turn out that, since my module is providing a field, I can implement hook_field_views_data() to attach custom fields/filters/sorts/arguments to base tables provided by Search API Views, rather than altering Search Api Views' hook_views_data() implementation... that was handy.

Comment #24

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 26 April 2011 at 12:51

Phew, quite a large patch. It mostly looks fine, though, and I agree that it would be a nice addition to the Search API. A few things that need to be fixed, though:

Switching between read-only states (enabled / disabled) isn't really handled, as far as I see. You should delete the corresponding items from the search_api_item table when marking an index as read-only, and then re-insert all items when unsetting read-only again.
For this, it would probably be a good idea to split the "Remember items to index" logic from SearchApiIndex::postCreate() into its own method. (And also use that in search_api_enable() – what the heck did I drink while writing that?)
Why remove the search_api_mark_dirty() function, instead of adapting it? I also don't really understand the code you replace it with – why use a merge query with that IF condition instead of just filtering on changed = 0 like I do? And why execute one query for each index instead of a single one for all?
I also don't really understand how the relation to the server is handled. As I see, you don't call $server->removeIndex() for read-only indexes, but otherwise there doesn't seem to be any changes. You can't call addIndex() when you won't call removeIndex(), as you can't know what the server does in those methods, and what structures it might set up.
I guess this is just so the server doesn't delete the items, which might have been added by another program? Please figure out another way to do this. If necessary we can even say that servers will have to check for the index's read-only state themselves when removing it.
Behaviour when moving a read-only index to another server is also unclear – would this even make sense? Or does the feature even make sense for normal servers? In some sense, this is more of a server feature than something on the index side, the index is (at least in your case) just a front for the server getting its data from elsewhere.
In other cases, where you really just set a regular index on a regular server to read-only, I'd expect the server to remove its indexed items when the index is removed. So maybe the index should just be removed normally, and you can special-case this in your Sarnia server?
While forbidding access to the "Status" tab might make sense, you'll still have to let users define indexed fields and processors, as those are used at search time, too. (This is also the answer to your @TODO in the index class.)
Thanks for showing me _element_validate_integer(), good to know!
#1068342: Provide a "fields to run on" option for processors already introduces a search_api_update_7107() function, please increment your number (as the other patch will probably land earlier).
The new field should be included in search_api_entity_property_info().

I think that's all. This is just a code review, though, I haven't tried it out yet.

Hm, this issue is getting a little crowded – could you please create a new issue, exclusively for the read-only flag? And please tag it with "API change".

There is a comment in the code, "Maybe the service class or a postprocessor already set the entities." This is exactly the capability I was looking for, but I wasn't able to figure out where to do this. The service class or a postprocessor doesn't have access to the View object--the only way to pass through loaded entities is in the $results array itself.

Yes, and there they should be found by Views. What's the problem? It's just no single "processed_results" array, but an "entity" field for each single result. I agree that this probably wasn't the ideal choice, but using a simple loop to set the entities in the right locations isn't too hard, and certainly better than introducing a new, undocumented key.
Or have you tried that already and it didn't work?

Search API Views generates an array of arrays as the View result (the results property of the View object, like $view->results), but Views core always has an array of objects. This makes it hard or impossible to make new Views handlers for use with Search API Views that extend Views' own bundled handlers.

Yes, you're right, I sadly realized that too late. However, does the patch really help? Wouldn't it be better to go the whole way and change the code to use objects instead of arrays for the individual results? In that case, we should probably do this together with #1089758: Make use of new flexibility for Views field handlers (which I hope to get to some time soon).

Comment #25

becw CreditAttribution: becw commented 26 April 2011 at 22:24

I've created a new issue for the "read-only" patch, as requested: #1138992: Read-only indexes, and I'll see what I can do with #1089758: Make use of new flexibility for Views field handlers.

Comment #26

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 14 June 2011 at 20:25

Component:	Miscellaneous	» Framework
Issue tags:		+API change

Comment #27

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 14 July 2011 at 22:30

Title:

Integrating Non-Drupal Data

» Add support for indexing non-entities

File	Size
1064884--indexing-non-entities-27.patch	112.67 KB

I can't really believe it, but this now really seems to work. The tests fail exactly as before, when clicking through the site nothing exploded in my face … Seems ready for some other eyes!
(NB: I haven't updated the docs (README.txt, .api.php) yet, as I wanted the API to be final before doing that, so you'll have to make do with just my explanations here – or ask, if you'd like to help and something is unclear. But no worries, I'll definitely update the documentation before committing this.)

Initial remark: If you are no developer, the following explanations probably won't really interest (or make much sense to) you, but you can still help testing! Just apply the patch, call update.php and then do as much stuff with the Search API as possible. Afterwards report back anything that exploded (or showed notices, …) and previously didn't. Just keep in mind that, of course, other projects aren't yet updated (although I'll create issues for some of mine in a few minutes). So if, e.g., the Ranges module doesn't work anymore, just disable it for the tests.

OK, and now for the actual explanations …

So, the attached patch adds support for indexing any kind of data, not just entities. This is done by implementing hook_search_api_item_type_info(). The code for dealing with the new types has to be implemented in a "data source controller", which encapsulates everything that was previously hard-coded in that regard. For example, it's now also quite easily possible to index stuff that has non-integer IDs. (Search backends will have to beware of that, of course.)
The wrappers provided by the Entity API are still used for getting metadata and item data – therefore, data source controllers also have to implement a method that returns such a wrapper suitable for the specific item type. The Entity API handles this (wrappers for non-entities) quite well, almost without any disadvantages compared to entities (that I'm aware of), so I figured it would be an aweful idea to implement something in my own style for the same task.

As said, hook_search_api_item_type_info() isn't documented yet, so see search_api_search_api_item_type_info() to get an idea of how it works. There actually aren't any other possible settings there, although types can of course add any number of data source-specific settings.
The methods a data source controller has to implement are documented in the interface in includes/datasource.inc, in includes/datasource_entity.inc the controller provided by the Search API for all entities (with property information) can be viewed as an example.
And as mentioned minimalistically in .api.php, the module providing the types is responsible for calling the search_api_track_item_*() functions at the appropriate times (since the Search API can of course not know when this woud be the case). Note, however, that it's also possible to not use this mechanism at all, and have the items indexed some other way (e.g., as the OP required, not by Drupal at all – although there is now also the "read only" index option for that).

OK, so far for people wanting to add their own item type. It would of course be great if someone would already do this now, so we can see if that really works, but I guess the chances are pretty slim for that … Otherwise, please ask away if there are problems.
And without adding new types, as said: please just test whether the existing stuff still works with the patch.

But of course, you'll also want to know the API changes contained in here. I.e., what things you'll have to change in your existing modules that build on the Search API.

You can't assume that an index indexes entities anymore.
This is of course by design and should be clear. To ensure that this is followed by all modules, I renamed the index property "entity_type" to "item_type", so a) it doesn't convey the wrong impression and b) everyone using this will have to change this everywhere (and then also see whether it is used in an entity-specific way). (In other instances (when semantics didn't really change), I've kept the name "entity", even though it is "wrong", to keep necessary changes at a minimum.)
For retrieving the entity metadata wrapper for an index, now always use $index->entityWrapper(), which calls the data source controller's corresponding method. There's now also a flag to specify that the properties in the wrapper shouldn't be altered according to the index's data alteration. (Previously, you had to use entity_metadata_wrapper() directly for that.)
For loading items for an index, use $index->loadItems(). This will, again, just call the corresponding method on the data source.
Item IDs don't have to be integers anymore
This will be relevant to most service classes, I guess. (At least for my two it was.) Use SearchApiDataSourceControllerInterface::getIdFieldInfo() to retrieve the type of the ID field for a certain index. The data source controller for the item type of an index can be obtained with $index->datasource().
Changed/Removed functions:
- search_api_mark_dirty() was renamed to search_api_track_item_change() to fit into the search_api_track_item_*() scheme.
- search_api_set_items_indexed() was renamed to search_api_track_item_indexed() for the same reason.
- The signatures of search_api_index_specific_items(), search_api_index_status() and _search_api_index_reindex() have changed.
- search_api_list_servers() and search_api_list_indexes() are finally gone, after being marked "deprecated" for the better part of the decade. (OK, for the whole decade so far, which isn't really that hard, but still …)
There are also search_api_get_item_type_info() and search_api_get_datasource_controller now to help deal with the framwork additions.

Huh, this doesn't look like all that much … I've probably forgotten several things, will review this properly again when it's not after midnight.

A rule of thumb for most corrections: Just search for "entity" (or "entit", to also catch the plural and correct comments and variable names) and check everything that comes up.

Comment #28

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 14 July 2011 at 22:38

Status:	Active	» Needs review
Issue tags:		+gsoc, +gsoc2011, +gsoc2011-drunken_monkey

And here are two issues for my related modules:
- #1219310: Adapt to upcoming API change for the Solr backend
- #1219314: Adapt to upcoming API change for the Saved searches module

The Autocompletion module doesn't seem to need corrections.

Comment #29

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 July 2011 at 21:28

Issue tags:

+D7 stable release blocker

This issue has still too few tags!

Comment #30

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 21 July 2011 at 21:43

File	Size
1064884--indexing-non-entities-30.patch	127.26 KB

OK, the documentation is added. I also added a base class for data source controllers that represent external data, which might be useful for the original use case of this issue. You'll then only have to specify the available fields and write a fitting service class, and you should be done. (If the data lies in the right form on a Solr server, you could even get away without the latter step.)

So, last change for any objections until I commit this tomorrow, or so. One week lying here ought to be long enough …

Comment #31

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 23 July 2011 at 09:33

Status:

Needs review

» Fixed

Committed.

Comment #32

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 25 July 2011 at 14:18

Status:

Fixed

» Needs review

File	Size
1064884--follow-up-32.patch	5.29 KB

Forgot updating some code, especially hook_enable() and hook_disable().

Comment #33

das-peter CreditAttribution: das-peter commented 25 July 2011 at 16:12

File	Size
search_api-add-support-for-indexing-non-entities-1064884-33.patch	1.82 KB

Looks like there are still two locations were the necessary replace didn't take place:

SearchApiAlterBundleFilter::alterItems()
SearchApiAlterBundleFilter::configurationForm()

Attached patch fixes that and also replaces the artefacts in the example code in search_api.api.php for the sake of consistency.
Maybe I'm able to do a review of the patch in #32 - but first I've to boot myself into search_api ;)

Comment #34

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 25 July 2011 at 21:06

Thanks, good catch! I seem to just have added the supportsIndex() method there, without remembering that the property name changed, too.

Comment #35

Damien Tournoud CreditAttribution: Damien Tournoud commented 4 August 2011 at 10:28

Status:

Needs review

» Needs work

It seems that #33 is in, but #32 is not.

#32 has a small issue: in search_api_enable(), we should not start tracking entities attached to a non enabled index.

Comment #36

Damien Tournoud CreditAttribution: Damien Tournoud commented 4 August 2011 at 10:30

Status:

Needs work

» Needs review

File	Size
1064884--follow-up.patch	5.53 KB

Patch attached with the small change mentioned in #35.

Comment #37

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 4 August 2011 at 10:48

Status:

Needs review

» Fixed

You're right, thanks for reviewing and spotting this!
Committed.

Comment #38

18 August 2011 at 10:51

Status:	Fixed	» Closed (fixed)
Issue tags:	-gsoc, -API change, -gsoc2011, -gsoc2011-drunken_monkey, -D7 stable release blocker

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #39

18 August 2011 at 10:51

Issue tags:

+gsoc, +API change, +gsoc2011, +gsoc2011-drunken_monkey, +D7 stable release blocker

Restoring issue tags, see #2125755: System messages removed all issue tags during D7 upgrade.

Comment #40

raajkumar.kuru

Chennai

CreditAttribution: raajkumar.kuru commented 25 November 2014 at 05:40

Issue summary:

View changes

Hi, i want to know how to select index fields on solr using search api query.

Comment #41

lquessenberry CreditAttribution: lquessenberry as a volunteer and commented 22 October 2015 at 20:18

I have been reading and reading in circles, but I wanted to stop and ask if this is the way for me to index a table that is not Drupal based into an index that Search API can use. Is this part of the Search API module now? If it is, where are the configuration options for it, or am I just missing the whole point?