Searching over multiple (heterogenous) indexes [#296198]

While implementing the attachment indexing mechanism, we (febbraro, robertDouglass and I) stumbled across a problem: how to store the attachment text?
It would be easily possible to just append it to the "text" field or add a new, multi-valued field or both. But then it would be impossible to distinguish the place of occurence of the term at search time, which, unfortunately, is a requirement, since the attachments should appear directly in the search results, not just links to the nodes containing them.

To achieve this, we have to store attachments as seperate documents, but adding them to the same index as the nodes wouldn't be very clean. So we'd have to create a second index for attachments and subsequently always search both indexes (or give the user the option to search only one of them). This is supposed to be possible, but none of us has an idea, how.

Anyone here got an idea? Or a suggestion for an entirely different route, even?

Comments

Comment #1

JacobSingh commented 17 August 2008 at 09:38

I don't believe this is possible. Solr 1.3 does provide sharding, etc but this is for the same schema. Perhaps a better way to implement this (I realize it is an overhaul of the implementation) is to change our primary key from an nid to a url. so node/123 would be the new primary key, and file/myfile.doc could be another.

What do you guys think of this? I don't think there are major performance concerns really, as Solr's indexes are not worried about using string based IDs. I know that it would require a total re-index for everyone using it production though.

The major advantage is that then panels, and other pages could be exposing their data to Solr and be searchbable.

Best,
Jacob

Comment #2

robertdouglass commented 17 August 2008 at 21:58

That was similar to febbraro's suggestion. I'll chew on it.

Comment #3

febbraro commented 17 August 2008 at 23:08

I was thinking about it this weekend too, with the url field, adding a multi valued attachment field and keeping the nid field, I can also make it easily configurable whether the attachment indexing adds to an existing node doc, or creates a new document to reference the attachment directly.

Thomas, thanks for getting this issue created and so accurately depicting the scenario.

Comment #4

JacobSingh commented 17 August 2008 at 23:58

I don't know about the performance implication in Solr, but I think that for enterprises, it will be important to make the schema not so Drupal specific. That is, solr provides dynamic fields (which we use for CCK), perhaps we should namespace all fields though. I can easily imagine requirements for federated search wanting to search other systems - CRMs, MIS systems, knowledgebases, etc. And in doing so, I think using some type of uri as the ID is the only way to go. It need not be an actual URL to a document or file, but it should contain some type of identifier so it can remain a Unique field for updates, but not be so node centric...

Does this sound crazy? Am I too worried about the 10% to lose the 90%. I'm not sure, but I don't like being so married to nodes as many potential users of Apache Solr are going to be larger organizations who will need more flexible indexes I think.

Comment #5

robertdouglass commented 18 August 2008 at 13:31

#4 @JacobSingh: I'm not comfortable with full URIs for all IDs. I'd need benchmarks before even considering that.

I see the current index as being a node specific index. I don't want to suddenly veer from that approach. It is easy enough to make new indexes. Copy apachesolr_search module and add things there (the issue of addressing different indexes in Solr remains since it is currently hardcoded to 'solr').

The problem I see with the suggestions so far is that they're trying to do an uploaded-file centric search on a node-centric index. Making the index "generic" is not the solution. This will just loose the efficiency that we have for node searching.

We have to decide here which problem we're trying to solve: do we want to do a keyword search for a word in a file and find the node? Or do we want to find the file? I suppose the answer is that we want to find the node but know which file it is in. The reason this is different than say comments is because comments are displayed on the screen, whereas the text of files is buried in the file. In comments it's perfectly okay to search for a word in a comment and find the node.

What I'm looking for as a resolution to this issue for the current module is a way to search for a word in a file, and find the node that it is attached to. This can be solved by appending the text from the files to the node, in a multi-valued "attachments" field.

The next issue is to figure out how people are supposed to know what file the text is in. Does the order of the file attachments array on the $node object, and the order of the attachments field in the $response search result always correspond? If so we can identify the attached file by the position of the matching text in the search result array that comes back. If not we need a different solution.

Why don't I want a solution like febbraro's site, where you click from the search result straight through to the file? I think for most cases, directly displaying the file (or downloading it) is missing the context of where that file lives in the Drupal site. If I show you a node, you can see the context and discussion going on with the file (and you can open or download the file). If I send you straight to the file, you have no navigation, no context, nothing. How will you get back to that file? Search for it again? You won't know whom it was uploaded by, how it was classified with taxonomy, comments on the node, nothing.

It's possible (obviously) that some sites will want this differently. These sites should build a new index that either indexes files text all by itself, or finds a comfortable lowest common denominator schema between files and nodes. Later we might be able to offer such a search as well.

Comment #6

febbraro commented 18 August 2008 at 13:37

That was a thought of mine too, that Drupal (and nodes specifically) are just one piece of the search equation and that as a search solution we should be able to index more than just nodes, however since I was not involved in the initial implementation and the thought gone into it I did not want to assume that it was an easy swap out. For that reason I would defer to those that know the actual implementation better. As far as URLs for the identifier, that idea is much better than what I was doing (nid--fid) and I would retro fit my module for that scheme should we all agree on that implementation.

Comment #7

robertdouglass commented 18 August 2008 at 13:41

febbraro: I agree that the drupal path makes a good identifier. I think a URI would likely be a mistake. Somewhere we need a roadmap towards searching across multiple sites, though, since there we'll need namespace (such as $base_url).

Comment #8

febbraro commented 18 August 2008 at 13:45

robertDouglass: Right, not the full http://.... but path/to/whatever.

Comment #9

JacobSingh commented 18 August 2008 at 21:34

Hey Robert,

I agree you can create more indexes, but this will never address the federated search issue...

Do you think there is a problem with using some type of uuid or uri because of performance reasons only? Or are there other issues at play? I think there is a huge benefit in opening up the schema as long as there are no performance implications.

To that end, I don't think there would be. The Unique key is not sorted on, and my guess is that because it is never searched on, its format would have no affect on its performance. Again, a guess, but I'm willing to put my time where my mouth is and benchmark it / ask on solr-user if you are in favor of a non-node specific index.

Btw, separate topic, but I agree that the content found in files should by default link to an anchor where the files are on the node page. For files not in nodes... I'd like to see something be possible, because I have a lot of clients with folders full of PDFs who don't want to attach them to nodes.

Best,
J

Comment #10

robertdouglass commented 19 August 2008 at 21:18

We're all more or less in agreement.

- I only oppose changing the ids if the goal is also "genericizing" the search to not be node centric. I want to stay node centric. I'm also interested in benchmarks if we start using 64 bit uuids or full URIs
- A generic file search that includes the whole files directory plus any other directories would be super useful... as a separate module, probably with a separate schema.xml

Anybody have other suggestions on how to link to a node page but anchor to the file, or at least indicate that the searched text exists in the file?

Comment #11

febbraro commented 20 August 2008 at 14:06

robertDouglass: Just curious, why is your goal to keep the index purely node centric? Knowing the reasons behind that might make potential solutions more readily appearant.

WRT indexing the files directories and such, I can (eventually) make a more general file indexing module for other uploads.

As far as attachments though, I think it should be configurable if you link to the node with the file attachment or directly to the attachment document itself, at least for the current crop of requirements I'm getting from my clients they are in both camps.

Comment #12

robertdouglass commented 5 September 2008 at 11:52

Configurable might be good. Two links in the search results might be better. I agree that the current implementation is less than satisfactory due to the user experience.

Comment #13

JacobSingh commented 23 June 2009 at 03:48

Status:

Active

» Closed (won't fix)

I think this issue is badly named and out of date. Unless someone wants to re-open, I'm shutting this puppy down :)

Searching over multiple (heterogenous) indexes