Index into entity field [#348668]

Comment	File	Size	Author
#38	entity-type-348668-38.patch	2.16 KB	pwolanin
#27	entity-type-348668-27.patch	3.64 KB	pwolanin
#21	entity-type-348668-21.patch	3.52 KB	pwolanin
#2	apachesolr_users.zip	2.81 KB	Scott Reynolds

Comment #1

pwolanin commented 18 December 2008 at 15:41

Well, at the least you'll need to add a small module to index/search users.

I talked with Robert about this a little. your approach will depend on whether you want users to be mixed in with nodes in your search, or whether you want a totally separate search.

Log in or register to post comments

Comment #2

Scott Reynolds commented 23 February 2009 at 23:43

Status	File	Size
new	apachesolr_users.zip	2.81 KB

Here is a module that is in use on community.mylifetime.com. It provides a new type, 'user' for facet search. See: http://community.mylifetime.com/community/search/apachesolr_search/scott...

A couple points to answer anticipated questions
1.) Why write a seperate _index_alter()?
Frankly, indexing nodes vs users is very different. Different fields need could be added to the index. If they both used the same function, there would be in most implementations a if ($type == 'node') ... elseif ($type =='user') which defeats the purpose.

2.) What is this user_module_invoke?
My team and I have for awhile used this to 'build' an indexable profile. It is not a standard op for hook_user. But it has made it easy for us to roll out sites and 'build' profiles without changing any of our platform code. And yes, the module works without any module doing anything on $op == 'index'.

3.) What about changes to the solrconfig?
Yes there are changes to the solrconfig. We run a pretty modified solrconfig file though so providing a patch would be a pain. So here is the major change

<requestHandler name="drupal" class="solr.SearchHandler" default="true">
    <lst name="defaults">
     <str name="defType">dismax</str>
     <str name="echoParams">explicit</str>
     <float name="tie">0.01</float>
     <str name="qf">
        body^1.0 title^5.0 name^3.0 mail^3.0 taxonomy_names^2.0 tags_h1^5.0 tags_h2_h3^3.0 tags_h4_h5_h6^2.0 tags_inline^1.0
     </str>
     <str name="pf">
        body^2.0
     </str>
     <int name="ps">15</int>
     <str name="mm">
        2&lt;-35%
     </str>
     <str name="q.alt">*:*</str>

That is the new field 'mail' is added.

4.) Schema changes?
Yes the mail field needs to be added.

   <field name="mail" type="text" indexed="true" stored="true" termVectors="true" omitNorms="true"/>

5.) Is the user facet exposed in the facet blocks?
No it isn't. Looking for guidance on that because the facet block is generated by apachesolr_search() module which only deals with nodes. And there is no way to 'facet_alter'.

We use a custom search block on the site to do the facet filtering.

I think that address them all.

Please note that currently, we are not up to speed with the latest dev/beta versions of this module. It is mostly because we are not ready to switch to solr 1.4 yet. So the configs might need to be changed.

Log in or register to post comments

Comment #3

Scott Reynolds commented 26 February 2009 at 03:48

Status:

Active

» Needs review

realized that no one could install it without our private user_profile module. That dependency doesn't exist, it did in a previous iteration, so you may edit the .info file and remove this line
dependencies[] = user_profile

Code needs a review. I know it isn't perfect, willing to make it right but asking for guidance.

Log in or register to post comments

Comment #4

JacobSingh commented 26 February 2009 at 04:06

Hi Scott,

I haven't looked at this, but I imagine that it would need a lot of work because the APIs have changed significantly as we moved to Beta.
Can you try making a patch and rolling it in?

I'll review it anyway, because it is a killer feature, but I'll probably not get to it very soon if it is a zip file against an old code base.

Thanks!

-Jacob

Log in or register to post comments

Comment #5

pwolanin commented 26 February 2009 at 04:12

@Scott - I think there are (at least) two possible approaches: index nodes and users in the same index, or create a totally separate schema for users and expand the apachesolr framework module to more easily handle indexing multiple types of content into different indexes.

Frankly, the choice between these two for a particular site may also depend on whether they are using nodes as user profiles, vs. the core profile module or another solution. However, to me it seems that at least thinking about the latter would be useful.

For inclusion in this module, we should probably attempt a relatively bare-bones search that might just look at core profile module. It might not even meet your needs, but would be more general.

It's not clear to me that to index users into the same index you'd even need to change the schema, since the mail field could go into a dynamic field.

Log in or register to post comments

Comment #6

Scott Reynolds commented 26 February 2009 at 19:30

In regards to

It's not clear to me that to index users into the same index you'd even need to change the schema, since the mail field could go into a dynamic field.

is because I wanted to use it in the dismax equation add weight to that field.

My thought on node vs users is simple. If your using node profile/bio whatever else then u don't need this. And I really like it that they are in the same index. The facet searching that provides is cool and I think adds value. And through the use dynamic fields, not sure there is a technical need to build out a different schema and index.

I am willing to build it out for core profile. But I should mention that this works without any profile data. Meaning that if all you have is the basic drupal install, and apachesolr turned on its fine. It makes the users name and mail searchable. (Which is in line with Drupal core, user_search() implementation)

In regards to api change, this module is 123 lines of code with comments. It is pretty tiny. They only place without looking at the changes in Beta would be

/**
 * takes a set of documents and puts them to Solr
 *
 * @param $documents
 * array of documents to index on Solr
 */
function apachesolr_users_index_documents($documents) {
  try {
    $solr = apachesolr_get_solr();
    if (!$solr->ping()) {
      throw new Exception(t('No Solr instance avilable during indexing'));
    }
    
    // here we have solr ready to go
    $docs_sub_set = array_chunk($documents, 20);
    foreach ($docs_sub_set as $docs) {
      $solr->addDocuments($docs);
    }
    $solr->commit();
    $solr->optimize(FALSE, FALSE);
    
    // save the variable so it could be used later
  }
  catch (Exception $e) {
    watchdog('Apache Solr', $e->getMessage(), NULL, WATCHDOG_ERROR);
    throw new Exception(t('Failed to Index'));
  }
}

That is really the only place in this code it uses the Apachesolr PHP Client libraries. Everything else is self-contained. So I encourage you to take a look.

Log in or register to post comments

Comment #7

baumanis commented 20 April 2009 at 20:54

I've put the apachesolr_users module into drupal 6 but it did not work for my 6 yet. The new table apachesolr_usrs_queue gets updated, but the function apachesolr_users_index_documents does not update the solr.

Since i am not up the learning curve of drupal module coding yet, I have done this the old bandaid way until the author of this module (Scott Reynolds?) decides to work on it (which will be lovely and elegant). Basically right now I am running a script through cron that collects my user table uid and name and then creates an xml file out of it. Then the cron uses the post.jar to update the solr with my new xml file. This way user info shows up with all the rest of the search and I don't have to use the coresearches modules with their extra tabs.

Log in or register to post comments

Comment #8

Scott Reynolds commented 20 April 2009 at 20:59

doh severe bug in there

$users = db_query_range("SELECT uid FROM {apachesolr_users_queue} ORDER BY modified ASC", $last_checked, 0, 1000);

should be

$users = db_query_range("SELECT uid FROM {apachesolr_users_queue} ORDER BY modified ASC", 0, 1000);

Log in or register to post comments

Comment #9

robertdouglass commented 21 April 2009 at 10:41

No matter which approach we take I'd like to be able to guarantee that you don't have to go to a new tab to search for users. I'd like to use a unified search and get both users and nodes in the results.

Log in or register to post comments

Comment #10

pwolanin commented 21 April 2009 at 15:25

@Robert - really?

For that use case now, just use nodes-as-profiles. In the longe3r term, I think we would want to have a federated search - I think it's very poor to mix users and content in the same result set. E.g. : http://www.princeton.edu/main/tools/search/?q=stock has two panes of search results.

Log in or register to post comments

Comment #11

baumanis commented 23 April 2009 at 18:48

My users want content and user info in one search result set. I guess it depends on each drupal website's audience what kind of search result set to provide. So, it would be nice to have a choice of separating these or putting them together.

Log in or register to post comments

Comment #12

Scott Reynolds commented 25 April 2009 at 23:34

I think we need to separate out here indexing vs display. I do happen to think that representing users on the same level as content types is advantageous. I do think though, mixing users and nodes together can sometimes be a mess. So I think that providing a default facet of just 'nodes' is a good thing, and then can be turned off so the admin can say "mix nodes and users". So I think that in terms of the Apachesolr index, type:user should be there.

And i think this speaks to providing a single function

function apachesolr_get_params($params = array()) {
  $default_params = array(
     // all the variable gets here
    // and get the variable_get('mix_users_in?') to set the default fq types here
  );

  $final_params = $params + $default_params;

  return $final_params;
}

Thats how I would try to design it. And i believe that if $params['fq'][$key] = 'type:page' where there, it wouldn't get overridden in this function. Thereby, allow us to have a default behavior of just filtering to node types.

hope that makes sense, and i think that could work.

Log in or register to post comments

Comment #13

rapsli commented 16 September 2009 at 08:07

any chances, this gets into the module?

Log in or register to post comments

Comment #14

deltab commented 3 November 2009 at 12:06

Subscribing...

Log in or register to post comments

Comment #15

deltab commented 7 November 2009 at 08:25

+1

Log in or register to post comments

Comment #16

pwolanin commented 7 November 2009 at 15:12

If we want to go this route: the "type" for users needs to be a string that could never be a valid node type - e.g. 'user' is not acceptable, since I can create a node type using that string.

This is actually a more general problem also - for file attachments we are adding an extra Boolean, but that's not very scalable.

perhaps generally something like "object/user" and "object/file" or "object-user" or "a_s-user" or "a_s@user" or "@user"?
('a_s' == apachesolr_search)

Log in or register to post comments

Comment #17

Scott Reynolds commented 9 November 2009 at 17:06

Version:

6.x-1.x-dev

» 6.x-2.x-dev

what if it was a new facet, 'entity_type' instead of fancy strings. So "entity_type='node'" or "entity_type='user'" or "entity_type='comment'"

Of course this hits on the comment searching of 2.x, but given the recent Drupal nomenclature, I think this will translate well and would really provide more power in a clean way.

If we want to go the fancier strings route, can we leverage apachesolr_document_id() and facet.prefix?

Log in or register to post comments

Comment #18

pwolanin commented 11 November 2009 at 03:35

Well, it would be nice to figure out some usable approach for 6.x-1.0

Log in or register to post comments

Comment #19

pwolanin commented 11 November 2009 at 14:30

We are already roughly do this with the ID field :

function apachesolr_document_id($id, $type = 'node') {
  return apachesolr_site_hash() . "/$type/" . $id;
}

where the 'type' param there is the same as what you suggest as entity_type. i imagine, however, that doing a wildcard match on the ID field will perform much worse than if we have this as a separate string field.

Log in or register to post comments

Comment #20

Scott Reynolds commented 11 November 2009 at 16:19

Status:

Needs review

» Needs work

We are already roughly do this with the ID field :

Ya that was my final comment in 17. Spent some time on it early this morning couldn't come up with a better solution then wildcarding on the ID field. I don't think that is a good idea. Adding an 'entity_type' field to the schema seems like the way through to me.

Log in or register to post comments

Comment #21

pwolanin commented 14 November 2009 at 03:01

Version:	6.x-2.x-dev	» 6.x-1.x-dev
Status:	Needs work	» Needs review

Status	File	Size
new	entity-type-348668-21.patch	3.52 KB

Here's a possible schema change patch. This would more readily enable this user search as contrib.

Log in or register to post comments

Comment #22

robertdouglass commented 25 November 2009 at 13:55

Discussed with Peter. +1 for entity type.

Log in or register to post comments

Comment #23

anarchivist commented 25 November 2009 at 17:29

Entity type sounds good for me. Once this code gets committed I recommend writing up similar docs that would act as implementation guidelines for non-node content.

Log in or register to post comments

Comment #24

pwolanin commented 25 November 2009 at 19:08

I think we should just call it "entity" in the schema - no reason to make it longer.

Log in or register to post comments

Comment #26

pwolanin commented 25 November 2009 at 20:20

Well, even if you want them separate, adding such a field make it easy to filter the search results if all the data is in one index.

Log in or register to post comments

Comment #27

pwolanin commented 30 November 2009 at 00:07

Status	File	Size
new	entity-type-348668-27.patch	3.64 KB

like so.

Log in or register to post comments

Comment #28

robertdouglass commented 2 December 2009 at 20:01

Status:

Needs review

» Reviewed & tested by the community

I believe this change is positive.

Log in or register to post comments

Comment #29

Scott Reynolds commented 2 December 2009 at 21:00

Status:

Reviewed & tested by the community

» Needs work

Small change

<!-- enty_type is 'node', 'file', 'user', or some other Drupal object type -->

should be entity not enty_type.

Also probably need to set fq[]=entity:node on apachesolr_search queries. That way we a have a clear seperation.

Log in or register to post comments

Comment #30

robertdouglass commented 2 December 2009 at 21:40

Ah, yes, entity:node is a good catch. This change requires a re-index so it should be clear in both CHANGELOG and release notes.

Log in or register to post comments

Comment #31

pwolanin commented 3 December 2009 at 00:14

I think we can omit the fq by default - there is no need if you are only indexing nodes.

Log in or register to post comments

Comment #32

Scott Reynolds commented 3 December 2009 at 02:34

Well that would mean that user search would then need to say "For all queries that arn't mine fq[]=NOT entity:user".

I really think it should be the job of the 'entity' search module to add its fq in there. Things get kindof silly from there. so if you have a comment one that its fq[] = NOT entity:user fq[] = NOT entity:comment.

I take it you don't want add the fq for performance, just not sure its worth the performance gain.

And I know I have argued previously for showing nodes and users in the same result page. But having implemented a couple sites with both user and node searching, I think I was the only one to get that paradigm and probably the only one who thought it was cool.

Log in or register to post comments

Comment #33

pwolanin commented 3 December 2009 at 03:46

Well, mostly I don't want to force re-indexing on people who don't actually need it.

Log in or register to post comments

Comment #34

robertdouglass commented 3 December 2009 at 10:00

If we want to avoid re-indexing then we should be focusing this effort on 6.2. I believe Scott's argument is correct, so the decision mainly comes down to whether we should be making schema changes to 6.1. I've held that we shouldn't be, but I defer to Peter for the final decision on 6.1 releases.

Log in or register to post comments

Comment #35

greggles

he/him

English

Denver, Colorado, USA

commented 15 December 2009 at 23:18

I think it's fine to force re-indexing. We should just have a warning message "if you enable this module you will have to reindex your site" on the download page or the project release page.

Is the code in a currently testable state?

Log in or register to post comments

Comment #36

robertdouglass commented 16 December 2009 at 11:20

Status:

Needs work

» Closed (won't fix)

#641954: Update schema.xml for Solr 1.4 changes to schema version contained the entity field, and Scott says he's going to make a standalone user search module, so this issue is closed. greggles, note that the only warning for reindexing we'll have, currently, is in the release notes. People's site's won't break, however, so tracking latest devel versions is safe enough.

Log in or register to post comments

Comment #37

pwolanin commented 17 December 2009 at 03:06

Title:	User search	» Index into entity field
Status:	Closed (won't fix)	» Needs work

Actually - user search per se is "won't fix" but my last patch needs to be applied in some form

Log in or register to post comments

Comment #38

pwolanin commented 17 December 2009 at 03:16

Status:

Needs work

» Needs review

Status	File	Size
new	entity-type-348668-38.patch	2.16 KB

re-roll less schema changes.

Log in or register to post comments

Comment #39

pwolanin commented 17 December 2009 at 03:50

Version:	6.x-1.x-dev	» 6.x-2.x-dev
Status:	Needs review	» Patch (to be ported)

Committing this minimal patch to 6.x-1.x - need to come back to the language code soon.

Log in or register to post comments

Comment #40

robertdouglass commented 17 December 2009 at 12:48

Status:

Patch (to be ported)

» Fixed

#38 was committed to 6.2 and 5.2.

Log in or register to post comments

Comment #41

Scott Reynolds commented 17 December 2009 at 20:56

Version:	6.x-2.x-dev	» 6.x-1.x-dev
Status:	Fixed	» Needs work

Again, there is no fq[] = entity:node.

See #32 for the argument as to why this is a bad idea.

Log in or register to post comments

Comment #42

pwolanin commented 17 December 2009 at 21:08

Status:

Needs work

» Fixed

@Scott - I think it's a bad idea to add it by default. I want to always mix file and node results, for example.

Log in or register to post comments

Comment #43

Scott Reynolds commented 17 December 2009 at 21:18

Then how do I solve this?

Well that would mean that user search would then need to say "For all queries that arn't mine fq[]=NOT entity:user"

How do i prevent user documents from showing up on the file + node search? Am I going ot have to do that? Seems incredibly brittle and prone to issues. Going to have to do a special case for Solr Views to not at the clause.

if ($caller != 'apachesolr_users' && $caller != 'apachesolr_views') {
  // Exclude 'user'.
}

That feels pretty dirty.

Log in or register to post comments

Comment #44

pwolanin commented 18 December 2009 at 01:34

What's the alternative - I have to know to remove or OR together any entity fq entry to search everything in the index?

Unfortunately, neither of these are ideal situations. At least for the 2.x branch, we can think about a hook to collect all entity types and apachesolr_search could limit to an admin-selected subset for example.

Log in or register to post comments

Comment #45

Scott Reynolds commented 18 December 2009 at 03:27

Status:

Fixed

» Needs work

Unfortunately, neither of these are ideal situations. At least for the 2.x branch, we can think about a hook to collect all entity types and apachesolr_search could limit to an admin-selected subset for example.

I think this supports my point. Either build out the 'files + node' properly or not at all. (side note: where is the issue for files + nodes?)

My real problem with this change is that it makes implementing the Apache Solr api harder on other module developers. If we keep this as is, then I ask that we add something to the documentation and draw a red box, a red arrow and blinking lights around it.

Hence, CDW, we need documentation for the module developer.

Frankly, surprised you are pushing for this, you were the one in who was against this in #10.

Log in or register to post comments

Comment #46

robertdouglass commented 18 December 2009 at 13:15

Adding additional cores, schemas etc. and collecting results from multiple search indexes is not in the works. Thus, if we want to support searches on different entities they have to share one index. We can keep working on the schema and the logic to support this better, but I'm fully in favor of indexing entity type. I don't find it an unreasonable requirement for vertical search implementations to have to add entity:foo as a filter to all searches.

Scott, it's not your responsibility as the user search module author to prevent users from showing up elsewhere, it's rather the responsibility of the other implementations to prevent them from showing up. Going forward it has to be assumed that all sorts of stuff can be in the index, and you have to ask for exactly what you want, or you risk getting stuff you didn't count on.

hook_apachesolr_modify_query() {
  $query-add_filter('entity', 'user');
}

Log in or register to post comments

Comment #47

pwolanin commented 18 December 2009 at 18:38

Status:

Needs work

» Fixed

@Robert - ok, so you are agreeing with Scott. My point is that we need some systematic way for modules to opt in or out of a particular query.

For 6.x-1.x the patch as committed does nothing except add more data into the index that can optionally be used. In that sense, it's a feature beyond what the module supported before. We certainly did not cause a regression. Hence "fixed".

@Scott - please open a new issue for discussing how to move forward so we can improve the API.

Log in or register to post comments

Comment #48

robertdouglass commented 19 December 2009 at 10:37

Peter, I think I partly agree with Scott. I see the main search module as the dumping grounds of everything that gets indexed, but I think he sees this as problematic, and maybe you do too. So yeah, if we don't want users to show up in main search, then we have to specify what entities main search searches on.

In the not too distant future we'll be doing this all with Views and having the entity index is definitely a good thing there.

Log in or register to post comments

Comment #49

2 January 2010 at 10:40

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Index into entity field

Comments