The new development release for apachesolr (7.x-1.x-dev) got overhauled big time. The module now supports indexing of custom built entities.

More information about the changes here : http://drupal.org/node/966796

Currently the indexing of attachments fails on line 160 of apachesolr_attachments.module.

$success = apachesolr_index_nodes($rows, 'apachesolr_attachments');

It would be nice to see this fixed before the latest version of apachesolr goes to alpha / beta.

Comments

pwolanin’s picture

We should probably look even deeper at the architecture, though the core file API is still a bit weak.

nick_vh’s picture

We should also decide if we want to index files as separate entities or if we want to stick to the file per node logic. I assume that a file entity can have it's own entity indexing system and that we index also the links between a file_id and a node.

This should be clearly defined before we start coding onto the new D7 solr API because there are a lot of possibilities.

pwolanin’s picture

The problem is that the file entity doesn't have back-references to the nodes or other entities that reference it.

A better approach we should take for 7 is possibly not to maintain a separate indexing tracker, but rather make multiple documents from each entity being indexed - i.e. an additional doc for each file attachment. However, the problem with that is the timeout protection, and the fact that attachment indexing can get rather broken. For that reason it might be safer, at least for nodes, to retain this scheme?

pwolanin’s picture

Discussed an alternative scheme with Nick of trying to extract and cache the attachment text before indexing the node. That way we could do the multiple documents without worrying about the timeout.

However, there would be a potential disconnect in terms of ordering that would be a pain to account for.

osopolar’s picture

I gave a glance on the apachesolr changes and saw that where substantial changes to the api, anyway is there a small workaround to just to get the code work again (without any improvements)? Thanks.

pwolanin’s picture

Looking at the code we replace namespace with the environment ID. I hadn't realized that would make it hard to have a second indexing loop like we did before.

I'm thinking of an approach using the queue API to extract and index files whose text is not already cached.

pwolanin’s picture

If we wanted to try to revive the hackish way, we'd need to modify the main module something like:

index b4e6983..f45f42e 100644
--- a/apachesolr.module
+++ b/apachesolr.module
@@ -740,36 +740,43 @@ function apachesolr_set_stats_message($text, $type = 'status', $repeat = FALSE)
   }
 }
 
+function _apachesolr_index_position_key($env_id, $namespace = NULL) {
+  $key = $namespace ? $env_id . ':' . $namespace : $env_id;
+}
+
 /**
  * Returns last changed and last ID for an environment and entity type.
  */
-function apachesolr_get_last_index_position($env_id, $entity_type) {
+function apachesolr_get_last_index_position($env_id, $entity_type, $namespace = NULL) {
   $stored = variable_get('apachesolr_index_last', array());
-  return isset($stored[$env_id][$entity_type]) ? $stored[$env_id][$entity_type] : array('last_changed' => 0, 'last_entity_id' => 0);
+  $key = _apachesolr_index_position_key($env_id, $namespace);
+  return isset($stored[$key][$entity_type]) ? $stored[$env_id][$entity_type] : array('last_changed' => 0, 'last_entity_id' => 0);
 }
 
 /**
  * Sets last changed and last ID for an environment and entity type.
  */
-function apachesolr_set_last_index_position($env_id, $entity_type, $last_changed, $last_entity_id) {
+function apachesolr_set_last_index_position($env_id, $entity_type, $last_changed, $last_entity_id, $namespace = NULL) {
   $stored = variable_get('apachesolr_index_last', array());
-  $stored[$env_id][$entity_type] = array('last_changed' => $last_changed, 'last_entity_id' => $last_entity_id);
+  $key = _apachesolr_index_position_key($env_id, $namespace);
+  $stored[$key][$entity_type] = array('last_changed' => $last_changed, 'last_entity_id' => $last_entity_id);
   variable_set('apachesolr_index_last', $stored);
 }
 
 /**
  * Clear a specific environment, or clear all.
  */
-function apachesolr_clear_last_index_position($env_id = NULL, $entity_type = NULL) {
+function apachesolr_clear_last_index_position($env_id = NULL, $entity_type = NULL, $namespace = NULL) {
   $stored = variable_get('apachesolr_index_last', array());
+  $key = _apachesolr_index_position_key($env_id, $namespace);
   if (empty($env_id)) {
     $stored = array();
   }
   elseif ($entity_type) {
-    unset($stored[$env_id][$entity_type]);
+    unset($stored[$key][$entity_type]);
   }
   else {
-    unset($stored[$env_id]);
+    unset($stored[$key]);
   }
   variable_set('apachesolr_index_last', $stored);
 }
nick_vh’s picture

Oh no! Namespaces!

This might be a solution for a backwards compatibility but the better way here is definitely not namespaces.

We could add another indexing callback to the node entity which handles the attachments/files part. And then you also hook into the entity_update and entity_insert to tell solr a certain node has to be resent.
It would reduce the code effort in the apachesolr_attachment module.

This + the approach pwolanin suggests of the queue api to do the extraction separately would offer us a stable module.

pwolanin’s picture

Ya, I'd rather avoid the namespace thing, though it's a little frustrating that our current model ties you to jsut one node indexing process.

zilla’s picture

any further thoughts on this as the solr main moves towards an RC status? http://drupal.org/node/1323650
right now it seems like people are downgrading to beta 13 (per some other thread that i followed) which feels like a patchwork approach

happy to test any update if you'd like, i'm running a dev site on acquia right now so just let me know (right now the search is all wonked out, and attachments ain't working at all in the indexing, think my downgrade broke things big time)

pwolanin’s picture

I've been thinking hard about this - in many ways having the second tracker of nodes was the most efficient - but really is insufficient now that we can index many types of entities.

I'm thinking now about using the indexing loop over file entities to just do the text extraction and combining use of the Drupal queue when the file contents are not yet extracted when an entity has an attached file.

pwolanin’s picture

Ok, thinking a lot about this - the quickest fix would be to just try to extract all needed files in the middle of the node indexing process.

zilla’s picture

i hear you - though it's such a critical module/feature that it seems like the quick fix is less important than the right fix...while i'd love to have it running tomorrow with latest solr integration (as would others i assume), i'd much rather have a solution that's going to work for the long haul (e.g. addressing multiple file attachments and so on, or looking at how attachments are/can be exposed as facets, or whatever else has come up)

at any rate, i'm more than happy to test a dev version against solr integration beta 16 (which is where it's currently breaking) in about a week or so - doing a demo next week so need to not break anything until at least the 7th ;)

i'm also doing this on acquia's servers, so feel free to email me directly via contact if you'd like me to do a test installation and run some additional test scenarios for you (or pull logs, etc) so that you can get some insight into how it will behave in their hosted search environment

pwolanin’s picture

Well, ok that won't quite work either.

Possibly need to populate a new/separate tracking table each time an entity with attached files is indexed, and then have a pointer into this separate tracking table.

This could also resolve the bug with files attached to more than one node.

drasgardian’s picture

Has there been any further progress on this issue? I'd be happy to assist with any testing, or even put some dev time towards it if pointed in the right direction.

pwolanin’s picture

Status: Active » Needs work
StatusFileSize
new5.76 KB

Here's a start on the schema changes.

I feel like we might need an extra hook or two invoked by apacehsolr so we can react to bundles being removed.

nick_vh’s picture

StatusFileSize
new27.92 KB

Taking a different approach in the following patch. We are making use of the new entity support in apachesolr 7.x-1.x-dev

For every file it looks up the nodes that are attached to it and indexes them accordingly. since it is a separate entity now we can choose to enable or disable indexing using the apachesolr admin UI. No longer is an extra indexing process necessary.

I'm sure I've left out some important things, but I invite you all to test this.

Once you enable the module you will notice that a new entity type was added : Files
Enable this file checkbox
Check if your tika settings are valid

Index away!

Image of the index result
Image of the configuration UI

Todo : Make the text extraction cache better again and clean up the admin UI + make sure I did not delete functionality unwillingly
@Peter, I started doing this before I saw your patch - We might want to join the patches when appropriate

nick_vh’s picture

StatusFileSize
new27.92 KB

Forgot to do git pull before I created the patch, nothing major but just to be certain

nick_vh’s picture

Title: Upgrade apachesolr_attachments to co-operate with latest beta of apachesolr » apachesolr_index_nodes() no longer exists in apachesolr 7.x-1.x-dev
StatusFileSize
new44.18 KB

Now with working admin UI and exclusion callbacks.

The one drawback that I've noticed :
When you have 2 nodes, referencing to the same node and the setting "Exclude files attached to a node excluded by Apache Solr Search" excludes the attachment completely, even though the other node might be active. This is not a bug for now, it's a feature ;-)

I also merged the database update patch from pwolanin as much as it made sense. The module now completely relies on the entity support in apachesolr. In theory, attachments that are added to entities other than node should also be supported. Bugs could appear though...

Have fun and please test!

nick_vh’s picture

Title: apachesolr_index_nodes() no longer exists in apachesolr 7.x-1.x-dev » Upgrade apachesolr_attachments to co-operate with latest beta of apachesolr
StatusFileSize
new44.97 KB

Just fixing a typo, in some cases it was still returning tika-0.3.jar as a variable

pwolanin’s picture

Title: apachesolr_index_nodes() no longer exists in apachesolr 7.x-1.x-dev » Upgrade apachesolr_attachments to co-operate with latest beta of apachesolr

note - I think we need to get this patch in to the main module:
#1515822: Support notion of parent entity for derived documents

That will alleviate the need to execute extra Solr delete queries entity deletions and similar.

nick_vh’s picture

Status: Needs work » Needs review

Marking as needs review because I'd like some feedback from anyone that wants to be involved :-)

pwolanin’s picture

Status: Needs review » Needs work

At least in D6 there is not easy way to know when an attachment is removed from a node, so short of tracking that (which is what the removed column was for) stale docs would be left in the index.

jmdeleon’s picture

Getting the following error when I am asked to update following applying patch #21 above:

Error: Unsupported operand types in /home/drupal-newer/includes/database/mysql/schema.inc, line 85

pwolanin’s picture

I don't understand why the big added function

+function apachesolr_attachments_excluded_nodes($entity_id, $entity_type) {

I think a better approach is to allow the attachment to be indexed as many times as it is attached each time with a different ID.

pwolanin’s picture

I think 'apachesolr_attachments_exclude_types' will go away and always be true - if the node is excluded, we will never actually buidl the document, right?

pwolanin’s picture

Another downside of using the main indexing loop is that if you enable this module you essentially need to re-index all content on order to get file attachments indexed? That seems like a bummer.

nick_vh’s picture

StatusFileSize
new44.74 KB

I think 'apachesolr_attachments_exclude_types' will go away and always be true - if the node is excluded, we will never actually buidl the document, right?

Removed it with the patch that is attached. Since we call status callback it won't index attachments for nodes or entity types that are excluded

Another downside of using the main indexing loop is that if you enable this module you essentially need to re-index all content on order to get file attachments indexed? That seems like a bummer.

Afaik that is not true. The file entity is enabled and only the file entities will be added to the tracking table. We do need entity information in order to get attachments indexed but it does not require us to re-index all?

I think the way it is done seems the most flexible if an attachment was ever added to another entity different from node.

drasgardian’s picture

testing the patch in #21 I've hit a couple of problems so far

First, I started getting errors like this:

PDOException: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '7914' for key 'PRIMARY': INSERT INTO {apachesolr_index_entities_file} (entity_id, status, bundle, entity_type, changed) SELECT fm.fid AS entity_id, fm.status AS status, 'file' AS bundle, 'file' AS entity_type, 1334016075 AS changed FROM {file_managed} fm INNER JOIN {file_usage} fu ON fu.fid = fm.fid WHERE (fu.type = :db_condition_placeholder_0) ; Array ( [:db_condition_placeholder_0] => node ) in apachesolr_attachments_solr_reindex() (line 195 of [path to site]/modules/apachesolr_attachments/apachesolr_attachments.module).

Which seemed to be caused by having more than one node referencing a single file. In my case 2 translations referencing a single pdf.

This can be overcome by adding select->distinct() to the query in apachesolr_attachments_solr_reindex()

Next, I found that it kept trying to process the same batch of file entities over and over again. This is because when indexing a file fails, the file entity is not flagged as processed and so gets attempted again in the next batch. apachesolr_attachments should really skip over the failures in order to process all the other files.

In my case there were two reasons I found that extracting text from a file was failing:

1. because the actual pdf couldn't be found in the filesystem (I'd deliberately removed most of the pdfs from my dev environment to keep it lean). This shouldn't really ever occur in a production environment, but I think apachesolr_attachments should be able to skip over it and continue on with indexing all the other files for the sake of robustness.

2. where I'd placed field-based permission restrictions on a filefield field, so extracting text fails because apachesolr tries to index everything as anonymous. This case is legit so again, apachesolr_attachments should be able to cope with that, skip over it and continue on with indexing all the other files.

I'll have some more time later in the week so I'll hopefully be able to sort some of these out issues out and post a patch back.

nick_vh’s picture

StatusFileSize
new45.83 KB

Included a patch with small improvements and simplifications.

@drasgardian
Thanks for the distinct one, added that immediately.

In regards of your comments :
I tried removing a file, but watchdog catches that and tells me : "public://fast_esp_to_apache_solr_search_migration_20120227.pdf is not a valid file path"

Could you copy/paste the exceptions here that you encounter?

Also this issue is now dependent on #1519900: Error in apachesolr_index_get_entities_to_index() function because of the apachesolr_remove_entity function

drasgardian’s picture

StatusFileSize
new2.5 KB

I was getting the same watchdog message too, but the entity wasn't being removed from the index so the same entities just kept being attempted in a loop. Looks like you've corrected that now by adding apachesolr_attachments_is_file to the status callbacks.

The next issue I ran into was

WD php: EntityMalformedException: Missing bundle property on entity of type node. in entity_extract_ids() (line [error]
7501 of [path to drupal]/includes/common.inc).

This was coming from apachesolr_attachments_excluded_entities() when a referencing node had revisions. entity_load() was being called with the vid rather than nid.

I also noticed that apachesolr_attachments_solr_document() was reusing the $documents variable name.

Patch to correct these items attached.

my attachments are now being indexed :) but I have turned off my field-based permissions for the moment. Next I'll turn that back on and report back if it still causes issues.

drasgardian’s picture

StatusFileSize
new895 bytes

I ran into another old issue where the search result links for documents within the private filesystem point to the filepath of the files (e.g. /home/path/to/private/files/) rather than a url.

Attached is a patch to correct this.

This is addressing the same issue as #1188860: support private filesystem link, but so much has changed within this issue thread that a single patch could not apply both there and here.

nick_vh’s picture

Please keep this patch as one big one until the moment we commit it. Small patches will go lost in space. However, adding the small path together with the full patch is a +++! (interdiffs)

nick_vh’s picture

StatusFileSize
new47.46 KB

Renamed all variables with reference to parent/parents to make more sense when reading the code. I also included the previous patches from drasgardian (#32 and #33)

You should re-index to test these changes. Also do update.php

drasgardian’s picture

I'm getting the following error with patch #35

reset() expects parameter 1 to be array, null given apachesolr_attachments.module:237 [warning]
WD php: EntityMalformedException: Missing bundle property on entity of type node. in entity_extract_ids() (line 7501 of [error]

I'd make another patch but was just about to go home. The following change should fix it.

- $parents_list = reset($parents);
+ $parents_list = $parents_list ? reset($parents_list) : NULL;
nick_vh’s picture

perfect, will be incorporated in the next one.

Todo : Make sure file changes are recognized and kept track of in our own tables for easier backportability

nick_vh’s picture

StatusFileSize
new54.74 KB

I stopped using file_usage and file_manage in favor of field Api. Since all files that we are targeting are handled by fields (file field most probably) it is preferable that we loop over all those files instead of trusting file_usage and file_manage.

There are some tricks in the code that override a small part of EntityFieldQuery but it was to add more fields to the return value copared to the default EntityFieldQuery.

Try to re-index your site and see if it works, I'll do more testing later on.

drasgardian’s picture

StatusFileSize
new55.32 KB

Here's another update. It wasn't accounting for different file fields having different names for the columns storing the fid values.

Plus a few other small fixes. I added back in the hook_apachesolr_query_alter function to get the results displaying the mime type and a link back to the parent node.

drasgardian’s picture

StatusFileSize
new5.88 KB

for reference, here are the changes between #38 and #39

nick_vh’s picture

+++ b/apachesolr_attachments.moduleundefined
@@ -255,13 +258,14 @@ function apachesolr_attachments_excluded_entities($entity_id, $entity_type) {
-        $parent_entity = reset($parents_entities);

funky space difference here?

drasgardian’s picture

ah, sorry about that. My IDE was doing weird tab stuff instead of double spaces.

the change was removing the extra s off $parentS_entities though - I wasn't just adding whitespace ;)

nick_vh’s picture

StatusFileSize
new55.27 KB

Incorporated changes from drasgardian + made use of entity_uri for all and a cleanup of functions and variable assignments that were not needed.

nick_vh’s picture

StatusFileSize
new55.67 KB

Small mistake in query_alter

nick_vh’s picture

StatusFileSize
new55.63 KB

I hope this one is a very stable patch. Fixed a notice in the update + some weird text that was added to the install file.

nick_vh’s picture

StatusFileSize
new55.82 KB

drasgardian found a small mistake in _apachesolr_attachments_get_all_files();

The results were not merged with each other. Now they are getting merged. I'm not 100% sure if array_merge is the correct solution for this but hopefully someone will report back with the answer.

nick_vh’s picture

StatusFileSize
new55.24 KB

Removed some complex query in update_7003. It now re-indexes only your files so you should submit all of them again to your index when this update applies. It is the easiest way to get a consistent table from the first try.

nick_vh’s picture

StatusFileSize
new55.95 KB

In co-operation with Swentel the apachesolr_attachments_requirements() was not behaving properly.
We added a value key to the return value that fixed the notice

nick_vh’s picture

Status: Reviewed & tested by the community » Needs review

Additional todo's (most probably for other patch):

  • Make UI better
  • Test button to test the tika extraction. This allows site admins to verify if text extraction works as expected.
swentel’s picture

Status: Needs work » Reviewed & tested by the community

Works fine here!

drasgardian’s picture

Status: Needs review » Reviewed & tested by the community
StatusFileSize
new55.95 KB

patch #48 still had the extra 's' in $parents_entities

@@ -274,7 +274,7 @@ function apachesolr_attachments_excluded_entities($entity_id, $entity_type) {
         // load the parent entity and reset cache
         $parent_entities = entity_load($parent_entity_type, array($parent_entity_id), NULL, TRUE);
         // Take the first entity from the stack
-        $parent_entity = reset($parents_entities);
+        $parent_entity = reset($parent_entities);
 
         // Skip invalid entities
         if (empty($parent_entity)) {

updated patch attached

nick_vh’s picture

StatusFileSize
new58.18 KB

Now it is more in line with #1515822: Support notion of parent entity for derived documents and allows multiple parents. However, this added extra complexity that we even have to "fake" an entity and serialize it into the index. I added code comments why this was necessary.

We go for 0 node loads in the search results page :)

nick_vh’s picture

StatusFileSize
new59.67 KB

And some code style cleanups

nick_vh’s picture

Status: Reviewed & tested by the community » Fixed

And committed!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

jenlampton’s picture

Issue summary: View changes

This change renamed the database table, and now there's a conflict with the apachesolr_file module. That module also uses apachesolr_index_entities_file.
See #2568479: Schema conflict with apachesolr_file module

ohthehugemanatee’s picture

Version: 7.x-1.x-dev » 6.x-1.x-dev
Status: Closed (fixed) » Patch (to be ported)

I can't believe I'm saying this, with D8 so close to a reality... but this still needs backporting to D6.

janusman’s picture

Status: Patch (to be ported) » Closed (outdated)