I'm new to Drupal which is why I have not listed this as a bug report. It's entirely possible that I am missing something basic.

Crossposted here: http://drupal.stackexchange.com/questions/25739/search-api-attachments-n...

I'm running Drupal 7.12. I have a requirement to index files attached to nodes. (DOC, PDF, XLS, etc).

After quite a bit of trail and error, reading and searching I have been unable to get any of the attachment file indexing modules to work.

Search API Attachments seems the most promising but after configuring it doesn't seem to index any of the documents. It does call Tika to process attachments to new nodes, but doesn't seem to re-index any existing nodes. Searching on any file contents yields no results.

Any thoughts on what I should investigate to track this down would be greatly appreciated. Anyone running this module successfully? If so how did you configure it.

More info:

I followed these instructions: http://permalink.gmane.org/gmane.comp.php.drupal.support/22390

To see if I can track this down:

I configured a fresh installation of Drupal 7.12.

I installed Search API, Search API DB and Search API Attachments

projects/search_api_db-7.x-1.x-dev.tar.gz
projects/search_api_attachments-7.x-1.2.tar.gz
projects/search_api-7.x-1.0.tar.gz
I installed Tika and verified that it is working per instructions.

I created a field called Attachment1 of type File for the "Article" content type: structure -> content types -> article. label: Attachment1 name: field_attachment1, field: File.

I created a search server in Configuration -> Search and metadata -> search api using the database service class.

I then edited the Default Node Index to use the search server I defined and enabled it.

I then went to the Default Node Index fields and using "Add Related Fields" added Attachment1 » The file.

I checked the box next to Attachment1 >> The file and Attachment content: field_attachment1

I created a single Article and attached a word document to it.

I ran the cron service from the admin menu.

In the database I ran select * from search_index and I notice that only the terms from the node content are present, nothing from the document that I uploaded.

I verified that Tika was being called by adding the syslog call prior to shell_exec which yields:

Mar 17 13:32:06 xeon httpd: Calling Tika: java -Dfile.encoding=UTF8 -cp '/usr/local/tika' -jar '/usr/local/tika/tika-app-1.0.jar' -t 'http:///drupal/sites/default/files/Resume_0.doc'
However, if I select to re-index the site by going to Configuration -> Search and metadata -> search settings -> clear index and re-run cron I notice that Tika is NOT being called.

Also I notice that selecting clear index does not affect the search_index table. (Should it?)

I have separately verified that Tika is properly extracting the text from the document by logging it to a file prior to the call.

My feeling is that I'm missing something obvious.

Comments

Yermo-1’s picture

Doing some more digging:

I'm not sure if this is how it's intended to work but if you re-index the site (Configuration -> Search and Metadata -> Search Settings -> Clear Index) and then re-run cron search api attachments does not get called UNLESS a node has changed.

It does call extract_tika() on the node that has changed. I'm not sure under what circumstances it re-indexes all the nodes. At first it seemed changing a node was required to get it to reindex all file attachments but I am now not able to reproduce that behavior. (So only nodes that I edit have extract_tika called).

I have verified that extract_tika is returning strings from the attached documents up the chain.

However, those search terms do not seem to be making it into the database unless I'm misunderstanding something basic. Searching on keywords contained in the documents still yields no results.

Am continuing to dig through the code.

Yermo-1’s picture

Title: File Attachments Not Getting Indexed » File Attachments Not Getting (Re)Indexed
Component: Code » Documentation
Category: support » bug
Priority: Major » Normal

It looks like this is a case of PEBKAC or maybe it could be listed as a case of lack of documentation.

What threw me was the fact that TIKA was not being called when re-indexing the site. That led me to chase red-herrings and believe the problem was in indexing.

What I did not understand was that the default Drupal search form does not interface with search API and thereby search api attachments. (I had thought it plugged into some hook in the default search ... is it supposed to?)

I installed search api pages and using that form am able to get my attachment search results as I would expect.

lotyrin’s picture

Component: Documentation » Miscellaneous
Category: bug » support
Status: Active » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

BarisW’s picture

Version: 7.x-1.2 » 7.x-1.x-dev
Issue summary: View changes
Status: Closed (fixed) » Active

I'm not sure why this issue is marked as Fixed?
How can the tika index be cleared? When does that happen?

izus’s picture

hi,
did you try to clear the node index and reindex again ?

BarisW’s picture

Yes, obviously. But that doesn't seem to clear the file indexes as well. Only when I upload the file again to a node, it seems to get reindexed.

izus’s picture

and did you try to clear the cahe after deleting the index.
here are my thoughts :
- in getFileContent we do cache the files content as it's a performance-intensive operation.
we delete that cache on file update and file delete.

if clearing the cache after deleting the index makes it, then i think we must document this well and keep that behaviour as reindixing all the files can be very expensive for performance.
does this help ?

BarisW’s picture

What I notice is that I expect the file's snippet to be added to the search result. When I add an attachment to a node, I see the file's snippet in the search results (after applying this patch: #2134163: Add the snippet to the search result). However, when I change the attachment on a node, the snippet becomes empty and doesn't get updated. Also, when I clear the index and re-index it, the snippet doesn't get added again as well.

I assumed this to be related, but maybe I'm looking at the wrong issue?

nerdcore’s picture

I've just installed search_api_attachments-7.x-1.3 and am using search_api_solr-7.x-1.3 with a Solr 4.6.0 server.

My index is called "Default multilingual node index".

I've tried using the Solr server for file attachment indexing, and I've tried using a local Tika JAR.

I've modified the Basic Page Content Type to add a File field called "field_attachments".

I've enabled these fields for indexing at /admin/config/search/search_api/index/default_multilingual_node_index/fields:

* Attachments » The file. (field_attachments:file)
* Attachments » The file. » File name (field_attachments:file:name)
* Attachments » The file. » MIME type (field_attachments:file:mime)

I've rebuilt the index multiple times and see no errors from either Drupal or on the Solr server.

I can search node content (I'm also inexing the node body), and titles (as title_field provided by Entity Translation), but cannot get any results based on either the file name or the MIME Type matching my attachment in field_attachments.

izus’s picture

actually it's the file content that is indexed not the name of the file, neither its MIME type

nerdcore’s picture

@izus, Are you saying that it is impossible to use this module to provide MIME type information or filename information in search results? If so, why are these fields exposed as searchable fields in the list of searchable fields?

Is there an alternative way to provide these pieces of information? I'm trying to build a search which allows users to filter based on file type.

izus’s picture

that's a user case i didn't try yet, i'm just saying that while indexing the node content, we also index the files content in the same index. i didn't try your case yet, will throw a comment here if i find some time to test it. Please feel free to do the same if you succeed with or without a patch !

izus’s picture

Status: Active » Fixed

This issue was dealing with files not getting reindexed.
with the last code base the indexed files are cached for performance. when deleting the cache the files will be reindexed again.
closing the issue

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.