will i be able to search my website with the search function for keywords that are part of attachments such as .doc and .pdf?... thank you

Comments

markj’s picture

I notice someone asked about this back in April but there were no responses. I don't see any indication that Drupal can do this, but there are linux utilities like pdftotext (for PDF) and catdoc (for MS Word) that can extract the text of the binary files. Not sure if these are available for Windows or if there are equivalents. A simple solution to your problem would be to find all the attachments on a node, extract the text of the attachements, and add it to the node's entry in the search_dataset table. This would provide hits within the node and all attachments, which is crude but at least it would be a place to start.

I'm not that familiar with the search module but I'll take a look to see if this is possible.

ryyz’s picture

thanks markj..... i'll try a search with a .doc attacmwnt.... curious.. thank u again

markj’s picture

I've got a skeletal module that attaches some test text to each node's search data as part of the normal indexing process:

<?php

/**
 * Implementation of hook_nodeapi().
 */
function search_attachments_nodeapi($node, $op, $arg = 0) {
  switch ($op) {
    case 'update index':
      return search_attachments_node_update_index($node);
  }
}


/**
 * Implementation of hook_nodeapi('update index').
 */
function search_attachments_node_update_index(&$node) {
      return ' test text to attach';
}
?>

The next step is to identify the attachments for each module, extract the text for each, and add the text to the parent node's search data. I'll work on this over the next week or so... if all goes right, the result will be a module that will index .doc and .pdf files and merge their text with that of the parent node, so that search hits in the attached text wil return the parent node. We'll go from there. The module will only work if you have the extractor utilities (catdoc, pdftotext) installed, however.

markj’s picture

I've got this working on PDFs, Word, and text files. I'm still in the code cleanup/documentation stage, and I'd also like to add some basic form validation, like telling you if the module can't find the helper apps you specify. Stay tuned...

ryyz’s picture

markj.... this is great... i am still new to drupal and have no experience with php either (i'm an ole legacy guy).... can u include a 'how to' like what have to be added to drupal (any external /additional modules)...???? many thanks.

ymcp’s picture

This sounds like a very interesting module. I'd be happy to help test it.

markj’s picture

I think it's ready to test. You'll have to install pdftotext (http://www.bluem.net/downloads/pdftotext_en/ for OS X package; can't seem to find the linux version at the moment) and catdoc (http://www.45.free.net/~vitus/software/catdoc/ for source) in order to use the search_attachments.module, or if you are on a system that comes with 'cat' installed, you could just test it with .txt attachments without installing any additional helper applications.

Email me through my Drupal contact form and I'll send you the module later tonight. It would be nice to have a couple of people test it prior to making a general annoucement.

ryyz’s picture

maybe swish-e module might work.......

markj’s picture

I've packaged search_attachments.module up and made it available at http://interoperating.info/mark/search_attachments

Thanks to all who assisted.

xamox’s picture

This works for me! Awesome and thank you, this saves me SO much time.

---------------------------------------------------------------
http://xamox.NET

rancor’s picture

Nice! This was the last thing I needed to make my site perfekt

Thanks

Edit: Nooo =( Please make your release avaiable from drupal.org