I am facing a tricky problem and would love any help. Right now we have users creating nodes with many CCK fields and a search that uses Views exposed filters to search those nodes. All good so far, except now they're uploading files (using CCK FileField) and want to be able to search the file content in the same View.

To do the Views-based search, currently we're using the "Search: Search Terms" exposed filter which hooks into Drupal's core search. I found http://drupal.org/project/search_files and http://drupal.org/project/search_by_page and *AM* successfully indexing the PDFs and text files. Trouble is -- these modules don't hook into Drupal's core search, but instead provide a separate tab on the search page. This means they don't work with the "Search: Search Terms" filter.

Does anyone know how to do this, or have ideas on alternative approaches?

Comments

feedbackloop’s picture

This is my current solution. Can any search experts provide advice on improvements?

function mymodule_nodeapi(&$node, $op, $a3 = NULL, $a4 = NULL) {
  if ($op == 'update index' && $node->type == 'profile') {
    //watchdog('debug', 'profile node with nid ' . $node->nid . ' is being indexed');
    $extra = '';
    
    function index_files($field) {
      $fids = array();
      $extra_ = '';
      if ($field) {
        //watchdog('debug', 'node ' . $node->nid . ' has files: ' . print_r($field, TRUE));
        foreach ($field as $file) {
          $fids[] = $file['fid'];
          // Get cached value in DB if possible; otherwise make pdftotext etc calls.
          $query = db_query("SELECT content FROM {file_dumps} WHERE fid=%d", $file['fid']);
          if ($row = db_fetch_array($query)) {
            $content = $row['content'];
          } else {
            $content = search_files_attachments_get_file_contents($file['filepath']);
            if (!empty($content)) {
              db_query("INSERT INTO {file_dumps} (fid, content) VALUES (%d, '%s')", $file['fid'], $content);
            }
            //if ($content) { watchdog('debug', 'FILE: ' . $content); }
          }
          $extra_ .= " $content ";
        }
      }
      return $extra_;
    }
    
    $extra .= index_files($node->field_user_files);
    $extra .= index_files($node->field_user_publications);
    if (strlen($extra) > 0) {
      watchdog('debug', 'Added ' . strlen($extra) . ' bytes to node ' . $node->nid . ' from file attachments for search indexing.');
    }
    return $extra;
  }
}

P.S.: file_dumps is a custom table with three fields I made for caching file contents:

CREATE TABLE `YOUR_SITE_DATABASE_NAME`.`file_dumps` (
`did` INT( 10 ) NOT NULL AUTO_INCREMENT ,
`fid` INT( 10 ) NOT NULL ,
`content` MEDIUMTEXT NOT NULL ,
PRIMARY KEY ( `did` )
) ENGINE = MYISAM CHARACTER SET utf8 COLLATE utf8_general_ci 

It's not essential, but might speed things up.

feedbackloop’s picture

One final addition. If you used the code above, you probably want this in your module, too:

/**
 * Implementation of FileField's hook_file_delete().
 *
 * When a FileField file is deleted, remove the cache of it from the file_dumps table.
 */
function mymodule_file_delete($file) {
  if ($file->fid) {
    // Remove cached data (if it exists)
    db_query("DELETE FROM {file_dumps} WHERE fid=%d", $file->fid);
  }
}
feedbackloop’s picture

Whoops! Be sure to wrap index_files in if (!function_exists('index_files')), or PHP will throw a fatal error when nodeapi('update index') is called more than once.