Is the indexing of documents (attached files like PDF) supported ?

open-keywords - October 15, 2008 - 10:19
Project:xapian
Version:5.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs review
Description

The initiative to integrate Xapian and Drupal looks really great.
It would be even better if it would leverage the capability of Xapian to index document files like PDF, office documents, etc...

I don't see anything in the documentation about this, is this the case ?
Any limitations/constraints ? (CCK, File upload, etc...)

If not, any plans to support it ?

Regards

#1

open-keywords - October 15, 2008 - 20:57

Actually, this post seem to say yes !
http://www.trellon.com/blog/xapian-search-drupal

#2

singularo - October 16, 2008 - 04:29

While we have not yet added the indexing of pdf, doc etc in, it is in our plans, but has not been a high priority for our own projects so far.

Patches are welcomed ;-)

#3

miiimooo - November 21, 2008 - 12:47
Component:Documentation» Code
Status:active» needs review

Hi. I've made a start on this. The patch may not help you. My scenario might be different from yours. I needed to integrate the fileshare module. So it's a dirty little hack. I do the indexing in an external cron job using omega. This adds a tab to the normal search called 'Files', so no need to patch drupal core. Then returns search results from whatever got indexed by the cronjob. In my case this is the fileshare folders.

--- xapian.module.orig  2008-09-17 14:43:07.000000000 +0100
+++ xapian.module       2008-11-21 12:39:32.000000000 +0000
@@ -476,10 +476,17 @@
     while (!$i->equals($matches->end())) {
       $count++;
       $document = $i->get_document();
+//       drupal_set_message("<pre>" . print_r($document->get_data(), TRUE) . "</pre>");
       if (is_object($document)) {
+        if (is_int($document->get_data())) {
         $results[$count]->type = 'node';
         $results[$count]->sid = (int)($document->get_data());
         $results[$count]->score = (int)($i->get_percent());
+        } else {
+          $results[$count]->type = 'file';
+          $results[$count]->data = $document->get_data();
+          $results[$count]->score = (int)($i->get_percent());
+        }
       }
       $i->next();
     }
@@ -572,7 +579,58 @@
     return $results;
   }
}
+function xapian_search ($op = 'search',  $keys = null){
+  global $pager_total;
+  global $pager_page_array;
+
+  switch ($op) {
+       case 'name':
+         return t('Files');
+       case 'reset':
+         return;
+       case 'search':
+      $links = array();

+//       drupal_set_message("<pre>" . print_r($keys, TRUE) . "</pre>");
+      $page = (!empty($_REQUEST['page']) ? $_REQUEST['page'] : 0);
+      $words = '"'.chop(str_replace('\(', '(', str_replace('\)', ')', str_replace('\*','*', escapeshellcmd($keys).' ')))).'"';
+      /// TODO handle pager
+      $extra = array();
+      list($count, $results) = xapian_query($keys, 0, 10, $extra);
+      $pager_total[0] = (int)($count / variable_get('xapian_search_results_per_page', 10)) + 1;
+      $pager_page_array[0] = $page;
+//       drupal_set_message("<pre>" . print_r($results, TRUE) . "</pre>");
+      foreach($results as $result) {
+        if ($result->type != 'file') {
+          continue;
+        }
+
+        $arrData = explode("\n", $result->data);
+        $arrParsed = array();
+        foreach($arrData as $line) {
+          list($key,$val) = explode("=", $line);
+          $arrParsed[$key] = $val;
+        }
+        $found = array(
+          'type' => t('Files'),
+          'link' => $arrParsed['url'],
+          'title' => basename($arrParsed['url']),
+          'snippet' => search_excerpt($keys, $arrParsed['sample']),
+          'date'=>$arrParsed['modtime'],
+          'extra' => array(
+            'folder'=>$arrParsed['url'],
+            'file'=>$arrParsed['url'],
+            'score'=>$arrParsed['score'],
+            'file_size'=>$arrParsed['size'],
+            'file_type'=>$arrParsed['type'],
+           )
+         );
+        //TODO check for any unusal field - maybe 'caption'
+        $links[] = $found;
+      }
+      return $links;
+  }
+}

 
 

Drupal is a registered trademark of Dries Buytaert.