Is the indexing of documents (attached files like PDF) supported ?
open-keywords - October 15, 2008 - 10:19
| Project: | xapian |
| Version: | 5.x-1.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs review |
Jump to:
Description
The initiative to integrate Xapian and Drupal looks really great.
It would be even better if it would leverage the capability of Xapian to index document files like PDF, office documents, etc...
I don't see anything in the documentation about this, is this the case ?
Any limitations/constraints ? (CCK, File upload, etc...)
If not, any plans to support it ?
Regards

#1
Actually, this post seem to say yes !
http://www.trellon.com/blog/xapian-search-drupal
#2
While we have not yet added the indexing of pdf, doc etc in, it is in our plans, but has not been a high priority for our own projects so far.
Patches are welcomed ;-)
#3
Hi. I've made a start on this. The patch may not help you. My scenario might be different from yours. I needed to integrate the fileshare module. So it's a dirty little hack. I do the indexing in an external cron job using omega. This adds a tab to the normal search called 'Files', so no need to patch drupal core. Then returns search results from whatever got indexed by the cronjob. In my case this is the fileshare folders.
--- xapian.module.orig 2008-09-17 14:43:07.000000000 +0100
+++ xapian.module 2008-11-21 12:39:32.000000000 +0000
@@ -476,10 +476,17 @@
while (!$i->equals($matches->end())) {
$count++;
$document = $i->get_document();
+// drupal_set_message("<pre>" . print_r($document->get_data(), TRUE) . "</pre>");
if (is_object($document)) {
+ if (is_int($document->get_data())) {
$results[$count]->type = 'node';
$results[$count]->sid = (int)($document->get_data());
$results[$count]->score = (int)($i->get_percent());
+ } else {
+ $results[$count]->type = 'file';
+ $results[$count]->data = $document->get_data();
+ $results[$count]->score = (int)($i->get_percent());
+ }
}
$i->next();
}
@@ -572,7 +579,58 @@
return $results;
}
}
+function xapian_search ($op = 'search', $keys = null){
+ global $pager_total;
+ global $pager_page_array;
+
+ switch ($op) {
+ case 'name':
+ return t('Files');
+ case 'reset':
+ return;
+ case 'search':
+ $links = array();
+// drupal_set_message("<pre>" . print_r($keys, TRUE) . "</pre>");
+ $page = (!empty($_REQUEST['page']) ? $_REQUEST['page'] : 0);
+ $words = '"'.chop(str_replace('\(', '(', str_replace('\)', ')', str_replace('\*','*', escapeshellcmd($keys).' ')))).'"';
+ /// TODO handle pager
+ $extra = array();
+ list($count, $results) = xapian_query($keys, 0, 10, $extra);
+ $pager_total[0] = (int)($count / variable_get('xapian_search_results_per_page', 10)) + 1;
+ $pager_page_array[0] = $page;
+// drupal_set_message("<pre>" . print_r($results, TRUE) . "</pre>");
+ foreach($results as $result) {
+ if ($result->type != 'file') {
+ continue;
+ }
+
+ $arrData = explode("\n", $result->data);
+ $arrParsed = array();
+ foreach($arrData as $line) {
+ list($key,$val) = explode("=", $line);
+ $arrParsed[$key] = $val;
+ }
+ $found = array(
+ 'type' => t('Files'),
+ 'link' => $arrParsed['url'],
+ 'title' => basename($arrParsed['url']),
+ 'snippet' => search_excerpt($keys, $arrParsed['sample']),
+ 'date'=>$arrParsed['modtime'],
+ 'extra' => array(
+ 'folder'=>$arrParsed['url'],
+ 'file'=>$arrParsed['url'],
+ 'score'=>$arrParsed['score'],
+ 'file_size'=>$arrParsed['size'],
+ 'file_type'=>$arrParsed['type'],
+ )
+ );
+ //TODO check for any unusal field - maybe 'caption'
+ $links[] = $found;
+ }
+ return $links;
+ }
+}