The initiative to integrate Xapian and Drupal looks really great.
It would be even better if it would leverage the capability of Xapian to index document files like PDF, office documents, etc...

I don't see anything in the documentation about this, is this the case ?
Any limitations/constraints ? (CCK, File upload, etc...)

If not, any plans to support it ?

Regards

Comments

open-keywords’s picture

Actually, this post seem to say yes !
http://www.trellon.com/blog/xapian-search-drupal

singularo’s picture

While we have not yet added the indexing of pdf, doc etc in, it is in our plans, but has not been a high priority for our own projects so far.

Patches are welcomed ;-)

miiimooo’s picture

Component: Documentation » Code
Status: Active » Needs review

Hi. I've made a start on this. The patch may not help you. My scenario might be different from yours. I needed to integrate the fileshare module. So it's a dirty little hack. I do the indexing in an external cron job using omega. This adds a tab to the normal search called 'Files', so no need to patch drupal core. Then returns search results from whatever got indexed by the cronjob. In my case this is the fileshare folders.

--- xapian.module.orig  2008-09-17 14:43:07.000000000 +0100
+++ xapian.module       2008-11-21 12:39:32.000000000 +0000
@@ -476,10 +476,17 @@
     while (!$i->equals($matches->end())) {
       $count++;
       $document = $i->get_document();
+//       drupal_set_message("<pre>" . print_r($document->get_data(), TRUE) . "</pre>");
       if (is_object($document)) {
+        if (is_int($document->get_data())) {
         $results[$count]->type = 'node';
         $results[$count]->sid = (int)($document->get_data());
         $results[$count]->score = (int)($i->get_percent());
+        } else {
+          $results[$count]->type = 'file';
+          $results[$count]->data = $document->get_data();
+          $results[$count]->score = (int)($i->get_percent());
+        }
       }
       $i->next();
     }
@@ -572,7 +579,58 @@
     return $results;
   }
 }
+function xapian_search ($op = 'search',  $keys = null){
+  global $pager_total;
+  global $pager_page_array;
+
+  switch ($op) {
+       case 'name':
+         return t('Files');
+       case 'reset':
+         return;
+       case 'search':
+      $links = array();

+//       drupal_set_message("<pre>" . print_r($keys, TRUE) . "</pre>");
+      $page = (!empty($_REQUEST['page']) ? $_REQUEST['page'] : 0);
+      $words = '"'.chop(str_replace('\(', '(', str_replace('\)', ')', str_replace('\*','*', escapeshellcmd($keys).' ')))).'"';
+      /// TODO handle pager
+      $extra = array();
+      list($count, $results) = xapian_query($keys, 0, 10, $extra);
+      $pager_total[0] = (int)($count / variable_get('xapian_search_results_per_page', 10)) + 1;
+      $pager_page_array[0] = $page;
+//       drupal_set_message("<pre>" . print_r($results, TRUE) . "</pre>");
+      foreach($results as $result) {
+        if ($result->type != 'file') {
+          continue;
+        }
+
+        $arrData = explode("\n", $result->data);
+        $arrParsed = array();
+        foreach($arrData as $line) {
+          list($key,$val) = explode("=", $line);
+          $arrParsed[$key] = $val;
+        }
+        $found = array(
+          'type' => t('Files'),
+          'link' => $arrParsed['url'],
+          'title' => basename($arrParsed['url']),
+          'snippet' => search_excerpt($keys, $arrParsed['sample']),
+          'date'=>$arrParsed['modtime'],
+          'extra' => array(
+            'folder'=>$arrParsed['url'],
+            'file'=>$arrParsed['url'],
+            'score'=>$arrParsed['score'],
+            'file_size'=>$arrParsed['size'],
+            'file_type'=>$arrParsed['type'],
+           )
+         );
+        //TODO check for any unusal field - maybe 'caption'
+        $links[] = $found;
+      }
+      return $links;
+  }
+}
marvil07’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev

moving to 6.x

marvil07’s picture

Title: Is the indexing of documents (attached files like PDF) supported ? » index uploaded files
Status: Needs review » Active
Issue tags: +parsing, +third-party tools

changing the title to make a little more sense.

The plan is to support upload and filefield modules

IMO we should follow omega implementation, using external tools(any2text app), copying an rearranging from there:

format/apps I see easier:

  • text files (.txt, .text) - no parse
  • PDF (.pdf) if pdftotext is available (comes with xpdf)
  • PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with xpdf) are available
  • MS Word documents (.doc, .dot) if antiword is available
  • MS Excel documents (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
  • MS Powerpoint documents (.ppt, .pps) if catppt is available (comes with catdoc)
  • Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
  • MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
  • Rich Text Format documents (.rtf) if unrtf is available
  • Perl POD documentation (.pl, .pm, .pod) if pod2text is available
  • TeX DVI files (.dvi) if catdvi is available
  • DjVu files (.djv, .djvu) if djvutxt is available

other format/apps:

  • HTML (.html, .htm, .shtml) - no parse? (suggestions?)
  • PHP (.php) - suggestions?
  • OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) if unzip is available + own parser .. suggestions?
  • OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is available + own parser?
  • MS Office 2007 documents (.docx, .dotx, .xlsx, .xlst, .pptx, .potx, .ppsx) if unzip is available
  • AbiWord documents (.abw) - suggestions?
  • Compressed AbiWord documents (.zabw) if gzip is available - suggestions?
  • XPS files (.xps) if unzip is available + pw - suggestions?
marvil07’s picture

Version: 6.x-1.x-dev » 6.x-2.x-dev
marvil07’s picture

Title: index uploaded files » Index uploaded files
Version: 6.x-2.x-dev »
Status: Active » Postponed

I think this is going to fit good at 7.x #923752: Integrating with search_api, postponing until that gets in.

marvil07’s picture

Status: Postponed » Active
marvil07’s picture

Version: » 7.x-1.x-dev
ywarnier’s picture

The following code might serve as an inspiration at least for the text extracting bit (+ marvil07 has experience in the piece of code in this project related to Xapian and indexing all uploaded documents): http://code.google.com/p/chamilo/source/browse/main/inc/lib/document.lib...

marvil07’s picture

Assigned: Unassigned » marvil07

let's finally try this

marvil07’s picture

Status: Active » Needs work
StatusFileSize
new4.99 KB

After exploring/making some code and constantly rewriting it(aka first time familiarizing with d7 fields api, that's not really so similar to cck, whatever :-p) I end up thinking that I will be creating an independent project for getting a plain text version of each field, if possible.

So instead of just another module in xapian, it would be one module_exists().

Attaching current code(a new xapian submodule), but hoping to move it to its own soon and before adding code in the xapian project.

marvil07’s picture

Status: Needs work » Postponed

I started a sandbox project module to generate a plain text representation of fields: Plain. Postponing this a little until I move it to a full project.

I also opened a discussion to get feedback about it on the Contributed Module Ideas group: Plain text for fields.

marvil07’s picture

See http://drupal.org/project/search_api_attachments and http://drupal.org/sandbox/cpliakas/1145040
Hopefully code can be merged to unify backend extraction(maybe on converter or at plain) and then integrated(making search_api_attachments plugabble).