Index uploaded files [#321538]

The initiative to integrate Xapian and Drupal looks really great.
It would be even better if it would leverage the capability of Xapian to index document files like PDF, office documents, etc...

I don't see anything in the documentation about this, is this the case ?
Any limitations/constraints ? (CCK, File upload, etc...)

If not, any plans to support it ?

Regards

Comment	File	Size	Author
#12	0001-Initial-code-for-indexinf-files.patch	4.99 KB	marvil07

Comments

Comment #1

open-keywords commented 15 October 2008 at 20:57

Actually, this post seem to say yes !
http://www.trellon.com/blog/xapian-search-drupal

Comment #2

singularo

Adelaide, AUS

commented 16 October 2008 at 04:29

While we have not yet added the indexing of pdf, doc etc in, it is in our plans, but has not been a high priority for our own projects so far.

Patches are welcomed ;-)

Comment #3

miiimooo

Europe

commented 21 November 2008 at 12:47

Component:	Documentation	» Code
Status:	Active	» Needs review

Hi. I've made a start on this. The patch may not help you. My scenario might be different from yours. I needed to integrate the fileshare module. So it's a dirty little hack. I do the indexing in an external cron job using omega. This adds a tab to the normal search called 'Files', so no need to patch drupal core. Then returns search results from whatever got indexed by the cronjob. In my case this is the fileshare folders.

--- xapian.module.orig  2008-09-17 14:43:07.000000000 +0100
+++ xapian.module       2008-11-21 12:39:32.000000000 +0000
@@ -476,10 +476,17 @@
     while (!$i->equals($matches->end())) {
       $count++;
       $document = $i->get_document();
+//       drupal_set_message("<pre>" . print_r($document->get_data(), TRUE) . "</pre>");
       if (is_object($document)) {
+        if (is_int($document->get_data())) {
         $results[$count]->type = 'node';
         $results[$count]->sid = (int)($document->get_data());
         $results[$count]->score = (int)($i->get_percent());
+        } else {
+          $results[$count]->type = 'file';
+          $results[$count]->data = $document->get_data();
+          $results[$count]->score = (int)($i->get_percent());
+        }
       }
       $i->next();
     }
@@ -572,7 +579,58 @@
     return $results;
   }
 }
+function xapian_search ($op = 'search',  $keys = null){
+  global $pager_total;
+  global $pager_page_array;
+
+  switch ($op) {
+       case 'name':
+         return t('Files');
+       case 'reset':
+         return;
+       case 'search':
+      $links = array();

+//       drupal_set_message("<pre>" . print_r($keys, TRUE) . "</pre>");
+      $page = (!empty($_REQUEST['page']) ? $_REQUEST['page'] : 0);
+      $words = '"'.chop(str_replace('\(', '(', str_replace('\)', ')', str_replace('\*','*', escapeshellcmd($keys).' ')))).'"';
+      /// TODO handle pager
+      $extra = array();
+      list($count, $results) = xapian_query($keys, 0, 10, $extra);
+      $pager_total[0] = (int)($count / variable_get('xapian_search_results_per_page', 10)) + 1;
+      $pager_page_array[0] = $page;
+//       drupal_set_message("<pre>" . print_r($results, TRUE) . "</pre>");
+      foreach($results as $result) {
+        if ($result->type != 'file') {
+          continue;
+        }
+
+        $arrData = explode("\n", $result->data);
+        $arrParsed = array();
+        foreach($arrData as $line) {
+          list($key,$val) = explode("=", $line);
+          $arrParsed[$key] = $val;
+        }
+        $found = array(
+          'type' => t('Files'),
+          'link' => $arrParsed['url'],
+          'title' => basename($arrParsed['url']),
+          'snippet' => search_excerpt($keys, $arrParsed['sample']),
+          'date'=>$arrParsed['modtime'],
+          'extra' => array(
+            'folder'=>$arrParsed['url'],
+            'file'=>$arrParsed['url'],
+            'score'=>$arrParsed['score'],
+            'file_size'=>$arrParsed['size'],
+            'file_type'=>$arrParsed['type'],
+           )
+         );
+        //TODO check for any unusal field - maybe 'caption'
+        $links[] = $found;
+      }
+      return $links;
+  }
+}

Comment #4

marvil07 commented 11 December 2009 at 20:48

Version:

5.x-1.x-dev

» 6.x-1.x-dev

moving to 6.x

Comment #5

marvil07 commented 6 January 2010 at 05:51

Title:	Is the indexing of documents (attached files like PDF) supported ?	» index uploaded files
Status:	Needs review	» Active
Issue tags:		+parsing, +third-party tools

changing the title to make a little more sense.

The plan is to support upload and filefield modules

IMO we should follow omega implementation, using external tools(any2text app), copying an rearranging from there:

format/apps I see easier:

text files (.txt, .text) - no parse
PDF (.pdf) if pdftotext is available (comes with xpdf)
PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes with xpdf) are available
MS Word documents (.doc, .dot) if antiword is available
MS Excel documents (.xls, .xlb, .xlt) if xls2csv is available (comes with catdoc)
MS Powerpoint documents (.ppt, .pps) if catppt is available (comes with catdoc)
Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
Rich Text Format documents (.rtf) if unrtf is available
Perl POD documentation (.pl, .pm, .pod) if pod2text is available
TeX DVI files (.dvi) if catdvi is available
DjVu files (.djv, .djvu) if djvutxt is available

other format/apps:

HTML (.html, .htm, .shtml) - no parse? (suggestions?)
PHP (.php) - suggestions?
OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm, .sxw, .sxg, .stw) if unzip is available + own parser .. suggestions?
OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb, .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is available + own parser?
MS Office 2007 documents (.docx, .dotx, .xlsx, .xlst, .pptx, .potx, .ppsx) if unzip is available
AbiWord documents (.abw) - suggestions?
Compressed AbiWord documents (.zabw) if gzip is available - suggestions?
XPS files (.xps) if unzip is available + pw - suggestions?

Comment #6

marvil07 commented 1 June 2010 at 14:17

Version:

6.x-1.x-dev

» 6.x-2.x-dev

Comment #7

marvil07 commented 21 November 2010 at 20:02

Title:	index uploaded files	» Index uploaded files
Version:	6.x-2.x-dev	»
Status:	Active	» Postponed

I think this is going to fit good at 7.x #923752: Integrating with search_api, postponing until that gets in.

Comment #8

marvil07 commented 27 January 2011 at 22:15

Status:

Postponed

» Active

#923752: Integrating with search_api got in!

Comment #9

marvil07 commented 24 April 2011 at 23:12

Version:

» 7.x-1.x-dev

Comment #10

ywarnier commented 26 April 2011 at 02:49

The following code might serve as an inspiration at least for the text extracting bit (+ marvil07 has experience in the piece of code in this project related to Xapian and indexing all uploaded documents): http://code.google.com/p/chamilo/source/browse/main/inc/lib/document.lib...

Comment #11

marvil07 commented 21 May 2011 at 21:10

Assigned:

Unassigned

» marvil07

let's finally try this

Comment #12

marvil07 commented 23 May 2011 at 08:44

Status:

Active

» Needs work

Status	File	Size
new	0001-Initial-code-for-indexinf-files.patch	4.99 KB

After exploring/making some code and constantly rewriting it(aka first time familiarizing with d7 fields api, that's not really so similar to cck, whatever :-p) I end up thinking that I will be creating an independent project for getting a plain text version of each field, if possible.

So instead of just another module in xapian, it would be one module_exists().

Attaching current code(a new xapian submodule), but hoping to move it to its own soon and before adding code in the xapian project.

Comment #13

marvil07 commented 10 June 2011 at 10:21

Status:

Needs work

» Postponed

I started a sandbox project module to generate a plain text representation of fields: Plain. Postponing this a little until I move it to a full project.

I also opened a discussion to get feedback about it on the Contributed Module Ideas group: Plain text for fields.

Comment #14

marvil07 commented 18 September 2012 at 05:21

See http://drupal.org/project/search_api_attachments and http://drupal.org/sandbox/cpliakas/1145040
Hopefully code can be merged to unify backend extraction(maybe on converter or at plain) and then integrated(making search_api_attachments plugabble).

Index uploaded files

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

format/apps I see easier:

other format/apps:

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

News items

Our community

Documentation

Drupal code base

Governance of community