Attached is a patch to allow text to be extracted from files referenced from within the body field of a node. Where I work we often just insert hyperlinks to documents (uploaded with imce) from within the body field. The module as it stands does not look the the body field at all.

The attached patch uses regex to find the href property of a links within the body (absolute or relative) then loads the appropriate $file object for these files out of the files table before passing the array back to be processed in the usual fassion.

let me know if you think this patch is OK or needs work.

CommentFileSizeAuthor
#3 body text.patch3.84 KBaaron1234nz
bodyfield.patch2 KBaaron1234nz

Comments

pwolanin’s picture

Title: Extract text from the body filed » Extract text from the body field
Status: Needs review » Needs work

I'm not sure I really like this idea - seems the one file could be referenced from many nodes?

Whether or not to attempt this should certainly be a configurable option - off by default.

$placeholders = substr(str_repeat("'%s',", count($files_to_find)), 0, -1);

instead use: http://api.drupal.org/api/drupal/includes--database.inc/function/db_plac...

aaron1234nz’s picture

I agree that this should be a configurable option (along with the other extraction methods). There is also a high chance that one file could be referenced by more than one node. My patch #937720: One-to-many relationship between files and nodes will solve this issue.

I'll look to do some more work on this over the next couple of weeks and post a new patch if you interested in progressing this.

aaron1234nz’s picture

StatusFileSize
new3.84 KB

Here is an updated patch that adds three checkboxes to the admin interface where the admin can choose to extract text from the following locations: File attachments, CCK filefields, body text.

jpmckinney’s picture

Title: Extract text from the body field » Check the body field for hyperlinks to files
jpmckinney’s picture

Status: Needs work » Needs review