Project:File import
Version:6.x-1.0-beta3
Component:Code
Category:feature request
Priority:normal
Assigned:grobot
Status:needs work

Issue Summary

We had a need to batch-import file attachments from an old site, and create a node per attachment.

As most of these nodes were .doc, .pdf or .txt, we also considered automatically extracting the node body from the document.

This is only partial code, but we're continuing our work on it and hope to improve it further.

Comments

#1

Title:Batch import a directory of files» Batch import files as nodes

OK, so there are a lot of TODO items sitting in this patch. But hopefully it's a step in the right direction.

Some notes:

  • There are some form elements which have been added but which don't do anything, eg the "Move or copy?" element. See the TODO list at the top of file_import.batch.inc
  • I've restricted changes in the main module to just adding the menu entry. The rest is all in a separate file for simplicity.
  • I've used ereg to define the file attachment matches, but I think this should be simplified for Regular Humans to use, and just be eg a comma-sep list of extensions matched to content extractor tools.
  • PDF imports really failed. Extracting PDF content usefully via ghostscript is non-trivial.

Some changes / future steps which this module suggests:

  • Code in file_import_form_submit() which handles attaching a file to a node can be abstracted into a new function, file_import_attach_file_to_node(), so that it can be re-used throughout the module in multiple locations.
  • Going forward, I'd like to make the attachment process an action triggered when the above function is called, so that (for example) we could also import files and attach them using other methods (eg, as CCK filefield, Ubercart file download, video for video/op_video module, OR as a node's file attachment which is the only option currently).
  • We could do better things with the node title generation.

Attached are the patch and a downloadable copy which you can experiment with easily.

To try it out:

  1. Install the module attached :)
  2. Visit admin/settings/file_import/batch and set up some commands to extract content from file types
  3. Visit admin/content/file_import/batch and import a directory of files. You should see nodes get created with the node body extracted from the file, and the node title matching the filename.

I know the UI has some gaping holes and I'd really welcome any complaints and suggestions for how this could be made more awesome. Please test and give feedback!

Thanks

AttachmentSize
532754-file_import-directory_import.patch 18.74 KB
532754-file_import-batch.tgz 15.29 KB

#2

Have any further improvements in this direction?

#3

Status:active» needs review

Really wants testing and feedback.

It worked for us, which in our case was a one-off import of a large existing document structure (many hundreds of MB of old mailing list attachments), but I don't think the feature is ready for prime time due to UI issues.

#4

Status:needs review» needs work

Actually - needs WORK, not needs REVIEW. You can help by giving feedback and suggestions.

#5

I just tested this. Seems to be OK. For me at least, it doesn't show up as a MENU_LOCAL_TASK, just a callback. No idea why.

I don't think it matters that the UI is not great. Quick & dirty does the trick for me in this case, because I just want to import a bunch of stuff & clean it up later. It's not really possible to predict beforehand what good node titles would be.

I think that if you wanted a better UI, you could look at Audio Import, part of drupal.org/project/audio.

You should turn the tarball as a sandbox, I think - then it could have its own issue queue/branches, etc.

#6

Thanks Evan, been a while since I even thought about this. Done.

http://drupal.org/sandbox/grobot/1216958

#7

Nice. Ironically, my boss told me after I found this that we would just have the file attachments manually added by interns, so I don't know if I'll have time to test more. But probably best in a sandbox regardless, now that they are easier to see on d.o. than pre-Git migration.

#8

Heh ... Nw. Yes a good prompt. Later today saw a lullabot article on bulk image uploads which looks like it might help similar cases.

Tell me though, do you see "revision 1 by evan donovan" under the issue summary heading? Are you aware of any special issue editing powers that your d.o account has? Thats new to me.

EDIT: The 'revision X by {last_comment_author}' is: #1217286: Posting a comment changes too many issue node revision properties

#9

It says "Revision 1 by grobot for me". I think it is a bug with d.o.

#10

Hi, Maybe I'm missing something.
I access /admin/content/file_import/batch and /admin/content/file_import and I see the same form.
And it's not letting me create a node for each file uploaded.
Thanks in advance

nobody click here