I would like a user to upload a Word document and have it programmatically turn into the body of an node. (I am, for the moment, ignoring images.) I can output something resembling decent HTML using wvWare with a custom xml filter. I'm still having trouble getting the spacing right (I have 3+ blank lines between the <p> and the start of a paragraph.) Possibly a drupal filter could be used to help eliminate this.) But I'm not sure how to procede from here. I would like to scan the content for what may be references to other nodes, and to insert links - or at least have this option.

I'm also unsure if I should wait for the 6.x CCK release, or if dependecies on the upload and filters modules would be enough.

Having searched drupal.org, this issue seems to come up about once a year, so I'd like to contribute my module. Any thoughts or ideas would be greatly appreciated.

Thanks,

Tistur

Comments

igdrupal’s picture

I've been looking for something just like this. Unfortunately, I don't have much expertise, but a module that can do this would be an awesome addition to drupal. Currently, I just have members select all in their word doc and paste into FCKeditor. I also fooled around with the fileview module even though it doesn't support .doc. It does actually convert but the formatting is really ugly. Probably needs some kind of custom filter. Hopefully some knowledgable coders get a look at this and come up with something.

grimsy’s picture

I did some work a while ago on turning .doc and .pdf into usable html. From memory I ended up using .mht files because likewise I could never quite get a satisfactory result using wvWare. Unfortunately .mht only worked with IE, but it was a start.
I'm trying to resurrect the virtual machine I did the work on to see how far along I got with it.

If you don't here back from me, my email addressed is available here.

grimsy’s picture

I used the class from here: http://www.php.net/manual/en/ref.com.php#42918 to do the actual conversion. As I said, I was using .mht for my files, but you might also be able to just save as .html or something. It would be good to see this tied into a document management system module, or at least a file browser anyway, so that users could preview documents before downloading them. Documents could be converted either as they're uploaded, via cron or on first preview.

Only problem with all this of course is that it relies on Windows + Word to be installed. Maybe it could be separated from the actual Drupal install and converted on another box (for those running *nix).

jscoble’s picture

If your using the native .doc format, it can get very complicated fast. I looked into transforming a Word document back in the Office '97 days to ingest into our CRM system and quickly realized I was better off saving it as an RTF and working with that.

Since Microsoft finally published the specs for the format, Joel Spolsky recently wrote a post about why its so tough, Why are the Microsoft Office file formats so complicated? (And some workarounds)

Nowadays Word can save in even more formats and you would probably be better off working with one of those formats and writing some code to clean up the output so that it turns out the way you want it to.

The clean up, scanning and linking to certain words, wouldn't be that hard from the coding perspective, but wringing out all the gotchas and special cases would be time consuming.

Good luck.

kenyob’s picture

I am looking for a way to turn the word document attachments into nodes themselves and possibly turn everything into one giant book.
My site http://www.section2athletics.org has a massive directory that is updated each year as multiple word documents. I would like to try and move them over to actually using Drupal and the website as a publishing system.