Create articles from HTML, RTF, Microsoft Word, Microsoft Excel documents

By hamlet on 16 Feb 2004 at 07:34 UTC

Drupal is very advanced system nowdays. The only lack is converter from HTML, RTF, Microsoft Word and Microsoft Excel documents (especially last ones). Because of popularity of these formats in business area, it would be great to support those formats in CMS.

E.g. many firms keep their price-lists in Excel docs and they want to easely post them to the Web without reformatting or so. As well as brochures "About us"-like in format of Word docs.

I really have no idea if there such module or core researches to make all above possible with Drupal. Any comments?

Comments

I think Drupal need suppot open-source formats

axel@debian.linuxrulez.ru commented 16 February 2004 at 19:17

Just drop of flame from me ;) Drupal open-source software and I think it need more focus on open-source formats - like OpenOffice for example, instead support proprietary closed formats from MS.

--

Axel

Russian Debian Community

OpenOffice Supports All Microsoft Formats

joseph commented 16 February 2004 at 19:48

Axel, as a long time Star/OpenOffice user I am happy to remind others that Star/OpenOffice has always supported .doc/.xls/.ppt and other MS file formats. If it is good for OpenOffice, why not Drupal?

But opensource first :)

axel@debian.linuxrulez.ru commented 16 February 2004 at 21:56

Right, but I think open-source products will have high priority in realization. Else they always will run in the tail of MS-train.

And also as user of OpenOffice I say: problems with _correct_ support of proprietary formats was always, especially for MS-formats. Because they shielded by patents, has no docs and so on.

--

Axel

Russian Debian Community

Drupal barely even does plain text

irwin commented 17 February 2004 at 03:19

As of 4.3.0, Drupal barely supports file uploading for plain text as it is. For example, the book module doesn't accept file uploads for the page, which would probably be useful. With the file API, it should be almost trivial to add the functionality for 4.5.0 (4.4.0 is in freeze now).

In any case, my advice is to walk before you run - do plain text, and then HTML, as those formats are probably the easiest to convert.

-- Irwin

1) I don't need such a conver

killes@www.drop.org commented 17 February 2004 at 21:32

1) I don't need such a converter so I won't code one.

2) If you use an external converter you probably could pipe any document through it after upload and use the result as the node's body.

3) Implementing such a converter in PHP is Not A Good Idea(tm).

--
Drupal services
My Drupal services

Converters seem to be a natural progression for filestore

joe lombardo commented 17 February 2004 at 22:20

It seems that a mechanism for 'piping' files through a converter would be a natural progression for filestore and keep the maintenance of supporting different file types in someone elses court.

Joe Lombardo | joe@familytimes.com | FamilyTimes Online Journals

Yes, that would be great. And

hamlet commented 18 February 2004 at 08:14

Yes, that would be great. And moreover if it could store document body together with it's images :) That's a challenge!

Any module could do it, the n

killes@www.drop.org commented 14 March 2004 at 04:20

Any module could do it, the nodeapi hook is very flexible.

--
Drupal services
My Drupal services

External programs to convert MSword

seanfarrell commented 3 November 2005 at 14:26

One possibility is to allow upload and attachment of an MS Word (or other) file to a node. Then call an external library or program to automatically produce a variety of "preview" files in other formats.

Here are two open source ones, binaries are available for a variety of platforms:

AntiWord - http://www.winfield.demon.nl/

WV library - http://wvware.sourceforge.net/ (previously known as: mswordview)

There is also at least one commercial library which uses PHP to convert Word into other formats.

Power Point Presentation Converter

timtak commented 14 March 2004 at 03:34

At a recent demonstration of a Japanese EMS/CMS* called dotcampus, assembled professors were very impressed with drag and drop conversion of power point presentations into a series of html files.

I am not sure but I think that perhaps dotcampus uses .NET or some similar Microsoft technology, which I don't understand, but perhaps it is possible to use a part of MSOffice, or some available "dll" to do this sort of conversion, afterall the functionality is there in Power Point (and OOo' Presentation, which can also convert to flash.)

The professors could convert to flash using OOo and then link to them, but the ability to drag and drop their course materials into a CMS and let that make a course web page for them made their eyes light up.

Tim
*Education Management System, Course (not content) Management Sytem also Virtual Learning Environment.

Converting old sites

joel_guesclin commented 18 March 2004 at 12:38

Has anybody thought about the issue of converting old sites? I have a site which is based almost entirely on HTML files with a tiny bit of Perl coding to present some automated menus and archive lists. This is a magazine site with several hundred articles in various languages - far too much work to want to contemplate by hand. So I'm thinking of using Perl's HTML::FileParse to go through the whole lot, extracting title, description (for the teaser), cleaning up the HTML (getting rid of all the old styles) and then injecting the whole thing into book type nodes by writing direct into Drupal's database. Then there would be a lot of work to categorise the articles, and I've been wondering whether I could use the HTML keywords to extract some stuff to match vocabulary terms... Ummm, any ideas appreciated.

Importing HTML files

jsloan commented 19 March 2004 at 01:43

This is similar to the route I'll be taking but I'm not going to go as far as injecting the entire document into the data base. My plan is to use the html tidy to clean up(a lot of the old files are from a MS Front Page site - a huge html mess!) and format the existing files into valid xhtml formats. From there I am going to import them into Drupal using the same approach as the image module and scan the directories mapping the path to the actual locations and create node "pointers". At this point the node can be classified and used as any other node. When it is time to retieve the page Drupal will read in the file and treat it like $node->body. This is a very simple desrciption of what I am developing and as it gets closer I will be back with some examples and questions of my own.

-jim-

I did just that for three sites

kbahey commented 1 April 2004 at 20:32

My sites were running a custom developed primitive CMS that I wrote myself. The files were HTML which included two include files, std_header.inc at the top, and std_footer.inc at the bottom. The header one generated the menus, linked to the stylesheet, ...etc.

I wrote some custom PHP scripts (running from command line mind you), that extracted the statistics data (which was stored in a database), and stripped everything up to the body tag, and then inserted them into the Drupal database directly.

Remember you have to increment the counter table to make it point to the last node id you used. Also, if you want an alias for each node (e.g. the same old path so visitors do not get 404s), then you have to insert those too. If you have a taxonomy, then you have to point each node to where it should go in the taxonomy hierarchy. Also, any internal links in the documents may have to be modified as well.

This way, everything is in the database cleanly, and Drupal works as expected.

--
Drupal performance tuning and optimization, hosting, development, and consulting: 2bits.com, Inc. and Twitter at: @2bits
Personal blog: Ba

Converting MS Office Documents

jsloan commented 19 March 2004 at 02:03

I would agree with the comments that document conversion belongs outside the Drupal framework.
I've been using the product called Net-It Central to do this. The document is "printed" into a format that can be viewed with either a java applet or activeX control. There are also options to convert the ouput to PDF or Flash. All of the output is wrapped in a template that can be customized. All of the processed files are copied to a web server and I use rss feeds to import the links to my departmental Drupal sites. It is all automated and once the feeds are set up the Drupal sites stay in sync without any editing.(cron runs once a night) In my next step I am looking to link directly to the rss files and use php to build dynamic navigation menus on the fly.

For very large collections of MS documents this was our best option, if you are looking at very small numbers of documents and want to make them available in the exact layout then I would use Open Office and create PDF's - it works great!

-jim-

Jade/Open Jade

mmx-1 commented 1 April 2004 at 03:47

Jim Clark, the author of expat which is included with PHP, developed JADE a number of years ago to perform DSSSL (Document Semantics StyleSheet Language) transformations on SGML documents. As the library grew, support was added to perform XSLT transformations. I believe it can transform multiple formats (doc, rtf, TeX, MIF, XML, SGML, PS, HTML, troff, etc) into other formats and perform style transformations in secondary threads. Continued work on the library is being performed by the Open Jade group on SourceForge. The Kawa language for Java started as a port of Jade. I have not investigated the capabilities of expat, but if I recollect correctly, it was compiled using the JADE library.

mmx

Document Management

javanaut commented 1 April 2004 at 19:42

Please participate in the discussion on document management here: http://drupal.org/node/view/6850.

-Mark

Word -> RTF -> XML -> Web page content

peterx commented 9 October 2005 at 07:13

I have some sites converting RTF to XML to Web page content. Word documents are saved as RTF. A system equivalent to Drupal's cache then converts the RTF to XML to XHTML. The RTF to XML occurs only when the source page changes. The XML to XHTML occurs at page display because it includes dynamic changes.

If coupled with Drupal's cache then the XML to XHTML will occur in a filter and be cached. I already have an XML to XHTML conversion in a Drupal filter so that part would be easy.

I also had an Abiword to XML conversion working using the same XML so that Abiword mixed with Word. Both systems interpreted styles as XML tags so that PHP code or XSLT could convert the styles to headings and specialised tags such as code. If you want text in your Word document to appear as code then you use a character style named code and the converter changed that to an XML element named code. The XML to XHTML can then make the code element in to a span element with a class of code.

A slight refinement would be to turn all paragraph styles in to XHTML divisions and all character styles in to spans with both spans and divs having their class attribute set to the name of the style.

I develop using PHP 5. Most of the code is PHP 4 compatible. The interface module could be a standard Drupal module. There would be a separate library to download similar to using Gallery or PHPlist. The separate library would probably be LGPL.

http://petermoulding.com/technology/content_management_systems/drupal/

petermoulding.com/web_architect

so, how easy to import from OpenOffice into XHTML?

tomcalloway commented 14 October 2005 at 16:21

Peter,

Your post makes it sound easy to convert OO XML to XHTML that can be published on a webpage.

Do you already have a module that does something like that?

Is it possible to use OO 2.0 as an XML editor, saving the entries in an XML format/schema that is a transformation of the one OO uses (in case I don't want to save as an OpenDocument file)?

~Tom

I looked at OO 1

peterx commented 18 October 2005 at 00:20

I looked at converting OO 1.something. You have to unzip the file then merge about 6 files. I have not looked at OO 2. May look over the next few weeks. Where do they publish the schema and documentation?

http://petermoulding.com/technology/content_management_systems/drupal/

petermoulding.com/web_architect

OO Documentation for its OpenDocument XML

tomcalloway commented 18 October 2005 at 03:41

I found this free online book that looks like a very practical, complete introduction to working with OO's XML file formats:
http://books.evc-cit.info/

Here is the direct link to the OO project's official schema and docs:

for OO 2.0 http://xml.coverpages.org/ni2005-09-26-a.html (this has recent news, context, and links undoubtedly to the schema/docs--I just didn't trace)
for OO 1.0 (http://www.oasis-open.org/committees/download.php/6037/office-spec-1.0-c...)

I'm still evaluating which road to take with XML, but thanks for keeping this inquiry alive. It may prove a useful journey.

And I'm willing to learn to be able to contribute to the effort, if it aligns with my goals for the project I'd use it for.

OO version 2 to XML through POOO

peterx commented 24 October 2005 at 05:22

http://drupal.org/node/35036 POOOO, the Pet OpenOffice Odt Obtainer. You can tell I like the letter O.

http://petermoulding.com/technology/content_management_systems/drupal/

petermoulding.com/web_architect

has been suggested before

bertboerland commented 14 October 2005 at 19:28

and i do think it wil be implemented within 1 year. however, reading these format is only the first step, having LDAP (ADS?) rights of these documents will be a true sharepoint killer.

but then again, what is the point of killing sharepoint? ;-)
--
groets
bertb

--
groets
bert boerland

Amaya and Mozile

tomcalloway commented 18 October 2005 at 14:05

Amaya (w3c and apache) handles schemas, though as I read about it, I believe it is meant as a browser + xml features, not as a javascript editor for integration with a CMS. Still, it's worth looking at. http://www.w3.org/Amaya/

Though it's only supported in Firefox and does not support schemas, as an XHTML editor it is in-line and fast, though few features. http://mozile.mozdev.org/ Worth taking a look at.

Create articles from HTML, RTF, Microsoft Word, Microsoft Excel documents

Comments

I think Drupal need suppot open-source formats

OpenOffice Supports All Microsoft Formats

But opensource first :)

Drupal barely even does plain text

1) I don't need such a conver

Converters seem to be a natural progression for filestore

Yes, that would be great. And

Any module could do it, the n

External programs to convert MSword

Power Point Presentation Converter

Converting old sites

Importing HTML files

I did just that for three sites

Converting MS Office Documents

Jade/Open Jade

Document Management

Word -> RTF -> XML -> Web page content

so, how easy to import from OpenOffice into XHTML?

I looked at OO 1

OO Documentation for its OpenDocument XML

OO version 2 to XML through POOO

has been suggested before

Amaya and Mozile

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community