As a newcomer to internet technology, I've been investigating and weighing the merits of the approaches various CMSs take to content/documents, and would like to discuss those approaches (and others) from a general design point of view, not to promote a certain CMS.

Possible formats of import/export/publish/storage:
plain text, .rtf, html, xhtml, xml (various--incl. DocBook, specialized DTDs, etc.), OpenOffice 1.1, OpenDocument (OO 2.0), .pdf, more?

Design decisions:
flat files or relational database
single or multiple storage formats, and which one(s)
format conversion technology (xslt, etc.)
web publishing techniques (as html, xhtml + css)
WYSIWIG editors: xml, xhtml, html, text (Bitflux, Kupu, TinyMCE, etc.)

Decision Criteria:
* The storage format allows the greatest flexibility in import/export/publishing to and from other formats
* Usability and ease-of-integration of editor
* Relative Storage space requirements
* Relative Performance

I see that the consensus on a few Drupal forum threads addressing this topic was to do conversions outside of the Drupal framework and to avoid doing such conversions in PHP.

My specific project requirements (they carry over to the next page):
*upload OpenDocument document files (user can upload to my site)
*user can input and edit content using WYSIWIG editor without needing to know tags or wiki syntax
*automatic import of periodically published xml data from another website that uses a very specialized DTD
*creating customized, structured anthologies/compilations of content entered for user-defined periods of time via the three main routes listed above, outputting them as html and .pdf documents to a users browser.
*For the anthologies, either publish them using the formatting in the user's original input or allow my css or report format to define the output. Yet this requires that the structure of the user's content is preserved (e.g. if entered as xhtml, they would have used headings)
*The ability to index, search, filter, sort, and tag the content in many ways, using taxonomy and other functions/modules
*The format is flexible to accomodate future formats and content management techniques that may be developed

Here are discussions and/or examples of other approaches taken by PHP CMSs:

Flux CMS (www.bitflux.org): uses Bitflux XML WYSIWIG editor and conversion technique "Popoon" (PHP version of the Apache Cocoon project)

eZpublish (ez.no): a recent discussion thread in their forums debates the merits of an xhtml datatype vs. their specialized "ezxml" database storage datatype format [note that eZpublish provides an OpenOffice publishing as html capability] http://ez.no/community/forum/suggestions/add_xhtml_datatype_replace_xml_...(offset)/20#msg84776

Comments

peterx’s picture

An XHTML internal storage format makes sense as all Web pages should be XHTML.

An XML internal storage format makes sense as that gives us easier support for all those extension modules. We need only one transformation for everything. My XSLT transformation module is working toward multiple transformations in one.

Support for an XML editor is a great idea. Preferably an independent XML editor with wide support similar to the FCKeditor. The editor needs local spell checking, not the FCK server spell checking. The editor needs to use a schema to prompt the entry of new elements. The editor must show both raw and finished version of the document. Is there an editor like that? BXE is close but was initially aimed at only the Flux CMS and does not work with IE. I could not use BXE because some Internet cafes use IE. Kupu is open to multiple CMSs, works with both IE and Firefox, but does not use schema which greatly reduces the value of an XML editor.

There should be two XSLT conversions. One is on input similar to the image module making a thumbnail. You upload your RTF or XML document, store the document in a table and edit or replace it any time. When you hit publish then it is XSLTed as XML in to the normal Drupal node. When the node goes from the node table in to the cache, or out to the user if not cached, then the final XSLTing converts the XML to XHTML. During the final conversion most of the XML will already be XHTML which will make the overhead less than the repeated preg_replace stuff used in Drupal's current content transformations.

See also http://drupal.org/node/5887

http://petermoulding.com/technology/content_management_systems/drupal/

peterx’s picture

My Pet module now includes an XSLT transformation in the filter process. Code, external link, and PHP elements are formatted via XSLT. After the overhead of sucking in the XML, the transformation is fast which means I can save time by moving more of the transformations in to XSLT.

XML and XSL together at a Drupal near you. I will experiment more next week.

http://petermoulding.com/technology/content_management_systems/drupal/