export book as xml for formatting [#1482]

Comment	File	Size	Author
#29	book_34.patch	17.2 KB	puregin
#27	book_32_0.patch	16.19 KB	puregin
#26	book_31_0.patch	16.33 KB	puregin
#20	book_30.patch	15.89 KB	Uwe Hermann
#19	book.module_6.patch	13.62 KB	puregin
#14	xml-export-03.patch	22.3 KB	puregin
#7	explode2dir.php	6.64 KB	puregin
#6	xml-export-01.patch	14.19 KB	puregin
#5	xml-export.patch	14.06 KB	puregin

Comment #1

moshe weitzman commented 3 June 2004 at 00:10

can someone suggest an xml schema for this? i think we need a general xml schema for nodes. after that, this becomes a simple matter of nesting node elements (I think)

Log in or register to post comments

Comment #2

Teto commented 1 February 2005 at 09:42

Hi,

Is there any news about a such feature ?
All i've found about a docbook schema is here : http://docbook.sourceforge.net/projects/schema/
It seems there isn't much in the docbook cvs about that.

Teto.

Log in or register to post comments

Comment #3

puregin commented 14 May 2005 at 09:11

Here's a list of the Book publishing DTDs I know about:

Name	Notes	Ref
ISO 12083:1998//DTD Book//EN - this includes ISO 12093:1993//DTD Mathematics//EN	Committee standard - very general. Used e.g. by University of California Press.	www.xmlxperts.com/bookdtd.htm
DocBook	Applications - widely used by Computer book publishers, e.g. O'Reilly. Good support.	docbook.org
TEI/TEI-Lite	Applications - scholarly/historical/literary documents	www.tei-c.org
MIL-STD-38784 (CALS)	Applications - Military/Govt/Enterprise publishing	http://xml.coverpages.org/mil-std-38784-a1-dtd.txt

I'd highly recommend DocBook as a useable, technically focused, XML DTD with strong toolset support.

Log in or register to post comments

Comment #4

puregin commented 18 May 2005 at 08:37

I'd suggest we start with something very simple.

The patch which I submitted for http://drupal.org/node/1898 wraps each node in <div> tags, with a level, and a node id attribute, for printer friendly output.

We can't rely in general on the contents of a node being XHTML, even if we force output through an XHTML validator such as tidy. So our best bet is to encode the entire contents of a node as CDATA. This gives us hierarchy, and encapsulated contents (of any kind - later this could also be other kinds of data or markup)

This output will be valid XML, with a pretty simple DTD. It is easy to take such a file and write simple XSLT based scripts on the the client side to explode this file into a directory tree of HTML, a single HTML file, or many other formats.

Importing is trickier. It's relatively easy to import an exported file, and update the nodes of the book according to the hierarchy defined by the sectional <div> elements. Importing needs to take care of structure which has changed - child nodes added, deleted, or moved.

It would also be nice to have some client-side scripts to import other formats into this nested sectional <div> based format - for example, to take a directory tree of HTML fragments, and make this into an importable file.

Log in or register to post comments

Comment #5

puregin commented 1 June 2005 at 10:33

Status	File	Size
new	xml-export.patch	14.06 KB

This patch enables export of books as XML documents.

The XML is DocBook "at the level of structure", but
node contents are wrapped as CDATA, since we
can't be sure that the contents are valid XML.

Several other bugs/feature requests are also
addressed with this patch:

- Fixes bugs

http://drupal.org/node/1898
http://drupal.org/node/1482
http://drupal.org/node/8049
http://drupal.org/node/1899

Should go a long way towards implementing feature request
http://drupal.org/node/2062

It should also be easy to extend this to produce OPML,
for example.

- Adds about 170 lines, of which more than 100 are comments
- Added doxygen comments
- Made doxygen comment format consistent; fixed minor grammatical slips
- A proper Doctype and more informative HTML element is generated
for printer-friendly HTML output.
- Refactored book_print() to use book_recurse().
- Refactored book_recurse(). Applies 'visitor' callback functions to nodes
during weight/title order tree-traversal. The parameterized
visitor callbacks can be used to generate different kinds of output.
There are many other kinds of operations on books which can be implemented
by writing a pre-node/post-node pair of callback functions: word-count/
statistics gathering, comparison, copying, search and replace...
- Introduced book_export() which uses book_recurse() to generate
DocBook-like XML to export book contents in a structured form.
An md5 hash is computed for each node to help import code to
decide if a node needs to be updated or not.

Log in or register to post comments

Comment #6

puregin commented 3 June 2005 at 08:29

Assigned:

Unassigned

» puregin

Status	File	Size
new	xml-export-01.patch	14.19 KB

This updated patch adds "weight" metadata, which I forgot to capture in the previous patch. I'm not sure how much other metadata I should include.

Log in or register to post comments

Comment #7

puregin commented 3 June 2005 at 09:06

Status	File	Size
new	explode2dir.php	6.64 KB

The attached command-line PHP script may be useful in testing the XML export patch supplied.

Assuming your local version of PHP is built with CLI support and XML parser support, you should be able to run the script against an XML export file generated by the book module with my patch. After you have installed the patch, you can select a book page, click on the 'export XML' link, and save the result as a file, say 'test.xml'. Then run the script. On my system this looks like this:

% ./export2dir.php test.xml

This will produce output that looks something like this:

./explode2dir.php test.xml
md5: 9e8ca98c6a8be35c21f31f7937608acc
weight: 1
md5: 11a1956a1592feac37abee6b469e62c8
weight: 0
md5: ed4c91279d3bed28b56899b75ccaa9aa
weight: 0

It will generate a directory hierarchy, with one directory per book node. Each directory contains a file containing the node contents and a file 'nid' containing the metadata. You can check, for example, that the md5 signature of the contents match the md5 signature recorded with the metadata.

Djun

Log in or register to post comments

Comment #8

dries commented 3 June 2005 at 18:46

I like the approach taken in this patch! Let's tidy up the menu structure. I suggest changing

book/export   (docbook)
book/print    (plain-text)

to

export/docbook
export/text

If we add OPML-support, it would then become:
export/opmlSimilarly, I suggest renaming 'export XML' to 'export DocBook XML' (or something).

Log in or register to post comments

Comment #9

dries commented 3 June 2005 at 18:53

Haven't tried it but how does DocBook handle CDATA? Does it come out OK? Read: does it make sense to do it this way?

Log in or register to post comments

Comment #10

puregin commented 3 June 2005 at 23:04

Dries,

The output XML which I've implemented is only 'DocBook-like', not true DocBook. I've been dealing with structure at the top level (book, chapter, section). I've hidden away the actual content inside a CDATA section. The XML is really intended to provide a container for export.

At this point I'm trying to focus on a relatively simple way to do an export/import round trip of the current content format (text/'loose' HTML). Most people would probably not edit this using an XML editor.

I think I can generate a tar/gzip archive of the directory structure I described (output of explode2dir.php) directly on the server, either by calling an external pipeline, or by using the tar/gzip PEAR extension. So the XML format I described would be useful primarily as means to do the import, unless we can think of a better way to import such a directory structure (perhaps via the node_import module?)

How the CDATA section is handled depends on the application. Most XML editors will display this as CDATA, allow the user to edit as CDATA, and to perform edit operations such as cut/paste to convert the CDATA to other XML elements. DocBook formatting applications could do various things - ignore the CDATA; format as 'preformatted', e.g., source listing; or try to do something clever, like attempting to parse and convert the CDATA into real DocBook before proceeding.

To generate true DocBook, we would have to:

emit a document type declaration
decide if we want to export complete documents (i.e., top level elements such as books, articles, set) or document fragments.

The real difficulty is dealing with the content: to convert this into DocBook, we'd have to attempt to map (possibly not well formed) text and/or HTML into well-formed DocBook XML. This would require guessing in many cases, since text/HTML doesn't directly encode the author's intent. A problem with this would be that the content might not be returned exactly as exported in an (export/import) 'round trip' .

If we want to support true DocBook, it would probably be better to do this via an input filter, similar to the PHP input filter - definitely worth pursuing, but perhaps a separate issue?

I will rewrite the patch to make sure that the exported XML validates as a DocBook fragment, and punt off to people who actually want to deal with real DocBook the problem of embedding and converting content. These folks would probably not so interested in round-trip import/export, until we have native DocBook nodes, at which point many of these issues vanish (I hope :)

Regards, Djun

Log in or register to post comments

Comment #11

Amazon commented 4 June 2005 at 00:01

As an active member of the documentation team I would greatly appreciate any ability to export and import content to and from the documentation handbooks. Drupal is slow for editing, and has some usability issues that I will be following up on.

Please consider accepting this as a incremental step to assisting the documentation team and other editors.

Kieran

Log in or register to post comments

Comment #12

dries commented 4 June 2005 at 07:28

If the goal is to generate books, DocBook-export is key. However, if the output is only DocBook-like, it is only going to be used by a handful of people. I think the code comments should mention that it is only DocBook-like.

If the goal is to import/export books, OPML might be the better choice. I think book syndication (publish/subscribe) is going to be the more popular.

Either way, let's clean up the URL scheme and extend the book_help() function a bit (if not already).

Log in or register to post comments

Comment #13

dries commented 5 June 2005 at 11:00

I made some changes (URL scheme) and committed this patch to HEAD. Please update your tree before making more changes.

- Can you update the issues affected by this commit?

- I wonder why the release info (md5-sum) isn't stored at XML attributes so it can be parsed/extracted easily. Probably DocBook-specific.

- Firefox did not recognize the XML document as being XML. I think we might have to send the proper headers:
drupal_set_header('Content-Type: text/xml; charset=utf-8');. Haven't checked yet.

Great work Djun.

Log in or register to post comments

Comment #14

puregin commented 6 June 2005 at 09:18

Status	File	Size
new	xml-export-03.patch	22.3 KB

Thanks, Dries.

Here is a new patch, against revision r1.299.

- now generate value DocBook XML (fragments). Level 0 nodes are exported as books; level 1 as chapters, and level 2 and higher as sections.

Content is still wrapped as CDATA, for the moment. I am working on code to generate proper DocBook from HTML, but this will require 1) Tidy, HTMLCorrector or something of the sort, to ensure that the HTML is well-formed XHTML; and 2) XSLT support enabled for PHP. I assume that this will need to be a packaged as a contributed module?

- generate OPML (titles only)

- issue Content-type: text/xml headers for XML output (DocBook, OPML)

- changed URL scheme (single callback for all exports, takes export type as 1st argument)

- merged book_export_html and book_export_xml into a single function book_export with 'switch()' logic to handle different output formats, to support the above URL scheme.

So for example,

book/export/html/154

will generate printer friendly HTML, while

book/export/docbook/154

generates DocBook.

- added function _book_get_depth($nid) which computes the absolute depth of a node in a book hierarchy

- printer-friendly HTML is now generated with the exported node embedded to its absolute depth (e.g., level 3 nodes will always be marked-up as level 3 nodes)

- changed name book_node_visitor_print -> name book_node_visitor_html

- added parameter depth to 'post-node' visitor function call

- fixed incorrect parameter to node_invoke_nodeapi() - changed 'view' to 'print' so as to avoid rendering book navigation in printer friendly (or other export format) output.

- Added the new admin/help documentation, updated to include the XML export functionality.

Log in or register to post comments

Comment #15

muralik commented 9 June 2005 at 01:08

when i patch, i get this error, any clues?
failed at line 321

Log in or register to post comments

Comment #16

puregin commented 9 June 2005 at 03:11

I'd guess you are applying the patch to the wrong version. Probably your best bet is to grab the latest version from CVS, since this batch has been applied. Keep watching though because there are some other changes I'm currently working on.

Regards, Djun

Log in or register to post comments

Comment #17

killes@www.drop.org commented 7 August 2005 at 01:30

Status:

Needs review

» Fixed

patch has been applied.

Log in or register to post comments

Comment #18

ax commented 9 August 2005 at 14:09

Status:

Fixed

» Active

is xml export supposed to reveal the code of php-pages? have a look at http://drupal.org/book/export/docbook/3 for an example. this strikes me as a security issue, so i'm marking this ACTIVE.

Log in or register to post comments

Comment #19

puregin commented 23 August 2005 at 21:30

Status:

Active

» Needs review

Status	File	Size
new	book.module_6.patch	13.62 KB

The attached patch for book module implements the following changes

new XML export format includes all attributes required for re-import, supports
book import (which I am going to provide as a contributed module)
DocBook XML export functionality has been removed (to a separate module)
Re-architected book_export() to enable external modules to provide support for
different kinds of export

I have not addressed ax's concert re: visibility of PHP code. I agree this is important.

We could introduce a new permission: 'export books', or go for something more granular,
'export PHP code in books'.

What would people prefer to see here?

Djun

Log in or register to post comments

Comment #20

Uwe Hermann commented 23 August 2005 at 22:25

Status	File	Size
new	book_30.patch	15.89 KB

Rerolled patch to improve format (diff -u -p).

Log in or register to post comments

Comment #21

moshe weitzman commented 3 October 2005 at 19:48

i think 'export PHP code in books' could be an admin pref on the book settings page. no need to have the book vary by role like that. just my .02

Log in or register to post comments

Comment #22

Tobias Maier commented 5 October 2005 at 01:42

there is not just this security issue.
some people like me dont like it that this feature is active by default and that there is no option to unactivate it.
-->think about brochure sites do they need export features available?

Log in or register to post comments

Comment #23

Tobias Maier commented 5 October 2005 at 01:43

should such a feature be available per node / per book or for the whole site? or all together?

Log in or register to post comments

Comment #24

moshe weitzman commented 2 November 2005 at 22:19

i propose that a single setting apply to the whole site

Log in or register to post comments

Comment #25

puregin commented 3 November 2005 at 08:12

I will add such a setting in the book admin page, then.

Djun

Log in or register to post comments

Comment #26

puregin commented 28 November 2005 at 03:03

Priority:

Normal

» Critical

Status	File	Size
new	book_31_0.patch	16.33 KB

The attached patch removes the export functionality for DocBook XML and OPML. It also supercedes the previous patch (book_30.patch) which introduced export to 'Drupal XML' functionality.

The removed functionality has been re-packaged; these changes have been been, or will shortly be, submitted as contributed modules.

I feel that this removes non-core functionality from book.module, provides a cleaner architectural delineation, and will encourage additional development for book module and contributed modules supporting book.module and other structural modules.

This change will enable administrators to include/exclude support for exporting various formats by installing/enabling modules, so people who don't want the functionality won't have to put up with it.

At the same time, this patch provides support for external modules to provide export functionality.

I hope that this simultaneous simplification and extension of functionality will meet with approval for the 4.7 release.

Djun

Log in or register to post comments

Comment #27

puregin commented 28 November 2005 at 07:07

Status	File	Size
new	book_32_0.patch	16.19 KB

Sorry, I created the last patch against an older version of book.module. This patch is made against v333.

Log in or register to post comments

Comment #28

Bèr Kessels commented 28 November 2005 at 13:37

I like this patch
+1

A small issue: IMHO you should remove the export permissions too, from book module.

Log in or register to post comments

Comment #29

puregin commented 29 November 2005 at 20:04

Status	File	Size
new	book_34.patch	17.2 KB

Per the suggestion by Bér, I've reworked the patch (attached) to remove the 'export books' permission from book.module.

Each export format, as supported by contributed modules, can have its own permissions.

By the way, I have adopted Moshe's suggestion and set up a flag 'Allow PHP Export' in the
export_dxml contributed module. Via this option, site admins can enable or disable
export of PHP code in the dxml book export.

Log in or register to post comments

Comment #30

dries commented 30 November 2005 at 13:16

Status:

Needs review

» Fixed

Makes sense. Committed to HEAD. Thanks.

Log in or register to post comments

Comment #31

(not verified) commented 14 December 2005 at 13:20

Status:

Fixed

» Closed (fixed)

Log in or register to post comments

export book as xml for formatting

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

Comment #28

Comment #29

Comment #30

Comment #31

News items

Our community

Documentation

Drupal code base

Governance of community