Preserve hierarchical structure of document by using sections instead of bridgeheads
Bodo Maass - November 4, 2008 - 11:41
| Project: | Export DocBook |
| Version: | 5.x-1.x-dev |
| Component: | Code |
| Category: | task |
| Priority: | normal |
| Assigned: | Bodo Maass |
| Status: | active |
Description
The current version of export_docbook is converting all headers to the bridgehead tag in docbook. It would be better to convert it to sections, because this will preserve the hierarchy and will allow to create a better table of contents in the resulting pdf.
I guess this would be very hard in XSL because the headers have no corresponding closing tag that marks the end of their scope. However, the table of contents module does a good job of extracting the header hierarchy, so I'm planning to use that code as a base for a html -> docbook conversion with sections instead of bridgeheads.
If anyone is interested, I'll post my results here (although the code will probably be quite experimental).

#1
XSL is evil.
Although at least I discovered that Visual Studio contains a fairly useable XSL debugger. The hierarchical stuff was quite easy, but now I found that the table conversion gets the column count wrong on some tables where some cells have colspan>1 in more than one row.
#2
You're welcome to use any stylesheet in place of the supplied one.
As you pointed out, the header tags in HTML are presentation and not structural markup, hence are essentially useless for generating genuine sections in DocBook.
A better approach which will save much grief for maintaining revisions between HTML and DocBook versions would be to use the book structure to provide document structure, and avoid author generated headers.
Good luck... Djun
#3
I'd be happy to provide a contrib directory for additional stylesheets etc.
So, please feel free to attach your XSL once you're satisfied, along with good documentation, and I'll include it.
Also please note: My plans for the Export Docbook module are to roll this into Book Import/Export.
Regards, Djun
#4
Hi Djun,
It's more than just a stylesheet. I reused some code from the tableofcontents project to inject proper hierarchical structure into the document before it is processed by the xsl, and that allows to simplify the xsl significantly. However, there still are a few assumptions about the original html, so I need to play with this some more before it could be useful for general consumption. I'll post my results here once they are ready.
Best regards,
Bodo
#5
OK, this is just completely the wrong approach, then.
If you need to convert from "structure" provided by headers, as per the table of contents module, this should be done one time only in conversion to book structure. Thereafter, use book module to provide structure, enforce editing guidelines on your authors, and treat H* tags as presentational only.
Otherwise, you will have an editing maintenance nightmare on your hands.
I will not commit any code that attempts to guess structure from HTML. Why bother with book module at all in this case?
#6
We are talking about two levels of structure. The book module provides structure between book pages. The html headers provide additional structure within a single page. Both levels have their justification. I guess the intra-page structure is not essential if you only want docbook for printed pdfs. But for CHM, or for pdf with a deep browseable table of contents, the additional structure is helpful for long pages.
For example, the page http://www.sygyt.com/en/quickstart is fairly long. With the original docbook export, the whole page would just get one entry in the table of contents. With the additional intra-page structure from the headers, one can also navigate within the page. Attached is a screenshot of how this looks in the pdf viewer.
With this approach I'm essentially treating some book pages as small chapters in a book. The quickstart page is an extreme example of this, most other pages on my site are much shorter. But I don't see an editing nightmare here, because the consistency of the book structure is in no way affected by the content of the pages. Or would you suggest that I should break this page down into 6-9 individual book pages?