Internationalization support for PDF [#6795]

The following was originally submitted by joel_guesclin on another issue. I reproduce it here because it isn't applicable in its original location:

I would very much like to use PDFview on the site I am planning, even if it is a simple rendering. However, I have a major problem: the only font sets provided with the product are in ISO-8859-1. Since Drupal (and all the other CMS I have looked at) very sensibly works in UTF-8, then as soon as I try creating output in any other language than English, I lose all the "special" characters (eg é, è, ç, à etc). Since the site I am working on includes English, French, Spanish, German, Italian, Portuguese.... but also Russian, Hindi (हर व्यक्ति के लिए वितरण के अधिकार देता ), Bengali, Farsi... this is obviously a restriction that make the product unusable. Has anyone found any way round this, or tested PDFView with other than the default fonts - indeed does anyone know where to find some UTF-8 open source freeware fonts?

Comment	File	Size	Author
#7	pdfview_fpdf_patch_0.txt	13.73 KB	TheLibrarian
#4	pdfview_fpdf_patch.txt	11.38 KB	TheLibrarian

Comments

Comment #1

Steven commented 30 March 2004 at 10:17

I'm not sure what fonts PDFView needs, you might find some useful things on this page.

You might also find Gentium interesting, which is a very comprehensive Latin-alphabet font (including extensions).

Comment #2

Steven commented 30 March 2004 at 10:19

By the way, don't confuse UTF-8 (an encoding) with Unicode (the character set). Searching for UTF-8 fonts won't get you much, searching for Unicode fonts will.

Comment #3

TheLibrarian commented 22 May 2004 at 17:16

After looking around a bit, I think that FPDF [1] will suit your needs better. It supports addition of new fonts and encodings, so it will not be limited to Western character sets. If no one has any objections, I'm going to rewrite PDFview to use FPDF instead of phppdflib.

[1] http://www.fpdf.org/

Comment #4

TheLibrarian commented 22 May 2004 at 19:43

Status	File	Size
new	pdfview_fpdf_patch.txt	11.38 KB

As a proof of concept, this patch switches from phppdflib to FPDF. The code isn't very clean yet, but this is a functional modification. In the future, the inner class will move to its own file.

Future features will include more support for HTML elements such as TABLE, BLOCKQUOTE, and HR. I also will re-add support for image nodes and add configuration items for changing fonts and encodings.

If anyone has requests for more HTML elements or PDF features (such as display of taxonomy terms), please indicate them by posting to this thread.

Comment #5

TheLibrarian commented 22 May 2004 at 20:50

The patched version is currently running on http://www.nosir.org if anyone cares to take a look at the output.

Comment #6

TheLibrarian commented 23 May 2004 at 00:29

This new patch has much cleaner code. There is now a separate file for the PDF class. The code fits Drupal's coding guidelines. Image support has been re-added. Blockquote is functional.

Tables will prove to be slightly harder than I previously thought. I don't think I'll finish it up tonight. However, this version of PDFview completely replaces the old, and is self-standing. Can it get committed to HEAD while I work on table support?

This version is now running on http://www.nosir.org. Check out a post with a blockquote to see how it looks! One good example is: http://www.nosir.org/node/view/34.

Comment #7

TheLibrarian commented 23 May 2004 at 00:30

Status	File	Size
new	pdfview_fpdf_patch_0.txt	13.73 KB

The patch mentioned in the above post.

Comment #8

SupaDucta commented 25 June 2004 at 20:28

I have just installed the pdfview plugin, patched - uses FPDF.

Basically, it produces a too simple PDF.

I have tried making the PDF out of a Drupal node in Croatian language, and all accent characters are displayed as garbage.

FPDF by default supports several encodings, but not UTF-8. Let's say it can be done fiddling with FPDF's makefont functions, but surely - for a system that uses UTF-8 like Drupal, UTF-8 encoding for FPDF should be included with the module install archive.

Additionally, when fonts are included into PDFs they need to have a proper embedding policy. So font can be embedded in subset (only characters in use, good for online PDFs) and embedded full (for imagesetting and press printing - not necessary here), but should have embedding info present. So PDFs generated thus from Drupal should have Embedded Subset policy set, which actually embeds used characters of a font into PDF. Otherwise, if we use Helvetica with Unicode support in PDF but without embedding, when printing the font will be substituted with the printing system's Helvetica and special chars may again be garbaged.

The testing node itself included a few HTML tags: -br- and -b- and -i- tags, and an occasional -img src=...- tag. All -br- tags are OK during PDF generation, but if the author typed ENTER key for new line there are no spaces between paragraphs as were in node. Only -b- tags were interpreted. -ul- tag doesn't indent a text, it produces a space between paragraphs in PDF.

Instead of just a page number in the PDF's footer, it would be nice to have at least the originating site's name and URL.

However, the best solution by far would be the ability of choosing a template for a PDF file - so we could make a template for PDF docs which would be filled on-the-fly.

Now what happens if authors use bbCode tags? They are echoed as they were input in page's source - no formatting.

FPDF is very powerful, and it can be used to produce really great stuff.

I am involved in development of French PDF-related site http://www.abracadabrapdf.net, authored by Adobe Certified Expert Jean-Renaud Boulay, and we are planning on putting Drupal behind it and to expand French language content to French, English and at later stages Croatian. We could REALLY use this module to present users with articles and tutorials in PDF format, generated on-the-fly.

But, for that we need fully rich HTML code support, and template abilities as a must. When the image abilities are inserted, it would be good for all GIF and PNG images to be resampled to JPEG in PDF document.

We could solve template needs fiddling with FPDF, but I am sure many Drupal users interested in PDF output would like to have the ability of templating the PDF in advance.

In general, I believe this is a great idea, and a great thing you initiated here, but needs a LOT of work before it gets actually usable.

Comment #9

(not verified) commented 26 June 2004 at 17:44

Thank you very much for the feedback SupaDucta. I have been looking for someone to mention whether they use this module before I put too much work into it. For now, I'm going to continue using FPDF to expand on the areas you mention, especially HTML support. I realize I also need to work on support for BBCode and PHP content. I am also on the lookout for even better utilities than FPDF, so if you have any suggestions, please feel free to speak up. Absent a better solution, I think that it may be in our best interests to create a 'Drupal' release of FPDF that has some of the settings/fonts tweaked to support our needs.

It is exciting that you are looking into using this module on http://www.abracadabrapdf.net. While I am unable to work very much on PDFview until the first of July, I am curious if you have a timeframe for the migration of site. This will help me allocate my time to the proper projects.

TheLibrarian

Comment #10

SupaDucta commented 26 June 2004 at 18:56

Oh yes, we would use it - extensively, and then some ;)

This would serve as a great extension to abracadabraPDF as a brand, and as a great extension to Drupal. When I imagine all our articles generated as a PDF on the fly... yummie ;)

Concerning templating - I believe it would be easier to implement a pure HTML template in for ex. subfolder pdfview/template than a FPDF's template because template editing needs to be available and easy for all Drupal's users.

Colours - it may be a good idea to implement a colour-conversion function that would replace all HEX colour values from the HTML output into RGB, so PDF thus created can have a proper, standardized colourspace: RGB tag. That's also one of the reasons all GIFs and PNGs would be better off converted to JPEGs with appropriate compression settings.

Font issue - FPDF supports TrueType and Type1 fonts, and there is even an AddFont script that makes makefont.php manipulation easier? I believe a cross-paltform solution driven through Unicode support can be implemented here, enabling all those accent characters in various languages to be displayed and printed properly.

Concerning php PDF libraries, information I have tell me FPDF is the best of the free libs. Have you checked contributed scripts on their site?

Timeframe - I am struggling with my first Drupal site at the moment which I also plan to enrich with PDF output at later stages (Drupal is a hard nut, I must say but very good), and after that we are going to start developing the new abracadabraPDF under Drupal. Since it is going to be a big chunk of work required, the old site will keep functioning but the new site will be online on another testing location. Our timeframe is when done properly, it is not strict, and we can implement PDFview features when they are finished ;) Mr. Boulay was delighted on the possibility of PDF creation, especially through FPDF.

Furthermore, I believe I can speak on my own and on Mr. Boulay's behalf to say - whatever you need, we can help. Additionally, I think Mr. Boulay had contacts the guys from FPDF.org so, if needed, we can work potential problems out together or get advices from them if necessary. Especially we can help from the PDF side of the story.

I say again - BIG THUMB UP for PDFView! :)

Comment #11

joel_guesclin commented 31 August 2004 at 09:55

Priority:

Normal

» Critical

I have a major problem with FPDF: I posted a request on their forum for recommended Unicode fonts. The reply that came back: "you know - that unicode is not supported by FPDF?". Well no, I didn't know. And since Drupal is wholly based on Unicode (quite right too!), it seems to me impossible to make FPDF work properly for any language other than plain US English - unless the coding in pdfview is able to determine what the language is and convert from Unicode to a single-byte coding. But can it do this?

Comment #12

Steven commented 31 August 2004 at 11:01

If you only use one character set, you can convert from Unicode/UTF-8 into it before sending it off to PDFVIEW. For the Latin-1 character set, you use PHP's utf8decode(); to convert from UTF-8 to ISO-8859-1. Otherwise you'll need to use iconv, librecode or mbstring extensions for PHP.

Comment #13

killes@www.drop.org commented 31 August 2004 at 15:47

I'd actally prefer to do no conversion and do utf-8 PDFs right out of the database. i have spend some time searching but din't find an obvious solution. Maybe we could modify the fpdf sources.

Comment #14

SupaDucta commented 31 August 2004 at 20:47

Well, according to FPDF, PDF doesn't support that encoding.

See

http://www.fpdf.org/phorum/read.php?f=1&i=16111&t=16111

However ;) a little hack could be performed, where special characters would be remapped. I am really not sure how it would be done on the PHP side, but basically the following would have to be performed:

Me, for example. I am using UTF-8 to display Croatian accented characters in Drupal. For example č, ć etc. Half of these characters lie in the same positions within the font map as in 8859-2 ISO codepage. But, the rest lies within the so-called extended Unicode range. So, for Croatian language, characters coming from Drupal's UTF-8 would have to be remapped to ISO-8859-2 codepage positions, prior to FPDF handling, and then the FPDF would actually use ISO-8859-2 prepared font to be able to display characters in PDF.

So when FPDF runs, it would need to set the font where the desired characters would already be remapped.

However, I believe this could be altered for such a 'hack'. Please see makefont.php and font map files ie. iso-8859-2.map etc. in FPDF distribution. Drupal's output would have to be adjusted prior to FPDF handling ie. character positions remapped to respond to a proper *.map file of FPDF. So, a PDFView module would have to have in it's module settings page an option like: PDF Remapping CodePage where the user would select the apropriate ISO codepage and the module would remap character positions according to this. Perhaps I could try and provide assistance with this, and with some PDF testings since I have done a lot of it in the prepress.

So basically, the following codepages are available, and I believe this would cover almost the complete UTF-8 character space: cp874, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258, 8859-1, 8859-2, 8859-4, 8859-5, 8859-7, 8859-9, 8859-11, 8859-15, 8859-16, koi8-r, koi8-u. UTF-8 output has to be remapped to one of those inputs prior to FPDF. Most of the 'usual' characters share the same positions within a group (125*), only certain accents etc. have specific positions.

Comment #15

Steven commented 31 August 2004 at 22:04

A bit off topic, but:
"So basically, the following codepages are available, and I believe this would cover almost the complete UTF-8 character space: cp874, 1250, 1251, 1252, 1253, 1254, 1255, 1257, 1258, 8859-1, 8859-2, 8859-4, 8859-5, 8859-7, 8859-9, 8859-11, 8859-15, 8859-16, koi8-r, koi8-u. UTF-8 output has to be remapped to one of those inputs prior to FPDF. Most of the 'usual' characters share the same positions within a group (125*), only certain accents etc. have specific positions."

You have got be kidding me ;). Each of those codepages has 256 positions, but 128 of them are the same between all of them (ASCII). Even if you assume all positions are used (they're not) and that the upper 128 characters of each codepage do not overlap with others (they do), then you still get only 128 * 20 = 2560 codepoints. Unicode 4.0 has 96382 assigned codepoints (of which 'only' 70207 are Chinese Ideographs).

Comment #16

SupaDucta commented 31 August 2004 at 22:27

LOL :) Yep, lots of characters. But as you correctly said, maybe iconv, librecode or mbstring extensions for PHP could be used to do the actual remapping. Should be sufficient for ex. Croatian accents (there are not many of them actually - 10 only to remap or UTF-8 decode, out of which at least 4-5 lie in the standard range. That should also cover Serbian, Slovenian). Don't know about Chinese though :). I mean I could use 8859-2 in Drupal and the same encoding in FPDF, but UTF-8 is sooooo great displaying all those weird Š Ž chars worldwide ;)

A friend of mine and I have encountered the Unicode problem when (I was) translating and (he was) developing a few small JS based Acrobat plugins where no Croatian accents were displayed in Acrobat's menu item that plugin generated and to this day we haven't figured a solution out.

Comment #17

Steven commented 31 August 2004 at 23:46

I've been doing some PDF reading the last days (for something else, but it was about Unicode too).

Here's what I've managed to gather from the specs:

PDF supports 2 forms of font mapping: simple and composite. Simple is where every byte in a string maps to a glyph in the font, using glyph names. Composite fonts are a lot more complex. Of interest are the CID-keyed composite fonts. These allow you to specify a way to map an arbitrary byte stream to characters (for example from UTF-8 or UTF-16 to Unicode codepoints). This is good for TrueType fonts, because they contain a mapping from Unicode codepoints to glyphs. You can embed TrueType fonts as 'CIDFont Type 2'.

The easiest approach would be to use UTF-16BE and restrict it to the Basic Multilingual Plane (codepoints below U+FFFF, no surrogates [1], which is often called UCS-2). It is especially interesting, because for these characters, every 16-bit word maps to the identical Unicode codepoint, and PDF easily allows you to use that simple mapping with the 'Identity' CIDtoGID map [2].

Now there's the issue of fonts. There are a lot of good fonts out there with a large coverage of Unicode, but the problem is that you probably do not want to embed a 23MB universal font in your PDF. So ideally, you would have a set of smaller fonts (Gentium, Bitstream Cyberbit, ...), check which Unicode ranges they support, check which Unicode codepoints are used by the text, and include only the required fonts.

To get FPDF to do all this is very difficult however. As I said, it's easiest to use UTF-16BE for PDF. PHP strings are 8-bit only, but they are binary-safe so you can put UTF-16 data in them and process them by byte pairs. Conversion from UTF-8 to UTF-16 is very straightforward if you disallow UTF-16 surrogates.

FPDF only supports so called simple fonts. So you would need to:
- Alter the font handling code to embed the file as a composite 'CIDFont Type 2' font with the Identity mapping.
- Change the glyph metrics code so it uses Unicode codepoints rather than WinANSI mappings [3].
- Alter the string handling so it processes the string in byte-pairs and treats each pair as one character (UTF-16/UCS-2).
- Add a smart font selector and switcher, based on the unicode codepoints used in the text.
- Change the output code so that every string is written out as UTF-16BE, properly escaped for the PDF (FPDF's current escape mechanism assumes the string only contains printable ASCII/ANSI characters).

Instead of UTF-16BE/UCS-2, UTF-8 could be used, either only internally in PHP (although it would not make the PHP string processing easier) or also in the PDF itself. In the latter case, a CIDtoGID CMap mapping would need to be created which maps UTF-8 byte sequences to Unicode codepoints.

Notes:
[1] The PDF specs say that the range of a CID is 0-65535. This means that you could support more than just the BMP, but only through multiple fonts which each cover at most one plane of Unicode. Restricting ourselves to the BMP is not that big of a problem.

[2] I haven't managed to figure out if, for TrueType fonts, the map maps to Unicode codepoints, or glyph indices. If it maps to glyph indices, then a custom map needs to be built per font, because a font's internal glyph indices are often assigned arbitrarily. This custom map would be based on the font's own Unicode-to-glyph mapping. In the case of a custom map, the UTF-16 Identity-mapping advantage is lost.

[3] FPDF now seems to rely on .afm (Adobe Font Metrics) files for font metric information, even for TrueType fonts (even though they contain their own metric information). I'm not sure if .afm files even work for Truetype fonts with lots of Unicode characters. In any case, the use of glyph names rather than glyph indices would be very slow for larger fonts.

Comment #18

Steven commented 31 August 2004 at 23:50

And for your reading pleasure, here's the link to the PDF specs that I used:
http://partners.adobe.com/asn/tech/pdf/specifications.jsp
(a whopping 1172 pages).

Comment #19

SupaDucta commented 2 September 2004 at 03:19

Quote:

Or embed only a subset or a font. PDF can be easily instructed to embed only the subset. And any of Adobe's OpenType Pro (Pro versions are only 'real' OpenType fonts - OpenType Non-pro fonts are simply Type 1 fonts reencoded to OpenType format) fonts could be used as universal font. Not that it helps in this ever growing complication. :(

Comment #20

Steven commented 2 September 2004 at 03:44

I've gotten a first version of a Unicode-compatiple FPDF up at http://www.acko.net/node/56 .
But it only embeds TTF fonts entirely. Splitting them up would be a lot more work.

Comment #21

SupaDucta commented 3 September 2004 at 01:15

Steven, I see we have posted in the same UTF-8 related forum topic on FPDF.org. :) I belive you are onto something really good. I have downloaded UFPDF and will implement it into one site that is almost finished to see what happens and how it goes. Will keep you posted on anything I notice, to provide as much feedback as possible.

Since I am terribly booked at the moment, please give me several days until the first testing.

This could be GREAT ;)

Active

» Fixed

Comment #29

(not verified) commented 5 July 2006 at 10:45

Status:

Fixed

» Closed (fixed)

Internationalization support for PDF

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

Comment #28

Comment #29

News items

Our community

Documentation

Drupal code base

Governance of community