PDF support ( i.e. convert PDF to JPG support)

duntuk - January 29, 2009 - 21:30
Project:ImageCache
Version:6.x-2.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Description

The ability to support PDF docs, to convert them into JPG, would be great.

(this would probably require support with imagefield or filefield as well)

#1

duntuk - January 29, 2009 - 21:44
Category:task» feature request

#2

egfrith - March 7, 2009 - 22:42

I'm interested in this issue, as I'm looking into writing some code to create a preview of pdf document uploaded in a filefiled, and making these available to views.

The conversion would require imagemagick to be installed, as gd can't convert pdf to jpg.

#3

egfrith - March 7, 2009 - 22:53

Hmmm... looks like PDF to jpg conversion isn't going to happen in imagefield module: #339266: Feature Request: Convert PDF to image. Perhaps in filefield module? Or as a contrib module? There is also the pdfstamper module, though this is not yet views enabled, and does more than I really want: #391308: Future direction of the module.

#4

TyraelTLK - April 15, 2009 - 19:13

Subscribing

#5

egfrith - June 20, 2009 - 23:42

To get this to work, a sequence of patches to imageapi and imagecache modules is required:

  1. imageapi module: create an imageapi_image_get_info() function to replace the use of getimage() in imagecache: #416254: Add equivalent of image_get_info() at the toolkit level
  2. imagecache module: use imageapi_image_get_info() instead of getimage(). No patch yet; for now replace getimagesize($src) around line 412 in imagcache module with imageapi_image_get_info($src).
  3. imageapi module: the patch at #375218: Changing file type with imagemagick needs to be applied, so that imageapi_imagemagick saves pdf files which have been converted to a browser-viewable format (e.g. jpeg or png) with a .pdf extension.

Then, create an imagecache preset which contains the "Change File Format" action from imagecache_coloractions module. You can specify that the pdf (or any other type) is converted to jpeg, png or gif.

To move the work on this forward, reviews are needed of #416254: Add equivalent of image_get_info() at the toolkit level.

#6

fei - June 7, 2009 - 21:20

Subscribing (thanks for the support)

#7

egfrith - June 20, 2009 - 23:45

I've got this working - just. I've edited comment #5 so that it gives up-to-date instructions.

#8

egfrith - June 23, 2009 - 22:26

There is now a patch for imagecache. Here are updated instructions for testing:

1. imageapi module: apply latest 6.x patch at #416254: Add equivalent of image_get_info() at the toolkit level
2. imagecache module: apply patch attached here.
3. imageapi module: apply patch at #375218: Changing file type with imagemagick

Then, create an imagecache preset which contains the "Change File Format" action from imagecache_coloractions module. You can specify that the pdf (or any other type) is converted to jpeg, png or gif.

AttachmentSize
imagecache_366373-8.patch 1.21 KB

#9

alex_andrascu - July 10, 2009 - 10:40

All ok after applying patches on step 1 and 2 at #8 but the last one fail against version. Please review.

#10

egfrith - July 10, 2009 - 10:59

I've just tested this with the latest -dev version of imagapi. The last patch applies OK for me, though with an offset:

$ patch -p0 < imageapi_375218-2.patch
patching file imageapi/imageapi_imagemagick.module
Hunk #1 succeeded at 111 (offset 9 lines).

Does it apply at all for you?

#11

alex_andrascu - July 10, 2009 - 11:09

I've applyed the patches in the order you describe in #8 with TortoiseSVN. Maybe that's why. Anyhow i've applyed it by hand and it seems it working. Now i don't know how to use all this stuff to write a IM raw command to try tiff->jpg conversion.

Thanks for the blitz reply :)

[Update]

I figure that we shall do a cumulative patch with

#416254: Add equivalent of image_get_info() at the toolkit level
#375218: Changing file type with imagemagick

for this to work without errors.

#12

egfrith - July 10, 2009 - 11:13

Have you tried using imagecache_actions.module (as described at the end of #8)? It may not be how you want to do things in the long run, but it would confirm whether things are working. I'd be interested to know!

Re the patch, once you've confirmed things are working, perhaps it would make sense to post a new combined patch to #416254: Add equivalent of image_get_info() at the toolkit level

#13

alex_andrascu - July 10, 2009 - 11:58

I guess we're very close now...i just lack some php / imagemagick skills

ImageMagick command: /usr/bin/identify -format "%w %h %m" 'sites/default/files/ads/angel_copy_0.tif'
ImageMagick output: 626 926 TIFF
ImageMagick command: /usr/bin/identify -format "%w %h %m" 'sites/default/files/ads/angel_copy_0.tif'
ImageMagick output: 626 926 TIFF
ImageMagick command: /usr/bin/convert 'sites/default/files/ads/angel_copy_0.tif[0]' -resize 50x74! -colorspace RGB -quality '90' -append 'jpeg:sites/default/files/imagecache/thumbnail_100X100/ads/angel_copy_0.tif'
ImageMagick output:

[UPDATE]

Holly molly this is workin' :)

It just doesn't append the .jpg at the end of the file
It creates a jpg with the .tif extension. Wonder where's the problem.

#14

egfrith - July 10, 2009 - 11:56

Great! Yes, the file has to have the orginal extension, otherwise imagecache will think it doesn't exist. As far as I can see, this doesn't cause problems when viewing in browsers.

#15

alex_andrascu - July 10, 2009 - 11:59

No it doesn't :) But we need to fix this anyhow.

#16

schildi - July 10, 2009 - 12:26

Not sure if this hint is helpful for your project, but
- converting PDF to JPG will drop the complete text (no cut and paste any more)
- you will get the well known JPEG-artefacts around sharp edges

may be you will have a look at the DJVU-format which is also a raster format but preserves the text when converting from PDF. Text is still selectable. And it has some other advantages. For more background please see http://en.wikipedia.org/wiki/DJVU.

The disadvantage might be that the format is not as wide spread today.

#17

egfrith - July 10, 2009 - 14:56

Thanks for your hint schildi. I hadn't thought about DJVU, which does have the advantages you say over jpeg. However, is it viewable in a browser? And can imagemagick convert to it?

Your comment also reminds me that I've had problems with the jpegs that imagemagick has produced from some PDF files. On some machines I've used (all Linux) they have either not showed in the browser, or show in a partial way. On other machines (again Linux) they have been fine. The workaround has been to convert the files to png rather than jpeg.

#18

egfrith - July 10, 2009 - 14:58

@alex_andrascu: I agree fixing the filenames would be nice, but I think that it is a separate - and potentially very thorny - issue. I think I may have seen it discussed elsewhere, so it might be worth searching.

#19

cbrody - July 15, 2009 - 14:56

I've got #8 to work using a CCK filefield and Views to display the imagecache converted images but the images are each displayed multiple times in the view (as many times as there are images, e.g. three images results in each being displayed three times). Any hints?

#20

schildi - July 12, 2009 - 14:23

On Linux it installs with some stand alone application (converters like cjb2) and a plugin for firefox.
I checked this out and it worked well for me.
For a complete conversation cycle you can start from e.g. a jpeg or tif file and use one of the converters mentioned above to create the djvu file.
For example command lines see

http://en.wikisource.org/wiki/Help:DjVu_files

Converting form png is also described to be possible. You have probably use "convert" to get a pbm-stream and pipe the result through cjb2 (not checked).

#21

vthirteen - July 18, 2009 - 07:11

subscribing

#22

egfrith - July 20, 2009 - 12:43

@19 cbrody: I'm not sure that this is an issue with the code in this patch. To test whether it is, can you check that the multiple images are actually one image file? E.g. examine the HTML on the pages on which you have the multiple images displayed, and find the href of a converted image, and then view it in the browser on its own. Also, you could check the HTML of the page to make sure there aren't multiple hrefs to the same image.

If the image itself is fine, and there are multiple hrefs, perhaps there is problem with the view?

#23

cbrody - July 20, 2009 - 22:02

Hi egfrith, the img src and href is the same for all the images. Seems this could be a problem with Views, as I have it set to select distinct and group multiple values. The query is as follows:

SELECT DISTINCT(node.nid) AS nid, node_data_field_menu.field_menu_data AS node_data_field_menu_field_menu_data, node_data_field_menu.nid AS node_data_field_menu_nid, node.type AS node_type, node.vid AS node_vid FROM node node  LEFT JOIN content_field_menu node_data_field_menu ON node.vid = node_data_field_menu.vid WHERE node.status <> 0

#24

Justin W Freeman - August 2, 2009 - 20:51

Subscribing.

#25

ricklawson - August 5, 2009 - 13:29

subscribing

#26

egfrith - September 9, 2009 - 21:39

I've merged the two imageapi patches, and fixed a problem with one of them which prevented images appearing the first time they were generated, leaving "Failed generating an image..." messages in the logs.

Here are updated instructions for using the patch:

1. imageapi module: apply latest 6.x patch at #416254: Add equivalent of image_get_info() at the toolkit level, #19
2. imagecache module: apply patch attached at #8.

Then, create an imagecache preset which contains the "Change File Format" action from imagecache_coloractions module. You can specify that the pdf (or any other type) is converted to jpeg, png or gif.

Other news: the changes to the core code now mean that this functionality should be in D7 with the imageapi_imagemagick module; see #269337: Support for more image types (PDF, TIFF, EPS, etc.).

Another point: it seems that some versions of Safari do have a built-in PDF viewer, so JPEG or PNG files which have a .pdf ending aren't displayed, because the built-in viewer tries to display them as PDFs. At the moment, the best guess I have about what to do about this is to implement a wrapper module for imagecache that would map URLS such as imagecache_wrapper/files/test.jpg to imagecache/files/test.pdf ... but other ideas are welcome.

#27

ricklawson - September 10, 2009 - 10:09

I'm wondering if there is another way to approach this that might be more flexible.

In the flashvideo module, they have a flashvideo_cck module that takes the incoming video file in one cck field, converts it and sticks the .flv result into a second cck field. The module itself hides the appropriate fields on the input form from the authors.

If we were to implement the pdf --> .jpg system in a similar way, we would have access to the original pdf and also to the resultant jpg. It may even be possible to output multiple pages of the pdf into multiple occurances of the jpg cck field.

I know what I'd like to do but I'd need guidance on how to do it. I am a willing volunteer to help, though...

#28

ricklawson - September 10, 2009 - 10:29

and I guess because the filename can then properly relfect the content, Safari will be okay...

#29

egfrith - September 10, 2009 - 11:12

Thanks for your comments ricklawson.

At present we do have access to the original PDF - it's at a location like /files/original.pdf . The problem is that the file ending of the resultant JPEG or PNG file is also .pdf

The solution you propose should fix this problem, but I'm wondering if it's more complicated than it needs to be? I was thinking of a bit of code that didn't have to insert anything into the database, but which would pretty much use the tools given by imagecache. Also, we would have to work out how to map different imagecache presets onto the CCK fields. And what would happen when the presets are altered or flushed?

It might be possible to create a module that effectively re-implements imagecache_cache() so that it if asked generate presetname/files/original.jpg, it would look for /files/original.pdf if it couldn't find /files/original.jpg
http://drupalcontrib.org/api/function/imagecache_cache/6

#30

Boobaa - October 13, 2009 - 16:17

Subscribe

#31

anrikun - October 24, 2009 - 22:06

Please have a look at http://drupal.org/node/578804
If you're interested in it, I will post the code.

#32

iva2k - October 25, 2009 - 01:35

This would be an awesome feature to have. Please, think also of supporting other file types, or at least a roadmap to do it. Can it potentially employ a mimedetect module to recognize file types?

Once the feature is committed, I would be picking it up into iTweak Upload module. People there are requesting previews of other file types besides images #601896: Allow custom preview / thumbnail for attachments.

What I would like to have from imagecache is a function that returns TRUE if there is a preview image for any given file, either an image or any other supported type, like PDF. This will decouple nicely and make iTweak Upload's code support (without modifications) any future imagecache updates. See _itweak_upload_isimage() and itweak_upload_itweak_upload_preview() functions in itweak_upload.module - these are the ones I will modify/replace with corresponding imagecache call.

@drewish
Before I get too excited - please chime in if you would consider committing a final patch from this issue into imagecache project? What would be your requirements?

#33

egfrith - November 9, 2009 - 23:53

There's now a solution for the problem with Safari (and some other browsers, it turns out) not displaying converted thumbnails. See #628146: Some browsers do not display converted images. The code is in the attachment - it's not committed to a module yet.

 
 

Drupal is a registered trademark of Dries Buytaert.