Wednesday, October 27, 2010

Counting pages

I've been working on a Ruby on Rails project for a while. One area of it has morphed into a bit of document management, and for some users it is important to know how many pages a specific document has in it. At least for PDFs and TIFFs.

Well, ImageMagick is one approach, letting you load the document then review its properties. But as anybody who has used it will know, unless you are careful, this can be a huge memory sink. In fact I use ImageMagick 'convert' as a way to force my machine to run out memory during testing, to see if it fails gracefully.

So, I hunted around a bit and came up with these programs: tiffdump and pdfinfo. I also considered tiffinfo, although the 'rawness' of tiffdump just seemed more appealing when parsing out the data I needed.

To install them (on Ubuntu):

sudo apt-get install libtiff-tools poppler-utils

Then use the command line programs from Ruby, something like this:


path = '/home/someone/somewhere/somefile.xxx'
mime_type = WEBrick::HTTPUtils.mime_type(path, WEBrick::HTTPUtils::DefaultMimeTypes)
if mime_type=='image/tiff'      
  return `tiffdump '#{path}' | grep 'Directory'`.count('\n')
elsif mime_type=='application/pdf'      
  return `pdfinfo '#{path}' | grep 'Pages'`.split(':')[1].chomp.to_i
else
  # whatever
end


Not pretty, not clever, but a lot faster than RMagick, and a lot easier than the Ghostscript approaches I've seen discussed but never actually working.

No comments:

Post a Comment