Well, ImageMagick is one approach, letting you load the document then review its properties. But as anybody who has used it will know, unless you are careful, this can be a huge memory sink. In fact I use ImageMagick 'convert' as a way to force my machine to run out memory during testing, to see if it fails gracefully.
So, I hunted around a bit and came up with these programs: tiffdump and pdfinfo. I also considered tiffinfo, although the 'rawness' of tiffdump just seemed more appealing when parsing out the data I needed.
To install them (on Ubuntu):
sudo apt-get install libtiff-tools poppler-utils
Then use the command line programs from Ruby, something like this:
path = '/home/someone/somewhere/somefile.xxx'
mime_type = WEBrick::HTTPUtils.mime_type(path, WEBrick::HTTPUtils::DefaultMimeTypes)
if mime_type=='image/tiff'
return `tiffdump '#{path}' | grep 'Directory'`.count('\n')
elsif mime_type=='application/pdf'
return `pdfinfo '#{path}' | grep 'Pages'`.split(':')[1].chomp.to_i
else
# whatever
end
Not pretty, not clever, but a lot faster than RMagick, and a lot easier than the Ghostscript approaches I've seen discussed but never actually working.
No comments:
Post a Comment