Wednesday, October 27, 2010

Counting pages

I've been working on a Ruby on Rails project for a while. One area of it has morphed into a bit of document management, and for some users it is important to know how many pages a specific document has in it. At least for PDFs and TIFFs.

Well, ImageMagick is one approach, letting you load the document then review its properties. But as anybody who has used it will know, unless you are careful, this can be a huge memory sink. In fact I use ImageMagick 'convert' as a way to force my machine to run out memory during testing, to see if it fails gracefully.

So, I hunted around a bit and came up with these programs: tiffdump and pdfinfo. I also considered tiffinfo, although the 'rawness' of tiffdump just seemed more appealing when parsing out the data I needed.

To install them (on Ubuntu):

sudo apt-get install libtiff-tools poppler-utils

Then use the command line programs from Ruby, something like this:


path = '/home/someone/somewhere/somefile.xxx'
mime_type = WEBrick::HTTPUtils.mime_type(path, WEBrick::HTTPUtils::DefaultMimeTypes)
if mime_type=='image/tiff'      
  return `tiffdump '#{path}' | grep 'Directory'`.count('\n')
elsif mime_type=='application/pdf'      
  return `pdfinfo '#{path}' | grep 'Pages'`.split(':')[1].chomp.to_i
else
  # whatever
end


Not pretty, not clever, but a lot faster than RMagick, and a lot easier than the Ghostscript approaches I've seen discussed but never actually working.

Tuesday, October 12, 2010

Chroot - ooh now I can run OpenOffice

I've been struggling with OpenOffice crashes since I've been running Ubuntu 10.04 (Lucid). I've tried everything. I've added horrible red-herrings to one of the many seemingly relevant bug reports on Launchpad. And in the process, I've tried debugging (debug symbols seem to be inadequate) and then I saw a discussion about recreating a bug from a previous Ubuntu version in a chroot based basic installation. So I followed the instructions for creating a chroot with a basic Ubuntu installation, installed a few basic packages (nano for example), set up the en_US UTF-8 locale, following Andrew Beacock's blog (necessary to install Java). I also had to add some archives for apt to pick up OpenOffice. Now I have Lucid running chroot'd inside Lucid.

I don't know chroot well and heard some issues around mounting disks, and I'm using ext4 with ecryptfs encryption for my home which kinda gets in the way, so I went the roundabout route and mounted ssh using sshfs (yes I had to install both of these first).

Finally I installed openoffice.org-ubuntu and openoffice.org-human-style. And I finally ran ooffice and edited a document all day long with no crashes. I don't know if that's it, allowing me to avoid some strange library conflict, or whether tomorrow is another day and another crash. But currently I like the chroot method for testing a clean install without making a whole clean install, or making it difficult to get at my documents, which I find a VM image tends to. And it took me about half an hour total time to get it to work, without chewing up half my disk or half my memory. I like chroot for this. Hopefully it will keep me productive for a while.