How To: Digitise a book

Got a book that you want a digital copy of? No option but to do it yourself? Read on.

I have loads of great books, quite a few textbooks, that are long out of print and never received a digital release. In the interests of preserving them, I’d like to take a few important excerpts from them as digital copies to refer to from time to time. Doing this is an absolute pain in the arse though – the only option you have is to do it yourself for some books, and it’s a time consuming process that often ends up as an exercise in frustration.

But it can be done! It just takes a bit of patience. Here’s how you can do it.

Wait: Is this legal?

Well, that depends on your local jurisdiction. Digitising an entire book probably isn’t legal. I’d go so far as to say “almost certainly” isn’t legal, but hey, there might be a strange law in a strange land I’m not aware of. In most sane jurisdictions though copying a section of a book for personal or academic use might fall under ‘fair use’ laws. So that’s the legality of it – check your local copyright laws.

That said, if it’s a book you purchased, and you’re not going to share it with anyone, chances are nobody will know except you. If you go share it on a torrent network… well, be it on your own head if things go awry. It’s not up to me what you do with this knowledge.

Can’t I just use a scanner?

Flatbed scanners are great… at copying sheets of paper. Your average consumer scanner is great at scanning a few flat pages, but they’re typically slow (for inexpensive ones at least). They do have some inherent advantages: the lighting is even, the resolution is good (because it’s flat up against the glass), and they don’t suffer from motion blur. But unless you’re prepared to dissassemble the book into individual sheets of paper, it’s probably not going to work, particularly if the book happens to be thick. You’ll end up with guttering (a black void where the edge of the page meets the binding) and warped text, and towards the gutter the image will become darker and discoloured.

So no, for some books this simply isn’t an option.

What you’ll need

In order to get this right, at a minimum you’ll need:

  1. A camera. You can use a DSLR or a high end smartphone camera, it’s up to you.
  2. An outdoor space.
  3. A PDF program or similar to process and assemble the images

By far the easiest and quickest way is to simply use your smartphone outside. Why outside? By shooting the pages outdoors, you get even ambient lighting, making it much easier to get a good image of the page. Shooting indoors should be avoided, unless you’re willing to rig up a special space with appropriate lighting.

There are plenty of smartphone apps that can be used as ‘scanners’ – these can be a help or a hinderance. For iOS I used Scanner Pro – while it chews through battery life and turned the phone into a hotplate, it had the best post-processing of any of the scanner apps I tried. Post-processing can help a lot – even if it’s just correcting geometric distortion. It also generated a PDF for me that I could easily bring into Adobe Acrobat. Post-processing apps will also try to brighten the background and increase contrast with the text – making it easier to process it into a PDF or readable format later on.

Photographing the pages

Down to business! When you’re outside, position the page as flat as possible. This will help to avoid distortion. While software can correct some distortion, by removing as much of it as possible to begin with, you’ll reduce the amount of work required later on. You might need to lift up the opposite part of the book to get the page to flatten out.

Also ensure that there’s as little glare as possible on the image – in fact, aim for no glare. If you’re outside this might just require some shifting around to get the right angle from the sun. If you’re indoors, this will be a lot more challenging – you need the scene to be well lit, with the lights carefully positioned to avoid glare. This process is a bit beyond the scope of this article, but there are lots of examples online. If you’ve only got room lighting and can deal with a poorer quality image, you can always stand the book upright and photograph it that way.

Avoid shadows! Shadows will show up on the final image and cause uneven lighting on the final scan, and can affect post-processing. Some apps can process the image so that shadows are removed, but if you need to retain colours or an accurate reproduction of an image on the page, the shadow will show up.

Avoid using the flash, particularly with glossy pages! The flash will tend to overbrighten some areas, leading to text discolouration (or worse – the text disappearing entirely!) and will introduce glare as well. Glossy pages are especially prone to this problem. When shooting outside you won’t need a flash, hence why I recommend you shoot outside.

If you’re using an app, and it offers multiple image modes, consider picking one appropriate for the content. Scanner Pro (iOS) offers a few modes: Black and White, Colour Document, Colour Photo, and Grayscale Photo. I was taking excerpts from a textbook, so I used a mixture. If all you’re dealing with is text on a white background, something like Black and White works well because it’ll turn the background white and make the text stand out nicely – making it much easier to process the text later on, and removing artefacts like shadows. It can also help to reduce file sizes.

If your app offers batch shooting, use it – this will allow you to shoot all the images one after the other, assembling them into a single file (usually a PDF) at the end of it.

Processing it into a readable file format

Most apps will spit out a PDF of an inflated file size, so you’ll want to process them in order to get the file size down. How much processing you’ll want to do will depend on what you want to do with the file. We’ll start with something simple first.

Simple PDF Processing: Adobe Acrobat Pro

Acrobat Pro (or a similar PDF editing program) will be fine if all you want is to output a PDF that matches what you put into it. The best way to use Acrobat is to run its Optical Character Recognition (OCR) engine over the images. To do this, open up the file it generated and go to Tools > Optimise PDF > Optimise Scanned Pages.

This brings up a new diagloue box. Tick “Recognise Text” down the bottom third of the box, and then click the EDIT button. The “Output” combo box will usually have “Searchable Image” selected. Don’t use this – all this does is overlay OCR’d text on top of the original image, which just makes the file larger again. Instead, select “Editable Text and Images”. This will OCR the text and then discard much of the underlying image, dramatically reducing the file size. It does keep the text fairly close to the original image though – so any inherent distortion in the image will be left in place. This is why it’s so important to try to get a good, flat, straight shot the first time – it reduces the work needed to be done by the software later on.

Don’t try to use image compression. On raw pages straight from your scanning app, it’ll process the entire page and blur the text, turning the pages into a mess.

This approach works especially well for pages from a textbook or books with a specific format that output well to a PDF. But if you want to take it even further, either reducing the file size yet again or turning it into an epub file, there’s another option…

Big Guns: ABBYY FineReader

Acrobat Pro isn’t bad at OCR on most documents, but if you really want to process the hell out of those files, FineReader is the best in the business. It has a much more powerful OCR engine, built in image editing tools to process pages (straightening text, removing distortion, and other tools to fix up images), and a much more powerful set of tools to manage content on the page. You can also export to multiple layouts and file formats – you may, for example, export to an epub file.

If you want to do an exact copy of a PDF into another PDF, FineReader offers a few additional advantages. Firstly, its OCR engine tends to be a bit more powerful than Acrobat’s; if Acrobat can’t get a decent recognition on a section of text, it’ll just give up and leave it as an image, while FineReader will at least have a go. Secondly, FineReader will remove a lot more of the background than Acrobat – it’ll actively pick out text and images and discard everything else. This translates into a significantly smaller file size than anything Acrobat can create.

That said, this is a tedious process. While FineReader can do a lot of things without much input, to get good results you need to have excellent quality images to begin with. Sometimes FineReader can’t deal with special in-text images or symbols – but instead of recognising this, it’ll have a go anyway and end up screwing up the formatting. Sometimes you’ll just have to manually go through it and block out sections of text or images to get the results you want. Even then, every so often it’ll screw up text in some almost inconceivable way – and you’ll need to go through and fix it.

Outputting to ePub or similar can be a chore too – unless the book you’re processing is just straight text without too many images, you’ll probably need to do a lot of manual formatting, or it’ll end up looking like a bad Kindle ebook. But this is probably the best way to get the job done properly.

A few words of advice

In doing this a couple of times, I’ve learned that the best way to make this a smooth process is to do it in stages, and to make sure I get the best quality images I quite possibly can. I’d shoot one chapter at a time, process it, and then combine them all into a single PDF later on. Yes, it’s a pain in the arse to shoot the pages carefully one at a time – but there is no substitute for a well shot page. Software can help but it can’t really give you what wasn’t there to begin with. Outside of some fairly simple distortion correction, any other post processing you do will affect the image in some other way – it’s a trade off.

Best tip of all though? Shoot outside. Seriously, makes lighting a lot easier.

Have fun!


