Google Books may want to rule the world, but can they actually produce as promised?

Here’s an admission for you: I have a hard time comprehending all of the details around what Google Books is up to. I gather they’re scanning everything (though not necessarily well — more on that in a sec) ever published, posting the out-of-copyright stuff, linking to snippets of the other stuff, and there are some folks objecting to that, so there’s a lawsuit, and a settlement in the works. This article, The Audacity of the Google Book Search Settlement, caught my eye today, and laid out some of the legal hoop-jumping, so that helped. And I know that the American Library Association has been having their say, as they generally do (not a fan, but that’s another post). I’m just not sure what I think of it all.

I digitized a book myself last year, a book the Town of Amherst holds the copyright to so no worries there, and is long out of print (A History of the Town of Amherst, New York, 1818-1865). It was a painful, arduous process, and I even had a grant to do it, but oy, I’m not sure I’d gear up for that again. But there’s Google Books, with tons of resources, just scanning and posting away, and doing all the heavy lifting, so that’s good, right? For the same reason I scanned my one book (it’s useful to local researchers and there are only a few copies out there, so this way anyone can access it), Google Books is scanning, well, all the books.

But — I’ve been playing around with the Barnes & Noble ereader app for the iPhone, and used it to download a handful of free ebooks from Google Books. The results have all been poor. Emma chopped off a few words at the beginning and had all kinds of bad characters, poor OCR. Ditto for Persuasion. Virgil’s Aeneid wasn’t so bad. But Anna Karenina was missing the first four chapters — completely missing them. I was appalled. The sloppy OCR I can try to get past (though…) but leaving out four chapters?!

  1. That was interesting. I guess no one is actually checking the work. Making mistakes like that is easy when scanning, but only if the scanner is not being very careful. Still, it is disappointing.

    • When you use OCR, optical character recognition, when you scan, the better the software the better the result. But you can never trust it to get everything. I should clarify that Google Books says upfront in its disclaimer that there will be errors, bad characters, etc., and they go one about perfecting their software to make that less, in the future. According to Google Books, these errors “will not detract from your enjoyment of the book”. I beg to disagree — especially if by “error” they mean “might be missing four chapters”!

  2. In most cases, the page is presented to the browser as an image, and the OCR layer is not visible, but used only for full-text searching of the content. Each word in the OCR layer is mapped to the image coordinates on the pages where that word appears, which allows highlighting the word on the image. But for ebook readers, my gut instinct is that they need the OCR for display purposes, because downloading only the text requires a tiny fraction of the bandwidth needed for the image-plus-text.

    By the way, there’s a new website about Google’s book scanning:

  3. Google has certainly undertaken a massive project. The results make me wonder how many project management resources they assigned to it – something that often runs through my mind in this kind of situation given that June is a PMP. lol.

    At any rate – ITA that missing four chapters from a book would diminish your enjoyment. No kidding. Apparently Google is more interested in quantity over quality at this point? It seems like a slipshod way of going about things.

  4. As an author, I’m not sure how I feel about this. I understand they’re only using a part of the books that are copyrighted, but it still feels a little “infringey” to me. Maybe it helps sell a book, I don’t know–I’d have to do more to understand if people who actually read a snippet then feel compelled to buy the book. Hard to say. From a purely business perspective, this would be the only reason for a snippet to be featured, anyway.

    As for the older works, it’s a shame that they’re not coming through in a complete way. Quality is always better than quantity, in my opinion.

  5. Joe, I’m a neophyte with such things. I think I understand you to say that if ereaders were using the image scan instead of the OCR we’d get better results?
    Thanks for the link. I’m trying to keep up with this issue, but it’s fairly complex — sites like that help to gather all the info together.

