Business Brains

CAPTCHAs now being leveraged to digitize the world's print books

Posting in Energy

Through an audacious crowdsourcing strategy, anyone who buys a ticket to an event or conducts an online transaction is helping to convert classic books to digital format -- to the tune of 100 million words a day.

We've all encountered the online challenge-response test when ordering things online -- where a bunch of strange words in strange fonts are displayed and need to be retyped to verify that you are a living, breathing human being and not a bot. That's called a CAPTCHA, which stands for "Completely Automated Public Turing test to tell Computers and Humans Apart."

Louis von Ahn, associate professor of computer science at Carnegie Mellon University and original creator of the CAPTCHA challenge screen, had a brainstorm a couple of years back -- why not harness all that time and energy people are putting into re-typing CAPTCHA codes, and put it to good use?

Now, it is -- many CAPTCHA codes now presented to verify human end-users are actually words taken from classic print books, via optical character recognition, and farmed out for conversion to digital format.

As von Ahn put it at a recent TED presentation, there's a lot of potential energy and brainpower than can be harnessed out there:

"It turns out that approximately 200 million CAPTCHAs are typed everyday by people around the world. When I first heard this, I was quite proud of myself. I thought, look at the impact that my research has had. But then I started feeling bad. See here's the thing, each time you type a CAPTCHA, essentially you waste 10 seconds of your time. And if you multiply that by 200 million, you get that humanity as a whole is wasting about 500,000 hours every day typing these annoying CAPTCHAs. So then I started feeling bad."

von Ahn and his team launched the "reCAPTCHA" project, which engages libraries and publishers to deliver OCR images to Web security sites to essentially use the wisdom of the crowd to convert the words into text. While OCR technology automatically converts many words into digital text, about 30% of printed works more than 50 years old are unrecognizable to the system. "So the next time you type a CAPTCHA, these words that you're typing are actually words that are coming from books that are being digitized that the computer could not recognize," he says.

Currently, reCAPTCHA is helping to digitize 100 millions words a day, or the equivalent of about two and a half million books a year, Ahn says.

"Every time you buy tickets on Ticketmaster, you help to digitize a book. Facebook: Every time you add a friend or poke somebody, you help to digitize a book. Twitter and about 350,000 other sites are all using reCAPTCHA."

Share this

Joe McKendrick

Contributing Editor

Joe McKendrick is an independent analyst who tracks the impact of information technology on management and markets. He is a co-author of the SOA Manifesto and has written for Forbes, ZDNet and Database Trends & Applications. He holds a degree from Temple University. He is based in Pennsylvania. Follow him on Twitter. Disclosure