PDFZone Ziff-Davis Enterprise
Authoring | Utilities | Content Management | Document Management | Mobile | DRM | Other Formats | Tips
Home arrow Document Management arrow Can Batch OCR Scanning and Image PDF Search Coexist?
Can Batch OCR Scanning and Image PDF Search Coexist?
By Elizabeth Millard

Rate This Article:
Add This Article To:
San Francisco-based financial planner uses Adobe's Paper Capture tool to ensure that text in images is searchable once scanned. Plus, PDFzone checks out some other options.

A rhetorical question for the Internet age: If you can't search on the contents of a digital document, is it still better than paper? For Michael Altschul, IT manager at San Francisco-based financial planning firm Kochis Fitz, the answer is both yes and no.

With the proliferation of desktop search technology, Altschul has been excited about including search in his company's budget for the coming year. So far, he said, everything works magically with desktop search tools, including Word, Excel, e-mail, and non-image PDFs. But there's one major area that's lacking: image PDF search in conjunction with batch OCR (optical character recognition).

ADVERTISEMENT

Like many companies that are awash in paper files and client documents, Kochis Fitz has been striving to cut down on paper use. Toward this goal, about 80 percent of all the files on the network are scanned PDFs. But Altschul has not found a search product that adequately searches on the text contained within images. The ability is vital, he noted, because many of the images within documents contain relevant information on client accounts.

Click here to read about PDFzone's search for the perfect duplex, direct-to-PDF scanner.

The lack of image search capability is hindering the company from going fully paperless, Altschul said.

"At present, we scan client, mission-critical documents only," he noted. "For these documents, the accessibility of paper files is almost on par with electronic accessibility for the lay user, so we don't gain anything by scanning documents pertaining to the day-to-day firm management."

If the company could easily access scanned documents via keyword searches, Altschul said, then scanning everything in the office would be a snap.

Some Solutions

The difficulty with PDF image files is that when they are created from a scanner or an online fax service, the PDF doesn't contain text information. Like a photograph, the PDF is just a snapshot of the document. That makes it tough to tweeze out searchable elements such as individual pieces of information contained in the image.

When Altschul went looking for vendors to solve the problem, he found few that could address the challenge, but one did seem up to the task.

Adobe Systems Inc.'s Acrobat Professional, version 6.0 or 7.0, can do the conversion using the Capture tool, or batch mode for Capture. The standard version of the application has the OCR tool but doesn't include the batch mode.

"What I like about Capture is that it seems to write the OCR-[translated] text to the PDF file, such that desktop search utilities can recognize text," said Altschul. "The other vendors I researched lacked such tight integration."

Another alternative is Google Desktop Search, which Altschul has been considering as an alternative to Capture. Released in March, the Google utility supports PDF search for the first time.

To take advantage of the technology, ScanSoft Inc. has developed the OmniPage Search Indexer, a plug-in that mainly supports PDFs containing text, but also has the ability to use OCR and index image-based PDFs that have scanned text. The application returns results on a locally served page.

ScanSoft's OCR technology recreates the text information from the image content without doctoring the original file. This is crucial, according to the company, because many images have to be preserved as is for legal or regulatory reasons to show the authenticity of a receipt, contract, or piece of correspondence.

Click here to read more about searching PDFs with Google Desktop Search and OmniPage Search Indexer.

One more tool up for consideration is the AutoCapture-X4 from Acrobotics Inc. According to the company, the tool provides complete automation to use OCR on an unlimited number of PDF files. Continuous monitoring is done of complex folder and subfolder structures for automated, unattended OCR translation of PDF files that are added to any folder or subfolder. With AutoCapture-X4, all Acrobat 7.0 Paper Capture features are supported and automated.

Once Altschul has PDF image search in place, he's confident that he can move closer to a goal of implementing a utility that uses OCR to batch-convert all of the company's scanned PDFs.

An application that can handle both tasks is the FineReader, from Abbyy Software House. In the 7.0 Professional Edition, there is enhanced PDF recognition accuracy and functionality, according to the company.

"Individually, these products are cool," Altschul said. "Collectively, they would quickly become invaluable."

Can PDFzone help you solve a problem? Just ask. Click here to e-mail Editor John MacKenna.


Discuss Can Batch OCR Scanning and Image PDF Search Coexist?
 
>>> Be the FIRST to comment on this article!
 

 
 
>>> More Document Management Articles          >>> More By Elizabeth Millard
 



FREE ZIFF DAVIS ENTERPRISE ESEMINARS AT ESEMINARSLIVE.COM
  • Dec 5, 2 p.m. ET
    Case Studies in MSP Profitability: 10 Processes to Automate to Achieve 2008 Goals
    with Michael Krieger. Sponsored by Autotask
  • Dec 6, 12:30 p.m. ET
    The State of the Great Windows Vista Migration
    with Aaron Goldberg. Sponsored by Dell & Microsoft
  • Dec 6, 2 p.m. ET
    Three Best Practices for Securing Microsoft Exchange
    with Michael Krieger. Sponsored by Entrust
  • Dec 6, 3 p.m. ET
    Simplify Your World, part 2: A Virtual Desktops Case Study
    with Joel Shore. Sponsored by EqualLogic
  • 12-19 VTS LOGO for BotMod
    Join us on Dec. 19 for Discovering Value in Stored Data & Reducing Business Risk. Join this interactive day-long event to learn how your enterprise can cost-effectively manage stored data while keeping it secure, compliant and accessible. Disorganized storage can prevent your enterprise from extracting the maximum value from information assets. Learn how to organize enterprise data so vital information assets can help your business thrive. Explore policies, strategies and tactics from creation through deletion. Attend live or on-demand with complimentary registration!
    FEATURED CONTENT

    Sponsored by Ziff Davis Enterprise Group


    DOWNLOADABLE ROI CALCULATORS & TOOLS FROM BASELINE
      Calculate Cost and ROI of Spam, VOIP, RFID, Sarbanes-Oxley and more...


    Featured Calculators:

     



    See More Tools!
    By Category| Planners |Calculators | Quizzes

     

    Special Report


    PDFzone Special Report: Making the Perfect PDF
    The Perfect PDF
    PDFzone shows you how to shine and polish your PDF by adding the reader-friendly touches your audience desires.

    Special Report


    PDFzone Special Report: Microsoft's PDF Play
    Microsoft's PDF Play
    Microsoft planned to offer a "Save to PDF" function in Office 2007, but the threat of legal action from Adobe may have them reconsidering.

    Special Report


    PDF conversion
    PDF Conversion Central
    Convert anything and everything to PDf and back again. Word docs, RSS, AutoCAD and more.
    ADVERTISEMENT