And yes, they could make scanned-text PDFs searchable with on-the-fly OCR, but that's not quite ready, yet, for prime time.
Right now, desktop searching is one of those emerging technologies that
in theory—looking through both your hard drive and the Web at the same time for
relevant hits—sounds life-changing but in practice still seems like the MP3
player before the iPod came out: Yeah, they work, but no one does it.
Microsoft and Apple are building desktop search features into future
versions of their operating systems (respectively code-named Longhorn and
Tiger), and search engine giants Google and Yahoo have their own
branded desktop search utilities. Even ScanSoft gets into the discussion with
its PaperPort OCR/desktop
organizer software, which doesn't search the Web but it overlaps with a lot of
the desktop search capabilities that the others offer.
And then there's Blinkx, a software company taking on all of the above.
Some of this startup's competitors work only on one platform or, in the case of
Google, don't search PDFs.
We caught up with Blinkx's founder and CTO Suranga Chandratillake to chat
about PDFs, searching them, why he thinks PDF users will like Blinkx and what on
earth the company was thinking earlier this month when it issued a Blinkx
Mac beta to
coincide with Macworld and Steve Jobs' trumpeting of Spotlight, Tiger's desktop
search package.
Don
Fluckinger: Is searching a PDF
technically more challenging than other documents?
Suranga
Chandratillake: No, not really. If anything, they're slightly better.
Indexing is pretty much the same thing, but once you've got a search, you can
highlight text inside a PDF. The words that you search for are highlighted in a
PDF, which you can't do in Word, for example.
Fluckinger:
Do we already have more PDFs than we
can organize on our hard drives? Will we have more in the
future?
Chandratillake: It's an extremely
popular format, particularly in the business context--everything from sales
orders and proposal letters all the way to ad copy and brochures. ... Being able
to index them and sort through them is critical. There's no way we could have
launched a product without support for PDF.
PDF is definitely as significant as any Microsoft Office format. In the
surveys and analyses that we've done, the biggest data type, by far, is e-mail
that can be up to 60 percent of the average person's data. ... The other 40
percent are split between the productivity formats. There are some exceptions.
You do get designers, for example, that have a lot of CAD files, but for the
average office worker we see a split--by file size--of 35 percent to 40 percent
of what's left is PDF, and the rest is split among the Windows Office
formats.
I think that PDFs are going to get extremely popular. Think of the things
we buy on the Web and the services we pay for on the Web. I pay my phone bill,
cable bill, power on the Web. All these people send me PDFs. People are just
going to keep using it more and more and more, and sure at the moment it's
companies sending things off to individuals and that inevitably leads to
individuals sending things to each other. That all points to a massive increase.
Finding the right PDF at the right time gets to be a bigger and bigger problem,
and that's where Blinkx wants to step in.
Fluckinger:
How does Blinkx deal with
password-protected PDFs or files otherwise
rights-managed?
Chandratillake: We can't search through
anything that's encrypted. If you don't have the password, there's no way for us
to get to it. If it's in a secure directory or a secure file-share, but you have
access to it, then we can still see that file.
We can index any version of PDF, including pre-version 4 [of Acrobat],
which were actually pretty different formats than the ones that are popular
today. The only ones we don't support are those that don't have any text, just
images. We do index metadata, however, so if those PDFs have metadata embedded
like the company name or author name, it will draw that out.
Fluckinger:
PDFzone users typically have PDFs of
faxes or scanner paper text pages. Will future versions of Blinkx be able to do
OCR and search these too?
Chandratillake: We can do it, but it's
very difficult for us to release. We've played around [with a freeware OCR
engine plug-in]. That essentially works, but we need to find a good OCR engine
and see if we can license it.
And secondly, everything gets a lot bigger. Good OCR engines tend to be
10 to 15MB in size, which completely blows our download size out of the water;
right now we're around 5MB. It's definitely possible, but right now it's not
ideally structured [for Blinkx]. Once it becomes available in the right way,
we'll definitely do it.
Fluckinger:
Why would Blinkx release a Mac-side
utility with Apple building similar features into
Tiger?
Chandratillake: The main reason is
because people asked us to do it. We got lots of e-mails, really from Day One of
the PC release, saying, "When are you going to do a Mac version?" Funny thing
is, we talked to a lot of journalists, bloggers, writers of various sorts and
analysts about the technology in the earlier days, and many of them are closet
Mac users and said their personal stuff was on the Mac. Because of that massive
demand, we always were going to build one.
The thing with Spotlight, it's a phenomenal operating system-level
metadata-based search engine, but it doesn't do all the things Blinkx does. When
it comes to keyword search or search based on metadata, and doing it very fast,
I think Spotlight will rapidly become the industry standard on the Mac.
The Blinkx toolbar does a whole bunch of other things: conceptual
linking, which automatically looks at text and links you to text without you
searching for it, and smart folders. I think when Spotlight comes out it will
take care of the straightforward search and we will be seen as the complementary
tool that does some stuff alongside it.