trenchant.org

by adam mathes
archive · subscribe

On Fishers

My friends who founded KnowNow are very smart. Adam Rifkin and Rohit Khare want systems that will provide instant search facility for their date, and really, all of their data, that Adam dubs this a Fisher -

"I find myself wondering if there is still an opportunity to launch a desktop search product that fits the classic definition of platform. The equivalent of a "Browser" for the next decade that brings together existing disparate tools by mixing SMTP and HTTP and throws in a healthy dose of instant messaging and RSS -- except that instead of browsing for information it lets you go (for lack of a better word) "Fish" for information. It's got a simple browser interface and query language (like Google), is lightning fast (due to regular re-indexing), and offers search results of your personal stuff in that simple UI.

Rohit is three steps ahead of me here -- that any good "Fisher" of all your emails, IM's, desktop files, web history, and RSS feeds needs a great algorithm to rank the results of your "Go Fish" queries. (Ranking is something ZOO doesn't do, and therefore it cannot handle the volumes of email I receive daily.)"

Fisher as Product Category -

"There is no ranking that makes the results better than grep. At least (unlike Zoe) X1 and Bloomba return matches in order of most-recent-first, but for quality search results, Google has proven to me that ranking is of utmost importance. There is no ranking algorithm a la PageRank that acknowledges even the simplest truths about my mail (stuff from Rohit ranks higher than Orkut notifications, say :).

Many folks can't even imagine 100K messages, but I'm closer to a million (!). Sounds absurd, sure, but the design target for Microsoft Longhorn is supposedly 1 terabyte PCs! And that speaks to another basic criticism of all the aforementioned tools: email may be the center of my universe, but not the entirety of it. How about an "image search" of my hard drive that didn't require me to laboriously pre-caption each photo? Or a "version search" of our latest spec sheet that doesn't trip up on the fact that there are 32 separate Word attachments that all contain the same paragraph over the last year? Or a way to search all the web pages I've visited before? There's hard drive to spare -- why not cache everything?"

This is a hugely important problem. It’s a software problem, an interface problem, and fundamentally an information science problem. (But I’m biased, that’s what I study now.)

I’ve tried to deal with this issue in the domain of web pages when I prototyped Flick, and currently I’ve been using Furl for a similar end.

I don’t think caching everything is the answer, at least for web pages. The important thing there is distinguishing the “good” bits necessary to index and keep from the “bad” bits that you accessed but were not relevant, or didn’t answer your search query, or only have temporal value. This requires more work, but is almost second nature to people who have been surfing the web and saving pages or weblogging for a long time. Having just the good bits indexed and searchable by keyword probably solves a lot of the “I saw this web page about X and I need to find it again,” or “show me the important pages I read that solved X problem” if you make a moderate effort to mark the good bits as you realize they are good.

The problem with email management, and the larger world of the unindexed desktop is much harder. First, due the braindead nature of most operating systems and 1970’s era file systems, separating the “good” bits from the “bad” bits isn’t as easy in a world without standard file-level metadata. But at least in the email world, those messages that aren’t deleted could be considered “good.”

But this doesn’t help much as the sheer number of messages that accumulate over the years becomes daunting. Even if you index them and search them by keyword instantaneously (a mere issue or programming) you haven’t solved the problem. Retrieval isn’t that hard: as they discuss, the ranking the results is the really hard part.

I think ranking is part of the problem, but not all of it. The over-reliance on traditional text-based search techniques is the other half.

First let’s look at ranking. To rank random bits of content in a Fisher, what metrics could the system use? Date is an obvious one, but is probably more helpful as a way to narrow searches. Frequency of use is one metric to look at.

Google’s PageRank technology is a variation on citation based-authority that academics sometimes use to measure the “importance” of journal articles and other academic works. The basic concept in using this on the web is to examine the link structure to determine authority. I’m not sure how, or if, this concept can be applied to email, but the hunch is that there is something similar about the implicit structure of your email archive (or your corporate email archive) that when analyzed could have some correlation to their importance.

A more complicated metric that a well designed operating system might be able to measure is to correlate corresponding documents by use. That is if you had X documents open during similar timeframes, those X documents are probably related. Assuming you are currently using one of those X documents, the other documents in the set could be ranked higher in their return.

This metric brings about two points. First, the context of the search - what documents and text you have open or have recently modified - could help immensely. Second, it points out that a text-based keyword search may not be the whole answer. A content-based information retrieval system that allows you to construct search queries based on the kind of content you’re searching for could be an important area to look at. This isn’t the best example, but rather than just searching for a company name in your email to find correspondence with members of that company, if you have one email from that company the fact that all email from that company will be from the same domain name is something your Fisher could notice. It might rank email to and from that specific person as most relevant, email to and from that company as also relevant. When you think of your documents and content as query statements, interesting possibilities open up.

When the domain switches from email to media - like music or images - the possibilities for content-based image retrieval seem even more interesting.

Regardless, I think the “Fisher” is an important goal in computing today. I’d rank it second in importance. The only thing I want more is a Good Writing Tool. (More on that soon.)

· · ·

If you enjoyed this post, please join my mailing list