Gmail Search; Readers’ Respond and a Proposal
Well. That was an interesting response.
My post on the failings of Gmail search proved quite popular, and garnered a number of interesting responses. Aside from the various name-calling (first time I was ever called an asshat, so circle the day on my calendar!), and the one guy who accused me of being a shill for Yahoo, most of responses here at my blog and on the posting at Hacker News fell into three main areas.
“You are stupid for wanting search to work like that!”
Fair enough, if that’s what you think. But your saying so doesn’t change the fact that Gmail’s search is failing to help me manage my email. And I can see clearly that as the years of email pile up it will only get more important to be able to search very specifically.
Some of these responses were along the line of “Use Search Autocomplete from labs for searching for people”. OK, but that violates the experience Google has habituated me to expect: One single box, type some text in, it’ll do a good job finding it. I don’t want to know there is a special auto complete. There’s a search box at the top; I want to use it.
“All email search sucks, but at least Gmail is fast!”
This is both funny and true. One of the real virtues of Gmail’s existing search is that it is really damn fast. The Yahoo Mail search I referenced in the original post was *much* slower than Gmail’s search. But one poster referenced Gene Weinberg’s story from Code Complete regarding speed; another poster had the same reaction I did:
It's not "completely broken," but no hits for a query of "zag" in an email that contains "zagg" comes uncomfortably close to "doesn't work."
So for me, Gmail’s speed is nice, but it is useless speed if I can’t find what I need. A number of people agreed with me.
This comment made me laugh though:
I love Gmail search because I've seen the difference - have you ever tried searching email in Mobile Me? It's like they ignore your query and search a random string.
Finally, the guy who commented (paraphrasing) ” … I couldn’t find a specific email so I assumed I deleted it; now I’m not so sure” describes a very real problem. If you can’t find it by search and you know search is deficient, you end up wondering if you deleted it. If you never find it, the effect is the same as if Google was randomly deleting emails.
Google can’t afford to index all that email! It’s hard.
Well, yes. It is hard, and having implemented various substring indexing systems in my professional career I know all too well how expensive in both time and space it can be. I grant the point. But if we don’t think about the hard things, nothing interesting ever gets done.
Plus, searching through ten years of email without substring search seems like it will become impossible.
Gmail encourages you to not delete; archive is the mantra. But ten years from now I’ll have a ton of email. I will likely need amazing search capabilities to make use of it, otherwise you might as well just delete your old email and be done with it. Maybe you shouldn’t keep email, but the prospect of having all those old emails interests me.
So what could Google do?
A really intriguing suggestion was for Google to use Google Gears to substring index your mail locally on your computer. This has a certain appeal for sure, but is not really a general solution when you want to be able to use public access computers at the library to check your email, or use your iPhone, etc. Still, a really clever idea.
Substring indexing itself was generally regarded as too hard for this reason: it would take a ton of effort and storage to fully index all the text in all the emails across all the Gmail accounts. This argument seems persuasive, until you look at it too hard:
The number of words/tokens to substring index is not really proportional to the number of bytes in an email X number of emails per account X number of accounts. It is much smaller than that, for several reasons:
- Most of the space emails take up is not really text; it is attachments. Pictures, etc. Those bytes don’t matter for this sort of indexing.
- The overlap in tokens (words) across all the emails Gmail holds must be huge; I bet the total amount of unique tokens is vastly smaller. Manageable, even. So you could have each word substring indexed (tries, trigrams, take your pick of a hundred ways to do this).
- When a new email arrives, indexing it would be a matter of recording which tokens it contains (bloom filters?).
- Some of the tokens would have to be substring indexed on the fly; over time (pretty quickly I expect), this would be infrequent
Then, to perform a substring search, you identify the tokens the substring matches (from the total corpus of known tokens), then filter for the emails containing those tokens which belong to the user. From there you have ground the problem down pretty far.
Yes, I simplified to beat the band. Yes, scale is a real issue. Sure, you could make assumptions that a user will stick to a single language. Whatever. But it is one solution that might work.
Another thought is to do aggressive tag generation; so that the token “Zagg” automatically indexed to “headphone”, “earbud”, “headset”, etcetera; so a search for “headset” might be able to suggest Zagg. This is kinda like what Google’s web search does, if it was integrated into email search.
Why can’t Google come up with a better idea? Um, they do index the web, for Pete’s sake!
Can you come up with another approach? An even better one? Cool! Let me know.
Heck, let Google know. Because at the moment it is still Search Fail for Gmail.