The Problem: You have a big database of stories and comments. I'll take rmg's word that we're
talking about maybe 7 gigabytes of text in 5 million comments. Your mission, should you choose to
accept it, is to make a list of all the comments that contain a particular phrase.
The traditional answer to this is to use a Relational Data Base Management System or
RDBMS, a generic
tool that automatically indexes such a pile of data as it is created. The RDBMS takes whatever
measures are necessary as the database is manipulated (largely transparent to all but the most
well-informed users) so that you can later give it a query amounting to "give me a list of all the
comments containing the phrase foo-bar," and out pops the list like magic.
When a tool like this works it is truly sweet. When such a tool is open-source it is even sweeter. When even profit-hungry industry
behemoths are offering free as in beer
solutions, it's easy to see why programmers have gone almost universally to RDBMS engines when they
need to store and retrieve data, whether they actually need the advanced features of the RDBMS or
not. Programming tutorials which in days of yore would have walked new users through the task of
creating, writing, and re-reading files, now introduce the neophyte to linked data controls and SQL
queries instead.
Like all good solutions, though, the RDBMS can't be good at everything. As the old saying
goes, the jack of all trades is master of none; and the RDBMS introduces a few problems that
eventually become apparent in real-world use:
- The indexing increases storage and processing requirements.
- When the indexing becomes corrupted, the entire database can become unreadable instead of the
damage being limited to the corrupted records. Elaborate tools are necessary to recover from this.
- Even simple, linear queries can involve traversing widely separated parts of the database in
order to follow relational pointers.
- RDBMS code is extremely complicated and when the RDBMS engine itself has a bug or fault it can
be extremely difficult to find and fix.
- Because most programmers don't know exactly what has to be done to process their queries,
latencies can be innocently introduced which harm scaleability and are very difficult to
eliminate.
The latency problem is at the heart of K5's nonfunctional search function, and there aren't
many solutions.
- You can buy enough RAM to accommodate the whole DB; this will work but gets expensive.
- You can become an expert on your DB engine and tweak the indexing and query structure until it
squeaks; this might or might not work and obviates the whole point of using a convenient generic
tool to access your data.
- You can create your own simpler parallel data structure either within or outside of the RDBMS
to do specialized indexing.
All the really workable suggestions fall in the last category. Before we get to my idea and the
gales of laughter it inspired, let's look at something a bit more obviously practical:
Tex's version of the
vector space search:
Then I would make another database, like a binary tree or something like that. Every
word would be a separate record.
So basically if a word was already in the word search database, the comment id, story id, diary id,
whatever, would just be added to that word's record.
And if not, then a new word record in the wordsearch database would be created.
Once this word database is created, doing the search becomes simplicity itself -- you just go to
the word record and spit out its contents. And as rmg pointed out, you can
even do phrase searches with a little extra work either at search time or by including word order
in the word index.
This is such a good idea, you might be asking: What's my problem with it?
The answer, mainly, is that it's a pain in the ass to implement. First of all if you don't want
the scheme to bog down under its own weight you need to do a custom data structure for it; it's
essentially going to be a forest of pointers with a few bits of data sprinkled throughout. You
will be adding dynamically to all of them through the course of a day, and their growth rates are
very unpredictable. Those word indexes will range from common articles like "the" which list
nearly every comment, down to individual posts which contain unusually misspelled words. You can
pare the list down a bit (as Google does) by eliminating the very common words and spell-checking,
but that just adds more logic.
Maintaining a database of this type and keeping it from becoming corrupted is hard. And while it
will probably be smaller than the sum of all the original text, it probably won't be that
much smaller; remember that the few hundred most common words are going to all have hundreds of
thousands of pointers. It might still fit in RAM; really it has to because when you add a
post you're tacking new pointers onto hundreds of widely separated data structures. As those
pointer lists grow you will have to move them around and garbage collect the storage space in real
time. If they're on disk you'll burn out the force coil trying to get to them all.
The word index helps with overhead, but it also moves a lot of heavy lifting from the search itself
to realtime index maintenance. And at the risk of repeating myself, let me say that writing an
engine to keep all those word pointer lists updated and handy is hard. You will certainly
write it in a higher-level language if you're not a complete masochist. If you get lazy and try to
use the RDBMS to do it you'll find that it still slows to a crawl, now when you leave a post
instead of when you do a search. This kind of thing has to be somewhat optimized. You'll have
more fun than the law allows trying to debug it (especially under real world load conditions), and
Gaia help you if it gets corrupted.
My modest proposal was to simply tack every new post to the end of a big text file as
they're created, and write a really efficient parser to scan it. At first this sounds like, soooo
1980. But on reflection, you'll see that it's not 1980 at all -- it wouldn't have been possible,
much less practical in 1980.
In 1981 I was privileged to visit a major Air Force computer installation that featured a Cray I
processor -- the famous "C" with the bench around it and its freon-cooled circuit boards. It had
eight megabytes of RAM and sixteen vector CPU's working in parallel to achieve about 100 megaflops
performance. Disk and tape drives filled a room the size of a small warehouse. The 80 megabyte
removable-platter hard disks, with media the size and shape of a cake, had access times reckoned in
tenths of a second. We were told that this machine was used, among other things, to sort Exxon's
100 million record mailing list.
You no longer need a ten million dollar supercomputer to do that. And you'd no longer bother with
vector processing and the many tricks that were necessary to optimize memory and disk paging on a
dataset that size. And similarly, as recently as 1990 searching a seven-gigabyte dataset on a
personal computer would have been well nigh impossible without a lot of fancy indexing and careful
memory and disk management.
But it's 2004, not 1990, and things have changed.
Right now it is not hard to find hard disks with over 100 megabyte per second sustained transfer
rates. This means you could load and scan seven gigabytes in about a minute. While that sounds
like terrible performance there are several reasons why it is actually acceptable.
- Most searches don't scan the entire database; so if you follow Scoop's convention and keep the
stories in a separate, smaller pile, only the comment search will have the performance problem.
- Searches are already limited by default to a more modest "current" dataset.
- Searches terminate and return results when a page of results has been acquired; on most
searches this will not require scanning the whole DB.
- The time required for a search is definite and predictable, so the user can be given
appropriate status indications.
- As blocks are loaded into RAM they can be used for all ongoing concurrent searches, so while
one worst-case search might take 60 seconds, a thousand concurrent worst-case searches will
also take 60 seconds.
- The effect on the main database is effectively zero.
Now it's possible to argue that 60 seconds is entirely too long for a search to complete, and if I
was doing a lot of arcane searches I'd probably agree. But this is on the order of how long
archive comment searches were taking anyway before the plug was pulled; and it's on the order of
how long searches take on many web BBS systems.
The advantage here is that the search routine spends nearly all of its time waiting for the hard
drive to load another block, so that no matter how many searches are requested none of them will
ever take more than the predictable worst-case time and the effect on K5's other functionality will
be unnoticeable. And it was the effect on K5's other functionality more than anything else that
made it a problem for the site.
In the original comment I suggested that the scanner should be coded in assembly language, and
that's partly to make sure the search stays invisible in this way; but mostly it's because the
thing is just so simple that you can. No matter how good your compiler is you tend to do things in
a higher level language that look stupid when translated to machine code, like recalculating
intermediate results, moving strings around instead of examining them in situ, and using memory
where registers are available. If you're avoiding this kind of performance hit it's probably
because you are very familiar with how your compiler writes code and keeping it in the back of your
mind; and that's not high-level programming even if you are doing it in C++.
A tight loop like a search is exactly the kind of thing that would benefit from being hand tuned.
The maintenance of Tex's word directory would benefit too, but that's so complicated only a pure
masochist would even attempt it. (I've done projects like that, but I wouldn't expect anybody else
to.) The big advantage of my scheme is that it is simple. Maintaining the search database
is simple and scanning it is simple. This means you can easily and completely debug it, and you
don't have to worry about it getting corrupted when the power fails unexpectedly.
All of your files, including the RDBMS and search index, live on another kind of database, the hard
disk file system. Unlike the RDBMS, a file system is designed to do one thing -- store and
retrieve blocks of data without worrying too much about what they are -- and if it's worth the
electrons it's made of, a file system will do this very well. Any good well-maintained file system
will tend to group a big file like the search index all in one place, so that linear access is
blazingly fast, limited mostly by the rotation of the platters rather than repositioning of the
head. The flat index leverages this to its advantage.
And if it still seems unworkable and silly, keep in mind that next year's hard drives will be even
faster than today's. RAID controllers are getting cheap enough for personal systems. The
technology to bring the worst-case search under 10 seconds is already there for a reasonable cost,
and within a shorter time than K5 has already existed I'd guess that 1 second will be
achievable.
Meanwhile, RDBMS code never gets any simpler. The opposite is true; faster machines and more
memory encourage feature bloat, so that performance doesn't rise as rapidly as the low-level
performance of the machine does.
I have argued in the past, loudly and at great length, that modern computing suffers because
lazy programmers use powerful hardware as a crutch instead of learning how to do things right. An
alert reader might accuse me of violating my own principles by leaning on the fast CPU and hard
drive instead of building a proper index.
It's a valid criticism and I'd answer the same way modern programmers do; it's a tradeoff. It is
lazy to depend on gigahertz processors and super-fast hard drive interfaces that didn't even exist
a few years ago. But you also have to ask what you get in trade for that performance hit.
I like fast code, but even more than that I like code that works. This means I like code
that it simple, lean, and written with an awareness of its hardware platform. What the flat search
index buys you is extreme simplicity. Simple code is code that is easily optimized and easily
debugged. It also buys you short development time and low cost, which aren't enough to justify
requiring a bigger computer but aren't to be laughed at, either.
In the final analysis, if time and resources are limited there is really only one choice that makes
any sense. A search function that works, however laughably implemented, is surely better than one
you don't have the time to write and debug.