Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
A Modest Search Algorithm Proposal

By localroger in Technology
Mon May 03, 2004 at 06:43:49 AM EST
Tags: Software (all tags)
Software

I recently expressed an opinion as to how a robust search algorithm might be built. All right, stop laughing. You too, Tex. I will now explain why it isn't nearly as dumb an idea as it sounds.


The Problem: You have a big database of stories and comments. I'll take rmg's word that we're talking about maybe 7 gigabytes of text in 5 million comments. Your mission, should you choose to accept it, is to make a list of all the comments that contain a particular phrase.

The traditional answer to this is to use a Relational Data Base Management System or RDBMS, a generic tool that automatically indexes such a pile of data as it is created. The RDBMS takes whatever measures are necessary as the database is manipulated (largely transparent to all but the most well-informed users) so that you can later give it a query amounting to "give me a list of all the comments containing the phrase foo-bar," and out pops the list like magic.

When a tool like this works it is truly sweet. When such a tool is open-source it is even sweeter. When even profit-hungry industry behemoths are offering free as in beer solutions, it's easy to see why programmers have gone almost universally to RDBMS engines when they need to store and retrieve data, whether they actually need the advanced features of the RDBMS or not. Programming tutorials which in days of yore would have walked new users through the task of creating, writing, and re-reading files, now introduce the neophyte to linked data controls and SQL queries instead.

Like all good solutions, though, the RDBMS can't be good at everything. As the old saying goes, the jack of all trades is master of none; and the RDBMS introduces a few problems that eventually become apparent in real-world use:

  • The indexing increases storage and processing requirements.
  • When the indexing becomes corrupted, the entire database can become unreadable instead of the damage being limited to the corrupted records. Elaborate tools are necessary to recover from this.
  • Even simple, linear queries can involve traversing widely separated parts of the database in order to follow relational pointers.
  • RDBMS code is extremely complicated and when the RDBMS engine itself has a bug or fault it can be extremely difficult to find and fix.
  • Because most programmers don't know exactly what has to be done to process their queries, latencies can be innocently introduced which harm scaleability and are very difficult to eliminate.
The latency problem is at the heart of K5's nonfunctional search function, and there aren't many solutions.
  1. You can buy enough RAM to accommodate the whole DB; this will work but gets expensive.
  2. You can become an expert on your DB engine and tweak the indexing and query structure until it squeaks; this might or might not work and obviates the whole point of using a convenient generic tool to access your data.
  3. You can create your own simpler parallel data structure either within or outside of the RDBMS to do specialized indexing.
All the really workable suggestions fall in the last category. Before we get to my idea and the gales of laughter it inspired, let's look at something a bit more obviously practical:

Tex's version of the vector space search:

Then I would make another database, like a binary tree or something like that. Every word would be a separate record. So basically if a word was already in the word search database, the comment id, story id, diary id, whatever, would just be added to that word's record. And if not, then a new word record in the wordsearch database would be created.
Once this word database is created, doing the search becomes simplicity itself -- you just go to the word record and spit out its contents. And as rmg pointed out, you can even do phrase searches with a little extra work either at search time or by including word order in the word index.

This is such a good idea, you might be asking: What's my problem with it?

The answer, mainly, is that it's a pain in the ass to implement. First of all if you don't want the scheme to bog down under its own weight you need to do a custom data structure for it; it's essentially going to be a forest of pointers with a few bits of data sprinkled throughout. You will be adding dynamically to all of them through the course of a day, and their growth rates are very unpredictable. Those word indexes will range from common articles like "the" which list nearly every comment, down to individual posts which contain unusually misspelled words. You can pare the list down a bit (as Google does) by eliminating the very common words and spell-checking, but that just adds more logic.

Maintaining a database of this type and keeping it from becoming corrupted is hard. And while it will probably be smaller than the sum of all the original text, it probably won't be that much smaller; remember that the few hundred most common words are going to all have hundreds of thousands of pointers. It might still fit in RAM; really it has to because when you add a post you're tacking new pointers onto hundreds of widely separated data structures. As those pointer lists grow you will have to move them around and garbage collect the storage space in real time. If they're on disk you'll burn out the force coil trying to get to them all.

The word index helps with overhead, but it also moves a lot of heavy lifting from the search itself to realtime index maintenance. And at the risk of repeating myself, let me say that writing an engine to keep all those word pointer lists updated and handy is hard. You will certainly write it in a higher-level language if you're not a complete masochist. If you get lazy and try to use the RDBMS to do it you'll find that it still slows to a crawl, now when you leave a post instead of when you do a search. This kind of thing has to be somewhat optimized. You'll have more fun than the law allows trying to debug it (especially under real world load conditions), and Gaia help you if it gets corrupted.

My modest proposal was to simply tack every new post to the end of a big text file as they're created, and write a really efficient parser to scan it. At first this sounds like, soooo 1980. But on reflection, you'll see that it's not 1980 at all -- it wouldn't have been possible, much less practical in 1980.

In 1981 I was privileged to visit a major Air Force computer installation that featured a Cray I processor -- the famous "C" with the bench around it and its freon-cooled circuit boards. It had eight megabytes of RAM and sixteen vector CPU's working in parallel to achieve about 100 megaflops performance. Disk and tape drives filled a room the size of a small warehouse. The 80 megabyte removable-platter hard disks, with media the size and shape of a cake, had access times reckoned in tenths of a second. We were told that this machine was used, among other things, to sort Exxon's 100 million record mailing list.

You no longer need a ten million dollar supercomputer to do that. And you'd no longer bother with vector processing and the many tricks that were necessary to optimize memory and disk paging on a dataset that size. And similarly, as recently as 1990 searching a seven-gigabyte dataset on a personal computer would have been well nigh impossible without a lot of fancy indexing and careful memory and disk management.

But it's 2004, not 1990, and things have changed.

Right now it is not hard to find hard disks with over 100 megabyte per second sustained transfer rates. This means you could load and scan seven gigabytes in about a minute. While that sounds like terrible performance there are several reasons why it is actually acceptable.

  • Most searches don't scan the entire database; so if you follow Scoop's convention and keep the stories in a separate, smaller pile, only the comment search will have the performance problem.
  • Searches are already limited by default to a more modest "current" dataset.
  • Searches terminate and return results when a page of results has been acquired; on most searches this will not require scanning the whole DB.
  • The time required for a search is definite and predictable, so the user can be given appropriate status indications.
  • As blocks are loaded into RAM they can be used for all ongoing concurrent searches, so while one worst-case search might take 60 seconds, a thousand concurrent worst-case searches will also take 60 seconds.
  • The effect on the main database is effectively zero.
Now it's possible to argue that 60 seconds is entirely too long for a search to complete, and if I was doing a lot of arcane searches I'd probably agree. But this is on the order of how long archive comment searches were taking anyway before the plug was pulled; and it's on the order of how long searches take on many web BBS systems.

The advantage here is that the search routine spends nearly all of its time waiting for the hard drive to load another block, so that no matter how many searches are requested none of them will ever take more than the predictable worst-case time and the effect on K5's other functionality will be unnoticeable. And it was the effect on K5's other functionality more than anything else that made it a problem for the site.

In the original comment I suggested that the scanner should be coded in assembly language, and that's partly to make sure the search stays invisible in this way; but mostly it's because the thing is just so simple that you can. No matter how good your compiler is you tend to do things in a higher level language that look stupid when translated to machine code, like recalculating intermediate results, moving strings around instead of examining them in situ, and using memory where registers are available. If you're avoiding this kind of performance hit it's probably because you are very familiar with how your compiler writes code and keeping it in the back of your mind; and that's not high-level programming even if you are doing it in C++.

A tight loop like a search is exactly the kind of thing that would benefit from being hand tuned. The maintenance of Tex's word directory would benefit too, but that's so complicated only a pure masochist would even attempt it. (I've done projects like that, but I wouldn't expect anybody else to.) The big advantage of my scheme is that it is simple. Maintaining the search database is simple and scanning it is simple. This means you can easily and completely debug it, and you don't have to worry about it getting corrupted when the power fails unexpectedly.

All of your files, including the RDBMS and search index, live on another kind of database, the hard disk file system. Unlike the RDBMS, a file system is designed to do one thing -- store and retrieve blocks of data without worrying too much about what they are -- and if it's worth the electrons it's made of, a file system will do this very well. Any good well-maintained file system will tend to group a big file like the search index all in one place, so that linear access is blazingly fast, limited mostly by the rotation of the platters rather than repositioning of the head. The flat index leverages this to its advantage.

And if it still seems unworkable and silly, keep in mind that next year's hard drives will be even faster than today's. RAID controllers are getting cheap enough for personal systems. The technology to bring the worst-case search under 10 seconds is already there for a reasonable cost, and within a shorter time than K5 has already existed I'd guess that 1 second will be achievable.

Meanwhile, RDBMS code never gets any simpler. The opposite is true; faster machines and more memory encourage feature bloat, so that performance doesn't rise as rapidly as the low-level performance of the machine does.

I have argued in the past, loudly and at great length, that modern computing suffers because lazy programmers use powerful hardware as a crutch instead of learning how to do things right. An alert reader might accuse me of violating my own principles by leaning on the fast CPU and hard drive instead of building a proper index.

It's a valid criticism and I'd answer the same way modern programmers do; it's a tradeoff. It is lazy to depend on gigahertz processors and super-fast hard drive interfaces that didn't even exist a few years ago. But you also have to ask what you get in trade for that performance hit.

I like fast code, but even more than that I like code that works. This means I like code that it simple, lean, and written with an awareness of its hardware platform. What the flat search index buys you is extreme simplicity. Simple code is code that is easily optimized and easily debugged. It also buys you short development time and low cost, which aren't enough to justify requiring a bigger computer but aren't to be laughed at, either.

In the final analysis, if time and resources are limited there is really only one choice that makes any sense. A search function that works, however laughably implemented, is surely better than one you don't have the time to write and debug.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login
Make a new account
Username:
Password:

Note: You must accept a cookie to log in.

Poll
I keep my data in...
o Oracle 5%
o DBase 0%
o FoxBase 0%
o Access 0%
o MySQL 19%
o PostGreSQL 20%
o Fixed-width random files 4%
o Text files 18%
o XML 2%
o Post-It Notes 4%
o My head 16%
o Shredder 7%

Votes: 124
Results | Other Polls

Related Links
o Scoop
o Google
o opinion
o laughing
o You too, Tex
o rmg's word
o RDBMS
o open-sourc e
o free as in beer
o Tex's version
o pointed out
o Also by localroger


Display: Sort:
A Modest Search Algorithm Proposal | 253 comments (246 topical, 7 editorial, 1 hidden)
+1, technology (1.33 / 18) (#1)
by Hide The Hamster on Sat May 01, 2004 at 04:05:23 PM EST




Free spirits are a liability.

August 8, 2004: "it certainly is" and I had engaged in a homosexual tryst.

Yah (2.33 / 6) (#2)
by melia on Sat May 01, 2004 at 04:06:45 PM EST

What the flat search index buys you is extreme simplicity. Simple code is code that is easily optimized and easily debugged. It also buys you short development time and low cost, which aren't enough to justify requiring a bigger computer but aren't to be laughed at, either.

Yah it looks pretty similar to PHP.


Disclaimer: All of the above is probably wrong

Easy fix to K5's database woes (2.41 / 12) (#3)
by Jed Smith on Sat May 01, 2004 at 04:12:55 PM EST

http://www.slashcode.com/
_____
K5 is dead. Steve Ballmer made the most insightful comment on a story. -- jw32767
I'd just like the record to reflect that (2.47 / 19) (#6)
by Tex Bigballs on Sat May 01, 2004 at 04:27:14 PM EST

I was just throwing my idea out there because I had to do a similar project in college. I have no idea if it would be easy or difficult to implement. In fact, the only programming I've ever done is some C in college and some visual basic for my job.

Therefore, I would defer to someone who knows more about databases to evaluate local roger's method.

Finally, I would just like to add that local roger is an egotistical windbag and here is yet another sterling example.

Why not just make comments smaller? (nt) (2.57 / 21) (#7)
by ennui on Sat May 01, 2004 at 04:33:05 PM EST



"You can get a lot more done with a kind word and a gun, than with a kind word alone." -- Al Capone
-1, moron (2.79 / 34) (#8)
by pb on Sat May 01, 2004 at 04:41:39 PM EST

Here is why your proposal is total bullshit: kuro5hin effectively uses it already, and it doesn't work.

Since you admit your ignorance about databases, you'll just have to take my word that "SELECT * FROM comments WHERE comment LIKE '%foo%'" is roughly equivalent to searching for the string 'foo' in a flat text file of comment text. You'll also have to accept the fact that searching K5 currently works sort of like that now, and that's why it's so slow.

To search it faster, what you'd want to do would involve a slightly more complicated scheme, and it doesn't really matter if you store your data on the filesystem (as I do) or in a database (as K5 does); you just have to do it right. That is to say, your search algorithm shouldn't involve sifting through 7GB of data every time.

After all, if sifting through 7GB of data is 'fast', then sifting through 7MB would be faster! Ideally, you'd want to reduce the amount you have to search for any given query to some small fraction of the actual data, and then use that to rank and look up the relevant records. This is called an index.

Now, why do people generally use databases for this instead of writing their own system in assembler with flat text files? Well, because databases are designed for these sorts of problems. They were originally written in assembler, by experts, and these experts have refined their techniques and incorporated new algorithms over the years. Do you know what a B-tree is? Well, I guarantee you that they do.

So although it is possible that for a highly specialized task, you could 'beat them' with some hand-crafted program, it isn't likely. Chances are, if your database or search program is dog-slow, it's because your database schema is badly designed, and/or your choice of algorithms in accessing your data is poor. Other people who know what they're doing could use the same database, do the same task, and do it far faster, because they know how to do it better.

In short... leave it to the professionals.
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

Uh... (2.44 / 9) (#9)
by The Honorable Elijah Muhammad on Sat May 01, 2004 at 04:46:17 PM EST

What happened to vector search?


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
Nobody wants to wait 60 seconds on search results (2.76 / 17) (#13)
by bc on Sat May 01, 2004 at 05:01:57 PM EST

The solution is to use a system with proper caching, db pooling etc. Many such solutions exist. Some are complicated, some are absurdly easy. All this work's been done already, better, by people smarter than you and me.

You might also want to think about using a proper database, and not a dreadful piece of shit.

Nonetheless, blaming a perverse, dreadful implementation of search and DB and design and so on (scoop) on the whole world of RDBMS, and advocating something that will "only take 60 seconds" (and may indeed be better and faster than perl + mysql, lets face it, these two things should be banned) is to miss the point that we can take a step forwards, into a new, sunlit upland of lightning performance and piss-easy development, rather than a step backwards into nasm, flat files, and general agony.

♥, bc.

-1 and then some (2.28 / 14) (#19)
by toulouse on Sat May 01, 2004 at 05:21:56 PM EST

You've already exhausted this troll, and it was fucking tedious and uninspired the first time.

A perusal of your literary output demonstrates that you are not above flogging a dead horse, but kindly refrain from indulging yourself here. PlzDitchKThx.


--
'My god...it's full of blogs.' - ktakki
--


Won't work at all (2.87 / 16) (#20)
by vadim on Sat May 01, 2004 at 05:24:35 PM EST

Ok, let's take a look at your claims: Search through 7 GB of data stored on disk in 10 seconds. That is about 716MB of data parsed per second.

I suppose this approach might work if it was all in RAM, but on a hard disk? No way. A rough benchmark says my hard disk can do 55 MB/s. Okay, let's say we can get 80MB/s. That's still 9 hard drives in RAID-0 in very optimistic conditions, and a horrible MTBF. So make that 18 drives at least.

Next, we need a bus capable of handling that. I have two PCI 66MHz 64 bit slots in this machine, that's 532 MB/s. Oops. Bottleneck.

At this point, it's pretty evident that we need some hardware to support this that's more decent than my dual Athlon. AMD64 maybe, or some expensive Sun box. I don't think that Rusty can afford one of these.

Also, with the cost of 18 good (probably SCSI) hard drives and just a 7 GB file, you'd be better off just getting a machine with enough RAM to keep the whole thing in memory.

The whole idea appears to be braindead. For the money this idea would cost you probably could get enough RAM, and a decent database. For some reason you discard using indexes. You ignore that parsing 716MB of data a second will have a horrible impact on the system bus, and the disk activity will slow down the database and reduce the available amount of cache memory. You ignore that the data already is in a database and want to duplicate it for some reason.

Your approach also loses lots of possibilities that databases give you. So the search is too slow because there's too much data. Why not restrict the search?

Force users to select a timeframe to search, encourage limiting the search to a story, user or section. Maybe even add the moderation score as a parameter. Allow to search for signatures separately.

True, scanning through the whole database with a RDBMS would be slower than doing it with a text file. But that's a completely stupid idea, you could use the database's capabilities and get a better performance out of it! If the database is bad, then there are others available (postgres?)

--
<@chani> I *cannot* remember names. but I did memorize 214 digits of pi once.

Your head is clouded (2.58 / 12) (#21)
by cameldrv on Sat May 01, 2004 at 05:25:06 PM EST

Your knowledge of algorithms is stunted by your real-time experience and the need for guaranteed worst-case performance. Suppose we coded google in assembly and scanned a big text file of the whole web every time you did a search. Oh, but it's fast and easy to write because it's in assembly.

Roger: (2.50 / 10) (#25)
by mcc on Sat May 01, 2004 at 05:56:39 PM EST

This faux "discussion" that you are having between your main account and your "tex bigballs" dupe account is getting really grating. Please stop it.

Hey Roger. (2.64 / 14) (#29)
by scanman on Sat May 01, 2004 at 06:27:21 PM EST

If you hate vector searches so much, why don't you stop using them? No one is forcing you to use vector searches. To help you out, I'll give you the contact information for ICANN so you can ask them for the IP address of kuro5hin.org without using a vector search:

(ICANN) Internet Corporation for Assigned Names and Numbers
4676 Admiralty Way, Suite 330
Marina del Rey, CA 92092

Phone: 310-823-9358
Fax: 310-823-8649

"[You are] a narrow-minded moron [and] a complete loser." - David Quartz
"scanman: The moron." - ucblockhead
"I prefer the term 'lifeskills impaired'" - Inoshiro

OH SHIT I GAVE IT +1FP (2.26 / 23) (#30)
by James A C Joyce on Sat May 01, 2004 at 06:27:43 PM EST

I'm really sorry everybody.

On a more topical note, I bet I could come up with a better way of doing this just by picking a random techie phrase from the front page of perl.com. Ah, here we are..."Bloom Filters". There ya go, Mister Localroger Troll Sir, implement something with that.

I bought this account on eBay

Bah. (2.84 / 13) (#32)
by regeya on Sat May 01, 2004 at 07:07:19 PM EST

As other people have said, RDBMS systems are handy because someone else has already used optimized routines for you. I'll admit that I don't know much about the subject myself, but I'm sorry to say that I think your approach is, well, stupid. I ended up getting a journalism degree because I quickly learned that CS wasn't the major for me, but not before learning something about indexing. There are far more efficient ways of indexing data than what you propose, and a decent RDBMS could handle it. The problem is that MySQL (or does kuro5hin use something different?) just isn't up to the task.

A lot of people have thrown brain power at such problems. A flat text file + parser sounds like a stupid Perl programmer's idea. This notion of reinventing the wheel and denouncing using a modular solution as lazy, though, is absurd, IMHO. Again, others have worked on problems before, and

Heh...I guess I'm a lamer, but I like the idea of using files instead of throwing everything everything into an RDBMS, though. With a decent filesystem, indexing scheme, and caching scheme, this could work. You wouldn't be able to rely on the RDBMS as a security device (that's fine, I suppose) and you'd have to handle things like locking on your own, but it could work. Hell, someone do it and see if it's possible to make it work. Let's go further: How 'bout someone takes some existing pieces and tries to make a Scoop-type system using, say (this is an example, and I've no experience in such things, so feel free to laugh) a PHP system that reads/writes XML, and a pre-built indexer such as SWISH-E (again, feel free to point and laugh if it wouldn't work) along with a home-grown indexing scheme (which again could be based on XML, though localroger will blast it for being lazy ;-D) and see if it could even come close to Scoop. I'd be interested to see if it could work.

[ yokelpunk | kuro5hin diary ]

uh... (2.80 / 21) (#35)
by coderlemming on Sat May 01, 2004 at 07:17:08 PM EST

Well, I'm going to take you seriously and assume you're not trolling.  Perhaps I'm wrong and you're just insanely subtle.

I'll grant you that your solution would work.  It's a quick, dirty hack, and if you throw enough hardware at it, it'll work.  Maybe.  For now.  But it's stupid.

First of all, in this writeup and in your comments, you proclaim the amazing power of modern disk IO systems, what with their caching and all of that.  Add to that the ability of any multitasking OS to schedule around your 99.9% IO-bound thread, and you take this to mean that you're home free.  Unfortunately, you've overlooked a few things (these are just the ones that come to mind):

  •  You're going to absolutely trash all of the disk caches.  In a fraction of a second, you're going to totally blow the one on the disk, even if it's huge, and a few seconds after that, you'll probably also get the kernel's cache.  By the time you're done, they'll likely have a small fraction of your database in cache, and not much else.  Your next search won't benefit from this at all, and nor will anything else.  Filesystem efficiency will suffer.
  •  Your suggestion that no other processes will suffer from your low-priority thread thrashing the disk is a bit short-sighted, as well.  As you point out, the RDBMS is storing its data on the disk as well, and every single other action that scoop does is going to go through the disk.  You're definitely going to severely bog that down by scanning your flat-file.
  •  When data is invalidated, how do you find the entry in the flatfile to mark it as invalid?  Another search?  This question was raised in some of the threads you linked to but never answered.
Beyond these points, you also have to realize that a simple implementation of a word-index search is just as trivial as implementing your flatfile, scan, and garbage-collection systems.  RDBMSs are made for huge data sets like this, and there's been hard research done on how to make them fast.  Besides, if it's inefficient, find ways to optimize it later.  That's almost an exact quote from you.

And one final point: there are already word-index packages out there, even open-source projects, that could be dropped in and used with very little coding.  It's been done, and done, and done some more.

Your suggestion is not (quite) laughable, because you're pointing out something many of today's programmers forget: if it works, and it's easy, why bother making it more complicated?  But in this case, the complication factor is not that high, and the benefit is several orders of magnitude.


--
Go be impersonally used as an organic semen collector!  (porkchop_d_clown)

real full text search/index is easy enough (2.55 / 9) (#37)
by simul on Sat May 01, 2004 at 08:08:46 PM EST

Full text indexing is a well-researched issue, with a long history. I think, maybe, that you really don't know how to use indexes.

Sure, storing comments in a database can be slow. I built a commenting system that puts each comment in it's own text file.

But if you're going to be searching text files, a full-text index is *the* only way to go. You could use WAIS, or experiment with a new system like CLucense. What takes 1 second, in your article, would take a few milliseconds.

However, full-text indexing works well with MYSQL. You have you create your comemnt table as MyISAM - NOT INNODB. the problem with this is that you get slow updates. Teh advantage is highspeed searching.

http://dev.mysql.com/doc/mysql/en/Fulltext_Search.html

I'm surprised this article is receiving votes though... considering it's lack of technical merit. I thought kuro5hin'ers were more sophisticated than that.

Read this book - first 24 pages are free to browse - it rocks

linux? (1.72 / 18) (#38)
by SilentChris on Sat May 01, 2004 at 08:10:18 PM EST

might i recommend linux?

A real, ready seach engine (2.50 / 6) (#39)
by danharan on Sat May 01, 2004 at 08:10:58 PM EST

Brag all you want about seeing a Cray in 1980, but you're not lazy enough to be an outstanding programmer. Lucene exists, it's free and it works. Why reinvent the wheel?

You know (2.42 / 14) (#42)
by Dr Phil on Sat May 01, 2004 at 08:40:53 PM EST

I heard some people are so good that they can actually write assembly that moves faster than the speed of light.

*** ATTENTION *** Rusty has disabled my account for anti-Jewish views. What a fucking hypocrite.
Uhm. (2.00 / 8) (#43)
by SIGNOR SPAGHETTI on Sat May 01, 2004 at 09:26:51 PM EST

You can't write better assembly language than a good C compiler. Unbelievably, the 8086 is no more. Aside from that, I'm down with flat files; they fit good on magnetic tape.

--
Stop dreaming and finish your spaghetti.

Hmm. Not sure this makes too much sense. (2.66 / 6) (#44)
by xC0000005 on Sat May 01, 2004 at 09:30:27 PM EST

I appreciate the value of something that would work.  However, let me point out a few things that, as a person who makes a living programming, stood out to me.

#1.  Don't re-invent.  There are already systems that do full text indexing, written by people who have specialized in it. Your flat file idea is interesting, but I would personally shun it because there's not too much point. (See other comments on this)

#2.  Regarding that search index tree, again as someone who has worked on one such thing, I have to say it is the best tool for the job. But why would you maintain the complete index in memory? (More on the idea of killing the disk later) Garbage collection is only an issue if you are allowing deletes.  Yes, Kuro5hin does allow deletes, but I'll bet they are the edge case, not the norm.  Now you need some way to maintain the integrity of your tree, hash/checksum/something.  And probably a lru/cache plan.  Does this sound familiar?  It should, because it's what already exists in a number of existing search implementations.  And in regards to killing the disk getting there, the solutions are again as simple or complex as you make them.  Simple - use a filesystem with heavy read/write caching, (or a SAN, preferrably one of the ones that like to play fun tricks with the IO queue).  The downside is you are tied to heavy metal hardware or a particular FS.  Complex is writing the buffer manager yourself.  (I do not recommend this).  

Finally, the optimal approach is actually a mixture.  You have the index tree for lightning fast searches with a slow insertion time.  You have a midlevel tagged query.  You have the last level (on the newest posts) where x% of resources can be consumed doing a slow crawl.  

I think leveraging an existing search engine like goodle is a better idea.

Voice of the Hive - Beekeeping and Bees for those who don't

I encourage this... (1.77 / 9) (#55)
by JahToasted on Sat May 01, 2004 at 11:11:04 PM EST

The biggest problem with in Computer Science today is that we've forgotten the old KISS (Keep It Simple, Stupid) rule. Let's face it, we don't need all of the features of the (IMHO) bloated MySQL. Yeah, if you're trying to set up a mission critical database server and need things like transactions, views, and inline functions, then maybe MySQL is the right choice for you. But for the rest of us, a flat file with an assembly parsing programme is the way to go. And a good programmer could write the assembler programme in such a way as to be portable across multiple architectures, something that MySQL can't really claim.

+1FP
______
"I wanna have my kicks before the whole shithouse goes up in flames" -- Jim Morrison

I call bullshit (2.72 / 11) (#57)
by skyknight on Sun May 02, 2004 at 02:52:16 AM EST

I have argued in the past, loudly and at great length, that modern computing suffers because lazy programmers use powerful hardware as a crutch instead of learning how to do things right.

This is an archaic mentality. You make it sound like writing in anything other than assembly is laziness. Yes, there are lazy programmers who do stupid shit and rely on blazing fast hardware to mask their incompetence. No, not everyone who uses high level languages is a dolt. You may have heard of a guy by the name of Donald Knuth. He asserts that "premature optimization is the root of all evil." Writing code in assembly in the 21st century is almost always premature optimization.

You seem to harbor a righteous sentiment that the ultimate metric for goodness of code is speed of execution. That mentality was correct in the past when machines were ridiculously expensive such as the Cray you mentioned. Today, there are much better things for which to optimize, e.g. discoverability, ease of debugging, portability, and rapidity of development.

A smart engineer determines the bottlenecks on a project and works to reduce them in a way that optimizes the translation of human labor to system value. Worshiping arbitrary and Puritanical metrics may make you feel good, but it is a waste of time.



It's not much fun at the top. I envy the common people, their hearty meals and Bruce Springsteen and voting. --SIGNOR SPAGHETTI
5 million comments? (1.42 / 7) (#59)
by omghax on Sun May 02, 2004 at 03:50:58 AM EST

so K5 has been around for, what, 4 years? 4 * 365 = 1460. And 5,000,000 / 1460 = 3200 comments PER DAY!?!?! Hell no (and remember, this is talking about the early years, when K5 saw minimal traffic).

YOU STUPID FUCKING IMBECILE. Look at http://www.kuro5hin.org/pages/stats and see how many hits per day there are. And then find the ratio between the number of posts per day (3200) versus the number of page hits.

You sir, are necessarily a moron.

I put the "LOL" in phiLOLigcal leadership - vote for OMGHAX for CMF president!

Yes, but no, and also, no (2.83 / 6) (#60)
by ZorbaTHut on Sun May 02, 2004 at 04:34:57 AM EST

Yes, what you've described would work.

However, it would be ludicrously slow, for no good reason. We don't expect our database software to be bug-ridden. It isn't. It works. It might not be utterly bugfree, but chances are, for the kind of simple thing you need on a full-text search, it will work flawlessly.

Second, it would be *ludicrously* slow. Linear-time. We can do better, and linear-time will bring even modern boxes to their knees on 7gb of data unless you want to spend thousands on RAM (totally unnecessarily).

So: We can do better, more easily, and without any expected major problems. Why exactly is yours even a good idea?

Third, incidentally, full-text searches suck. Honestly. We've known that for years. It's a horrible way of searching.

A full-text search, implemented through a DB, won't take much longer than what you've described here. It will likely be as bugless, or even more bugless, and it will be far faster. And it will still suck.

Assembler Too Slow (2.40 / 5) (#61)
by OldCoder on Sun May 02, 2004 at 06:01:25 AM EST

The naive idea of
a really efficient parser to scan it.
has been superseded by string search algorithms some decades ago. If you are going to make a huge text file, then rip off a fast algorithm coded in C and let it rip. It will easily beat your assembler-coded naive string search. If it is still too slow, then translating it to hand-coded assembler will probably not help. Choose a different algorithm or buy a faster computer.

Of course finding the string inside the huge text file isn't the same as finding out what article the phrase was used in. That's because lots of algorithms skip around in the text and will ignore article headings. Once you've found the string you might have to search backwards to find the header. And so on.

Compressing the text may enable you to keep it all in ram, and people usually don't know exactly and precisely what they are looking for, so that it may be faster and more practical to use one of the algorithms for approximate string searching of compressed text. However, be aware that these methods are still new and research is ongoing.

Given the problem statement it might be better to search uncompressed text for more than one pattern at a time using Multiple Approximate String Matching.

--
By reading this signature, you have agreed.
Copyright © 2003 OldCoder

OR... (2.00 / 4) (#62)
by Farq Q. Fenderson on Sun May 02, 2004 at 06:37:18 AM EST

You could use an RDBMS that has the option fine-tuned indexing. Then, you could use it. MySQL's indexing is for people who don't care about how the indexing actually behaves.

I don't think you're going to get much out of writing a bunch of assembly for the purpose of doing string searches. Tight code is nice, but tight algorithms are where it's at. Besides, most worthwhile compilers can write better assembly than you can, and do.

farq will not be coming back

This problem was discussed recently (2.66 / 6) (#63)
by liquidcrystal3 on Sun May 02, 2004 at 06:54:57 AM EST

Building an open source search engine. At 7GB of searchable data, I'm pretty sure whatever is the best method for them is the best method here; a good search engine can search the entire web in a couple of seconds.

well written article but stupid idea: -1 (2.00 / 6) (#64)
by ant0n on Sun May 02, 2004 at 07:32:41 AM EST

Coding in assembly can be an interesting and rewarding activity in itself, and an understanding of assembly will improve your skills in programming in any high-level language, too. But coding in assembly as a means of speeding up a program is a very bad idea. If you want to speed up a program, you should speed up the underlying algorithm, not the actual implementation. Your proposal sounds quite naive: you have a very slow algorithm (brute-force, linear text search in a huge, flat textfile) and you want to speed this up by using fast hardware and assembly in 'a tight loop'... that's the way script kiddies think. Your argumentation is on the line of those morons who believe that every programming problem can be solved as soon as we have 7 Gigaherz - Athlons with 10 Terabytes of RAM.


-- Does the shortest thing the tallest pyramid's support supports support anything green?
Patrick H. Winston, Artificial Intelligence
Aaarrgh (2.85 / 20) (#73)
by bugmaster on Sun May 02, 2004 at 01:11:18 PM EST

God god man, not again !

I don't know how I can make this any clearer. Should I use sock puppets or something ? Look. You seem to have two main problems with the vector search and similar approaches:

  1. RDBMS is not an ideal solution to search
  2. Smart search algorithms are hard to write
I agree with #1; for pure search, RDBMS doesn't really fit the bill. Note that the vector search proposal is not an RDBMS, but a separate engine. In case of Scoop, however, the rest of the site is already written as an RDBMS (AFAIK), so it would avoid some code duplication to write search that way, too.

Your #2 reason is just stupid. Yes, search is hard to write -- but that's because it's a hard problem. Period. You write,

Simple code is code that is easily optimized and easily debugged. It also buys you short development time and low cost...
Ok. Here's a simple challenge for you, then: write a bubble-sort program in pure Assembly. Optimize the crap out of it. Now, write a quicksort program in C or Java (actually, they come with quicksort IIRC, not sure how good their implementations are). According to you, the simple bubble-sort will outperform the complex quicksort every time, right ? Well, this is something that's trivially easy to test: just keep increasing the data set in a loop, and graph the resulting time for both algorithms. One graph will look like O(n^2), the other one will look like O(N log N); I am going to let you figure out what that means.

The problem with your approach is not that you can't code (I'm sure you can) or that I'm employed by some evil commercial-code corporation (I'm not). The problem is that it's just so damn stupid. Your approach is a linear search through records on disk; this ensures that your search is only as fast as the disk. Disks are slow, localroger. Very, very slow. Even RAID arrays. Your search is O(N) on disk with a very small constant factor. Every other search engine is something like O(log N), with a large constant factor. Pull out your trusty TI-85 to see what I mean.

When you need something to take less time, the tradeoff is usually to have it take more space. For example, this is how caching works. RDBMS, vector search, other search engines... They all utilize this approach: they create some alternate data structure that takes space, and use it to save some (or, actually a lot of) time. Is this more complex ? Yes. But it's essential. Yes, a bicycle is much simpler than a car -- but you wouldn't go from San Francisco to Toronto with just a bicycle. It's simply too slow. We are talking about seven gigabytes of text here, man. Modern drives have the throughput (and that's without seek time, assuming only one user, etc.) of about 30-60 MB/s (I admit, I just pulled that number off of Google). That's on the order of 200 seconds to scan the entire file, assuming there's just one k5 user doing it.

Your statements on speed ("a thousand concurrent worst-case searches will also take 60 seconds", "The technology to bring the worst-case search under 10 seconds is already there for a reasonable cost", "and within a shorter time than K5 has already existed I'd guess that 1 second will be achievable") tells me that you are not only ignorant of the big-O notation, but of UI design as well. 60 seconds per search is not just a bit slow; it's unacceptable. 10 seconds is unacceptable. 1 second would be barely acceptable -- but I want comment search to work today, not in some bright future when hard drive maufacturers will do your thinking for you. You see, if any operation takes more than 50 ms (again, I pulled this off of Google, but I've seen similar figures before), users assume that the application has locked up (or that k5 is timing out). This includes drduck, duxup and rusty. This includes me and you, yes, you. We human beings are just wired that way. No one will sit there waiting for your best-case-scenario 60s search to terminate (especially since it's more like 200s in reality). Acceptable latency for user interaction is measured in milliseconds.

It sucks, but it looks like you might actually have to think about things before launching your search engine project -- as opposed to coding linear search in Assembly right away. Life is hard.
>|<*:=

+1FP, but (none / 2) (#77)
by nlscb on Sun May 02, 2004 at 03:42:31 PM EST

Careful with A Modest ... Proposal. Only you localroger can get away with that cliche. It should be retired.

Seriously, great article as usual.

Comment Search has returned - Like a beaten wife, I am pathetically grateful. - mr strange

How does a novice begin w/Assembly? (none / 1) (#78)
by nlscb on Sun May 02, 2004 at 03:48:23 PM EST

dear localroger, your complaints on programmers not knowing assembly are long and many. fine. how does an aspiring programmer correct this horrible oversight in our programming education system? i would currently describe myself as a beginning intermediate in my skills. what do you suggest i do to understand the magic of assembly, besides 20 years working on programming industrial machine tools? sincerely, nlscb

Comment Search has returned - Like a beaten wife, I am pathetically grateful. - mr strange

Why is this being voted up? (3.00 / 11) (#85)
by The Honorable Elijah Muhammad on Sun May 02, 2004 at 04:39:27 PM EST

Dear localroger,

Being a competent programmer may give you license to bitch and moan about how the kids these days aren't taught asm at birth, but it's no excuse for not realizing why a flatfile O(n) search is a very bad idea for such a large amount of data. Yes, optimization is good, but the minor increase you can wring out of writing proper asm is miniscule compared to the improvement you can get out of choosing a proper algorithm in the first place. Yes, your idea works in theory, but in practice it's so expensive, and such a pain in the ass as to be infeasible.

If you can sit 7gb of data (that's growing every day) in ram, using your expensive disk arrays and memory, and then search through it flat... just think about how much better you could do it if you had chosen a proper method of searching in the first place.

Options:
(1) Go ahead with localroger's plan, rusty takes several thousand dollars out of the yacht fund to pay for ram, servers and disks.
(2) Go with the correct solution and implement pb's vector search code.

Sincerely,
The Honorable Elijah Muhammad


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
How the fuck is this getting voted up? [nt] (none / 2) (#88)
by tlhf on Sun May 02, 2004 at 05:00:17 PM EST



I know nowt about this (none / 0) (#89)
by GenerationY on Sun May 02, 2004 at 05:01:21 PM EST

but how does this solution stand up against adding a Google Site Search box to the top of the front page?

I'm tempted to think 600 USD is a small price to pay to make the problem disappear (600 USD wouldn't buy you consultancy time sufficent to implement the above would it?) I assume, however, I am missing the critical flaw in this fiendishly simple plan.

A Modest Improvement (2.90 / 11) (#92)
by Russell Dovey on Sun May 02, 2004 at 05:48:38 PM EST

My idea is so simple as to be obvious in hindsight, but:

If the searcher is faster when written in assembler, then why don't you code the textfile in assembler? Assembly language speeds up everything by a factor of 2 or 3, according to your earlier comment, so using both an assembler searcher and an assembler textfile would be 4 to 9 times faster! That's a total search time of either 15 seconds, or 6 and a half seconds!

I'll split the patent with you, we'll make millions selling this to Google alone!

"Blessed are the cracked, for they let in the light." - Spike Milligan

Information Retrieval (3.00 / 11) (#105)
by Pseudonym on Sun May 02, 2004 at 08:11:04 PM EST

Disclaimer: I write highly-scalable text database servers for a living. Therefore anything I say may be biassed.

It's a dumb idea. It's signature files all over again, only this time, we're not bothering with the actual signature files. (For the record, using relational databases for text is also a dumb idea. Text does not fit in columns. Unless your DBMS has a specialised text index built-in, performance will always be poor and space usage, especially on 7Gb of data, will be worse even than localroger's proposal. Just don't go there.)

The limiting factor these days is not disk at all, but cache. Having multiple concurrent searches will not run at the same speed as a single search because every memory access will be a cache miss.

Recent research (though there is some evidence that the "big" search engines have known this for a while) has found that compressed inverted indexes are faster than uncompressed indexes (e.g. "vector space searches") even when the vector space index fits in RAM, so long as you use the right compression algorithm. Plus, it's not nearly as hard to implement as you might think.

Oh, one more thing. A search engine that takes 60 seconds to finish, even if it's predictable, is useless. People like feedback, and the ability to refine then re-run their searches.


sub f{($f)=@_;print"$f(q{$f});";}f(q{sub f{($f)=@_;print"$f(q{$f});";}f});
Why not just keep a hash in memory? (none / 2) (#106)
by big fat idiot on Sun May 02, 2004 at 08:30:49 PM EST

How many words are in the English language? Would adding all the non words in common use on the internet even manage to double that number?

So why not just build a hash of words together with a pointer to a list of all the comments that contain those words?

Then have a search engine that runs as a separate process or as a demon that scoop could hand off the code to?

Oh wait! It's been done already

Go do it (2.90 / 10) (#107)
by lukme on Sun May 02, 2004 at 08:37:51 PM EST

Since you seem to have so much conviction, and assembler experience, I think it is a wonderful learning experience for you.




-----------------------------------
It's awfully hard to fly with eagles when you're a turkey.
*yawn* (2.66 / 9) (#110)
by Estanislao Martínez on Sun May 02, 2004 at 10:13:28 PM EST

Time complexity of scanning a text for a match: O(n), where n is the size of the text; for the sake of argument, we'll measure it in words.

Time complexity of balanced binary tree lookup: O(log2 n), where n is the number of keys.  (A key in a keyword search will be a word.)

Constant speedup factor that yokelroger claims from using assembly over C (in comment here): 3

By the time n becomes as small as 9, the C-coded balanced binary tree is finding hits faster than the assembler-coded sequential scan.  (Based on solving n = 3*log2 n)  Sure, there's additional complexities to be considered (the hit you get from searching a tree for a single keyword is presumably a list of documents containing the word being searched), but even for medium-sized values of n, the balanced binary tree is devastating.  (And using a trie might be faster and more memory efficient than a binary tree.)

--em

PS (1.60 / 5) (#111)
by Estanislao Martínez on Sun May 02, 2004 at 10:26:42 PM EST

Your article is stupid *and* unfunny.

--em

+1 Fiction [nt] (3.00 / 12) (#113)
by nooper on Sun May 02, 2004 at 11:18:52 PM EST



+1 FP, Made me realize something STUPENDOUS! (none / 2) (#114)
by Empedocles on Sun May 02, 2004 at 11:39:00 PM EST

That localroger's brain fossilized shortly after 1981.

---
And I think it's gonna be a long long time
'Till touch down brings me 'round again to find
I'm not the man they think I am at home

Best of both worlds? (none / 0) (#121)
by For Whom The Bells Troll on Mon May 03, 2004 at 12:38:02 AM EST

Indexes and parsing text-files => NXDB.

Been tinkering around with quite a few for sometime now, but admittedly, most of them (including the non-OSS ones) have some distance to go before you want to take a serious look at them.

---
The Big F Word.

Not smart, better use reiserfs (none / 1) (#123)
by Highlander on Mon May 03, 2004 at 02:27:24 AM EST

A time of 60 seconds to scan a text file is much too long, and it also takes away full seconds of raw processing power.

Not to use a binary tree seems stupid, since the binary tree scales so much better. One can still use your idea of using the filesystem and keep the binary tree somehow on the file system.

Actually I think there are file systems around that already use a binary tree, I think reiserfs, so you really need only to put your keys into filenames and hits into the files. If it doesn't perform, complain to the reiserfs team and make it work, because it should.

You propose ending the search after the first hit, but we really want all stories matching a query. I rather suggest that only stories be searchable, since adding every comment to the database might cost performance, but maybe I am overestimating the power of a cluster of monkeys here.

Moderation in moderation is a good thing.

HAHAHAHA (none / 2) (#124)
by the77x42 on Mon May 03, 2004 at 04:43:41 AM EST

Since rusty has so much money, he can get one of these.

Random binary from good ol' DBASE is my favorite search method. It would pick a random number and then see if your query was higher or lower. It can then get rid of half the database results. Repeat. Wash. Rinse.


"We're not here to educate. We're here to point and laugh." - creature
"You have some pretty stupid ideas." - indubitable ‮

you sound like alta vista (none / 1) (#126)
by dimaq on Mon May 03, 2004 at 06:58:48 AM EST

(IMHO)

anyway I propose you make a simple test - wget -R a large enough archive of messages of some sort - be it newsgroup or k5, then try your linear search (which can be done in a database too btw), and do the same with some reasonably dumb database like mysql using it's native b-trees to search for the words first and then parse the result to check whether a "phrase", however you define it actually occurs there.

btw you don't have to store words as such - you could store some derivative be it a truncated hash function or some sort of normalization (like plural->singular and past->present tense conversion for English). You would arrive a smaller but less precise index.

In the end the point against grepping through a large file is that it's a linear operation whereas selecting a subset of messages is logN.

And finally if I recall correclty there was an article of some sort published by folks who created alta vista search engine which discussed tradeoffs and implementations of full text search.

Oh yeah I forgot - you could let google do the job instead :)

I beg of you rusty, delete this shit (2.00 / 7) (#129)
by boxed on Mon May 03, 2004 at 07:16:34 AM EST

Dear Rusty,

this article makes all of kuro5hin look stupid. Not stupid as in "a little naive", but stupid as in "can't fucking read, search google or think at all". Delete this piece of shit article PLEASE, before hundreds of ignorant newbies actually believe localrogers statements and decides to code everything in assembler with bogosort. Yes, bogosort is EXACTLY as intelligent as this trite.

hubris (none / 2) (#136)
by dlec on Mon May 03, 2004 at 09:16:39 AM EST

What nonsense is this? I just can't believe what I'm reading. Your arguing that we should just search the entire text each time. Unbelievable! A modern hard drive will do about 25MB/second so your query over this 7GB file will take about 5 minutes. How is that possibly an improvement?

Let's get this straight: relational databases aren't designed for textual indexing, but for queries over structured data. What do you think the relational means anyway? They usualy have basic text searching tacked on, but it's not a core compotency.

If your finding that MySQL isn't working as well as you'd like, that is because your using the wrong tool. Did you never consider using an indexing server? Why didn't you just ask someone before coming up with this nonsense.

The only indexing server I have ever used is Lucence, but you can find a compilation of other open source Java indexers here, or open source indexing servers in general here.



Ignore this article (none / 3) (#137)
by DDS3 on Mon May 03, 2004 at 09:35:50 AM EST

...this article is horrible and plagued with bad, misinformation.

The simple fact that he thinks MySQL hangs the moon, is reason enough to ignore it, yet it gets factually worse.

The indexing ... processing requirements.

Not if they are done properly.  That's the whole friggen point of having indexes.  Since we are talking about searches, once the data is inserted, either the index is being used and reducing process or it's worthless (ignored).

Searches are already limited by default to a more modest "current" dataset.

What?  What is he trying to say here?  That databases have a non-current dataset?  WTH?

As blocks are loaded into RAM they can be used for all ongoing concurrent searches, so while one worst-case search might take 60 seconds, a thousand concurrent worst-case searches will also take 60 seconds.

What?  This guy is on crack!  This makes sooo many assumptions.  Chances are much higher that a thousand worst case concurrent searches will take somewhere between 60 and 60,000 seconds. Since when can a linear increase in work be completed with zero overhead?  Especially if we are talking about MySQL.  Plus, this assumes that the entire dataset nicely fits into RAM.  If it doesn't, expect performance to fall quickly with each additional search.  As it relates to MySQL, MySQL has serious scaling issues, so I'm absoluetely sure that you will not see 60-seconds for 1000, 60-second, concurrent queries.  Chances are, it will be time for a quick nap.

The effect on the main database is effectively zero

Main database?  Text search isn't considered the main database?  Does this assume that the text search DB is on another box?  Seems to have more bad assumptions.

This is just putting my toe in the water.  Basically, we can ignore anything this article offers.

This proposal is absurd (none / 2) (#143)
by p3d0 on Mon May 03, 2004 at 10:07:41 AM EST

Tell you what... You write your whole-file-search in assembler, and I'll write a Python script in one day that outperforms it by three orders of magnitude on a 7GB dataset. I'll achieve this by using "weird hash algorithms" and "strange lookup files" because that's what it takes to do the job properly.
--
Patrick Doyle
My comments do not reflect the opinions of my employer.
Conclusion: a large number of kurobots (none / 0) (#144)
by Undesirable Username on Mon May 03, 2004 at 11:27:02 AM EST

have a really sick sense of humor.

Although... reading through all the disparaging comments one experiences a level of meta-amusement hardly seen before. Almost orgasmic. Maybe not so sick after all =)

check out lurker for high-performance indexing (none / 2) (#148)
by klash on Mon May 03, 2004 at 12:15:04 PM EST

I highly recommend looking at the software package lurker. It's a mailing list archiver that creates keyword indexes using a highly specialized on-disk database. The database code is very cleanly separated from the rest of the package in the directory "libesort." The code is also very clear and very well-documented. Here is what the code comments have to say about this database (sorry about the bad formatting of the table, for some reasons scoop doesn't support <pre>):

/* What is ESort? -- An Online External Sort
*
* ESort provides a very primitive database API: An insert-only set of keys.
*
* There are very different trade-offs when compared with a standard database.
*
* N = number of keys in the set, M = number of operations
* The seeks needed to for each repeated operation in one transaction:
*
* ESort BTree Hash
* Insertion O(1) O(M*log(N)) O(M)
* Read in sequence O(1) O(M) impossible
* Seek to key O(M*log^2(N)) O(M*log(N)) O(M)
* Delete impossible O(M*log(N)) O(M)
*
* From this table, one can conclude that ESort allows blindly fast insertion
* and sequential read, but pays in the cost to seek. This is what makes it
* highly appropriate for keyword indexes.
*
* An additional restriction is that ESort only supports a single-writer.
* An additional advantage is that ESort readers get snap-shots.
*/

Lurker is designed to scale extremely well, so it seems likely that it could handle the demands of keyword-indexing all of kuro5hin's stories.



Dear localroger, (2.75 / 8) (#150)
by razumiking on Mon May 03, 2004 at 12:34:30 PM EST

I've enjoyed reading you stories for about two years now. Your writing is been a credit to K5. Since I got my account, I've always reflexively voted up your articles, but now I don't know if I can trust what comes out of your account anymore.

As I read this article, I couldn't believe what I was reading. It was misinformed, arrogant, even deceptive. It seemed as if it was designed to provoke people. I felt like, and I don't use this word lightly, I was reading a troll. It was my old usenet days all over again.

I've had bad experiences with trolls, localroger. I've had two of my favorite newsgroups and my favorite website overrun. Trolls have posted my personal information online in places where I thought my anonymity was assured. Trolling is not a game, localroger. When you troll, real people get hurt. It is as you said in a diary a while back: Trolls are the bullies of the internet.

I know you're a smart guy. I think everyone here knows it. You don't need to show us how clever you are by tripping up less canny readers with your trickery. That's the sort of thing RobotSlave and rmg get off on. It's not you.

People say K5 is dying, but I didn't believe it until today. Today, one of my favorite posters, a central figure in this community, and surely the most respected, has turned to trolling. I'm stunned and disillusioned.

From now on, I'll have to read your stories very carefully before I vote on them. I'll read all your posts with a suspicious eye. I'm sorry it has come to this, but I don't think we really have any choice anymore. You've broken the seal of trust. It will take along time to heal.

Sincerly,

Graeme.

1 word article replacement: (none / 2) (#151)
by phred on Mon May 03, 2004 at 12:35:58 PM EST

grep

assembly's fast! (none / 3) (#152)
by rmg on Mon May 03, 2004 at 01:14:21 PM EST

yeah yeah yeah!!

it's not slow!!

no no no !!

----

dave dean

I'm not going to enter this debate (none / 0) (#156)
by Haunting Koan on Mon May 03, 2004 at 02:41:00 PM EST

But isn't there some way to use Google to search k5?

Off topic - (none / 1) (#159)
by mami on Mon May 03, 2004 at 03:41:10 PM EST

I wonder, if my impressions are right. All the time something very controversial and upsetting is going on in politics, some oldtimers like localroger and irmdkl write regularly very uncontroversial and more or less boring, but valuable "technical or cultural" articles.

Two words. (none / 0) (#169)
by hummassa on Mon May 03, 2004 at 06:45:48 PM EST

Radix search.

It's called AST (none / 3) (#173)
by KWillets on Mon May 03, 2004 at 07:57:15 PM EST

Assembler

Search

Technology

Google appliance (none / 0) (#176)
by jabber on Mon May 03, 2004 at 08:46:43 PM EST

How many of these could Rusty buy for the $30k he collected in his CMF stunt?

[TINK5C] |"Is K5 my kapusta intellectual teddy bear?"| "Yes"

You apparently ignore basic facts (none / 2) (#182)
by vyruss on Mon May 03, 2004 at 10:47:02 PM EST

of computer science. One of them is, even if the hard disk does nothing but max out its sustained transfer rate on your (must be non-fragmented) single text file, you need A BUFFER to read the text into. After having skimmed your article and preceding comment and found nothing of the sort, I refuse to even read any more of it.

This is not a personal attack, I admire you for your writing but this is like reinventing air, not the wheel. And badly, might I say.

  • PRINT CHR$(147)

et tu, roger? (none / 1) (#194)
by Entendre Entendre on Tue May 04, 2004 at 03:18:07 AM EST

Now I REALLY wish adequacy was still around.

--
Reduce firearm violence: aim carefully.

wtf are you smoking? (none / 3) (#196)
by polyglot on Tue May 04, 2004 at 03:26:26 AM EST

gaining a few percent by writing in asm gets you nowhere if you use an O(n) or worse algorithm instead of O(log n). We don't let people out of first year CS until they understand that.

Seriously, go back to writing fiction. That was good. This is an unmitigated disaster.
--
"There is no God and Dirac is his prophet"
     -- Wolfgang Pauli
‮־

Use a relational database (none / 0) (#197)
by Ward57 on Tue May 04, 2004 at 03:37:28 AM EST

to hold the index system, and then prune the index so that it fits into main memory. Say a "prune" that triggers every so often, and when it can't allocate more space (for a carefully defined meaning of space, 1000,000 entries or a sensible number).

I haven't followed this discussion (none / 1) (#200)
by cyberdruid on Tue May 04, 2004 at 03:56:48 AM EST

But why not use one of the many good open source search engines? Swish-e is wicked fast.

our new k5 motto (none / 1) (#210)
by auraslip on Tue May 04, 2004 at 10:24:02 AM EST

"we hate this site"
124
how does google do it so fast? (none / 0) (#211)
by auraslip on Tue May 04, 2004 at 10:24:42 AM EST

with their "4,285,199,774 web pages"
124
Let google do it? (none / 1) (#215)
by univgeek on Tue May 04, 2004 at 01:24:47 PM EST

I know, I know google clobbers the db engine. However, would the following work -
  1. Create a separate 'archive directory'. This is basically a directory tree where each file is a comment in a flat text file, with only one link.
  2. This whole tree is cordoned off from the rest of the site, and ONLY this tree is allowed for search by google. I guess the robots.txt would allow this.
  3. The link would be to the db version of the comment. This way, anyone who is interested in following up, can go to the 'live' comment. Google wouldn't poke at this link because it would be part of a 'banned' tree in the robots.txt. Perhaps this may be too painful for users? Or it could even be a 20 sec refresh, leading to the db 'live' version.
  4. Add new comments to this tree, as they are created.
The tree could be
a) year -> month -> day -> comment, or
b) year -> month -> day -> story -. comments

Disclaimer: I have no clue on how mySQL will handle this, or any clue of whether the space occupied by this archive will be too large. Please poke holes in this. Sorry if this is too dumb.

Arguing with an Electrical Engineer is liking wrestling with a pig in mud, after a while you realise the pig is enjoying it!

The Poll (none / 0) (#223)
by theboz on Tue May 04, 2004 at 03:48:07 PM EST

I wish K5 had the multiple poll options like HuSi does, because I keep my data in most of the options.

Stuff.

Are you a programmer? (none / 2) (#225)
by trhurler on Tue May 04, 2004 at 04:26:30 PM EST

I could write Tex's word based solution and debug it to a very high level of reliability in a matter of hours, even in C. In less than a week, I could have an assembler solution for a single chosen target, but I wouldn't want to. The time required to do this in a high level language is probably stupidly small, but performance would suffer in most of them(and yes, you stupid language bigots, in this case, O(N) != 2(O(N)), so shut up already!), and it isn't hard to do this in C anyway.

Also, database corruption is easy to fix. Once every n time units(you pick based on practicality and your need for data integrity,) you back up the whole shebang by doing an export. You keep the last m exports just in case, with an automated system to shuffle them as needed(this will take maybe an hour to create, maybe two hours, given that you include debugging and documentation.)

In conclusion, your idea is bad. It is bad for one simple reason: if today, we are manipulating seven gigs of data, then tomorrow, it will be seventy. Are you willing to wait ten minutes for your data? Even if disks become twice as fast(this takes a LONG time, historically,) you'll still be screwed; five minutes isn't good enough.

The correct solution probably IS to learn how to tweak the RDBMS, if that can be done. It may "defeat the purpose," but it is probably the minimal effort solution that provides good performance, and that's what counts. The problem is, I doubt MySQL can be made to do what is needed. Frankly, it sucks goat nads. Tex's idea is a close second; my main concern is whether it would actually perform as well as one would think. I'd have to give that more thought and maybe some trial runs before I was sure.

--
'God dammit, your posts make me hard.' --LilDebbie

You can't be serious... (none / 1) (#236)
by ucblockhead on Wed May 05, 2004 at 11:32:12 PM EST

Your proposal just makes no sense. There's a reason databases exist. Storing the the data in one big chunk only makes sense if the data can only be used in one big chunk.

One way to make the comment search faster is to create a word index as comments/stories/etc are posted.

Roughly speaking, you have a table where the primary index is a single word, and the other columns are the comment id and perhaps the post date. Then to find all comments with that word, you do a select on that index, getting a set of rows in date order, each one representing a different post.

Of course, this will be a bit disk-hungry. But that's just an instance of trading memory (or diskspace) for speed. (K5 may have 7 gb of data, but a 250 gb disk costs $200.)

It's the difference between a Unix find and a Unix locate. (Roughly...the index for locate isn't created on the fly.)

(Having a real database like Oracle, and having a seperate database servers obviously helps.)

See, the thing is that searching the amount of data k5 has is not at all a "hard" problem. Real, professional systems generally have larger datasets and stricter requirements.
-----------------------
This is k5. We're all tools - duxup

localroger, (none / 0) (#237)
by ethereal on Thu May 06, 2004 at 11:02:05 AM EST

You ARE the Brute Squad!

--

Stand up for your right to not believe: Americans United for Separation of Church and State

if you had money... (none / 0) (#241)
by metalotus on Fri May 07, 2004 at 11:04:00 PM EST

could you pay Google to index your site? I'm not sure how most web sites store their data, but if you were inclined, it could save a lot of time.

I've found that if you want to sell your algorithm, just show someone how well it works! You could take a sample of K5 data, say 20 stories, and work with a reduced data set. You could write a simplified version of your algorithm and show how much work it does. Then you could hypothesize how the optimizations would help, and explain how well your program will scale from 20 stories to 7GB of stories. Without sample data & code, it is not convincing.



Ha! (none / 0) (#244)
by BOredAtWork on Sun May 09, 2004 at 02:42:47 PM EST

Well done, localroger. Some very interesting conversations have started because of this; Jonathan Swift would be proud.

What the fuck? (none / 0) (#245)
by ksandstr on Tue May 11, 2004 at 12:12:31 PM EST

Tex's suggestion is almost exactly the way that the typical full-text search is implemented[2]! If you were to have knowledge of a proper relational database system (such as PostgreSQL), you'd know that such things can be implemented as database extensions (OpenFTS springs to mind) while being entirely transparent to the application[1]. Hell, from what I've gathered, a simple combination of stored procedures (defined as operators if you really need the syntactic sugar), table rules and a secondary schema for the full-text searching system would be sufficient for such an implementation, even without resorting to a shared object (i.e. C linkage) extension!

Then again, migrating an application from MySQL to PostgreSQL is nowhere as painful as changing your mindset with regard to how RDBMS-oriented applications are supposed to work.

[1] Though I suppose this would introduce some confusion along MySQL-heads who're already used to tossing all their database logic into the application, the one place where it simply does not belong.
[2] Even MySQL does this behind the scenes, although in a characteristically immature and inflexible manner. PostgreSQL supports a special-case form of full text search (really a column prefix search) when a b-tree index is created on a varchar column and a LIKE operator is applied as the primary search rule with exactly one percent sign at the end.

--
Gegen kommunismus und bolschewismus und terrorismus, jawohl!

I'm missing a detail (none / 1) (#248)
by gstover on Tue May 18, 2004 at 09:12:08 PM EST

If the search algorithm does its search through the big file, & it finds the target string, how do you map the target string's location back to the file that the user wants?

gene



A Modest Search Algorithm Proposal | 253 comments (246 topical, 7 editorial, 1 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!