Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Building a Search Engine.

By byronm in Internet
Sat May 01, 2004 at 12:24:05 AM EST
Tags: Internet (all tags)
Internet

Without a doubt this has been one of the most absurd and strangest projects I have started so far. Not long ago the idea that I could build a search engine capable of indexing the Internet as a whole seemed so far away. Now it is becoming a reality. Without further ado I wish to announce the early release of mozdex.com an Open Search Engine.


Mozdex.com was dreamed up from the belief that searching should be more of a science and a factual process rather then a proprietary and secretive process. Through the beauty of open source and the hard work of the Nutch team we have been able to use Nutch build a beta test index of nearly 50 million pages.

What we want to do is provide a search system where you can see how the algorithm ranks pages. The ability to see incoming anchors and references to the pages gives more insight into the results. We feel that by working with an open API and Algorithm that the mass of great minds on the Internet can work together to come up with an algorithm that doesn’t lend itself so much to being cheated by “spammy” sites. The premise being that a well thought out algorithm can understand the basic tricks of the trade and more quickly react to new hacks & cheats used to "spam" indexes.

Mozdex was initially seeded from the Dmoz.org directory. We imported the rdf dump and spidered out from those links to create our beta index. This is how we arrived to the name “mozdex” or short for the dmoz.org index. Over the next few days we will be spidering out and referencing a link throughput of atleast 100 outbound and 100 inbound urls; increasing the anchors and ranking of pages and creating a more balanced ranking index. Interestingly by the limited subset of data through indexing dmoz.org sites the results are “true” as to what they would be from that specific market. Oddly enough with our smaller index subset non-english sites had more rank then alot of the english counterparts.

Through early May our goal is to hit the magical 250 million page mark. At 250 million pages we have sufficient data to start ranking results based on anchors and analyzer algorithms. To reach this goal we have a network of two db servers using the Lucene Index system (Jakarta Project) with two terabytes of disk space on each server as it generally takes 10kb per page to store the data and index segments. Our query farm is five P4’s and soon to be five more AMD Opterons with 16 gigs of memory. Through some early testing we were able to realize that our biggest cost was rack and facility space and that the performance as well as memory capacity of the Opterons offered us the best value. When thinking of query servers and indexes the goal is to have as much of the index segments in memory as possible for quickest retrieval. The memory capacity and throughput on the Opterons is a great advantage in this arena.

Obviously data availability as well as query performance is crucial when building a large index with not only the overhead of index maintenance but query process and day to day searches. Each query server has a master and replication server that is load balanced and failed over. On our web tiers we utilize Jakarta Tomcat JSP servers load balanced behind Squid. Squid offers us a highly efficient method to load balance, cache and tweak the throughput of our server farm. Many hardware based systems are built from squid so we are taking it a step further and using squid as an integral part in providing a high availability and high performance web farm

I will be putting up a daily blog of activities, events, issues as we come up with them and this will be made available as a link from the search page. Open search isn’t just the technology we run but the process and concepts we use to achieve our planned results of 2.5 billion pages by end of year.

We ask the kuro5hin community what your opinions and thoughts of such an index would be. Are webmasters, publishers and searchers generally interested in how they get the information that they are presented when they search? Do you think it is actually feasible to work on a process that stays ahead of the cheaters or do you believe it will be something doomed to fail just because its competing against the likes of Yahoo or Google?

With any interest in this subject we would be more then happy to publish white papers on our network, our servers and what it takes to build an index capable of indexing and searching millions of pages. Through the open technologies and the minds of the Internet we feel we can provide an invaluable search tool. Let us know what your opinion is.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login
Make a new account
Username:
Password:

Note: You must accept a cookie to log in.

Poll
Is there demand for open search?
o Yes 43%
o No 26%
o I don't Care 30%

Votes: 88
Results | Other Polls

Related Links
o Yahoo
o Kuro5hin
o Google
o mozdex.com
o Mozdex.com
o Nutch
o Dmoz.org
o Jakarta Project
o Also by byronm


Display: Sort:
Building a Search Engine. | 99 comments (68 topical, 31 editorial, 0 hidden)
General thoughts. (1.16 / 18) (#1)
by The Honorable Elijah Muhammad on Wed Apr 28, 2004 at 05:53:26 PM EST

(1) -1, buy an ad.
(2) You have no hope of competing on the same level as Google, Microsoft etc.
(3) ... I was trying to work in something about how K5 has no comment search, but I'm lazy.


___
localroger is a tool.
In memory of the You Sad Bastard thread. A part of our heritage.
Infrastructure (3.00 / 7) (#5)
by dennis on Wed Apr 28, 2004 at 06:34:04 PM EST

If you can make it a peer-to-peer algorithm so nobody has to buy 100,000 servers, then maybe this has potential. Don't ask me how to do it, though!

The paradox of an open server (2.80 / 5) (#11)
by whazat on Wed Apr 28, 2004 at 07:57:01 PM EST

Say I decide to set up a commercial server using Nutch. I want to make my server different from any other Nutch server, so it will stand out from the crowd.

I am under no obligation to give these changes back to the community, because I am not distributing any software. So the user still won't know the bias put in by the search engine, despite it being based on open source.

What I would like to see is an open source search engine that could be specialised by subject. So that I could run a search engine on plasma fusion, or a specialist subject of my choice. Also give it a non-graphical interface, so that other people could collate my results. That I might find useful.

Just wondering... (2.80 / 5) (#12)
by JahToasted on Wed Apr 28, 2004 at 08:48:30 PM EST

why doesn't google ban the sites that abuse it? I mean if I create a bunch of bogus sites and link back to my main site to increase my ranking, why doesn't google simply delete my site, and all of the sites that link to it (ie. the bogus sites)? Yeah, a few legit sites might get delisted, and people will bitch and moan, but it would be better than having every second result going to some crap site trying to sell junk.

So why doesn't google just ban the bad sites?
______
"I wanna have my kicks before the whole shithouse goes up in flames" -- Jim Morrison

p2p search engine (3.00 / 4) (#17)
by speek on Wed Apr 28, 2004 at 09:26:29 PM EST

I'd like to see a search engine that trades speed for brains. Something like a p2p app that sends out search requests, and nodes return information that particular node has indexed. A node might index sites based on the browsing behavior of the node owner (I say "based on" because it has to be careful not to violate the owner's privacy/security). Basically, I'm thinking of an implementation of the kind of search you read about in sci-fi stories where the person sends out "agents" that return info. Well, sending out agents won't work because who's going to agree to run your agent? Hmm, maybe a p2p app could accomplish the same goal? You send out search requests (complicated requests) and results filter back in over the next few hours/days.

--
al queda is kicking themsleves for not knowing about the levees

The search is terrible (3.00 / 12) (#21)
by JackStraw on Wed Apr 28, 2004 at 10:40:23 PM EST

Try searching for kuro5hin, or even Google! You get random pages within those sites; not the pages that you'd actually want to go to.

Not that I don't love the idea, but it certainly is far from being useful.


-The bus came by, I got on... that's when it all began.

From what I understand (2.47 / 21) (#23)
by Dr Phil on Wed Apr 28, 2004 at 10:55:48 PM EST

The fastest text search method is to make a flat text file of the entire web and use highly optimised assembly algorithms to search it in a matter of seconds. At least that's what localroger told me.

*** ATTENTION *** Rusty has disabled my account for anti-Jewish views. What a fucking hypocrite.
You asked (2.55 / 9) (#24)
by tricknology2002 on Wed Apr 28, 2004 at 11:00:52 PM EST

[D]o you believe it will be something doomed to fail just because its competing against the likes of Yahoo or Google?

No. I'm sure there will be reasons for failure other than Google and Yahoo. But those two will probably be at the top of a long list.

Scaling (2.75 / 4) (#30)
by helianthi on Thu Apr 29, 2004 at 05:31:05 AM EST

Great idea, but before spending too much energy into this, think twice of the scaling factor. Building a 50 million pages index doesn't mean at all your engine will handle 5 or 50 billion. Think big right from the beginning.

um (2.75 / 4) (#49)
by reklaw on Thu Apr 29, 2004 at 03:01:40 PM EST

is there some reason why that search engine gives me a results screen in something that looks like Spanish ["Resultados 1-10 (de un total de 6.531 documentos)" and "Search" changes to "Buscar"] no matter what I search for?

We need alternatives (2.83 / 6) (#53)
by QuickFox on Thu Apr 29, 2004 at 05:02:31 PM EST

Our dependence on Google makes us very vulnerable. Suppose Google starts charging users five dollars for each search. Or something. There should be alternatives. So this is a great idea.

But to make it useful you need to make it much faster than it is now. Why is it so slow, even though you seem to describe powerful hardware? And of course it must return much better search results.

Give a man a fish and he eats for one day. Teach him how to fish, and though he'll eat for a lifetime, he'll call you a miser for not giving him your fi

nice search engine, but needs some work.. (none / 3) (#54)
by Suppafly on Thu Apr 29, 2004 at 05:08:56 PM EST

You need to work on the algorithm that orders the results. If I search for openbeos, the main openbeos website should be first or 2nd, not like 8th.
---
Playstation Sucks.
On gaming and finances (3.00 / 4) (#59)
by danharan on Thu Apr 29, 2004 at 09:29:46 PM EST

There is a lot of money to be made in search engines with text ads, and it would be rather cool if a search engine's revenues were to finance OSS projects like those you use.

Second, about gaming. Jakob Nielsen suggested an extremely simple system that could get rid of most of it: give people a toolbar that lets them give a page a thumbs up or thumbs down. As long as my votes are not personally identifiable, this might also be a marketing gold mine as well as a way to give people some ownership over their search engine.

Where the big G was revolutionary was in using metadata. Rather than simply take a document -or god forbid, its meta-tags- at face value, it used links. The only way we are going to have another dramatic increase in search results is by taping another type of metadata. The Open Directory is a good start, but it too has flaws. Editors can introduce bias, old sites are rarely updated, and new sites take forever to be updated.

As an aside, there is one meta-tag that I would like to see taken at face value: the ICBM address.

---

Oh, I really like the idea, and think it has economic potential too. I also like the fact that I'll be able to figure out why fine art greeting cards does not bring up a site I'm redesigning although the same expression in quotes does.

+1FP, but... (3.00 / 6) (#63)
by pb on Thu Apr 29, 2004 at 11:45:04 PM EST

Although this is a topic that I am interested in, (and we could always use some more tech around here) I have some serious questions.

What is your relation to the Nutch project?

I see that you use the same icons, but I don't see any visible links to them, or
indeed any mention of them. Are you guys one and the same, or do you just like to rip off people's websites and freely available source code.

...which brings me to my second question.

Where is the source?

If you profess to have or want some sort of 'open' API or algorithm, then you might--at the least--want to provide a download for it. I understand that it may take time to fully document your (or The Nutch Project's) algorithms and API's; it shouldn't take nearly as much time to put up a tarball of the source. In fact, I managed to get one from the (unmentioned on your site) Nutch project's website.

Whoa, why do your search results suck so badly?

Now, I understand that this is a "BETA" project, but... some simple suggestions. First, cluster results from the same site, and list the base URL first. A simple search for 'slashdot' yields all of their various servers as separate results.

Meanwhile, a search for 'mozdex' seems to favor logs of your spidering on other sites. I guess some tweaking is in order here.

What made you pick 'Lucene' and 'Nutch' for indexing and searching, respectively.

There are lots of search engines already out there, open ones in fact. Do these scale better? Do they do something radically new? Are all the other ones just meant for searching a local site? etc.

What exactly are your algorithms?

A simple summary will do. I guess this is in the same place that the source is.

Directions for the future?

When I was playing around with searching, I stumbled upon latent semantic indexing and vector-space searching; in fact, I wrote a little search engine based around it. I rather like the idea of it. You could add a button or a link to every page that said "find pages like this", and it would actually work... maybe.

I know Google uses something like this for Google News; I don't know what they do for their "similar pages" links.

Good luck, and remember... if you're going to open yourself up to public scrutiny, expect people to start asking you questions. :)
---
"See what the drooling, ravening, flesh-eating hordes^W^W^W^WKuro5hin.org readers have to say."
-- pwhysall

Interesting, but poorly explained (2.66 / 6) (#64)
by coryking on Fri Apr 30, 2004 at 12:50:14 AM EST

It seems as though you've failed at one of the primary goals of your site:  to make it public why the engine is ranking stuff the way it does.  And you've failed that goal miserable.

Sure you say "cory, read the fucking FAQ (idiot)" but that is not good enough.  In fact, I am sure if I read your FAQ, you'd explain it.  But you know what?  I dont care about your FAQ, I haven't read the FAQ and I dont care to.  Neither does %99.99999999999999999999999 of your future, hypothetically large audiance.  Your "explaination" link doesn't explain anything to anybody expect the people who wrote your engine.

If you want to make your (very interesting) goal of informing your users why they got the results they did, you had better clean this up.  I dont think ANYBODY but somebody closely involved in your project can understand that page.  To everybody else, it's greek.  Think of all the people who use google.  Your grandma, your mom, your girlfriend, your teacher, co-worker, garbage man.  EVERYBODY uses google.  Could your garbage man understand that?  NO.  Your grandma?  No.  You either need to rethink your goal, or revisit that page and make it easy to understand.

(-1), buy an ad (1.20 / 10) (#66)
by guyjin on Fri Apr 30, 2004 at 01:17:22 AM EST

[nt]
-- 散弾銃でおうがいして ください
Al Gorithm (2.75 / 4) (#70)
by adimovk5 on Fri Apr 30, 2004 at 10:08:56 AM EST

The premise being that a well thought out algorithm can understand the basic tricks of the trade and more quickly react to new hacks & cheats used to "spam" indexes.

Spammers and cheats have to guess at the inner workings of search algorithms. They stay one step behind sites like Google who can discover their fakery and change the rules of the game accordingly.

A visible algorithm will allow spammers and cheats to more efficiently exploit your algorithm. It will be rendered useless in a short time and so will your search engine. The defenders will be acting out of altruism and good will. The attackers will be acting out of greed and self interest. Who do you think will win?



Why This is Good (2.80 / 5) (#75)
by KWillets on Sat May 01, 2004 at 12:29:16 AM EST

(I posted this earlier during voting, but mistakenly left it as an editorial comment.  So, I'll just repost it.)

This is, as the author admits, an incomplete project, but the idea is well worth exploring.

The Internet was conceived as an open system, where almost everything is accessible to all users, over open protocols. While the early Web was built on link navigation or "web surfing" to find information, search rapidly became the dominant access method (in fact, I think hyperlinking is actually declining, in usage if not in volume, but that's just a guess so far).

As search has become dominant, we have seen the rise of the private search engine, with proprietary crawling, indexing, and ranking technology. The obvious implication is that the basic web framework is simply broken; almost all web users must go through centralized search engines to find content, and these engines must guard their ranking methods to keep from being manipulated.

So, the current situation is that we rely on trusted intermediaries to interpret the web for us, using occult algorithms. In short, it's the Middle Ages.

Given the effectiveness of search as a basic access method, the only way out is to remove the mystery from the search and ranking algorithms. Mozdex may be on the right track, but the ultimate result is going to have even more user control over ranking factors. The idea that one ranking method should work for all searches should be discarded, and replaced with a diverse range of user-driven query methods.


Ummm... cool idea... however... (none / 0) (#84)
by bigchris on Sat May 01, 2004 at 01:58:02 PM EST

currently it's spitting out Apache Tomcat/4.1.30 errors.

type Exception report

message

description The server encountered an internal error () that prevented it from fulfilling this request.

exception

org.apache.jasper.JasperException: Lock obtain timed out
    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:254)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilt erChain.java:247)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain. java:193)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:2 56)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:1 91)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java :171)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174 )
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:199)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnectio n(Http11Protocol.java:700)
    at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:584)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:68 3)
    at java.lang.Thread.run(Thread.java:534)

root cause

java.io.IOException: Lock obtain timed out
    at org.apache.lucene.store.Lock.obtain(Lock.java:97)
    at org.apache.lucene.store.Lock$With.run(Lock.java:147)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:99)
    at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:75)
    at net.nutch.searcher.IndexSearcher.<init>(IndexSearcher.java:43)
    at net.nutch.searcher.NutchBean.init(NutchBean.java:71)
    at net.nutch.searcher.NutchBean.<init>(NutchBean.java:60)
    at net.nutch.searcher.NutchBean.<init>(NutchBean.java:50)
    at net.nutch.searcher.NutchBean.get(NutchBean.java:42)
    at org.apache.jsp.search_jsp._jspService(search_jsp.java:65)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:210)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilt erChain.java:247)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain. java:193)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:2 56)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:1 91)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardContext.invoke(StandardContext.java:2422)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:180)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.valves.ErrorDispatcherValve.invoke(ErrorDispatcherValve.java :171)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:163)
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:641)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:174 )
    at org.apache.catalina.core.StandardPipeline$StandardPipelineValveContext.invokeNex t(StandardPipeline.java:643)
    at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:480)
    at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:995)
    at org.apache.coyote.tomcat4.CoyoteAdapter.service(CoyoteAdapter.java:199)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnectio n(Http11Protocol.java:700)
    at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:584)
    at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:68 3)
    at java.lang.Thread.run(Thread.java:534)


---
I Hate Jesus: -1: Bible thumper
kpaul: YAAT. YHL. HAND. btw, YAHWEH wins ;) [mt]

doesn't work atm (none / 1) (#90)
by neetij on Sun May 02, 2004 at 07:09:46 PM EST

well, it doesnt seem to be working...a search query of 'mozdex' (in IE/opera) hasn't showed results for a few minutes. something sure is lacking.

Building a Search Engine. | 99 comments (68 topical, 31 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!