WebProWorld Part of WebProNews.com
Page One Link To Us Edit Profile Private Messages Archives FAQ RSS Feeds  
 

Go Back   WebProWorld > Search Engines > Search Engine Optimization Forum
Subscribe to the Newsletter FREE!


Register FAQ Members List Calendar Arcade Chatbox Mark Forums Read

Search Engine Optimization Forum SEO is much easier with help from peers and experts! The WebProWorld SEO forum is for the discussion and exploration of various search engine optimization topics. Any non (engine) specific SEO or SEM topics should go here.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #1 (permalink)  
Old 02-09-2005, 08:41 AM
Faglork's Avatar
WebProWorld Veteran
 

Join Date: Feb 2005
Location: Forchheim, Germany
Posts: 947
Faglork RepRank 0
Default Anyone running a spider/search engine?

Hello,

does anyone employ a spider? I am looking for a spider software which can index frames, pdf, word & excel files. The front end should provide a somewhat intelligent search function, capable of multi-term search. Database should be mysql.

Currently, I consider to use PhpDig
http://www.phpdig.net

Does anyone have experience with this software?

Alex
Reply With Quote
  #2 (permalink)  
Old 02-09-2005, 10:26 PM
WebProWorld Member
 

Join Date: Nov 2004
Location: London UK
Posts: 43
AccuraCast RepRank 0
Default

You might find some on javascript.internet.com and HotScripts.com I've never used any, so can't really suggest.
Reply With Quote
  #3 (permalink)  
Old 02-12-2005, 04:26 AM
Faglork's Avatar
WebProWorld Veteran
 

Join Date: Feb 2005
Location: Forchheim, Germany
Posts: 947
Faglork RepRank 0
Default

Thx, but I already searched there. They offer mostly rather simple scripts, and I need a *really good* spider, which can index frames, word files, pdf, and so on.

I find it somewhat strange that almost nobody seems to use spiders - in nearly all cases (I am talking about the "business directory"-type of website) some directory software will be used, but no spider for a "web search".

Why a spider? Feed the spider with your directory database, gather all information and create *your own infospace*. You can offer a full-text search over all the websites in your directory. This is a service you find almost NOWHERE and could be the killer application which distinguishes your directory from countless others.

Of course, most likely you need a dedicated server ...

Anyone care to discuss that topic?

Alex
Reply With Quote
  #4 (permalink)  
Old 02-12-2005, 09:01 AM
WebProWorld Pro
 

Join Date: Sep 2004
Location: Oslo, Norway
Posts: 114
kservik RepRank 0
Default

Same here, I am very interested in a spider. Here is a listing of Open Source Search Engines and a comparison of Open Source Indexers by Eric Lease Morgan.

The project that is possibly the most exciting around is Nutch, but it may take some time to finish.

My thought has been to create vertical directories where people can submit to a search engine and mix this with directory results.
__________________
Neteffects - Europe Search Marketing
Europe Business Directory
Ranking Directories - Resource of Search Engine Ranking Directories.
Reply With Quote
  #5 (permalink)  
Old 02-13-2005, 04:55 PM
WebProWorld Member
 

Join Date: Jun 2004
Posts: 88
emils RepRank 0
Default

I find this as a quite interesting topic. We have been working a while on tools for the following 3 cases, which comes very close to the ideas mentioned above:
- feeding a directory with spidered data. This was to be used in case there is not enough data to start a web directory. On such data, of course, a lot of manual processing has to be applied as well, but it really helped to launch a new directory.
- spidering external sites using directory data (obtained as above or through user additions)
- searching within resulting database and returning data from directory itself as well as spidered 'external' pages as well.

I think our work has pretty much covered the two cases mentioned here, with the comment that spidering can have lots of various uses and it all gets to what kind of data you are looking for and what you do with it. It was an interesting experience and although it has never been put into full use, or completely finished, its an asset that a good webmaster can find helpful at some times. We are planning to use some of this stuff in a future regional search engine, whenever we may really get to allocate the time and money to invest into it and get it launched.

Anyway, instead of using external tools we decided to build our own. Quite a difficult task. Our original aim was to build a search engine that can hold and query from 1 to 10 million webpages on a single PC. I must recognize that this is not an easy thing. Although the spider went out fine, the indexing portion was never finished. The plan we had was to use common database systems like MSSQL or MySQL as the basis for our index portion. As I said spidering works smooth but indexing doesn't; the problem was not with querying databases (which went out pretty well) but with updating the database. When you go over 100,000 webpages the database updating process tends to become too slow so that has pretty much been a definite limit for our scenario. To overcome such limitations, the complexity of the system increases a lot, the live database and updated database have for example to be separate databases, and switched automatically at times - whenever the indexing process has finished. Mixing lots of user searches with index updates usually turns into a deadlock due to the huge strain such a scenario puts on any database system.

I believe people are not using spiders because first of all they are pretty complex software, and you have to choose between two difficult options: either to use a ready-made one, which at some point may definitely be not good enough for what you need from it, or to build your own, which I personally think to be a difficult and challenging task. There are like 4 different portions here - spidering data, extracting information and filtering, indexing and the user search portion. All of them have to work smooth in order to get a decent product out of it.

Apart from this, building a spider to harvest data using a directory should be doable by a good developer, if you stick to the basics for the beginning. You have the base list (the directory's url's) so you don't have to sweat on that. Add some spider code to scan the external sites up to some level or number of subpages, some indexing system, and perhaps later build some plugins to extend indexing to a variety of document types (doc, pdf etc.). If I was in such a situation, I would try to find some developer able to do this for me rather than using a pre-made spider because there is always a point in the future where you badly need it modified and you can't have that. Of course, this also depends on funds available... and whether you can find the right person for the task.

I personally don't know any open source spider that i can recommend for this purpose, but who knows, there might be some out there.
Reply With Quote
Reply

  WebProWorld > Search Engines > Search Engine Optimization Forum
Tags: , ,



Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Search Engine Optimization by vBSEO 3.2.0