I find this as a quite interesting topic. We have been working a while on tools for the following 3 cases, which comes very close to the ideas mentioned above:
- feeding a directory with spidered data. This was to be used in case there is not enough data to start a web directory. On such data, of course, a lot of manual processing has to be applied as well, but it really helped to launch a new directory.
- spidering external sites using directory data (obtained as above or through user additions)
- searching within resulting database and returning data from directory itself as well as spidered 'external' pages as well.
I think our work has pretty much covered the two cases mentioned here, with the comment that spidering can have lots of various uses and it all gets to what kind of data you are looking for and what you do with it. It was an interesting experience and although it has never been put into full use, or completely finished, its an asset that a good webmaster can find helpful at some times. We are planning to use some of this stuff in a future regional search engine, whenever we may really get to allocate the time and money to invest into it and get it launched.
Anyway, instead of using external tools we decided to build our own. Quite a difficult task. Our original aim was to build a search engine that can hold and query from 1 to 10 million webpages on a single PC. I must recognize that this is not an easy thing. Although the spider went out fine, the indexing portion was never finished. The plan we had was to use common database systems like MSSQL or MySQL as the basis for our index portion. As I said spidering works smooth but indexing doesn't; the problem was not with querying databases (which went out pretty well) but with updating the database. When you go over 100,000 webpages the database updating process tends to become too slow so that has pretty much been a definite limit for our scenario. To overcome such limitations, the complexity of the system increases a lot, the live database and updated database have for example to be separate databases, and switched automatically at times - whenever the indexing process has finished. Mixing lots of user searches with index updates usually turns into a deadlock due to the huge strain such a scenario puts on any database system.
I believe people are not using spiders because first of all they are pretty complex software, and you have to choose between two difficult options: either to use a ready-made one, which at some point may definitely be not good enough for what you need from it, or to build your own, which I personally think to be a difficult and challenging task. There are like 4 different portions here - spidering data, extracting information and filtering, indexing and the user search portion. All of them have to work smooth in order to get a decent product out of it.
Apart from this, building a spider to harvest data using a directory should be doable by a good developer, if you stick to the basics for the beginning. You have the base list (the directory's url's) so you don't have to sweat on that. Add some spider code to scan the external sites up to some level or number of subpages, some indexing system, and perhaps later build some plugins to extend indexing to a variety of document types (doc, pdf etc.). If I was in such a situation, I would try to find some developer able to do this for me rather than using a pre-made spider because there is always a point in the future where you badly need it modified and you can't have that. Of course, this also depends on funds available... and whether you can find the right person for the task.
I personally don't know any open source spider that i can recommend for this purpose, but who knows, there might be some out there.
|