 |

07-21-2004, 04:26 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Oct 2003
Location: Texas
Posts: 300
|
|
Slurp is a little um... slow?
I have my apache setup to create a log for google and slurp by the user agent. The google log is 64k with the slurp log is 4k.
There's some very weird things I noticed from the yahoo slurp.
It tried to access these files:
- -> /clic30de.htm
- -> /rezepten.htm
- -> /HTML-reference/sitelinks.htm
- -> /solvic8b/summary/modifications.htm
- -> /classics/cyn/Handbook.htm
- -> /bluemeanies/fave.htm
- -> /human-investigation/reply/farrsite.htm
- -> /fback.htm
None of these files are on my site. All my site is php based and even if I do use html I use the full extention instead of htm. None of it really makes sense. I used google and yahoo to see if there's any pages linking me with this nonsense. There isn't.
This is weird:
/classics/cyn/Handbook.htm
We used to have the classics section accessible through eliteskills.com/ classics / thenumber . Cyn is a user of the site. Site profiles are accessed through eliteskills.com/ u / username. Handbook.htm seems to be random.
Is it just checking for auto generated content and forwarding based on different keywords? If so then it'd be pretty smart. It spent a lot of time looking at my forwarding urls. Like, if it's a place only accessible by the members you're automatically forwarded to the registration area when trying to access it. When finding these types of urls yahoo kept trying to access them over and over.
I used to have the site setup as u.php?u=username but now it's just /u/username . Same thing for z.php?i=number -> /z/number. Every time google would try to access a file like /z/number it would scan z.php for some reason. Yahoo did this too but to a small lesser degree. It didn't do it so much after adding header("HTTP/1.1 301 Moved Permanently"); , but still, I think everyone should track the access logs. The crawlers might be wasting time crawling nonsense for some reason instead of content-rich files.
Yahoo seems far more linear in the way it accesses the sites. Google seems to have a list of the urls compiled that it builds each crawl unless it hadn't accessed the site before. It's order of crawling seems almost random.
|

07-21-2004, 06:03 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Central US
Posts: 1,581
|
|
Two things, either Yahoo is aware of some links that is sending it to those places or it is a worm that is spoofing the User-agent string.
On the intelligence of Slurp vs. Googlebot, you are quite correct in your observation. Googlebot does tend to learn, while Slurp will keep jumping off the cliff with the rest of the lemmings. Alexa crawler is even more of a lemming.
I would put an exclusion in your robots.txt file and a NOINDEX, NOFOLLOW meta on the page(s) in question. That will at least cut down on it's persistant attempts to access those pages (as well as other obedient spiders).
Slurp did show some aggressive activity last week, at least for me it did. Now it is pretty much back to being a ghost. This could be that "linear" quality that you mentioned.
I think the key is to figure out what sets Slurp off to crawl your site to begin with. I cannot figure that part out though. It is not constantly updating pages, I know that. I am associated with two other forum sites and that is definitely not the case.
|

07-21-2004, 09:13 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Oct 2003
Location: Texas
Posts: 300
|
|
Quote:
|
Two things, either Yahoo is aware of some links that is sending it to those places or it is a worm that is spoofing the User-agent string.
|
Couldn't be. It's definitely yahoo as the weird urls are mixed in with other pages by the same IP range. /classics/cyn/Handbook.htm <-- this is not random. I checked the back links by www.eliteskills.com/ what-ever-the-weird-url-accessed-was and it showed nothing. Then I checked "whatevertheweirdurlaccessedwas" and it also showed nothing that had to do with my site. cyn is a user, classics used to be a directory, handbook is random. I think it may be scanning to see if I have autogenerated content based on keywords because I have a lot of dynamic data, and I've been fairly messy in the past.
Quote:
|
I would put an exclusion in your robots.txt file and a NOINDEX, NOFOLLOW meta on the page(s) in question. That will at least cut down on it's persistant attempts to access those pages (as well as other obedient spiders).
|
I added a HTTP/1.1 301 Moved Permanently and a php redirect. If a spider is wandering around another site that has that type of link I don't want to just block it but show that the page has moved.
Quote:
|
On the intelligence of Slurp vs. Googlebot, you are quite correct in your observation. Googlebot does tend to learn, while Slurp will keep jumping off the cliff with the rest of the lemmings. Alexa crawler is even more of a lemming.
|
Yeah google seems to send all collected urls to another place where they are ordered?, dated, and ranked? somehow. It probably can kill repeats much better. It flys through the different pages of my site usually in random order. Yahoo seems to go by way of interconnection and it definitely doesn't go as deep. Alexa is a drunk just wandering aimlessly attacking my bandwidth with it's broken beer bottle.
Quote:
The different crawler systems are coordinated to limit the activity on any single web server. We determine a single "web server" by IP address, so if your host is serving multiple IPs it may see higher levels of activity.
http://help.yahoo.com/help/us/ysearc.../slurp-03.html
|
So maybe if we set the crawler at User-agent: Slurp Crawl-delay: 0 it'll crawl more? Buy another IP? Google doesn't seem to care if it crawls the heck out of a site.
Quote:
|
I think the key is to figure out what sets Slurp off to crawl your site to begin with. I cannot figure that part out though. It is not constantly updating pages, I know that. I am associated with two other forum sites and that is definitely not the case.
|
I believe it works the same as google. By some kind of pageranking system. If the page was just initially found yahoo crawls only the first page while google crawls 5-10. I setup the "I like tacos project" http://www.eliteskills.com/tacos/ and tracked how it was accessed. The amount google crawled initially may be because I linked it off the main site with a pagerank of 6. When google hit a page with several links on it it didn't rank it. I think google needs to see all outward links in the same server before it gives a rank to the page and since it will only crawl x amount of sites initially we have to wait for the next crawl. Yahoo probably does the same but is much slower about it. It has to allocate its resources to high ranked pages foremost than using it on crawling smaller ones so I think it has some kind of date-pagerank algorithm that decides when a page is due for a crawl.
|

07-21-2004, 09:30 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Central US
Posts: 1,581
|
|
I cannot see Slurp taking random stabs at pages like you suggest. I have not seen anything to suggest this.
Even though you cannot find a link for those pages using any Search Engine to verify that, that does not prove a thing. They do not report every link, and probably will not report a link to a page that does not exist anyway (which is your case here).
I suggest picking up a copy Xenu Link Sleuth. It is a free download. Run that thing on your own site and see what it turns up.
On the exclusions for robots.txt, I was simply referring to pages such as logon.php or post.php, not exclusions for these not existant pages. The reason I say that is that Slurp spends too much time crawling these useless pages over and over again. They have no value -- and it may prompt Slurp to grab another link that is worth crawling.
This will alleviate some of the problem with the drunk Alexa crawler too (I like your analogy ...hehehehe).
|

07-21-2004, 11:54 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Oct 2003
Location: Texas
Posts: 300
|
|
Thanks for taking the time to read though. I thought it'd just be yahoo because google hadn't looked for those pages, they're all .htm files(I don't use html files), all links posted by users are checked to be online or it resets to default, and finally yahoo looked for them all at the start of the crawl. classics/cyn/handbook.htm seemed too much a coincidence. We have a classics directory and a usernamed cyn and I doubt any outer source could have linked it like that. It'd have to be me who wrote it. I scanned the database and the files(xfind, analogx), still nothing. I figured yahoo must have thought I was using some kind of exploit and scanned for dynamic data through targetted keywords. If the users did it it'd there would be more and it wouldn't all be .htm files. It seems too far fetched to be coincidence.
robots.txt- I was talking about my previous dynamic pages. I used to run off of query strings but now I don't. I initially just had a redirect with php but it still kept jacking at them. Adding the permaneantly moved header seemed to work. I did block member pages it was crawling. Google seems more coordinated with it's crawlers.
I like the lemmings example too =). I've setup multiple pages with similar settings to see what the search engines hold most rank to but so far no pagerank on the subpages.
|

07-22-2004, 03:20 AM
|
 |
WebProWorld Veteran
|
|
Join Date: May 2004
Location: Vienna, Austria
Posts: 967
|
|
Slurp definitely has his own programme and I have no clue what it is. After a brief existence in the Yahoo! index after 8 weeks of waiting, I have now disappeared again. Not even the index page to be found. Ah well, makes life interesting.
|

07-22-2004, 06:47 PM
|
 |
WebProWorld Veteran
|
|
Join Date: Oct 2003
Location: Texas
Posts: 300
|
|
They may use old results to resort and analyze newer data. I bet they're doing their best to optimize the efficiency of the search before they try to implement the resources needed in order to do a full out psycho crawl to compete with google.
The page rank for each page is based on the other sites. The "index" changes each time all the results are analyzed. Some sites are dropped some are added. Keep to creating inward links and it won't be such a menace in a few months.
|

07-22-2004, 07:44 PM
|
 |
WebProWorld Veteran
|
|
Join Date: May 2004
Location: UK
Posts: 369
|
|
Elite,
Amen to that. Yahoo seem to be settling as of today. Am going to keep a close eye on this and will keep contributing.
pne
__________________
<a href="http://www.sochoose.com/" target="_blank">Employee Assistance Programme
<a href="http://www.sochoose.com/employee_wellness_programme.php"/target="_blank">Employee Wellness Programme
|

07-23-2004, 09:34 AM
|
|
WebProWorld Pro
|
|
Join Date: Apr 2004
Location: Finland
Posts: 147
|
|
Thanks Ronnie for the program, what a cute little find =)
I'm experiencing similar file requests that fall under 404, supposedly by Yahoo. Such files haven't ever existed nor do exist within our site - the site is very young, only a couple months old so I would know.
Just as examples, provided by AWstats
/aesop/duelta/linux_drivers_isdn_eicon_eicon_isa.c.htm 1 -
/mail.cgi 1 http://www.viikko-osake.com/
/cgi-bin/tell.cgi 1 http://www.viikko-osake.com/
There's plenty of those, about hundred, but each file is requested only once or twice except for the following weird files that I have no knowledge of:
/_vti_bin/owssvr.dll 14 -
/MSOffice/cltreq.asp 14 -
/sumthin 6 -
/default.ida 4 -
And so on.
I love Xenu already, 630 internal and external links crawled in about 30s.
Regards,
|

07-23-2004, 06:32 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Central US
Posts: 1,581
|
|
Quote:
|
Originally Posted by Niko Holopainen
/_vti_bin/owssvr.dll 14 -
/MSOffice/cltreq.asp 14 -
/sumthin 6 -
/default.ida 4 -
|
Now those are worms, that I do know. My stats analyzer (SawMill) filters that stuff out of the reporting because they are not true hits. The worms are looking for holes in the security of the Server, and they get pounded by them daily.
Quote:
|
I love Xenu already, 630 internal and external links crawled in about 30s.
|
Did you find any broken links on your site? If so, you will want to get them fixed ... they put a bad taste in a spiders mouth.
|

07-26-2004, 09:04 AM
|
|
WebProWorld Pro
|
|
Join Date: Apr 2004
Location: Finland
Posts: 147
|
|
I did actually, I was very surprised (since no other tool had commented on them) but two - silly typos like .cm instead of .com etc.
There are pages that are created but don't have true content as per se, but I'm working on it (just limited by time, hundreds of files and limited resources in time).
Ahh, I don't really know about worms (except they're some hostile crawlers) but that explains it, their numbers are growing constantly. I didn't pay heed to them earlier since they're filtered out in all except the 404 reports.
Are worms something to be concerned about or are there some necessary actions towards them? Our security should be ok but of course no site (owned by micro companies) can stand against a proper and dedicated actions...
Thanks for your reply, regards
|

07-26-2004, 03:35 PM
|
 |
WebProWorld 1,000+ Club
|
|
Join Date: Aug 2003
Location: Central US
Posts: 1,581
|
|
Niko - the worm that is asking for the default.ida page is a Code Red attack worm. It has been rendered pretty much harmless by server security. The others also. You should be safe and no need to worry.
The worms will not go away and will show up on everyone's server logs. Filtering them out is easy enough to do. Most log analysis software should already do that, I am surprised that AWstats is still reporting these on your 404 list (it really should not -- they are not true 404's as we come to know them). But I would imagine that if the hit count is high enough on these, you may want to call it to the attention of your Web Hosting company -- just to make them aware of it.
|

07-27-2004, 05:37 AM
|
|
WebProWorld Pro
|
|
Join Date: Apr 2004
Location: Finland
Posts: 147
|
|
Thanks for your reply and Code Red is a familiar one, I didn't recognize it from there however.
I'll ask our coder to tweak (or I will when I have the time) the AWstats filters, you're absolutely right that they shouldn't appear there.
We have our own server with hardware firewalls in addition to server software configuration updatings done through one of our associates that makes a business out of remote upkeep of servers (quite nice setup actually) and workstations (linux). Thus in effect we don't have a hosting company and I'm grateful for any tips that me, our coder or the network upkeeper might have missed.
Yours truly,
|
| Thread Tools |
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|