PDA

View Full Version : How does google decide what sites to crawl beyond the index



oneeye
01-06-2004, 10:13 AM
Hi all,

Google will not crawl my site www.mortgageratesgroup.com beyond the index page. All of the links on the site are text, there are not any frames and the site is JavaScript free. As far as I can tell there is no reason why google wouldn't crawl beyond the index page. Do you need to attain a certain pr before google will crawl beyond the index page? Or is something wrong with my site?

All your feedback is greatly appreciated.

Thanks,

Oneeye

bubbasmurf
01-06-2004, 10:32 AM
I'm no expert but I read somewhere if you have a tag like <meta name="revisit-after" content="14 days">
is not realy good because the spider is limted to the 14 day rule you set. The spider may come back sooner without it. But Like I said I'm no expert just somthing I read somewhere.

greeneagle
01-06-2004, 10:56 AM
Real Estate Sites took a hard hit in the “GOOGLE” Dance.

If they are crawling your index page, they are still giving you a 0PR (page rank). Typically commercial sites secondary pages get a (index page -1PR). Since your index page has a 0PR they may not be proceeding to the others.

Also, they are putting much more emphasis on “Content Freshness” these days and typically do not re-crawl often unless the server query flags an update since last crawl or within a designated time.

It also seems that pages that are stagnant for a long period are penalized.

These newer crawl policies make sense from perspectives of bandwidth and content value.

You may get more help from some of the group offering SEO (Search Engine Optimization) help.

Hope this helps.

Ken

achronister
01-06-2004, 11:22 AM
I would also recommend removing the revisit tag. It serves no purpose and Google ignores it.

As far as Google crawling deeper, it will usually crawl at least some of your pages if they have unique content, regardless of PR. I have heard though, that the higher the PR, the deeper it will crawl.

Also make sure you have the proper HTTP last update header info on you server. Google likes recently updated content and new pages added, and will visit more frequently.

Aaron

fathom
01-06-2004, 11:42 AM
I have heard though, that the higher the PR, the deeper it will crawl.

That's the key right there.

Remembering that PageRank is a factor of importance the higher the PR the more important you must be to the web... as PR filters through so does Googlebot... as it follows each and every link to your site, thus crawling more often, and deeper.

bubbasmurf
01-06-2004, 11:50 AM
Also make sure you have the proper HTTP last update header info on you server.

Aaron

Aaron you lost me here. Can you tell me a little more about Http header on the server?

Mel
01-06-2004, 07:48 PM
Hi Bubba:

When the Googlebot visits a site he doesn't want to waste a lot of time indexing pages that have not changed since his last visit and thus issues a conditional GET with an if-modified-since date.

If the document has not been modified since the date and time specified in If-Modified-Since field, the server responds with a 304 status code and does not send the document body to the client, otherwise it serves up the page requested.

Oneeye:

Your server is a windows server which requires that the coding to do this be set up on each page, but this has been done and your home page is returning a response like this:

Server Response: http://www.mortgageratesgroup.com
Status: HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
X-Powered-By: ASP.NET
Content-Location: http://www.mortgageratesgroup.com/index.htm
Date: Wed, 07 Jan 2004 00:28:43 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Mon, 05 Jan 2004 21:40:05 GMT
ETag: "b96d717bd4d3c31:101a"
Content-Length: 32365


Where the last modified date indicates that the server should be capable of responding to the conditional get correctly.

There could be several reasons why your site is not being crawled, and you should look at the fixing following areas to make sure everything is correct:

You have a ► character in the beginning of your page title; Google has recently been cracking down on the use of this type of spam.

You have both a <title> and a <meta name ="title" ...> tag in your header. This may be seen as an attempt to spam the SEs

You have a lot of unnecessary tags in your head:

The <meta name="revisit-after" content="14 days"> tag asks the bot NOT to visit your site more often than once every 14 days. Get rid of this it does no good at all.

The <meta name="robots" content="follow"> tag is not correct in that it does not mention the required index portion of the tag required by the tag syntax, and this MIGHT be interpreted as do not index this page. Get rid of this tag: if you leave it out the default is to index the page and follow the links.

The following will not stop the bots from spidering your site but might help you get some better rankings.

Your description tag is 423 characteres including spaces and most SEs do not index more than 200 characters or so. Tighten up your description and put your most important keyphrases as close to the beginning as you can.

Your meta keywords tag is only going to be read by the Inktomi spider, but you should have not more than three keyphrases which are relevant to the page its on, instead of the 20 or so you have now.

cbp
01-06-2004, 07:58 PM
Mel wrote:

The <meta name="revisit-after" content="14 days"> tag asks the bot NOT to visit your site more often than once every 14 days. Get rid of this it does no good at all

Yes, its a useless meta tag. The spiders visit when they are good and ready.

BUT, what surprises me is the increasing use of this meta tag. Where are people getting their info from?

I recently came across an interesting comment from a DMOZ editor. They edited in a category that gets a lot of spam, so they have to investigate sites carefully. They commented on several things that sent up a red flag as to if the site could be spam. One of them was the use of this meta tag (another was hyphenated domain names). While the use of this tag (and hyphenated domain name)are not spam, they are probably more likely to be used by a site that is spamming. So he/she used the presence of these as an indicator of further investigation of the site.

CBP

bubbasmurf
01-06-2004, 08:11 PM
Thanks Mel, so if I'm reading this right when I make changes to my site and upload these changes thats all I need to do, there is no tag I need to change because when I updated my page it does that itself.
Great...

Mel
01-06-2004, 08:25 PM
Yep Bubba thats right so long as your server is set up right.

This is not a problem with most Apache hosted sites, but is often a problem with asp.net sites which have to be hosted in Windows servers which unlike Apache servers do not set this up as the default in the installation.

ronniethedodger
01-06-2004, 10:00 PM
Yep Bubba thats right so long as your server is set up right.

This is not a problem with most Apache hosted sites, but is often a problem with asp.net sites which have to be hosted in Windows servers which unlike Apache servers do not set this up as the default in the installation.

So is it possible that the configuration of an Apache server could be changed from it's default setting? In theory it could anyway, right? I don't want to be an alarmist or anything. ;0)

You can see the GET statements and the Server response codes in your raw data server logs. It is easy enough to verify that the Server is sending back the 304 code or the entire page. Also, any good log analyzer software will show this data for you too.

But you mentioned the they issued a conditional if-modified-since date. What date are they sending? I am assuming they are sending a date since their last crawl...correct?

ronniethedodger
01-06-2004, 10:05 PM
I recently came across an interesting comment from a DMOZ editor. They edited in a category that gets a lot of spam, so they have to investigate sites carefully. They commented on several things that sent up a red flag as to if the site could be spam. One of them was the use of this meta tag (another was hyphenated domain names). While the use of this tag (and hyphenated domain name)are not spam, they are probably more likely to be used by a site that is spamming. So he/she used the presence of these as an indicator of further investigation of the site.

Is that the general consensus of most DMOZ editors? Or just a few? It seems to be a logical thing to do though.

Also the way you worded that, it seems like they do not dig too deeply normally. Only if certain things send up red flags...otherwise they check a couple of normal things and let it pass on through.

Mel
01-06-2004, 11:33 PM
Hi Ronnie
I can't be sure if Googlebot sends the date of the last crawl in the conditional get or if they send the date the file was last saved which they recorded when they last crawled the site, though I supect that its the latter.

ronniethedodger
01-06-2004, 11:44 PM
Just as long as they are not post-dating the request. I guess it would not matter, eh? ;0)

It was just a curiosity question. Thanks Mel.

Mel
01-07-2004, 01:21 AM
Normally the server will send as the last modified date the saved date of the file. I would see no reason for a bot to use another date than this one which is passed to it from the server.

ronniethedodger
01-07-2004, 01:27 AM
That was just a joke Mel...hehehe

I don't see any reason for them to do it either. Some are so stupid, they don't even bother to ask.

mm99
01-08-2004, 01:23 AM
One thing that I have observed over and over again, is that if you lack doc tags, what goog does is it will come back at a later time to crawl the site.

If you identify your site properly, using the correct doc tags, they will crawl your entire site when they visit. It seems like this is the case just about every time.

Place doc tags on all your pages then do a manual submit on all your pages to goog and you should be fine. Also on your robots tag, you should have "index, follow" and not just "follow".


peace...Paul

oneeye
01-08-2004, 04:35 PM
Wow, What a load of information. Thanks for all the great feedback.

Parden the ignorance, but how do you know what type of doctype to use. I have seen this mentioned all over the web. I just never knew exactly what it all meant.

I have heard that it hurts your rankings if you submit your pages as opposed to letting googlebot crawl them any truth to this?

Thanks again,

Oneeye

ronniethedodger
01-08-2004, 06:24 PM
Parden the ignorance, but how do you know what type of doctype to use. I have seen this mentioned all over the web. I just never knew exactly what it all meant.


Here is a good discussion that has already addressed the issue of DOCTYPE's here at WPW.

http://www.webproworld.com/viewtopic.php?t=10663

I think that will cover just about every question you might have on that. If there is something that you don't understand, just post your question there...it is a pretty busy thread.

mm99
01-08-2004, 09:09 PM
I took another look at your site and you have the perfect doc tag for your site. You go that right in other words. I would clean up the head area and you should be fine.

In your head, you need to add strength to your title, get rid of comments, and get rid of everything except desc, kws, and add robots. This is how robots should look:

<meta name="robots" content="index, follow">

Do the above and you'll be fine.

peace...Paul

Mel
01-10-2004, 03:44 AM
HI Oneeye
Dug a little deeper and I see that your domain name was only registered on 11/21/03 so I am assuming that your site has been live less than a month?

Let me suppose for a moment that your site was live as of Dec 15th and that there were links that Google knew about pointing to your site and then that googlebot came right to your site for a visit (I know kind of a best case scenario).

Googlebot will often only retrieve your robots.txt file on the first visit and with your tag asking him not to come back for 14 days, his next visit might not have been before the end of December, when he picked up your home page and then was asked to wait another 14 days before visiting again or about now.

Googles cache of your page is still showing the page with the old header content, so I suspect googlebot has not yet visited your site again.

Now that you have removed that tag from your site he should visit soon and index more pages, look at your logs for googlebot visits. I suspect that you will have more pages indexed soon.

Patience may be a virtue, but its sometimes hard to accomodate.

ronniethedodger
01-10-2004, 04:06 AM
In your head, you need to add strength to your title, get rid of comments, and get rid of everything except desc, kws, and add robots. This is how robots should look:

<meta name="robots" content="index, follow">
Do the above and you'll be fine.


The robots meta is not required, so you can dump that too. This tag was created for Exclusions primarily, the default behaviour of ALL spiders is to INDEX and to FOLLOW.

The robots.txt file is the preferred method for listing all exclusions on directories or individual files. It is cleaner, easier to maintain, and keeps that unnecessary code out of your HEAD area.