iEntry 10th Anniversary Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
SEO 101 Welcome to the SEO 101 forum on WebProWorld - This SEO Podcast is geared towards Newbie's in order to teach and bridge the gap between website owners and the elusive SEO practices. So sit back, relax, enjoy, learn, and prosper from the SEO 101 Podcast.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 09-03-2008, 12:50 PM
WebProWorld Member
 
Join Date: Jun 2008
Posts: 28
rfrazee RepRank 0
Default Google found my test site - Even with Robots.txt limiting access

Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.

It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain.

Example:
http://testsite.lejoslearning.com (not the actual subdomain but you get the idea.

I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.

User-agent: *
Disallow: /testsite/

But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.

What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues.

Please advise your thoughts. Thanks.
Reply With Quote
  #2 (permalink)  
Old 09-03-2008, 12:53 PM
spiderbait's Avatar
WebProWorld Pro
 
Join Date: Oct 2003
Location: Gibsons, BC, Canada
Posts: 271
spiderbait RepRank 5spiderbait RepRank 5spiderbait RepRank 5spiderbait RepRank 5spiderbait RepRank 5spiderbait RepRank 5
Default Re: Google found my test site - Even with Robots.txt limiting access

I believe you need to treat a subdomain as a separate domain. This means you need to have a unique robots.txt file for each subdomain.
__________________
Jade Burnside, Ahead of the Web
What good is your web site if no one can find it?
SEO & Optimized Web Site Design
Reply With Quote
  #3 (permalink)  
Old 09-03-2008, 01:59 PM
WebProWorld Member
 
Join Date: Jun 2008
Posts: 28
rfrazee RepRank 0
Default Re: Google found my test site - Even with Robots.txt limiting access

Thanks. I will work on that option to see if I can get it disallowed, and then try to get things removed from Google, Yahoo, etc.

Ryan
Reply With Quote
  #4 (permalink)  
Old 09-03-2008, 11:58 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: Google found my test site - Even with Robots.txt limiting access

Try password protecting it. It wont rank for long.
Reply With Quote
  #5 (permalink)  
Old 09-04-2008, 06:24 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: Google found my test site - Even with Robots.txt limiting access

A robots txt file is not always followed.

robots crawl servers and happen to follow links they find on the pages contained there in.

Anything on a server is game to being crawled.....
Reply With Quote
  #6 (permalink)  
Old 09-04-2008, 06:29 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default Re: Google found my test site - Even with Robots.txt limiting access

Each domain and/or sub-domain for which exclusions are desired requires its own robots.txt file .

See Robot Text Files - DO Not Follow Sprecific Links
Reply With Quote
  #7 (permalink)  
Old 09-04-2008, 07:52 PM
WebProWorld Member
 
Join Date: Nov 2003
Location: uk
Posts: 51
martindow RepRank 0
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by rfrazee View Post
User-agent: *
Disallow: /testsite/
This robots text would have disallowed
lejoslearning.com/testsite
but not the subdomain as you have set it up
__________________
Martin
www.spectrumwellbeing.co.uk
Reply With Quote
  #8 (permalink)  
Old 09-04-2008, 07:56 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default Re: Google found my test site - Even with Robots.txt limiting access

Correct. As set forth, /testsite/ is a sub-directory relative to the root of the domain within which the robots.txt file exists.
Reply With Quote
  #9 (permalink)  
Old 09-09-2008, 11:32 AM
WebProWorld Pro
 
Join Date: Dec 2003
Location: Eastleigh, Hampshire, UK
Posts: 160
Clarrie RepRank 2
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by rfrazee View Post
Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.

It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain.

Example:
http://testsite.lejoslearning.com (not the actual subdomain but you get the idea.

I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.

User-agent: *
Disallow: /testsite/

But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.

What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues.

Please advise your thoughts. Thanks.
I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.

Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
Is Google Spying!?

And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...

So if you are using the google toolbar, this might well be how you content has been picked up
__________________
Clarrie
www.dvisions.co.uk - lose the camouflage and stand out...
Reply With Quote
  #10 (permalink)  
Old 09-09-2008, 03:08 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by Clarrie View Post
I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.

Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
Is Google Spying!?

And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...

So if you are using the google toolbar, this might well be how you content has been picked up
The days of SEs finding resources on the web by blindly crawling it are long gone. The web is expanding much too quickly for that to now be effective and/or efficient, and there is now a plethora of other means of discovering new resources.

That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered.
Reply With Quote
  #11 (permalink)  
Old 09-10-2008, 05:40 AM
WebProWorld Pro
 
Join Date: Dec 2003
Location: Eastleigh, Hampshire, UK
Posts: 160
Clarrie RepRank 2
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by deepsand View Post
That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered.
Hmm? Not sure I 100% agree with this - Allegedly the 1st thing any of the Google bots do when they index anything on a domain is to check the robots.txt - not just for a full crawl. The fact that the existence of the content was flagged up by the toolbar (or by any other means) doesn't change that protocol - the toolbar doesn't do the indexing, it merely tells Google that the content exists and sends a bot to index it, and that bot should in theory always read and follow the robot.txt instructions.

But at the end of the day, for whatever reason robots.txt is not an infallible way to hide content from search engine bots!
__________________
Clarrie
www.dvisions.co.uk - lose the camouflage and stand out...
Reply With Quote
  #12 (permalink)  
Old 09-10-2008, 02:01 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by Clarrie View Post
Hmm? Not sure I 100% agree with this - Allegedly the 1st thing any of the Google bots do when they index anything on a domain is to check the robots.txt - not just for a full crawl. The fact that the existence of the content was flagged up by the toolbar (or by any other means) doesn't change that protocol - the toolbar doesn't do the indexing, it merely tells Google that the content exists and sends a bot to index it, and that bot should in theory always read and follow the robot.txt instructions.
My statement was "That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered."

Being "discovered" and being "indexed" are 2 quite different things. Likewise, the indexing of an individual resource & that of the collective work that contains it are also different things. Lastly, being "indexed" is not the same thing as being published.

That an individual resource has been discovered does not mean that any particular SE has taken any action beyond ignoring it or privately indexing it.
Reply With Quote
  #13 (permalink)  
Old 09-10-2008, 03:04 PM
WebProWorld Member
 
Join Date: Jun 2008
Posts: 28
rfrazee RepRank 0
Default Re: Google found my test site - Even with Robots.txt limiting access

Please note, the subdomain also has its own robots.txt file while also has a disallow.

My understanding was that if the robots.txt file was there, it would not be indexed.

The solution of putting a password on the front of it is the path we are working on, but the issue of the search engines not abiding by the robots.txt file, is a bit concerning.
Reply With Quote
  #14 (permalink)  
Old 09-10-2008, 03:08 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by rfrazee View Post
Please note, the subdomain also has its own robots.txt file while also has a disallow.

My understanding was that if the robots.txt file was there, it would not be indexed.

The solution of putting a password on the front of it is the path we are working on, but the issue of the search engines not abiding by the robots.txt file, is a bit concerning.
SE compliance with robot directives has always been, perforce, discretionary.
Reply With Quote
  #15 (permalink)  
Old 09-10-2008, 06:37 PM
WebProWorld Pro
 
Join Date: Dec 2007
Location: Brussels, Belgium
Posts: 164
Jean-Luc RepRank 2
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by rfrazee View Post
I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.

User-agent: *
Disallow: /testsite/

But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.
Quote:
Originally Posted by rfrazee View Post
Please note, the subdomain also has its own robots.txt file while also has a disallow.

My understanding was that if the robots.txt file was there, it would not be indexed.
Do you mean that you added the robots.txt of the subdomain between your first and second post ?

Do not blame the search engines if they looked at the subdomain before you added the robots.txt file there.

Jean-Luc
__________________
Checking redirects made easy | | Professional AWStats Services
Reply With Quote
  #16 (permalink)  
Old 09-15-2008, 06:32 PM
WebProWorld Member
 
Join Date: Jun 2008
Posts: 28
rfrazee RepRank 0
Default Re: Google found my test site - Even with Robots.txt limiting access

Jean Luc,
Thanks for the question to clarify. Let me answer it quickly.

We have not added robots.txt files after we found out that Google had indexed us. We had always had the robots.txt files in place in both the specific folder in the subdomain, and the main domain of the site.

I posted again to clarify, since in my first post, I had not identified that fact that we had actually had and still do have robots.txt files in both locations.

Just to clarify one other thing, I am not blaming the search engines for anything. My assumption is that we did not understand something correctly, and therefore allowed the search engines to crawl something we didn't want them to crawl. So far, based on the comments we have received, it seems that the best solution is to put anything in development behind a password protected part of the site, and also robots.txt it. My only concern is that I thought that robots.txt was a definitive black and white solution, but it appears to not be so black and white.

Thanks again for the question to clarify.
Reply With Quote
  #17 (permalink)  
Old 09-16-2008, 05:36 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by Clarrie View Post
I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.

Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
Is Google Spying!?

And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...

So if you are using the google toolbar, this might well be how you content has been picked up

Hi Clarrie

Search spiders crawl servers AND happen to follow the links on the pages it finds on the server.

Search spiders do not always follow the instructions in robots.txt
Especially if the robots.txt is not constructed properly.

Hope it helps.
Reply With Quote
  #18 (permalink)  
Old 12-18-2008, 09:25 PM
full house's Avatar
WebProWorld Veteran
 
Join Date: Sep 2007
Posts: 522
full house RepRank 2
Default Re: Google found my test site - Even with Robots.txt limiting access

really! spiders also crawl server... I don't know that. I though only the web that spiders can see. so, this could help in crawling my site.
Reply With Quote
  #19 (permalink)  
Old 12-20-2008, 01:15 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default Re: Google found my test site - Even with Robots.txt limiting access

Quote:
Originally Posted by full house View Post
really! spiders also crawl server... I don't know that. I though only the web that spiders can see. so, this could help in crawling my site.
Do your pages exist someplace other than a server? How can a crawler/spider/robot reach your pages if it does not access your server?

That being said, understand that 'bots do not have unfettered access to any server. They can only access files via URIs (Uniform Resource Identifier) that they know of; and, of those, only those that do not require any login other that "anonymous."
Reply With Quote
Reply

  WebProWorld > Search Engines > SEO 101

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
Robots txt not found (404)? goodhelp Google Discussion Forum 2 02-20-2008 07:26 PM
Site is disappeared and nowhere found in Google Dervish Google Discussion Forum 18 05-14-2006 04:36 PM
A Test: Yahoo Begins Sponsoring Internet Access in Two Sher WPW_Feedbot Search Engine Optimization Forum 0 01-09-2006 03:00 PM
Yahoo Test: Sponsoring Internet Access in Two Sheraton Hote WPW_Feedbot Search Engine Optimization Forum 0 01-09-2006 02:30 PM
Open Source CMS - Test drive CMS with Admin Access too ronniethedodger Web Programming Discussion Forum 4 11-24-2004 12:50 PM


All times are GMT -4. The time now is 01:28 AM.



Search Engine Optimization by vBSEO 3.3.0