View Full Version : Google found my test site - Even with Robots.txt limiting access
rfrazee
09-03-2008, 11:50 AM
Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.
It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain.
Example:
http://testsite.lejoslearning.com (not the actual subdomain but you get the idea.
I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.
User-agent: *
Disallow: /testsite/
But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.
What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues.
Please advise your thoughts. Thanks.
spiderbait
09-03-2008, 11:53 AM
I believe you need to treat a subdomain as a separate domain. This means you need to have a unique robots.txt file for each subdomain.
rfrazee
09-03-2008, 12:59 PM
Thanks. I will work on that option to see if I can get it disallowed, and then try to get things removed from Google, Yahoo, etc.
Ryan
incrediblehelp
09-03-2008, 10:58 PM
Try password protecting it. It wont rank for long.
SemAdvance
09-04-2008, 05:24 PM
A robots txt file is not always followed.
robots crawl servers and happen to follow links they find on the pages contained there in.
Anything on a server is game to being crawled.....
deepsand
09-04-2008, 05:29 PM
Each domain and/or sub-domain for which exclusions are desired requires its own robots.txt file .
See http://www.webproworld.com/search-engine-optimization-forum/62731-robot-text-files-do-not-follow-sprecific-links.html
martindow
09-04-2008, 06:52 PM
User-agent: *
Disallow: /testsite/
This robots text would have disallowed
lejoslearning.com/testsite
but not the subdomain as you have set it up
deepsand
09-04-2008, 06:56 PM
Correct. As set forth, /testsite/ is a sub-directory relative to the root of the domain within which the robots.txt file exists.
Clarrie
09-09-2008, 10:32 AM
Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.
It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain.
Example:
http://testsite.lejoslearning.com (not the actual subdomain but you get the idea.
I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.
User-agent: *
Disallow: /testsite/
But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.
What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues.
Please advise your thoughts. Thanks.
I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.
Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
http://www.webproworld.com/search-engine-optimization-forum/12997-google-spying.html
And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...
So if you are using the google toolbar, this might well be how you content has been picked up
deepsand
09-09-2008, 02:08 PM
I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.
Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
http://www.webproworld.com/search-engine-optimization-forum/12997-google-spying.html
And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...
So if you are using the google toolbar, this might well be how you content has been picked up
The days of SEs finding resources on the web by blindly crawling it are long gone. The web is expanding much too quickly for that to now be effective and/or efficient, and there is now a plethora of other means of discovering new resources.
That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered.
Clarrie
09-10-2008, 04:40 AM
That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered.
Hmm? Not sure I 100% agree with this - Allegedly the 1st thing any of the Google bots do when they index anything on a domain is to check the robots.txt - not just for a full crawl. The fact that the existence of the content was flagged up by the toolbar (or by any other means) doesn't change that protocol - the toolbar doesn't do the indexing, it merely tells Google that the content exists and sends a bot to index it, and that bot should in theory always read and follow the robot.txt instructions.
But at the end of the day, for whatever reason robots.txt is not an infallible way to hide content from search engine bots!
deepsand
09-10-2008, 01:01 PM
Hmm? Not sure I 100% agree with this - Allegedly the 1st thing any of the Google bots do when they index anything on a domain is to check the robots.txt - not just for a full crawl. The fact that the existence of the content was flagged up by the toolbar (or by any other means) doesn't change that protocol - the toolbar doesn't do the indexing, it merely tells Google that the content exists and sends a bot to index it, and that bot should in theory always read and follow the robot.txt instructions.
My statement was "That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered."
Being "discovered" and being "indexed" are 2 quite different things. Likewise, the indexing of an individual resource & that of the collective work that contains it are also different things. Lastly, being "indexed" is not the same thing as being published.
That an individual resource has been discovered does not mean that any particular SE has taken any action beyond ignoring it or privately indexing it.
rfrazee
09-10-2008, 02:04 PM
Please note, the subdomain also has its own robots.txt file while also has a disallow.
My understanding was that if the robots.txt file was there, it would not be indexed.
The solution of putting a password on the front of it is the path we are working on, but the issue of the search engines not abiding by the robots.txt file, is a bit concerning.
deepsand
09-10-2008, 02:08 PM
Please note, the subdomain also has its own robots.txt file while also has a disallow.
My understanding was that if the robots.txt file was there, it would not be indexed.
The solution of putting a password on the front of it is the path we are working on, but the issue of the search engines not abiding by the robots.txt file, is a bit concerning.
SE compliance with robot directives has always been, perforce, discretionary.
Jean-Luc
09-10-2008, 05:37 PM
I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.
User-agent: *
Disallow: /testsite/
But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.
Please note, the subdomain also has its own robots.txt file while also has a disallow.
My understanding was that if the robots.txt file was there, it would not be indexed.
Do you mean that you added the robots.txt of the subdomain between your first and second post ?
Do not blame the search engines if they looked at the subdomain before you added the robots.txt file there.
Jean-Luc
rfrazee
09-15-2008, 05:32 PM
Jean Luc,
Thanks for the question to clarify. Let me answer it quickly.
We have not added robots.txt files after we found out that Google had indexed us. We had always had the robots.txt files in place in both the specific folder in the subdomain, and the main domain of the site.
I posted again to clarify, since in my first post, I had not identified that fact that we had actually had and still do have robots.txt files in both locations.
Just to clarify one other thing, I am not blaming the search engines for anything. My assumption is that we did not understand something correctly, and therefore allowed the search engines to crawl something we didn't want them to crawl. So far, based on the comments we have received, it seems that the best solution is to put anything in development behind a password protected part of the site, and also robots.txt it. My only concern is that I thought that robots.txt was a definitive black and white solution, but it appears to not be so black and white.
Thanks again for the question to clarify.
SemAdvance
09-16-2008, 04:36 PM
I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.
Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
http://www.webproworld.com/search-engine-optimization-forum/12997-google-spying.html
And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...
So if you are using the google toolbar, this might well be how you content has been picked up
Hi Clarrie
Search spiders crawl servers AND happen to follow the links on the pages it finds on the server.
Search spiders do not always follow the instructions in robots.txt
Especially if the robots.txt is not constructed properly.
Hope it helps.
full house
12-18-2008, 08:25 PM
really! spiders also crawl server... I don't know that. I though only the web that spiders can see. so, this could help in crawling my site.
deepsand
12-20-2008, 12:15 PM
really! spiders also crawl server... I don't know that. I though only the web that spiders can see. so, this could help in crawling my site.
Do your pages exist someplace other than a server? How can a crawler/spider/robot reach your pages if it does not access your server?
That being said, understand that 'bots do not have unfettered access to any server. They can only access files via URIs (Uniform Resource Identifier) that they know of; and, of those, only those that do not require any login other that "anonymous."