Submit Your Article Forum Rules

Page 1 of 2 12 LastLast
Results 1 to 10 of 19

Thread: Google found my test site - Even with Robots.txt limiting access

  1. #1
    Junior Member
    Join Date
    Jun 2008
    Posts
    28

    Google found my test site - Even with Robots.txt limiting access

    Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.

    It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain.

    Example:
    http://testsite.lejoslearning.com (not the actual subdomain but you get the idea.

    I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.

    User-agent: *
    Disallow: /testsite/

    But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.

    What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues.

    Please advise your thoughts. Thanks.

  2. #2
    Senior Member spiderbait's Avatar
    Join Date
    Oct 2003
    Posts
    268

    Re: Google found my test site - Even with Robots.txt limiting access

    I believe you need to treat a subdomain as a separate domain. This means you need to have a unique robots.txt file for each subdomain.
    Jade Burnside, Ahead of the Web
    What good is your web site if no one can find it?
    SEO & Optimized Web Site Design

  3. #3
    Junior Member
    Join Date
    Jun 2008
    Posts
    28

    Re: Google found my test site - Even with Robots.txt limiting access

    Thanks. I will work on that option to see if I can get it disallowed, and then try to get things removed from Google, Yahoo, etc.

    Ryan

  4. #4
    WebProWorld MVP incrediblehelp's Avatar
    Join Date
    Jan 2004
    Posts
    7,567

    Re: Google found my test site - Even with Robots.txt limiting access

    Try password protecting it. It wont rank for long.

  5. #5
    WebProWorld MVP SemAdvance's Avatar
    Join Date
    Dec 2005
    Posts
    1,037

    Re: Google found my test site - Even with Robots.txt limiting access

    A robots txt file is not always followed.

    robots crawl servers and happen to follow links they find on the pages contained there in.

    Anything on a server is game to being crawled.....

  6. #6
    Senior Member deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,489

    Re: Google found my test site - Even with Robots.txt limiting access

    Each domain and/or sub-domain for which exclusions are desired requires its own robots.txt file .

    See http://www.webproworld.com/search-en...fic-links.html

  7. #7

    Re: Google found my test site - Even with Robots.txt limiting access

    Quote Originally Posted by rfrazee View Post
    User-agent: *
    Disallow: /testsite/
    This robots text would have disallowed
    lejoslearning.com/testsite
    but not the subdomain as you have set it up

  8. #8
    Senior Member deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,489

    Re: Google found my test site - Even with Robots.txt limiting access

    Correct. As set forth, /testsite/ is a sub-directory relative to the root of the domain within which the robots.txt file exists.

  9. #9
    Senior Member
    Join Date
    Dec 2003
    Posts
    286

    Re: Google found my test site - Even with Robots.txt limiting access

    Quote Originally Posted by rfrazee View Post
    Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.

    It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain.

    Example:
    http://testsite.lejoslearning.com (not the actual subdomain but you get the idea.

    I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below.

    User-agent: *
    Disallow: /testsite/

    But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index.

    What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues.

    Please advise your thoughts. Thanks.
    I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.

    Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
    http://www.webproworld.com/search-en...le-spying.html

    And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...

    So if you are using the google toolbar, this might well be how you content has been picked up
    Clarrie
    www.dvisions.co.uk - lose the camouflage and stand out...

  10. #10
    Senior Member deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,489

    Re: Google found my test site - Even with Robots.txt limiting access

    Quote Originally Posted by Clarrie View Post
    I also had a similar problem with development content showing up on google. Surprised me as the dev area a) was restricted by robots.txt and b) there were no links into it for Googlebot to follow.

    Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post:
    http://www.webproworld.com/search-en...le-spying.html

    And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt...

    So if you are using the google toolbar, this might well be how you content has been picked up
    The days of SEs finding resources on the web by blindly crawling it are long gone. The web is expanding much too quickly for that to now be effective and/or efficient, and there is now a plethora of other means of discovering new resources.

    That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered.

Page 1 of 2 12 LastLast

Similar Threads

  1. Robots txt not found (404)?
    By goodhelp in forum Google Discussion Forum
    Replies: 2
    Last Post: 02-20-2008, 06:26 PM
  2. Site is disappeared and nowhere found in Google
    By Dervish in forum Google Discussion Forum
    Replies: 18
    Last Post: 05-14-2006, 03:36 PM
  3. A Test: Yahoo Begins Sponsoring Internet Access in Two Sher
    By WPW_Feedbot in forum Search Engine Optimization Forum
    Replies: 0
    Last Post: 01-09-2006, 02:00 PM
  4. Yahoo Test: Sponsoring Internet Access in Two Sheraton Hote
    By WPW_Feedbot in forum Search Engine Optimization Forum
    Replies: 0
    Last Post: 01-09-2006, 01:30 PM
  5. Open Source CMS - Test drive CMS with Admin Access too
    By ronniethedodger in forum Web Programming Discussion Forum
    Replies: 4
    Last Post: 11-24-2004, 11:50 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •