Submit Your Article Forum Rules

Results 1 to 10 of 10

Thread: Will this robots txt stop search engines from crawling?

  1. #1
    Senior Member
    Join Date
    Mar 2008
    Posts
    754

    Will this robots txt stop search engines from crawling?

    I was poking around some website directories where I work, and found this robots txt file in one of the sites:

    User-Agent: * Disallow: /

    # No search engines allowed.

    Doesn't this tell search engines to NOT crawl the site? Kind of strange, because the site holds a #2 position in Google for its targeted keyword. However, the Google listing has never included a title tag, description, or snippet of content -- just the url (which is what prompted me to investigate).
    Do the best you can - as fast as you can - then fix it later.
    --Seth Godin

  2. #2
    Administrator weegillis's Avatar
    Join Date
    Oct 2003
    Posts
    5,789
    That is the most basic of directives, and yes, it will 'block', or rather, 'discourage' the honest SE's. Not all bots belong to scrupulous SE's, though, and there will always be some that ignore, or don't even bother to request robots.txt, so there is still a chance that resources on the domain end up in the search index, even of the honest SE's as a result of artefact effect.

  3. #3
    Senior Member
    Join Date
    Mar 2008
    Posts
    754
    Quote Originally Posted by weegillis View Post
    That is the most basic of directives, and yes, it will 'block', or rather, 'discourage' the honest SE's.
    There's no reason why this site should be blocked from the search engines (especially Google), so it must be a goofy mistake on someone's part to have added the file to the directory. Talk about shooting yourself in the foot! But then no one ever took a closer look because it did so well in the serps anyway.
    Do the best you can - as fast as you can - then fix it later.
    --Seth Godin

  4. #4
    Administrator weegillis's Avatar
    Join Date
    Oct 2003
    Posts
    5,789
    Were there more than one robots.txt file? From what you say above, it sounds like someone has been treating them like .htaccess files. The only robots.txt that SE's ask for is the one (if present) on the domain root. All others are pointless.

  5. #5
    Senior Member
    Join Date
    Mar 2008
    Posts
    754
    Quote Originally Posted by weegillis View Post
    Were there more than one robots.txt file?
    Nope. Just the one file: robots.txt located in the root directory.
    This is the entire contents of that file:

    # robots.txt for <the domain url>


    User-Agent: *
    Disallow: /


    # No search engines allowed.
    Do the best you can - as fast as you can - then fix it later.
    --Seth Godin

  6. #6
    Administrator weegillis's Avatar
    Join Date
    Oct 2003
    Posts
    5,789
    Then the best we can assume is that the site owner(s) did a bit of SEO and link building, and attracted enough inbounds to push their keywords up in the serps. It's possible that their pages have never even been visited by a crawler (from the big SE's).

    It helps to know that robots.txt is not a block. It is only a directive, and defacto at that. Pages can be and are still requested, just not indexed. Links are still followed, and assuming there is any PR on the page, juice still passed.

    I had hoped that Webnauts or WilliamC would pipe in to debunk my response. They are both far more knowledgeable on this topic than am I.

  7. #7
    Senior Member
    Join Date
    Mar 2008
    Posts
    754
    Quote Originally Posted by weegillis View Post
    It helps to know that robots.txt is not a block. It is only a directive, and defacto at that.
    Do you think the robots.txt has any bearing on the fact that Google shows only the url in the serps? The page definitely has a title tag, description, and related content -- Google just doesn't show it for some reason.
    Do the best you can - as fast as you can - then fix it later.
    --Seth Godin

  8. #8
    Administrator weegillis's Avatar
    Join Date
    Oct 2003
    Posts
    5,789
    Quote Originally Posted by keyon View Post
    Do you think the robots.txt has any bearing on the fact that Google shows only the url in the serps? The page definitely has a title tag, description, and related content -- Google just doesn't show it for some reason.
    My thoughts, exactly. The algo does not work in a vacuum. The site obviously has an established link profile, just not derived from the site pages, themselves.

  9. #9
    WebProWorld MVP deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,481
    Said directive is literally a "Do Not Trespass" command; and, well behaved 'bots will comply by not requesting any files from the URL in question. This means that they will have no knowledge of any of the contents, including any links contained within. The result will be an absence of the external display of "a title tag, description, or snippet of content."

    However, they do know of the existence of the URL, and will index it. Additionally, they may discover files internal to the URL via IBLs from other sites; these too will be indexed, presuming that a Header check reveals the such files are indeed extant.

    All such indexed files will participate in the PageRank calculation matrix.

    And, they may well be displayed in SERPs if there is evidence of public interest, as Matt Cutts noted some years ago. In the case at hand, the DN itself may be one of interest and/or there may be sufficient IBLs to evidence such interest.

    It should not be assumed that there is "no reason why this site should be blocked from the search engines." TPTB may very well have a rational reason for such that others are unaware of.

  10. #10
    Moderator Tiggerito's Avatar
    Join Date
    May 2004
    Location
    Adelaide, Australia
    Posts
    550
    Google Webmaster Tools recently flagged a clients website as having important pages blocked from indexing (or similar words).

    I checked things and the developers had accidentally copied over the development robots.txt file onto the live server. The file was like the one above, a complete block!

    WMT was already flagging thousands of pages as being "Restricted by robots.txt‎".

    I got them to fix the problem, urgently.

    Now waiting to see if there is any impact.
    by Tony McCreath (Tiggerito)
    owner of Web Site Advantage

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •