PDA

View Full Version : Robots.txt and Sitemaps



ShoreCon
05-07-2007, 10:24 AM
Hi Everyone!

It occurred to me recently that with a really detailed robots.txt file, couldn't this act just like a sitemap for the search engines?

Of course I realize it is still important to have a sitemap page in many respects, but wouldn't/couldn't a robots.txt file work in much the same way?

Here is the basic format of a robots file:

User-agent: *
Allow: /
Disallow: /cgi-bin/

What if you had a detailed list of "Allowed" files or directories in the list? By definition, wouldn't that simply do the trick . . .

Just a thought.

Cheers!

fctoma
05-07-2007, 03:32 PM
I don't think anything should, or does, replace a normal site map. One reason is for the visitors. A normal site map is broken into various categories (maybe by product, area of coverage) with the respective links below. Quite a few larger sites have their 404 pointing to their sitemap.

Plus, another reason is obviously for onsite links to all your pages.

Best of luck!

Frank in Idaho

SummitPK
05-07-2007, 04:04 PM
Probably.

I too am not at my PC. I'm off climbing in beautiful Ouray, Colorado, then soaking in the hot springs with a beer.

zepop
05-07-2007, 04:07 PM
While what you state seems logical it is not historicly correct read what wikipedia explains to us.

http://en.wikipedia.org/wiki/Robots.txt

From Wikipedia, the free encyclopedia
(Redirected from Robots.txt)
Jump to: navigation, search

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data.

The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.

There is no official standards body or RFC for the robots.txt protocol. It was created by consensus in June 1994 by members of the robots mailing list (robots-request@nexor.co.uk). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.
Contents

I hope this helps you understand the robots.txt file.

Sincerely,
zepop

martindow
05-07-2007, 04:24 PM
Isn't the thing to do to add a line to the robots.txt file showing the sitemap address to search engines? For example:
Sitemap: http://www.example.com/sitemap.xml

craigmn3
05-07-2007, 05:18 PM
Be a Belt and Suspenders person in this case, no use having your pants come down because you didn't have one or the other

ShoreCon
05-07-2007, 05:24 PM
Indeed I understand that the "Allow" is not part of the current accepted protocol, but Google accepts its use. What other reason might they choose to accept this?

http://www.google.com/support/webmasters/bin/answer.py?answer=40364

If you are explicitly saying to Google "feel free to index these folders or directories" it acts much like a sitemap for search engines.

Side Note: I personally do care more about Google (who drives more than 80% of our traffic) than the others and I do realize that the importance of a true sitemap in terms of the user and the development of the sitemap.xml integration in Google. BUT, a robots file seems like a quick, easy and sure, a little bit dirty way o accomplishing the same thing :-)

zepop
05-07-2007, 07:53 PM
The problem here is that the search engines don't look at it as you are, a file to explicitly invite bots to crawl. The search engines prefer a site map becase that was an old agreement.
The search engines and web designers may agree to your proposal some time in the future.
The search engines also allow robot allow and donot follow in Meta Tag format, but only a small percentage of web sites use this format.

Zepop B-)
Webmaster http://www.smolka.biz and http://smolka.com

nazcreative
05-07-2007, 08:54 PM
My opinion, let the sitemap do its thing and use the robots.txt file for its intended purpose.


________________________________________________

Dan Naz
Integrated Marketing Services (http://www.nazcreative.com)
Apparel Graphic Design (http://www.trenzza.com)
Puerto Rican Graphic Tees (http://www.authenticboricua.com)

Orion
05-07-2007, 09:28 PM
the /allow won't do anything as the default command on any file is allow... robot.txt is just to exclude files you do NOT want the robots to follow.

incrediblehelp
05-07-2007, 11:47 PM
Shore keep them seperate, but (recently as stated above) you can use the robots.txt as a inclusion file now for the bots to find your sitemap.

Announcement: Big 3 Search Engines Team Up On Sitemaps (http://www.webproworld.com/viewtopic.php?t=69703)

dtalbot
05-08-2007, 08:27 AM
You can also create a search engine only sitemap based on the http://www.sitemaps.org/, load it to your site and then add the following line to your robot.txt:

Sitemap: http://www.yourdomain.com/sitemapfilename

Google and Yahoo will then spider the sitemap. I'm not sure about MSN ans ASK but I think they recognize this as well.

itispals
05-08-2007, 01:32 PM
The syntax that you have used has to be looked upon.
Infact few have even pointed out the error.
Allow cannot appear immediately after the User-Agent.
The User-Agent has to be followed by Disallow:
As you have mentioned reference to Google: I assume you have made reference to the following code:

"If you block Googlebot and want to allow another of Google's bots (such as Googlebot-Mobile), you can allow access to that bot using the Allow rule. For instance:

User-agent: Googlebot
Disallow: /

User-agent: Googlebot-Mobile
Allow: "
Here this indicates that the site should not be crawled by google bot, but can be crawled by Googlebot-mobile's Robot.
And also it would be advisable to use the http://www.domainname.com/sitemap.xml
Please note sitemap.xml and not sitemap.html or sitemap.txt
Then this has to be indicated in the robots.txt.
May be i can share the code tommorrow...
Infact when you do this, your site would soon be crawled by MSN and ASK too. (They have agreed, but the system is not in place). As of now Auto discovery would happen with Google and Yahoo.
Hope this helps...
Thanks,
with regards,
itispals
http://www.buckleupnow.com

ShoreCon
05-08-2007, 01:43 PM
It's interesting that you write that "Allow:" cannot follow the User-Agent.

Take a look at Google's own robots.txt file:

http://www.google.com/robots.txt

What are your thoughts?

deepsand
05-08-2007, 05:54 PM
It's interesting that you write that "Allow:" cannot follow the User-Agent.

Take a look at Google's own robots.txt file:

http://www.google.com/robots.txt

What are your thoughts?

Quite informative. It looks like Google follows the "what your's is mine; what's mine is mine alone" rule!

Interesting interpretation of their "do no evil" mantra.

ShoreCon
05-08-2007, 07:22 PM
"what your's is mine; what's mine is mine alone"

I like that :-) So true!

While I was merely trying to fully understand what they were doing, I wasn't REALLY looking at what they were doing.

It is always good to understand what the standards of a process are, but personally, I will go right to the "source" if you will and see what they are doing. So, I just followed their lead. An interesting twist, me thinks . . .

itispals
05-09-2007, 05:58 AM
Hello Shorecon,
Thanks for this Input.
I am glad i have learnt something new today.
I use one of the Syntax Checkers for Robots.txt and this is the message i got for the Google Robots.txt
*******************
WARNING: The tool has found some directory paths that don't include a trailing slash character.

Since a missing trailing slash can be both a deliberate decision or an error, and since this tool can't ipotize the real intentions of the webmaster, here follow some clarifications that could prevent a potential problem:

The following command will disable just the directory "private" and all its contents:
Disallow: /private/

...while the following command will disable both the "private" directory and any file or directory path starting with the text "/private" (so "/private-eye.html", "/privateroom/page.html", etc.):
Disallow: /private

Please be sure to use the correct syntax, according to your needs.
*********************************************
he following block of code contains some errors. Please, remove all the reported errors and check again this robots.txt file.
Line 1 User-agent: *
Line 2 Allow: /searchhistory/
Unknown command. Acceptable commands are "User-agent" and "Disallow".
A robots.txt file doesn't say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations.
Line 3 Disallow: /news?output=xhtml&
Line 4 Allow: /news?output=xhtml
Unknown command. Acceptable commands are "User-agent" and "Disallow".
A robots.txt file doesn't say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations.
************************************************** *
So it is based on these inputs that i had replied earlier. But i really find it strange that Google follows the robots.txt in other way...
It is a good thought that you wanted to follow the leader, and believe me, being in the internet field for so long, it never occured to me to check, google's robots text till date.
So thanks for that,
with regards,
Palani
http://www.buckleupnow.com