iEntry 10th Anniversary Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
Other Engines/Directories Got a comment about directories or some other engine? This is the place. There is a subforum dedicated to directories.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 05-07-2007, 11:24 AM
WebProWorld New Member
 
Join Date: Aug 2003
Location: New Jersey
Posts: 7
ShoreCon RepRank 0
Default Robots.txt and Sitemaps

Hi Everyone!

It occurred to me recently that with a really detailed robots.txt file, couldn't this act just like a sitemap for the search engines?

Of course I realize it is still important to have a sitemap page in many respects, but wouldn't/couldn't a robots.txt file work in much the same way?

Here is the basic format of a robots file:

User-agent: *
Allow: /
Disallow: /cgi-bin/

What if you had a detailed list of "Allowed" files or directories in the list? By definition, wouldn't that simply do the trick . . .

Just a thought.

Cheers!
Reply With Quote
  #2 (permalink)  
Old 05-07-2007, 04:32 PM
fctoma's Avatar
WebProWorld Pro
 
Join Date: Jan 2004
Location: The best hiking and fishing - Idaho
Posts: 119
fctoma RepRank 1
Default

I don't think anything should, or does, replace a normal site map. One reason is for the visitors. A normal site map is broken into various categories (maybe by product, area of coverage) with the respective links below. Quite a few larger sites have their 404 pointing to their sitemap.

Plus, another reason is obviously for onsite links to all your pages.

Best of luck!

Frank in Idaho
__________________
Living the life in the Idaho Falls and playing Disc Golf
Idaho Falls SEO
Reply With Quote
  #3 (permalink)  
Old 05-07-2007, 05:04 PM
WebProWorld New Member
 
Join Date: Oct 2003
Posts: 11
SummitPK RepRank 0
Default

Probably.

I too am not at my PC. I'm off climbing in beautiful Ouray, Colorado, then soaking in the hot springs with a beer.
Reply With Quote
  #4 (permalink)  
Old 05-07-2007, 05:07 PM
WebProWorld New Member
 
Join Date: May 2006
Location: Bronx, NY
Posts: 7
zepop RepRank 0
Default

While what you state seems logical it is not historicly correct read what wikipedia explains to us.

http://en.wikipedia.org/wiki/Robots.txt

From Wikipedia, the free encyclopedia
(Redirected from Robots.txt)
Jump to: navigation, search

The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data.

The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser.

There is no official standards body or RFC for the robots.txt protocol. It was created by consensus in June 1994 by members of the robots mailing list (robots-request@nexor.co.uk). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.
Contents

I hope this helps you understand the robots.txt file.

Sincerely,
zepop
Reply With Quote
  #5 (permalink)  
Old 05-07-2007, 05:24 PM
WebProWorld Member
 
Join Date: Nov 2003
Location: uk
Posts: 51
martindow RepRank 0
Default

Isn't the thing to do to add a line to the robots.txt file showing the sitemap address to search engines? For example:
Sitemap: http://www.example.com/sitemap.xml
__________________
Martin
www.spectrumwellbeing.co.uk
Reply With Quote
  #6 (permalink)  
Old 05-07-2007, 06:18 PM
craigmn3's Avatar
WebProWorld Veteran
 
Join Date: Jan 2004
Location: California
Posts: 335
craigmn3 RepRank 1
Default Belts and Suspenders

Be a Belt and Suspenders person in this case, no use having your pants come down because you didn't have one or the other
Reply With Quote
  #7 (permalink)  
Old 05-07-2007, 06:24 PM
WebProWorld New Member
 
Join Date: Aug 2003
Location: New Jersey
Posts: 7
ShoreCon RepRank 0
Default

Indeed I understand that the "Allow" is not part of the current accepted protocol, but Google accepts its use. What other reason might they choose to accept this?

http://www.google.com/support/webmas...y?answer=40364

If you are explicitly saying to Google "feel free to index these folders or directories" it acts much like a sitemap for search engines.

Side Note: I personally do care more about Google (who drives more than 80% of our traffic) than the others and I do realize that the importance of a true sitemap in terms of the user and the development of the sitemap.xml integration in Google. BUT, a robots file seems like a quick, easy and sure, a little bit dirty way o accomplishing the same thing :-)
Reply With Quote
  #8 (permalink)  
Old 05-07-2007, 08:53 PM
WebProWorld New Member
 
Join Date: May 2006
Location: Bronx, NY
Posts: 7
zepop RepRank 0
Default

The problem here is that the search engines don't look at it as you are, a file to explicitly invite bots to crawl. The search engines prefer a site map becase that was an old agreement.
The search engines and web designers may agree to your proposal some time in the future.
The search engines also allow robot allow and donot follow in Meta Tag format, but only a small percentage of web sites use this format.

Zepop B-)
Webmaster http://www.smolka.biz and http://smolka.com
Reply With Quote
  #9 (permalink)  
Old 05-07-2007, 09:54 PM
WebProWorld New Member
 
Join Date: Sep 2006
Location: New York
Posts: 16
nazcreative RepRank 0
Default

My opinion, let the sitemap do its thing and use the robots.txt file for its intended purpose.


________________________________________________

Dan Naz
Integrated Marketing Services
Apparel Graphic Design
Puerto Rican Graphic Tees
Reply With Quote
  #10 (permalink)  
Old 05-07-2007, 10:28 PM
Orion's Avatar
WebProWorld Veteran
WebProWorld MVP
 
Join Date: Sep 2003
Location: Halton Hills, ON
Posts: 702
Orion RepRank 4Orion RepRank 4Orion RepRank 4Orion RepRank 4
Default

the /allow won't do anything as the default command on any file is allow... robot.txt is just to exclude files you do NOT want the robots to follow.
Reply With Quote
  #11 (permalink)  
Old 05-08-2007, 12:47 AM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default

Shore keep them seperate, but (recently as stated above) you can use the robots.txt as a inclusion file now for the bots to find your sitemap.

Announcement: Big 3 Search Engines Team Up On Sitemaps
Reply With Quote
  #12 (permalink)  
Old 05-08-2007, 09:27 AM
WebProWorld New Member
 
Join Date: May 2006
Location: Oklahoma
Posts: 21
dtalbot RepRank 0
Default robot.txt and site maps

You can also create a search engine only sitemap based on the http://www.sitemaps.org/, load it to your site and then add the following line to your robot.txt:

Sitemap: http://www.yourdomain.com/sitemapfilename

Google and Yahoo will then spider the sitemap. I'm not sure about MSN ans ASK but I think they recognize this as well.
__________________
Daphne Talbot
http://www.TalbotServices.com
Website marketing & design
Reply With Quote
  #13 (permalink)  
Old 05-08-2007, 02:32 PM
WebProWorld New Member
 
Join Date: Apr 2007
Location: India
Posts: 18
itispals RepRank 0
Default Syntax Check

The syntax that you have used has to be looked upon.
Infact few have even pointed out the error.
Allow cannot appear immediately after the User-Agent.
The User-Agent has to be followed by Disallow:
As you have mentioned reference to Google: I assume you have made reference to the following code:

"If you block Googlebot and want to allow another of Google's bots (such as Googlebot-Mobile), you can allow access to that bot using the Allow rule. For instance:

User-agent: Googlebot
Disallow: /

User-agent: Googlebot-Mobile
Allow: "
Here this indicates that the site should not be crawled by google bot, but can be crawled by Googlebot-mobile's Robot.
And also it would be advisable to use the http://www.domainname.com/sitemap.xml
Please note sitemap.xml and not sitemap.html or sitemap.txt
Then this has to be indicated in the robots.txt.
May be i can share the code tommorrow...
Infact when you do this, your site would soon be crawled by MSN and ASK too. (They have agreed, but the system is not in place). As of now Auto discovery would happen with Google and Yahoo.
Hope this helps...
Thanks,
with regards,
itispals
http://www.buckleupnow.com
Reply With Quote
  #14 (permalink)  
Old 05-08-2007, 02:43 PM
WebProWorld New Member
 
Join Date: Aug 2003
Location: New Jersey
Posts: 7
ShoreCon RepRank 0
Default

It's interesting that you write that "Allow:" cannot follow the User-Agent.

Take a look at Google's own robots.txt file:

http://www.google.com/robots.txt

What are your thoughts?
Reply With Quote
  #15 (permalink)  
Old 05-08-2007, 06:54 PM
deepsand's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: May 2004
Location: Philadelphia, PA
Posts: 3,217
deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9deepsand RepRank 9
Default

Quote:
Originally Posted by ShoreCon
It's interesting that you write that "Allow:" cannot follow the User-Agent.

Take a look at Google's own robots.txt file:

http://www.google.com/robots.txt

What are your thoughts?
Quite informative. It looks like Google follows the "what your's is mine; what's mine is mine alone" rule!

Interesting interpretation of their "do no evil" mantra.
Reply With Quote
  #16 (permalink)  
Old 05-08-2007, 08:22 PM
WebProWorld New Member
 
Join Date: Aug 2003
Location: New Jersey
Posts: 7
ShoreCon RepRank 0
Default

Quote:
"what your's is mine; what's mine is mine alone"
I like that :-) So true!

While I was merely trying to fully understand what they were doing, I wasn't REALLY looking at what they were doing.

It is always good to understand what the standards of a process are, but personally, I will go right to the "source" if you will and see what they are doing. So, I just followed their lead. An interesting twist, me thinks . . .
Reply With Quote
  #17 (permalink)  
Old 05-09-2007, 06:58 AM
WebProWorld New Member
 
Join Date: Apr 2007
Location: India
Posts: 18
itispals RepRank 0
Default Google Robots.txt

Hello Shorecon,
Thanks for this Input.
I am glad i have learnt something new today.
I use one of the Syntax Checkers for Robots.txt and this is the message i got for the Google Robots.txt
*******************
WARNING: The tool has found some directory paths that don't include a trailing slash character.

Since a missing trailing slash can be both a deliberate decision or an error, and since this tool can't ipotize the real intentions of the webmaster, here follow some clarifications that could prevent a potential problem:

The following command will disable just the directory "private" and all its contents:
Disallow: /private/

...while the following command will disable both the "private" directory and any file or directory path starting with the text "/private" (so "/private-eye.html", "/privateroom/page.html", etc.):
Disallow: /private

Please be sure to use the correct syntax, according to your needs.
*********************************************
he following block of code contains some errors. Please, remove all the reported errors and check again this robots.txt file.
Line 1 User-agent: *
Line 2 Allow: /searchhistory/
Unknown command. Acceptable commands are "User-agent" and "Disallow".
A robots.txt file doesn't say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations.
Line 3 Disallow: /news?output=xhtml&
Line 4 Allow: /news?output=xhtml
Unknown command. Acceptable commands are "User-agent" and "Disallow".
A robots.txt file doesn't say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations.
************************************************** *
So it is based on these inputs that i had replied earlier. But i really find it strange that Google follows the robots.txt in other way...
It is a good thought that you wanted to follow the leader, and believe me, being in the internet field for so long, it never occured to me to check, google's robots text till date.
So thanks for that,
with regards,
Palani
http://www.buckleupnow.com
Reply With Quote
Reply

  WebProWorld > Search Engines > Other Engines/Directories

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -4. The time now is 08:31 AM.



Search Engine Optimization by vBSEO 3.3.0