|
|
||||||
|
||||||
| Index Link To US Private Messages Archive FAQ RSS | ||||||
| Other Engines/Directories Got a comment about directories or some other engine? This is the place. There is a subforum dedicated to directories. |
Share Thread: & Tags
|
||||
|
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
Hi Everyone!
It occurred to me recently that with a really detailed robots.txt file, couldn't this act just like a sitemap for the search engines? Of course I realize it is still important to have a sitemap page in many respects, but wouldn't/couldn't a robots.txt file work in much the same way? Here is the basic format of a robots file: User-agent: * Allow: / Disallow: /cgi-bin/ What if you had a detailed list of "Allowed" files or directories in the list? By definition, wouldn't that simply do the trick . . . Just a thought. Cheers! |
|
|||
|
While what you state seems logical it is not historicly correct read what wikipedia explains to us.
http://en.wikipedia.org/wiki/Robots.txt From Wikipedia, the free encyclopedia (Redirected from Robots.txt) Jump to: navigation, search The robots exclusion standard, also known as the Robots Exclusion Protocol or robots.txt protocol is a convention to prevent cooperating web spiders and other web robots from accessing all or part of a website which is, otherwise, publicly viewable. Robots are often used by search engines to categorize and archive web sites, or by webmasters to proofread source code. A robots.txt file on a website will function as a request that specified robots ignore specified files or directories in their search. This might be, for example, out of a preference for privacy from search engine results, or the belief that the content of the selected directories might be misleading or irrelevant to the categorization of the site as a whole, or out of a desire that an application only operate on certain data. The protocol, however, is purely advisory. It relies on the cooperation of the web robot, so that marking an area of a site out of bounds with robots.txt does not guarantee privacy. Some web site administrators have tried to use the robots file to make private parts of a website invisible to the rest of the world, but the file is necessarily publicly available and its content is easily checked by anyone with a web browser. There is no official standards body or RFC for the robots.txt protocol. It was created by consensus in June 1994 by members of the robots mailing list (robots-request@nexor.co.uk). The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended. Contents I hope this helps you understand the robots.txt file. Sincerely, zepop |
|
|||
|
Isn't the thing to do to add a line to the robots.txt file showing the sitemap address to search engines? For example:
Sitemap: http://www.example.com/sitemap.xml |
|
|||
|
Indeed I understand that the "Allow" is not part of the current accepted protocol, but Google accepts its use. What other reason might they choose to accept this?
http://www.google.com/support/webmas...y?answer=40364 If you are explicitly saying to Google "feel free to index these folders or directories" it acts much like a sitemap for search engines. Side Note: I personally do care more about Google (who drives more than 80% of our traffic) than the others and I do realize that the importance of a true sitemap in terms of the user and the development of the sitemap.xml integration in Google. BUT, a robots file seems like a quick, easy and sure, a little bit dirty way o accomplishing the same thing :-) |
|
|||
|
The problem here is that the search engines don't look at it as you are, a file to explicitly invite bots to crawl. The search engines prefer a site map becase that was an old agreement.
The search engines and web designers may agree to your proposal some time in the future. The search engines also allow robot allow and donot follow in Meta Tag format, but only a small percentage of web sites use this format. Zepop B-) Webmaster http://www.smolka.biz and http://smolka.com |
|
|||
|
My opinion, let the sitemap do its thing and use the robots.txt file for its intended purpose.
________________________________________________ Dan Naz Integrated Marketing Services Apparel Graphic Design Puerto Rican Graphic Tees |
|
||||
|
the /allow won't do anything as the default command on any file is allow... robot.txt is just to exclude files you do NOT want the robots to follow.
__________________
Ron Boyd website consulting (design, optimization, marketing) :: Follow Me: @orionsweb |
|
||||
|
Shore keep them seperate, but (recently as stated above) you can use the robots.txt as a inclusion file now for the bots to find your sitemap.
Announcement: Big 3 Search Engines Team Up On Sitemaps |
|
|||
|
You can also create a search engine only sitemap based on the http://www.sitemaps.org/, load it to your site and then add the following line to your robot.txt:
Sitemap: http://www.yourdomain.com/sitemapfilename Google and Yahoo will then spider the sitemap. I'm not sure about MSN ans ASK but I think they recognize this as well. |
|
|||
|
The syntax that you have used has to be looked upon.
Infact few have even pointed out the error. Allow cannot appear immediately after the User-Agent. The User-Agent has to be followed by Disallow: As you have mentioned reference to Google: I assume you have made reference to the following code: "If you block Googlebot and want to allow another of Google's bots (such as Googlebot-Mobile), you can allow access to that bot using the Allow rule. For instance: User-agent: Googlebot Disallow: / User-agent: Googlebot-Mobile Allow: " Here this indicates that the site should not be crawled by google bot, but can be crawled by Googlebot-mobile's Robot. And also it would be advisable to use the http://www.domainname.com/sitemap.xml Please note sitemap.xml and not sitemap.html or sitemap.txt Then this has to be indicated in the robots.txt. May be i can share the code tommorrow... Infact when you do this, your site would soon be crawled by MSN and ASK too. (They have agreed, but the system is not in place). As of now Auto discovery would happen with Google and Yahoo. Hope this helps... Thanks, with regards, itispals http://www.buckleupnow.com |
|
|||
|
It's interesting that you write that "Allow:" cannot follow the User-Agent.
Take a look at Google's own robots.txt file: http://www.google.com/robots.txt What are your thoughts? |
|
||||
|
Quote:
Interesting interpretation of their "do no evil" mantra.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
|||
|
Quote:
While I was merely trying to fully understand what they were doing, I wasn't REALLY looking at what they were doing. It is always good to understand what the standards of a process are, but personally, I will go right to the "source" if you will and see what they are doing. So, I just followed their lead. An interesting twist, me thinks . . . |
|
|||
|
Hello Shorecon,
Thanks for this Input. I am glad i have learnt something new today. I use one of the Syntax Checkers for Robots.txt and this is the message i got for the Google Robots.txt ******************* WARNING: The tool has found some directory paths that don't include a trailing slash character. Since a missing trailing slash can be both a deliberate decision or an error, and since this tool can't ipotize the real intentions of the webmaster, here follow some clarifications that could prevent a potential problem: The following command will disable just the directory "private" and all its contents: Disallow: /private/ ...while the following command will disable both the "private" directory and any file or directory path starting with the text "/private" (so "/private-eye.html", "/privateroom/page.html", etc.): Disallow: /private Please be sure to use the correct syntax, according to your needs. ********************************************* he following block of code contains some errors. Please, remove all the reported errors and check again this robots.txt file. Line 1 User-agent: * Line 2 Allow: /searchhistory/ Unknown command. Acceptable commands are "User-agent" and "Disallow". A robots.txt file doesn't say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations. Line 3 Disallow: /news?output=xhtml& Line 4 Allow: /news?output=xhtml Unknown command. Acceptable commands are "User-agent" and "Disallow". A robots.txt file doesn't say what files/directories you can allow but just what you can disallow. Please refer to Robots Exclusion Standard page for more informations. ************************************************** * So it is based on these inputs that i had replied earlier. But i really find it strange that Google follows the robots.txt in other way... It is a good thought that you wanted to follow the leader, and believe me, being in the internet field for so long, it never occured to me to check, google's robots text till date. So thanks for that, with regards, Palani http://www.buckleupnow.com |
![]() |
|
| Thread Tools | |
| Display Modes | |
|
|
|
WebProWorld |
Advertise |
Contact Us |
About |
Forum Rules |
MVP's |
Archive |
Newsletter Archive |
Top |
WebProNews
WebProWorld is an iEntry, Inc. ® site - © 2009 All Rights Reserved Privacy Policy and Legal iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 |