|
|
||||||
|
||||||
| Index Link To US Private Messages Archive FAQ RSS | ||||||
| Google Discussion Forum Google Discussion forum is for topics specifically related to Google. There is a subforum dedicated to AdSense/AdWords subjects. |
Share Thread: & Tags
|
||||
|
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
||||
|
Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
I would appreciate to hear your opinions.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Not myself. But they do: http://www.whitehouse.gov/robots.txt
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Quote:
So once again: Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
http://www.google.com/robots.txt -- PR 6.
IMO there wont be any advantage for having Page rank for robots.txt file.
__________________
SEO Professional India |
|
|||
|
Strange, because how athoritative can a text document be?
Unless you put some text in there that is relevant. Hmmm, would be a neat experiment. |
|
|||
|
Very simple, Yahoo! Site Explorer results
www.whitehouse.gov/robots.txt- 81k - Cached EIGHTY-ONE KB!!!!! Inlinks (1,262) There are a crapload of people the are fascinated by this and link to it. There's also a bunch of media references to it MSNBC CNN etc. Strange though it's not hidden or encrypted or something. |
|
||||
|
All the info you provide me above, I can retrieve without assistance.
So once again, my question was: Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Quote:
Outside of SEO, most robots.txt pages are indexed, and you can search for them. Googlehackers sometimes do this to search for hidden areas of web sites or to find vulnerable CMSs. Do a google search for inurl:"robots.txt" and note the top few results. Back in the early days of the web, and the early days of Google, a lot of educational resources and even company data files were uploaded as plain text, so the spiders learned to index them, even though they don't contain links, because they can often be the file a user is looking for. Although this is less common now. Here is an interesting robots.txt file: http://www.webmasterworld.com/robots.txt All this file contains is commented out blog entries, no actual robots.txt commands. Would this be considered spam?
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Quote:
I think you must have a closer look to this robots.txt Could it be that you have missed something? Or did I?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
You are correct. There are 2 lines of robot exclusion text, and 840 lines of spam.
On further investigation, this page displays different content to spiders than it does to web browsers: PHP Code:
Above code snippet is based on code located at http://www.webmasterworld.com/robots...ew=producecode, copyright 2007 WebMasterWorld and is shown in altered form under the fair usage justification of being reprinted for demonstrative purposes only, with editorial and unique components removed and only structure and common language conventions maintained.
__________________
The best way to learn anything, is to question everything. Last edited by wige; 08-15-2007 at 01:03 PM. |
|
||||
|
Quote:
Quote:
Quote:
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Additionally I made a search for the term robots.txt in Google.
Here I got WebmasterWorld at position 4 out of 2.420.000 pages. Cool way to target keywords. So please everybody, don't give up this discussion. I am sure there is some juice here.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Examine their robots.txt carefully.
Check line 3: # This code found here: http://www.webmasterworld.com/robots...ew=producecode
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Quote:
Does that mean I can setup a text file called seoworkers.txt, put a link in my sitemap or on an html page and google will treat it like a robots.txt? Hmm... keep ideas coming.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Or maybe I can make a 301 redirect of my robots.txt to my file seoworkers.txt?
But that will not serve the purpose for cloacking. But still how can I convince Google to see my seoworkers.txt file as a robots.txt? Hmmm... whats next?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
Good idea. I will give it a try. Even if that can take ages. I still will investigate the WebmasterWorld issue.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
This could be interesting. I don't think anyone's robots.txt is going to be indexed unless there are external links pointing to it. This being said, what advantage, if any, could be gained by doing so? Could the contents of this file have any effect on the rankings of the rest of your site?
Using an htaccess rewrite, you could quite easily present different content to different user-agents, targeting Googlebot, Yahoo!'s slurp, etc. You could target specific keywords to shore up any deficiencies for a particular search engine, thus optimizing separately for each search engine's ranking algorithm. This could get quite unwieldy, but might be interesting to investigate. |
|
||||
|
Quote:
To be processed as a robots file, I think the file has to be named robots.txt and must be in the root directory of the site. You would have to remap .txt files to be parsed as Perl or PHP or use mod_rewrite, pointing to a script similar to what I have above.
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Webnauts & Incredible
I can help you both solve the issue. 1. Yes the PR can have effect by passing the PR to your sites sitemap file for the search engines. From sitemaps.org Quote:
Webnauts I would not do a redirect of robots via the robots.txt file.....that would hurt more than help!!! |
|
||||
|
Quote:
I always though pagerank could only pass through an actual link, not a plain text url. Is this not the case? Bear in mind, when a document is spidered, two different processes are applied. One takes all the urls and adds them to the "to be crawled" database (discovery) and the other follows the links and distributes pagerank and calculates keywords, which I always thought was only done on link tags. Otherwise all the forum spam issues are pointless if you could get pagerank by simply including a plain text URL in a paragraph of text. That does away with the entire point of nofollow, and every anti-spam defense I have seen on any social media site.
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Quote:
Quote:
The http:// or www part are what tell the search algorithm which links to crawl. They read (parse the links from the database as you stated). I would think all the nofollow tag does is stop the links from being added to discovery. As for the anti spam defenses... like anything in life... take things with a grain of salt. Heck I think I have three anti spyware programs running on my PC and they still miss things.... |
|
||||
|
Quote:
Quote:
Of course, this is not how links are shown in a robots.txt file anyway. I doubt even if Google does crawl plain text links, that Googlebot would detect and crawl /folder/somefile.html. Especially since the typical robots.txt file contains the exact urls of only the files you don't want to be crawled.
__________________
The best way to learn anything, is to question everything. Last edited by wige; 08-17-2007 at 11:46 AM. |
|
||||
|
Quote:
Algorithms (search spiders) crawl over the server (this is why you need a file to stop them from indexing the folders) it fetches pages, and happens to follow links. For further clarification from Google Design and content guidelines * Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link Webmaster Help Center - Webmaster Guidelines Quote:
However as I noted above search engine spiders (algorithms) do not need to be told where to find things on the server..... they do need to be told.... NOT to look in certain folders or files however. Does this make better sense now?? :-> Last edited by SemAdvance; 08-17-2007 at 12:18 PM. |
|
||||
|
Quote:
But this goes back to the original point of the thread - since under normal conditions there are no links or urls in the robots.txt file that you would want to be crawled, is there any benefit from the search engines for robots.txt to have a PR? I have seen none. Someone did mention that it could improve the crawl rate of the robots.txt file itself, but it seems that Google already puts these files on an accelerated crawl schedule as it is.
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Update on the link test - Neither link shows up in my error logs as crawled by Google. However both links were accessed by spam bots, and by a forum user.
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Quote:
It would be found on the server root. The server is crawled by the algorithm. The algorithm does not need a link to crawl (spider) the server. It will find anything and everything on the server if it is able or unless otherwise instructed NOT to find everything. This is why it is called a spider..... The spiders retrieve the documents and send them back to a database. It needs no instructions from the host server to do this. It does need instructions what folders/ documents not to collect. The database is then analyzed and stored in the index and the URLs are parsed so that the crawl robots can go and discover more servers and websites. So the spider collects all the documents it finds on the server it does not need a link to do this. As for the OP question if there were links found within the robots.txt files it would seem the search engines would filter any passed PR to 0 as far as increasing SERP positions. I would agree it probably has little effect that would be noticeable in crawl rate. |
|
||||
|
Quote:
The sitemap files do not have a strictly specified name. Sitemaps have a suggested name (sitemap.xml) but other extensions are also accepted (sitemap.html) and files named sitemap with these other extensions have been used for user-accessed site maps for years. In fact I have seen sitemap.xml files that did not follow the spec, and were created before the spec for RSS use. It is very explicit in the sitemap specification that sitemap files do not need to be in the root folder. The highest sitemap file is considered authoritative, but additional sitemap files can be located in other sections of the site, for the reasons outlined in the specification itself. The algorithm has only two ways to find content on a server. It can follow a link, or it can discover the content by requesting an index page from the server (requesting /, which today most servers respond to by sending the index.html page instead of a directory listing for security reasons). If you have a page that has no links, and is not set in the server as a root document (the document served in response to a folder root request i.e. '/') the spider can NOT find it. It is called a spider because it travels the web from link to link.
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Quote:
To understand better there is an indexing bot, and a crawer bot. The crawler bot is a URLserver that tells the indexing bot, which URLs to follow. (It does not limit the pages it can index but rather is a set of instructions for links to find, that it pulled from the last set of documents it retrieved. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The indexer "downloads the pages to the index" The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link. Next is the URL Resolver which reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents. So as mentioned above the crawlers are connected to servers. Nowhere is it found that crawlers are limited to index the URLs it is sent... This is so that all pages it finds on the server can be collected. You can learn more at the links below Information Retrieval & Extraction Information Technology Services:Google Search Appliance The Anatomy of a Search Engine Peace! |
|
||||
|
Exactly. Like I said, no search engine can "discover" pages on a server. Other than "/robots.txt" and "/", no document on a server can be found by a spider unless another document somewhere on the Internet links to it. The question becomes what does Google consider to be a link?
I had thought that the part of the indexing process that divided the PageRank among links only took into consideration those URLs within link tags, and ignored the URLs that were plain text, in the spirit of "a link is a vote". Is this inaccurate? Hate to keep harping on this, but this is the main point of the issue. The URLs in a robots.txt file are not contained in link tags, so the way Google handles pagerank for plain text URLs is the major factor. Additionally, the fact the URLs are partial makes it less likely they would be detected and operated on.
__________________
The best way to learn anything, is to question everything. Last edited by wige; 08-17-2007 at 05:22 PM. |
|
||||
|
Quote:
That's why the world has M & Ms, plain and with peanuts....we do not need to agree on the same things. If they could not find the documents on their own..... then there would be no need for a robots.txt file to block them from finding those documents!!!! Please read that several times. Last edited by SemAdvance; 08-17-2007 at 05:15 PM. |
|
||||
|
Quote:
Quote:
"So, if there's a link to your "secret" web server or page on the web anywhere, it's likely that Googlebot and other web crawlers will find it." "Googlebot follows HREF links and SRC links." The robots.txt file exists to tell search engines not to crawl the web pages that other documents link to: Quote:
A great example - most sites have forms, but no form result page is listed in SERPs because although the URL is contained in the page, it is not a link and the spiders will not follow it.
__________________
The best way to learn anything, is to question everything. |
|
||||
|
Pages will not be found without a link. Period. Those pages are orphaned and will never be indexed. I have several pages that I use for various testing that reside on the server. Been there for years. Never been indexed or crawled. Will never be without a link.
URL's that are not linked are treated as text and not a link to follow later. Algorithyms crawl nothing. Algorithyms index nothing. John... If a robots.txt file has toolbar PR, a link would have have to been found passing PR to it lessing the amount of PR being passed by the other links on the same page OR once retrieved PR was assigned to it based upon the root of the site which does happen from time to time but is corrected when refigured. Dave Last edited by crankydave; 08-21-2007 at 12:40 PM. Reason: Additional thought |
|
||||
|
Quote:
A real robot comes and crawls the servers???? An algorithm most certainly is what crawls the server mate. You may have confused a PageRanking Algorithm with the Search Ranking Algoritm but there is more than one algo at work within most search engines. It is a script just like any virtual bot is. Don't know what you think happens....I would like to know though... Lastly how does any link report on your site find broken links? If you have a page on the server at root or below and it's not linked to, its found as a broken link. ;-> Doesn't happen by magic And again if the script could not find unlinked pages... there would be no need for a robots text to stop the script from finding folders and pages. That would be like posting a sign "No Turn On Red" at an intersection without a stop light.... Explain to me why you would need this robots.txt file...considering the bot could not find the folder or document UNLESS it was linked to?? In your thinking... the robot would need a file to tell it which folder & documents to find if no link was built. |
|
||||
|
Quote:
The Anatomy of a Search Engine 4.3 Crawling the Web Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system. I pulled three keyword terms from the paragraph to form one informational and educational phrase Crawling Web Servers Name Server Not documents boys and girls....SERVERs Further it states..... It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Ill pull some more terms Crawler connects to MORE THAN 500,000 SERVERS. I have worked with algorithms and large scale computers for over 20 years. Back before google or any search engine thought to use algorithms we were using them in private corporations to detect and deter internal external theft. Algorithms save millions of dollars to banks insurance companies retailers pharmaceutical aeronautics military space agencies and yes a few little search engines. Done! Do I need a special badge or something?? |
|
||||
|
SemAdvance...
A bot or spider crawls and/or fetches. The set of instructions given to it can be referred to as an algorithm. Instructions do not crawl anything. Instructions do not index anything. I hope this is clear. Quote:
A page that cannot be found doesn't exist in the eyes of an SE. How in the world can there be a broken link if there is no link? Bots/spiders only go where they are told to go. They cannot magically find pages on a server if they don't know where they are. Quote:
There's a whole list of reasons why a robots.text file is important. Search for them. Quote:
Dave |
|
|||||||
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Also, this spider, just like a search engine spider, can be set to either obey or ignore robots.txt. This DOES NOT mean that the spider magically finds files in this folder, it means that if a file has links but the site author does not want that page to be indexed, the spider will not attempt to crawl those URLs. Quote:
A broken link is a link to nothing, not a page that has no links to itself. Link reports find broken links by checking every link and listing the links that return error messages. I have never seen a report that says "The following documents do not have links to them". If you have such a report, or a generator that can find such documents, please post it.
__________________
The best way to learn anything, is to question everything. Last edited by wige; 08-21-2007 at 06:30 PM. |
|
||||
|
any document on the internet can have a page rank, be it a pdf file, an image, flash document, if you have enough link to a file, it will get some page rank.
until Google excludes the "robots.txt" file from its page rank algorithm.
__________________
ARFY.NET, SEO outsourcing to Pakistan SEO Pakistan, SEO Guru Pakistan, Khurram Ali Linkedin. |
|
||||
|
Can you show me a pdf or swf file that has page rank? I never saw that before.
Thanks.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO |
|
||||
|
http://www.surgeongeneral.gov/tobacco/smconsumr.pdf is a PDF file regarding quitting smoking. It has a PR of 5.
I should add, the Google toolbar does not seem to show the pagerank for PDF files. I just used a search likely to return high PR PDF files, and entered the results into a page rank analysis tool to determine the page rank of the file.
__________________
The best way to learn anything, is to question everything. Last edited by wige; 08-27-2007 at 11:43 AM. |
|
||||
|
Yeah I have see hundreds of PDF files with toolbar PR. Another one here:
http://www.firstamres.com/pdf/MPR_White_Paper_FINAL.pdf |
![]() |
|
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| robots.txt question | kimber23 | Search Engine Optimization Forum | 4 | 12-05-2006 05:51 PM |
| Robots.txt help | amar | Search Engine Optimization Forum | 1 | 02-09-2006 10:54 AM |
| Robots.txt | 27thNub | Search Engine Optimization Forum | 3 | 09-27-2004 05:40 PM |
| Robots.txt... WHY | Clicken | Search Engine Optimization Forum | 1 | 08-19-2004 05:33 PM |
| Robots.txt | candlese | Graphics & Design Discussion Forum | 5 | 03-09-2004 07:54 PM |
|
WebProWorld |
Advertise |
Contact Us |
About |
Forum Rules |
MVP's |
Archive |
Newsletter Archive |
Top |
WebProNews
WebProWorld is an iEntry, Inc. ® site - © 2009 All Rights Reserved Privacy Policy and Legal iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 |