iEntry 10th Anniversary Forum Rules Search
WebProWorld
Register FAQ Calendar Mark Forums Read
Google Discussion Forum Google Discussion forum is for topics specifically related to Google. There is a subforum dedicated to AdSense/AdWords subjects.

Share Thread: & Tags

Share Thread:

Reply
 
LinkBack Thread Tools Display Modes
  #1 (permalink)  
Old 08-11-2007, 07:32 AM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Arrow PageRank (PR) for Robots.txt?

Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?

I would appreciate to hear your opinions.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #2 (permalink)  
Old 08-13-2007, 04:43 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

So you have a robots.txt file that has page rank?
Reply With Quote
  #3 (permalink)  
Old 08-13-2007, 05:02 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by incrediblehelp View Post
So you have a robots.txt file that has page rank?
Not myself. But they do: http://www.whitehouse.gov/robots.txt
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #4 (permalink)  
Old 08-13-2007, 05:12 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

This is only because many forums/blogs have linked to it in discussion posts.
Reply With Quote
  #5 (permalink)  
Old 08-13-2007, 06:33 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by incrediblehelp View Post
This is only because many forums/blogs have linked to it in discussion posts.
That was not my question Jaan. That is very clear for me.

So once again:

Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #6 (permalink)  
Old 08-13-2007, 06:58 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by Webnauts View Post
Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
No not to my knowledge.
Reply With Quote
  #7 (permalink)  
Old 08-14-2007, 04:49 AM
amar's Avatar
WebProWorld Pro
 
Join Date: Aug 2005
Location: India
Posts: 295
amar RepRank 0
Default Re: PageRank (PR) for Robots.txt?

http://www.google.com/robots.txt -- PR 6.

IMO there wont be any advantage for having Page rank for robots.txt file.
__________________
SEO Professional India
Reply With Quote
  #8 (permalink)  
Old 08-14-2007, 11:38 AM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

If for some reason people where searching for it, I suppose it could, just as it would for any other plain text document.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #9 (permalink)  
Old 08-14-2007, 12:14 PM
WebProWorld MVP
WebProWorld MVP
 
Join Date: Jul 2003
Location: Miami, Florida
Posts: 312
rumblepup RepRank 2rumblepup RepRank 2
Default Re: PageRank (PR) for Robots.txt?

Strange, because how athoritative can a text document be?

Unless you put some text in there that is relevant. Hmmm, would be a neat experiment.
__________________
Beautiful patio umbrellas, living in pembroke pines, I'm rumblepup.
Reply With Quote
  #10 (permalink)  
Old 08-14-2007, 04:39 PM
WebProWorld Pro
 
Join Date: Jul 2006
Posts: 122
dann RepRank 0
Default Re: PageRank (PR) for Robots.txt?

Very simple, Yahoo! Site Explorer results

www.whitehouse.gov/robots.txt- 81k - Cached EIGHTY-ONE KB!!!!!

Inlinks (1,262)

There are a crapload of people the are fascinated by this and link to it. There's also a bunch of media references to it MSNBC CNN etc.
Strange though it's not hidden or encrypted or something.
Reply With Quote
  #11 (permalink)  
Old 08-14-2007, 05:43 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

All the info you provide me above, I can retrieve without assistance.

So once again, my question was:

Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #12 (permalink)  
Old 08-14-2007, 06:42 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by Webnauts View Post
Can there be any effect somewhere/somehow, if a robots.txt file of a web site has a certain PageRank?
If you are referring to an effect on the rest of your site, no. The spiders look at the file purely as text instruction to guide their behavior. The file can't pass pagerank anywhere because it does not contain crawlable links. However, in theory if the page rank were to become high enough, the robots.txt file itself might show up in a search - for a file path for example, or search terms that resemble a file path.

Outside of SEO, most robots.txt pages are indexed, and you can search for them. Googlehackers sometimes do this to search for hidden areas of web sites or to find vulnerable CMSs. Do a google search for inurl:"robots.txt" and note the top few results.

Quote:
Originally Posted by rumblepup View Post
Strange, because how athoritative can a text document be?
Back in the early days of the web, and the early days of Google, a lot of educational resources and even company data files were uploaded as plain text, so the spiders learned to index them, even though they don't contain links, because they can often be the file a user is looking for. Although this is less common now.

Here is an interesting robots.txt file: http://www.webmasterworld.com/robots.txt
All this file contains is commented out blog entries, no actual robots.txt commands. Would this be considered spam?
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #13 (permalink)  
Old 08-14-2007, 07:51 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
Here is an interesting robots.txt file: http://www.webmasterworld.com/robots.txt
All this file contains is commented out blog entries, no actual robots.txt commands. Would this be considered spam?
No actual robots.txt commands?
I think you must have a closer look to this robots.txt

Could it be that you have missed something?

Or did I?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #14 (permalink)  
Old 08-15-2007, 12:44 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

You are correct. There are 2 lines of robot exclusion text, and 840 lines of spam.

On further investigation, this page displays different content to spiders than it does to web browsers:
PHP Code:
$agent $ENV{'HTTP_USER_AGENT'};
if (
$agent =~ /msnbot/gi || $agent =~ /googlebot/gi) {
    
open(FILE,"<realrobotstext");
    print <
FILE>;
    
close(FILE);
} else {
    print 
qq|#
    #blah blah...
    
qq|#

Would this not be a direct violation of the Terms of Service of Google and the other search engines?

Above code snippet is based on code located at http://www.webmasterworld.com/robots...ew=producecode, copyright 2007 WebMasterWorld and is shown in altered form under the fair usage justification of being reprinted for demonstrative purposes only, with editorial and unique components removed and only structure and common language conventions maintained.
__________________
The best way to learn anything, is to question everything.

Last edited by wige; 08-15-2007 at 01:03 PM.
Reply With Quote
  #15 (permalink)  
Old 08-15-2007, 11:08 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
You are correct. There are 2 lines of robot exclusion text, and 840 lines of spam.
I would not call that necessarily spam, as they are comments. I am not sure how google will evaluate those comments. As they do with HTML comments? Or will they see that as a text file? If they see that as a text file, I would seriously consider that as a spamindexing technique.

Quote:
Originally Posted by wige View Post
On further investigation, this page displays different content to spiders than it does to web browsers:
PHP Code:
$agent $ENV{'HTTP_USER_AGENT'};
if (
$agent =~ /msnbot/gi || $agent =~ /googlebot/gi) {
    
open(FILE,"<realrobotstext");
    print <
FILE>;
    
close(FILE);
} else {
    print 
qq|#
    #blah blah...
    
qq|#

Would this not be a direct violation of the Terms of Service of Google and the other search engines?
I honestly saw that too. I guess we must check if they disallow users to access pages, but not the search engines. Then we can begin suspecting that they are cloacking with their robots.txt. I already saw something like that on some SEOs sites from India.

Quote:
Originally Posted by wige View Post
Above code snippet is based on code located at www.webmasterworld.com/robots.txt?view=producecode, copyright 2007 WebMasterWorld and is shown in altered form under the fair usage justification of being reprinted for demonstrative purposes only, with editorial and unique components removed and only structure and common language conventions maintained.
Looks like Google follows links in the robots.txt. Or did I miss something?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #16 (permalink)  
Old 08-15-2007, 11:13 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by Webnauts View Post
Looks like Google follows links in the robots.txt. Or did I miss something?
Are you sure, how do you know?
Reply With Quote
  #17 (permalink)  
Old 08-15-2007, 11:16 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Additionally I made a search for the term robots.txt in Google.
Here I got WebmasterWorld at position 4 out of 2.420.000 pages.

Cool way to target keywords.

So please everybody, don't give up this discussion. I am sure there is some juice here.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #18 (permalink)  
Old 08-15-2007, 11:20 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by incrediblehelp View Post
Are you sure, how do you know?
Examine their robots.txt carefully.

Check line 3:
# This code found here: http://www.webmasterworld.com/robots...ew=producecode
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #19 (permalink)  
Old 08-15-2007, 11:30 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

Yes John but that doesn't mean we know that Google crawled that page link from there. It could be there just for reference.
Reply With Quote
  #20 (permalink)  
Old 08-15-2007, 11:43 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by incrediblehelp View Post
Yes John but that doesn't mean we know that Google crawled that page link from there. It could be there just for reference.
Hey this is getting hot here. Can you probably tell where does Google look for the robots.txt?

Does that mean I can setup a text file called seoworkers.txt, put a link in my sitemap or on an html page and google will treat it like a robots.txt?

Hmm... keep ideas coming.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #21 (permalink)  
Old 08-15-2007, 11:47 PM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Or maybe I can make a 301 redirect of my robots.txt to my file seoworkers.txt?
But that will not serve the purpose for cloacking. But still how can I convince Google to see my seoworkers.txt file as a robots.txt?

Hmmm... whats next?
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #22 (permalink)  
Old 08-16-2007, 12:05 AM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

I would place a link in a robots.txt file that is available no where else online and see if that link gets indexed.
Reply With Quote
  #23 (permalink)  
Old 08-16-2007, 12:11 AM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by incrediblehelp View Post
I would place a link in a robots.txt file that is available no where else online and see if that link gets indexed.
Good idea. I will give it a try. Even if that can take ages. I still will investigate the WebmasterWorld issue.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #24 (permalink)  
Old 08-16-2007, 12:59 AM
Narasinha's Avatar
WebProWorld Pro
 
Join Date: Aug 2003
Location: Urbana, Illinois, US
Posts: 232
Narasinha RepRank 1
Default Re: PageRank (PR) for Robots.txt?

This could be interesting. I don't think anyone's robots.txt is going to be indexed unless there are external links pointing to it. This being said, what advantage, if any, could be gained by doing so? Could the contents of this file have any effect on the rankings of the rest of your site?

Using an htaccess rewrite, you could quite easily present different content to different user-agents, targeting Googlebot, Yahoo!'s slurp, etc. You could target specific keywords to shore up any deficiencies for a particular search engine, thus optimizing separately for each search engine's ranking algorithm. This could get quite unwieldy, but might be interesting to investigate.
Reply With Quote
  #25 (permalink)  
Old 08-16-2007, 11:28 AM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by Narasinha View Post
This could be interesting. I don't think anyone's robots.txt is going to be indexed unless there are external links pointing to it. This being said, what advantage, if any, could be gained by doing so? Could the contents of this file have any effect on the rankings of the rest of your site?
I don't think having your robots.txt file indexed or gaining page rank gives any benefit to the rest of your site, because there are no links or even full URLs in a proper robot exclusion file. But natural links can and do come into existance for these files, simply by posting to a forum like this one and asking a question or for help. There are dozens, if not hundreds, of posts on this site alone about robots issues with links to the files for Google to crawl and index.
Quote:
Originally Posted by Webnauts View Post
Or maybe I can make a 301 redirect of my robots.txt to my file seoworkers.txt?
But that will not serve the purpose for cloacking. But still how can I convince Google to see my seoworkers.txt file as a robots.txt?
To be processed as a robots file, I think the file has to be named robots.txt and must be in the root directory of the site. You would have to remap .txt files to be parsed as Perl or PHP or use mod_rewrite, pointing to a script similar to what I have above.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #26 (permalink)  
Old 08-17-2007, 09:32 AM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Webnauts & Incredible

I can help you both solve the issue.

1. Yes the PR can have effect by passing the PR to your sites sitemap file for the search engines.

From sitemaps.org

Quote:
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line:

Sitemap: <sitemap_location>

The <sitemap_location> should be the complete URL to the Sitemap, such as: http://www.example.com/sitemap.xml

This directive is independent of the user-agent line, so it doesn't matter where you place it in your file. If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file.
Submitting your Sitemap via an HTTP request
2. Since the search engines can follow the link in your robots.txt file you can link to anywhere else you see fit.

Webnauts

I would not do a redirect of robots via the robots.txt file.....that would hurt more than help!!!
Reply With Quote
  #27 (permalink)  
Old 08-17-2007, 10:33 AM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by SemAdvance View Post
1. Yes the PR can have effect by passing the PR to your sites sitemap file for the search engines.
So, what advantage would you get by giving your sitemap file Page Rank? Sitemap files are normalized, so that internal pagerank is filtered out.
Quote:
Originally Posted by SemAdvance View Post
2. Since the search engines can follow the link in your robots.txt file you can link to anywhere else you see fit.
I always though pagerank could only pass through an actual link, not a plain text url. Is this not the case? Bear in mind, when a document is spidered, two different processes are applied. One takes all the urls and adds them to the "to be crawled" database (discovery) and the other follows the links and distributes pagerank and calculates keywords, which I always thought was only done on link tags. Otherwise all the forum spam issues are pointless if you could get pagerank by simply including a plain text URL in a paragraph of text. That does away with the entire point of nofollow, and every anti-spam defense I have seen on any social media site.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #28 (permalink)  
Old 08-17-2007, 11:12 AM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
So, what advantage would you get by giving your sitemap file Page Rank? Sitemap files are normalized, so that internal pagerank is filtered out.
It would seem it would be crawled and indexed a bit quicker. I hadn't thought of this a great deal but there is some reason why sitemaps.org recommends the URL in the robots.txt other than helping them find it...that can be done easilyvia Webmaster Central.

Quote:
Originally Posted by wige View Post
I always though pagerank could only pass through an actual link, not a plain text url. Is this not the case? Bear in mind, when a document is spidered, two different processes are applied. One takes all the urls and adds them to the "to be crawled" database (discovery) and the other follows the links and distributes pagerank and calculates keywords, which I always thought was only done on link tags. Otherwise all the forum spam issues are pointless if you could get pagerank by simply including a plain text URL in a paragraph of text. That does away with the entire point of nofollow, and every anti-spam defense I have seen on any social media site.
Spiders cannot click links. So therefore all links are text urls within the database

The http:// or www part are what tell the search algorithm which links to crawl.
They read (parse the links from the database as you stated).

I would think all the nofollow tag does is stop the links from being added to discovery.

As for the anti spam defenses... like anything in life... take things with a grain of salt.

Heck I think I have three anti spyware programs running on my PC and they still miss things....
Reply With Quote
  #29 (permalink)  
Old 08-17-2007, 11:40 AM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by SemAdvance View Post
It would seem it would be crawled and indexed a bit quicker. I hadn't thought of this a great deal but there is some reason why sitemaps.org recommends the URL in the robots.txt other than helping them find it...that can be done easilyvia Webmaster Central.
I think this recommendation was so that people without accounts with every search engine could still get the sitemap added to the search engines. I know I don't have an account with MSN, Yahoo or Ask, but they can find (and do crawl) the siteindex file that is listed in my robots.txt file.
Quote:
Originally Posted by SemAdvance View Post
Spiders cannot click links. So therefore all links are text urls within the database

The http:// or www part are what tell the search algorithm which links to crawl.
They read (parse the links from the database as you stated).

I would think all the nofollow tag does is stop the links from being added to discovery.

As for the anti spam defenses... like anything in life... take things with a grain of salt.

Heck I think I have three anti spyware programs running on my PC and they still miss things....
Google crawls plaintext links? Test: http://www.ticketwarehouse.com/notalink.html http://www.ticketwarehouse.com/amalink.html

Of course, this is not how links are shown in a robots.txt file anyway. I doubt even if Google does crawl plain text links, that Googlebot would detect and crawl /folder/somefile.html. Especially since the typical robots.txt file contains the exact urls of only the files you don't want to be crawled.
__________________
The best way to learn anything, is to question everything.

Last edited by wige; 08-17-2007 at 11:46 AM.
Reply With Quote
  #30 (permalink)  
Old 08-17-2007, 12:14 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
I think this recommendation was so that people without accounts with every search engine could still get the sitemap added to the search engines. I know I don't have an account with MSN, Yahoo or Ask, but they can find (and do crawl) the siteindex file that is listed in my robots.txt file.

Google crawls plaintext links? Test: http://www.ticketwarehouse.com/notalink.html http://www.ticketwarehouse.com/amalink.html

Of course, this is not how links are shown in a robots.txt file anyway. I doubt even if Google does crawl plain text links, that Googlebot would detect and crawl /folder/somefile.html. Especially since the typical robots.txt file contains the exact urls of only the files you don't want to be crawled.
Actually most robts.txt files only list the folders you do not want indexed.

Algorithms (search spiders) crawl over the server (this is why you need a file to stop them from indexing the folders) it fetches pages, and happens to follow links.

For further clarification from Google

Design and content guidelines

* Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link

Webmaster Help Center - Webmaster Guidelines

Quote:
Originally Posted by wige View Post
I think this recommendation was so that people without accounts with every search engine could still get the sitemap added to the search engines. I know I don't have an account with MSN, Yahoo or Ask, but they can find (and do crawl) the siteindex file that is listed in my robots.txt file.
I agree you are right.

However as I noted above search engine spiders (algorithms) do not need to be told where to find things on the server.....

they do need to be told.... NOT to look in certain folders or files however.

Does this make better sense now??

:->

Last edited by SemAdvance; 08-17-2007 at 12:18 PM.
Reply With Quote
  #31 (permalink)  
Old 08-17-2007, 02:44 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by SemAdvance View Post
However as I noted above search engine spiders (algorithms) do not need to be told where to find things on the server.....

they do need to be told.... NOT to look in certain folders or files however.
Except for the sitemap page. Which does not need pagerank and should have no links to itself, but there is no universal name for this file. In other words, unlike robots.txt, there is no uniform location for the file so search engines can look for it, and it should have no links so search engines can't discover it.

But this goes back to the original point of the thread - since under normal conditions there are no links or urls in the robots.txt file that you would want to be crawled, is there any benefit from the search engines for robots.txt to have a PR? I have seen none.

Someone did mention that it could improve the crawl rate of the robots.txt file itself, but it seems that Google already puts these files on an accelerated crawl schedule as it is.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #32 (permalink)  
Old 08-17-2007, 02:54 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Update on the link test - Neither link shows up in my error logs as crawled by Google. However both links were accessed by spam bots, and by a forum user.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #33 (permalink)  
Old 08-17-2007, 03:24 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
Except for the sitemap page. Which does not need pagerank and should have no links to itself, but there is no universal name for this file. In other words, unlike robots.txt, there is no uniform location for the file so search engines can look for it, and it should have no links so search engines can't discover it.

But this goes back to the original point of the thread - since under normal conditions there are no links or urls in the robots.txt file that you would want to be crawled, is there any benefit from the search engines for robots.txt to have a PR? I have seen none.

Someone did mention that it could improve the crawl rate of the robots.txt file itself, but it seems that Google already puts these files on an accelerated crawl schedule as it is.
The sitemap files does have a set number of possible names.

It would be found on the server root.

The server is crawled by the algorithm.

The algorithm does not need a link to crawl (spider) the server.

It will find anything and everything on the server if it is able or unless otherwise instructed NOT to find everything.

This is why it is called a spider.....

The spiders retrieve the documents and send them back to a database. It needs no instructions from the host server to do this.

It does need instructions what folders/ documents not to collect.

The database is then analyzed and stored in the index and the URLs are parsed so that the crawl robots can go and discover more servers and websites.

So the spider collects all the documents it finds on the server it does not need a link to do this.

As for the OP question if there were links found within the robots.txt files it would seem the search engines would filter any passed PR to 0 as far as increasing SERP positions.

I would agree it probably has little effect that would be noticeable in crawl rate.
Reply With Quote
  #34 (permalink)  
Old 08-17-2007, 03:43 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by SemAdvance View Post
The sitemap files does have a set number of possible names.
It would be found on the server root.
The algorithm does not need a link to crawl (spider) the server.
This is why it is called a spider.....
A few things...

The sitemap files do not have a strictly specified name. Sitemaps have a suggested name (sitemap.xml) but other extensions are also accepted (sitemap.html) and files named sitemap with these other extensions have been used for user-accessed site maps for years. In fact I have seen sitemap.xml files that did not follow the spec, and were created before the spec for RSS use.

It is very explicit in the sitemap specification that sitemap files do not need to be in the root folder. The highest sitemap file is considered authoritative, but additional sitemap files can be located in other sections of the site, for the reasons outlined in the specification itself.

The algorithm has only two ways to find content on a server. It can follow a link, or it can discover the content by requesting an index page from the server (requesting /, which today most servers respond to by sending the index.html page instead of a directory listing for security reasons). If you have a page that has no links, and is not set in the server as a root document (the document served in response to a folder root request i.e. '/') the spider can NOT find it.

It is called a spider because it travels the web from link to link.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #35 (permalink)  
Old 08-17-2007, 04:52 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
A few things...

The sitemap files do not have a strictly specified name. Sitemaps have a suggested name (sitemap.xml) but other extensions are also accepted (sitemap.html) and files named sitemap with these other extensions have been used for user-accessed site maps for years. In fact I have seen sitemap.xml files that did not follow the spec, and were created before the spec for RSS use.

It is very explicit in the sitemap specification that sitemap files do not need to be in the root folder. The highest sitemap file is considered authoritative, but additional sitemap files can be located in other sections of the site, for the reasons outlined in the specification itself.

The algorithm has only two ways to find content on a server. It can follow a link, or it can discover the content by requesting an index page from the server (requesting /, which today most servers respond to by sending the index.html page instead of a directory listing for security reasons). If you have a page that has no links, and is not set in the server as a root document (the document served in response to a folder root request i.e. '/') the spider can NOT find it.

It is called a spider because it travels the web from link to link.
Actually URLs are discovered after the pages are indexed.

To understand better there is an indexing bot, and a crawer bot.

The crawler bot is a URLserver that tells the indexing bot, which URLs to follow. (It does not limit the pages it can index but rather is a set of instructions for links to find, that it pulled from the last set of documents it retrieved.

There is a URLserver that sends lists of URLs to be fetched to the crawlers.

The indexer "downloads the pages to the index"

The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function.

It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.

Next is the URL Resolver which reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs.

It puts the anchor text into the forward index, associated with the docID that the anchor points to.

It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.

So as mentioned above the crawlers are connected to servers.

Nowhere is it found that crawlers are limited to index the URLs it is sent...

This is so that all pages it finds on the server can be collected.

You can learn more at the links below

Information Retrieval & Extraction

Information Technology Services:Google Search Appliance

The Anatomy of a Search Engine

Peace!
Reply With Quote
  #36 (permalink)  
Old 08-17-2007, 05:07 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Exactly. Like I said, no search engine can "discover" pages on a server. Other than "/robots.txt" and "/", no document on a server can be found by a spider unless another document somewhere on the Internet links to it. The question becomes what does Google consider to be a link?

I had thought that the part of the indexing process that divided the PageRank among links only took into consideration those URLs within link tags, and ignored the URLs that were plain text, in the spirit of "a link is a vote". Is this inaccurate? Hate to keep harping on this, but this is the main point of the issue. The URLs in a robots.txt file are not contained in link tags, so the way Google handles pagerank for plain text URLs is the major factor. Additionally, the fact the URLs are partial makes it less likely they would be detected and operated on.
__________________
The best way to learn anything, is to question everything.

Last edited by wige; 08-17-2007 at 05:22 PM.
Reply With Quote
  #37 (permalink)  
Old 08-17-2007, 05:12 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by wige View Post
Exactly. Like I said, no search engine can "discover" pages on a server. Other than "/robots.txt" and "/", no document on a server can be found by a spider unless another document somewhere on the Internet links to it.
I see you spent three seconds reading the information presented that proves you are mistaken... but you can believe what you want.

That's why the world has M & Ms, plain and with peanuts....we do not need to agree on the same things.

If they could not find the documents on their own..... then there would be no need for a robots.txt file to block them from finding those documents!!!!

Please read that several times.

Last edited by SemAdvance; 08-17-2007 at 05:15 PM.
Reply With Quote
  #38 (permalink)  
Old 08-17-2007, 06:59 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by SemAdvance View Post
I see you spent three seconds reading the information presented that proves you are mistaken... but you can believe what you want.

That's why the world has M & Ms, plain and with peanuts....we do not need to agree on the same things.

If they could not find the documents on their own..... then there would be no need for a robots.txt file to block them from finding those documents!!!!

Please read that several times.
I certainly could have missed something in the documents you linked. Is there a specific item you can quote from any of those pages that states that a search engine is capable of discovering a page that has no links? Google indicates that every page should have a link:
Quote:
Originally Posted by http://www.google.com/support/webmasters/bin/answer.py?answer=35769
Make a site with a clear hierarchy and text links. Every page should be reachable from at least one static text link.
Further, Google states on http://scholar.google.com/webmasters/bot.html:
"So, if there's a link to your "secret" web server or page on the web anywhere, it's likely that Googlebot and other web crawlers will find it."
"Googlebot follows HREF links and SRC links."

The robots.txt file exists to tell search engines not to crawl the web pages that other documents link to:
Quote:
Originally Posted by http://www.robotstxt.org/wc/norobots.html
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page. In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
In other words, spiders follow explicit links to documents, and robots.txt tells spiders which of those documents should not be accessed.

A great example - most sites have forms, but no form result page is listed in SERPs because although the URL is contained in the page, it is not a link and the spiders will not follow it.
__________________
The best way to learn anything, is to question everything.
Reply With Quote
  #39 (permalink)  
Old 08-21-2007, 12:32 PM
crankydave's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Aug 2004
Location: Playing with fire!
Posts: 4,243
crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Pages will not be found without a link. Period. Those pages are orphaned and will never be indexed. I have several pages that I use for various testing that reside on the server. Been there for years. Never been indexed or crawled. Will never be without a link.

URL's that are not linked are treated as text and not a link to follow later.

Algorithyms crawl nothing. Algorithyms index nothing.

John... If a robots.txt file has toolbar PR, a link would have have to been found passing PR to it lessing the amount of PR being passed by the other links on the same page OR once retrieved PR was assigned to it based upon the root of the site which does happen from time to time but is corrected when refigured.

Dave

Last edited by crankydave; 08-21-2007 at 12:40 PM. Reason: Additional thought
Reply With Quote
  #40 (permalink)  
Old 08-21-2007, 01:06 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by crankydave View Post

URL's that are not linked are treated as text and not a link to follow later.

Algorithyms crawl nothing. Algorithyms index nothing.


Dave
Ummm what then do you think happens??

A real robot comes and crawls the servers????

An algorithm most certainly is what crawls the server mate.

You may have confused a PageRanking Algorithm with the Search Ranking Algoritm but there is more than one algo at work within most search engines.

It is a script just like any virtual bot is.

Don't know what you think happens....I would like to know though...

Lastly how does any link report on your site find broken links?

If you have a page on the server at root or below and it's not linked to, its found as a broken link.

;->

Doesn't happen by magic

And again if the script could not find unlinked pages... there would be no need for a robots text to stop the script from finding folders and pages.

That would be like posting a sign "No Turn On Red" at an intersection without a stop light....

Explain to me why you would need this robots.txt file...considering the bot could not find the folder or document UNLESS it was linked to??

In your thinking... the robot would need a file to tell it which folder & documents to find if no link was built.
Reply With Quote
  #41 (permalink)  
Old 08-21-2007, 01:13 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

SEM I guess I am still confused on why/how you think a server gets crawled when not one person in the world links to it. Can you explain further?
Reply With Quote
  #42 (permalink)  
Old 08-21-2007, 01:36 PM
SemAdvance's Avatar
WebProWorld Veteran
 
Join Date: Dec 2005
Location: In Your Mind
Posts: 788
SemAdvance RepRank 3SemAdvance RepRank 3SemAdvance RepRank 3
Default Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by incrediblehelp View Post
SEM I guess I am still confused on why/how you think a server gets crawled when not one person in the world links to it. Can you explain further?
Actually I read what the guys who built the search engine and not what a bunch of so called experts seem to think.

The Anatomy of a Search Engine

4.3 Crawling the Web
Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

I pulled three keyword terms from the paragraph to form one informational and educational phrase

Crawling Web Servers Name Server

Not documents boys and girls....SERVERs

Further it states.....

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls.

Ill pull some more terms

Crawler connects to MORE THAN 500,000 SERVERS.

I have worked with algorithms and large scale computers for over 20 years. Back before google or any search engine thought to use algorithms we were using them in private corporations to detect and deter internal external theft.

Algorithms save millions of dollars to banks insurance companies retailers pharmaceutical aeronautics military space agencies and yes a few little search engines.

Done!

Do I need a special badge or something??
Reply With Quote
  #43 (permalink)  
Old 08-21-2007, 02:24 PM
crankydave's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Aug 2004
Location: Playing with fire!
Posts: 4,243
crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9crankydave RepRank 9
Default Re: PageRank (PR) for Robots.txt?

SemAdvance...

A bot or spider crawls and/or fetches. The set of instructions given to it can be referred to as an algorithm. Instructions do not crawl anything. Instructions do not index anything. I hope this is clear.

Quote:
Originally Posted by SemAdvance
If you have a page on the server at root or below and it's not linked to, its found as a broken link.

;->

Doesn't happen by magic
Do you read what you post? This is just plain silly.

A page that cannot be found doesn't exist in the eyes of an SE. How in the world can there be a broken link if there is no link? Bots/spiders only go where they are told to go. They cannot magically find pages on a server if they don't know where they are.

Quote:
And again if the script could not find unlinked pages... there would be no need for a robots text to stop the script from finding folders and pages.

That would be like posting a sign "No Turn On Red" at an intersection without a stop light....

Explain to me why you would need this robots.txt file...considering the bot could not find the folder or document UNLESS it was linked to??
What? Are you serious?

There's a whole list of reasons why a robots.text file is important. Search for them.

Quote:
Originally Posted by "SemAdvance
In your thinking... the robot would need a file to tell it which folder & documents to find if no link was built.
Ummmm... yes. That's how pages get found. A link. What do you think a bot/spider does? Make a wild guess?

Dave
Reply With Quote
  #44 (permalink)  
Old 08-21-2007, 06:26 PM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Question Re: PageRank (PR) for Robots.txt?

Quote:
Originally Posted by SemAdvance View Post
The Anatomy of a Search Engine

4.3 Crawling the Web
Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
I read this entire document, by the founders of Google, and nowhere does it mention how the spider finds documents. If it does, please post a quote. Google, on their web site, does explicitly state how documents are found. I linked and quoted this in a previous post.

Quote:
I pulled three keyword terms from the paragraph to form one informational and educational phrase

Crawling Web Servers Name Server

Not documents boys and girls....SERVERs
Yes, servers are where the documents come from. This document is describing the political and ethical issues that were encountered in the early days of large scale spidering.

Quote:
Further it states.....

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls.

Ill pull some more terms

Crawler connects to MORE THAN 500,000 SERVERS.
Yes, Google indexes a lot of stuff. Not really relevant to how they find individual documents.

Quote:
I have worked with algorithms and large scale computers for over 20 years. Back before google or any search engine thought to use algorithms we were using them in private corporations to detect and deter internal external theft.

Algorithms save millions of dollars to banks insurance companies retailers pharmaceutical aeronautics military space agencies and yes a few little search engines.
Impressive, having worked with algorithms for 20 years, and not knowing:
Quote:
Originally Posted by American Heritage Dictionary
algorithm - n - A step-by-step problem-solving procedure, especially an established, recursive computational procedure for solving a problem in a finite number of steps.
Quote:
Originally Posted by WordNet
Spider - n - 3. A computer program that prowls the internet looking for publicly accessible resources that can be added to a database; the database can then be searched with a search engine
Let me give a real world example of how a spider works. Go to Google Sitemap Generator - Free Site Map Builder, XML Sitemaps, Easy and click the Webmaster tool button. Install the Java applet, and you will have access to a spider of your very own. Bear in mind that this spider, unlike a search engine spider, will not leave the domain you specify. The first screen will be prompts for various settings, specifically where the spider will look and what folders to crawl. This is the algorithm for the spider. It will also ask for the starting url. As you run the spider, you will notice that it fetches the starting page you specified, and shows a list of all the links. You can then watch as it goes to all the linked pages and finds their links and so on. You will also notice that the spider FINDS ALL LINKED DOCUMENTS YOU DON'T EXCLUDE in the settings (aka algorithm, aka instructions) but FINDS NOTHING THAT IS NOT EXPLICITLY LINKED TO.

Also, this spider, just like a search engine spider, can be set to either obey or ignore robots.txt. This DOES NOT mean that the spider magically finds files in this folder, it means that if a file has links but the site author does not want that page to be indexed, the spider will not attempt to crawl those URLs.

Quote:
Lastly how does any link report on your site find broken links?

If you have a page on the server at root or below and it's not linked to, its found as a broken link.
Do you know what a "broken link" even is?

A broken link is a link to nothing, not a page that has no links to itself. Link reports find broken links by checking every link and listing the links that return error messages. I have never seen a report that says "The following documents do not have links to them". If you have such a report, or a generator that can find such documents, please post it.
__________________
The best way to learn anything, is to question everything.

Last edited by wige; 08-21-2007 at 06:30 PM.
Reply With Quote
  #45 (permalink)  
Old 08-27-2007, 04:39 AM
khurramali's Avatar
WebProWorld Veteran
 
Join Date: Aug 2005
Location: Karachi - Pakistan
Posts: 584
khurramali RepRank 1
Default Re: PageRank (PR) for Robots.txt?

any document on the internet can have a page rank, be it a pdf file, an image, flash document, if you have enough link to a file, it will get some page rank.

until Google excludes the "robots.txt" file from its page rank algorithm.
__________________
ARFY.NET, SEO outsourcing to Pakistan
SEO Pakistan, SEO Guru Pakistan, Khurram Ali Linkedin.
Reply With Quote
  #46 (permalink)  
Old 08-27-2007, 09:32 AM
Webnauts's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Aug 2003
Location: Worldwide
Posts: 8,167
Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9Webnauts RepRank 9
Default Re: PageRank (PR) for Robots.txt?

Can you show me a pdf or swf file that has page rank? I never saw that before.

Thanks.
__________________
"Being an expert isn't telling other people what you know. It's understanding what questions to ask, and flexibly applying your knowledge to the specific situation at hand. Being an expert means providing sensible, highly contextual direction." Jeff Atwood
SEO Workers - Search Engine Optimization Consulting Company | SEO Analysis Tool | Webnauts Net SEO
Reply With Quote
  #47 (permalink)  
Old 08-27-2007, 11:33 AM
wige's Avatar
Moderator
WebProWorld Moderator
 
Join Date: Jun 2006
Location: United States
Posts: 2,648
wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9wige RepRank 9
Default Re: PageRank (PR) for Robots.txt?

http://www.surgeongeneral.gov/tobacco/smconsumr.pdf is a PDF file regarding quitting smoking. It has a PR of 5.

I should add, the Google toolbar does not seem to show the pagerank for PDF files. I just used a search likely to return high PR PDF files, and entered the results into a page rank analysis tool to determine the page rank of the file.
__________________
The best way to learn anything, is to question everything.

Last edited by wige; 08-27-2007 at 11:43 AM.
Reply With Quote
  #48 (permalink)  
Old 08-27-2007, 01:18 PM
incrediblehelp's Avatar
WebProWorld 1,000+ Club
WebProWorld MVP
 
Join Date: Jan 2004
Location: Live in Cincy Now
Posts: 7,573
incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4incrediblehelp RepRank 4
Default Re: PageRank (PR) for Robots.txt?

Yeah I have see hundreds of PDF files with toolbar PR. Another one here:

http://www.firstamres.com/pdf/MPR_White_Paper_FINAL.pdf
Reply With Quote
Reply

  WebProWorld > Search Engines > Google Discussion Forum

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
robots.txt question kimber23 Search Engine Optimization Forum 4 12-05-2006 05:51 PM
Robots.txt help amar Search Engine Optimization Forum 1 02-09-2006 10:54 AM
Robots.txt 27thNub Search Engine Optimization Forum 3 09-27-2004 05:40 PM
Robots.txt... WHY Clicken Search Engine Optimization Forum 1 08-19-2004 05:33 PM
Robots.txt candlese Graphics & Design Discussion Forum 5 03-09-2004 07:54 PM


All times are GMT -4. The time now is 11:25 PM.



Search Engine Optimization by vBSEO 3.3.0