|
|
||||||
|
||||||
| Index Link To US Private Messages Archive FAQ RSS | ||||||
| SEO 101 Welcome to the SEO 101 forum on WebProWorld - This SEO Podcast is geared towards Newbie's in order to teach and bridge the gap between website owners and the elusive SEO practices. So sit back, relax, enjoy, learn, and prosper from the SEO 101 Podcast. |
Share Thread: & Tags
|
||||
|
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
|||
|
Ugh!!! Last night, I received an e-mail from Google Alerts, with the list of pages they found for the site lejoslearning.com. It looked normal enough, but it was an old article so I checked it out.
It was a copy of the article, as it appears on my test site. My test site is set up as a subdomain. Example: http://testsite.lejoslearning.com (not the actual subdomain but you get the idea. I just checked Google Webmaster tools robots.txt file and it has the correct folder disallowed, as shown in the example below. User-agent: * Disallow: /testsite/ But when I go to Google and type in site: testsite.lejoslearning.com it gives me 180 pages in the index. What am I doing wrong? Do they treat a subdomain as its own separate domain, which would require its own separate robots.txt file? I am confused. I have a feeling this would cause some serious duplicate content issues. Please advise your thoughts. Thanks. |
|
||||
|
I believe you need to treat a subdomain as a separate domain. This means you need to have a unique robots.txt file for each subdomain.
__________________
Jade Burnside, Ahead of the Web What good is your web site if no one can find it? SEO & Optimized Web Site Design |
|
||||
|
Each domain and/or sub-domain for which exclusions are desired requires its own robots.txt file .
See Robot Text Files - DO Not Follow Sprecific Links
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
|||
|
This robots text would have disallowed
lejoslearning.com/testsite but not the subdomain as you have set it up |
|
||||
|
Correct. As set forth, /testsite/ is a sub-directory relative to the root of the domain within which the robots.txt file exists.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
|||
|
Quote:
Simply password protected the area to keep out prying eyes, but it still puzzled me for a while until I came across this post: Is Google Spying!? And realised this was the most likely culprit - looks like if G finds new content through other means than crawling, it doesn't bother to check robots.txt... So if you are using the google toolbar, this might well be how you content has been picked up |
|
||||
|
Quote:
That a specific resource has been newly discovered does not mean that its parent site has been completely crawled; and, it is only when such site crawl occurs that the robots.txt file is discovered.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
|||
|
Quote:
But at the end of the day, for whatever reason robots.txt is not an infallible way to hide content from search engine bots! |
|
||||
|
Quote:
Being "discovered" and being "indexed" are 2 quite different things. Likewise, the indexing of an individual resource & that of the collective work that contains it are also different things. Lastly, being "indexed" is not the same thing as being published. That an individual resource has been discovered does not mean that any particular SE has taken any action beyond ignoring it or privately indexing it.
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
|||
|
Please note, the subdomain also has its own robots.txt file while also has a disallow.
My understanding was that if the robots.txt file was there, it would not be indexed. The solution of putting a password on the front of it is the path we are working on, but the issue of the search engines not abiding by the robots.txt file, is a bit concerning. |
|
||||
|
Quote:
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
|
|||
|
Quote:
Quote:
Do not blame the search engines if they looked at the subdomain before you added the robots.txt file there. Jean-Luc |
|
|||
|
Jean Luc,
Thanks for the question to clarify. Let me answer it quickly. We have not added robots.txt files after we found out that Google had indexed us. We had always had the robots.txt files in place in both the specific folder in the subdomain, and the main domain of the site. I posted again to clarify, since in my first post, I had not identified that fact that we had actually had and still do have robots.txt files in both locations. Just to clarify one other thing, I am not blaming the search engines for anything. My assumption is that we did not understand something correctly, and therefore allowed the search engines to crawl something we didn't want them to crawl. So far, based on the comments we have received, it seems that the best solution is to put anything in development behind a password protected part of the site, and also robots.txt it. My only concern is that I thought that robots.txt was a definitive black and white solution, but it appears to not be so black and white. Thanks again for the question to clarify. |
|
||||
|
Quote:
Hi Clarrie Search spiders crawl servers AND happen to follow the links on the pages it finds on the server. Search spiders do not always follow the instructions in robots.txt Especially if the robots.txt is not constructed properly. Hope it helps. |
|
||||
|
really! spiders also crawl server... I don't know that. I though only the web that spiders can see. so, this could help in crawling my site.
__________________
Hawaii Events|Oahu Events|Honolulu Events |led signs|outdoor led sign |
|
||||
|
Quote:
That being said, understand that 'bots do not have unfettered access to any server. They can only access files via URIs (Uniform Resource Identifier) that they know of; and, of those, only those that do not require any login other that "anonymous."
__________________
The Penn State Ticket Man http://www.pennstateticketman.com http://www.happyvalleytickets.com http://www.hounddogtours.com |
![]() |
|
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Robots txt not found (404)? | goodhelp | Google Discussion Forum | 2 | 02-20-2008 07:26 PM |
| Site is disappeared and nowhere found in Google | Dervish | Google Discussion Forum | 18 | 05-14-2006 04:36 PM |
| A Test: Yahoo Begins Sponsoring Internet Access in Two Sher | WPW_Feedbot | Search Engine Optimization Forum | 0 | 01-09-2006 03:00 PM |
| Yahoo Test: Sponsoring Internet Access in Two Sheraton Hote | WPW_Feedbot | Search Engine Optimization Forum | 0 | 01-09-2006 02:30 PM |
| Open Source CMS - Test drive CMS with Admin Access too | ronniethedodger | Web Programming Discussion Forum | 4 | 11-24-2004 12:50 PM |
|
WebProWorld |
Advertise |
Contact Us |
About |
Forum Rules |
MVP's |
Archive |
Newsletter Archive |
Top |
WebProNews
WebProWorld is an iEntry, Inc. ® site - © 2009 All Rights Reserved Privacy Policy and Legal iEntry, Inc. 2549 Richmond Rd. Lexington KY, 40509 |