Submit Your Article Forum Rules

Page 4 of 5 FirstFirst ... 2345 LastLast
Results 31 to 40 of 41

Thread: WWW vs non-WWW - understanding the physical file and directory background

  1. #31
    WebProWorld MVP deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,446
    Just now "Fetched as Google" under the canonical form that I did not specify in the canonical tag, the one under which the overwhelming majority of data was accumulated. We'll see if this gets more attention than did the submission of 5 days ago.

  2. #32
    Senior Member
    Join Date
    Sep 2005
    Posts
    188
    Deepsand - I left the "Fetch as Googlebot" for a few days, to let things settle, but am now getting a "Fetch Status" of "Missing robots.txt" for pages which were previously showing either "Success" or "Failed". This is happening for both www and non-www URLs, and I havent made any changes to robots.txt, which is in the same folder as the site pages, where it has always been.

    Just wondered if you had encountered this in your recent tests, or if you can shed any light on why this message might be displaying?

  3. #33
    WebProWorld MVP deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,446
    That's solely an informational message, with no material effect.

    The robots.txt is an optional exclusionary file, which all well behaved SEs look for and honor any directives found. Ill behaved ones don't bother looking for such.

    If one has nothing to be barred from being crawled, this file is not needed.

    If you do have a robots.txt file in the home directory, then Google's screwed up. FWIW, I'm seeing a good bit of data on both GWMT and its SE that suggest that Google's lost its mind.

  4. #34
    Senior Member
    Join Date
    Sep 2005
    Posts
    188
    Thank God for that - I thought it was just me! If it's not priveliged info, I'd be glad to hear the results of your test. For my own project, I was taken aback by getting Page 1 results for all targetted terms within 7 days. There has been some fluctuation since - I have been adding backlinks and tweaking online page texts, so most of it has been mildly positive ( pushing a SERP from maybe 7 to 6 ). The first item on my Action Plan for every client has always been to create a Google Maps page ( partly because they seem to only offer the snail-mail option for confirmation these days ). On previous projects, when G had implemented 'integrated search' ( Maps, YouTube, etc. ) I had great results from the Maps/Places SERPs. They seemed to kick in again on the current project about 4 days ago, pushing some search terms from maybe 7th to 4th, but when I checked tonite, none of the searches feature the G maps results, which means they have dropped back down to previous positions. Still Page 1, which is good, but I wish Google would make it's mind up from day to day whether Maps/Places results will feature or not.

    Not sure what you mean by 'solely an informational message....' I am concerned that, while having some positive and negative results from the 'Fetch as Googlebot' previously, I was prepared to wait for things to settle down, but have to be worried when I see the 'Missing robots.txt' message, when I know that the robots file has been in place from day 1 of the project. Apart from the confusion about why G can't find the file, I have to wonder if it will mean the SERPs will drop because of some perceived problem Google has? While it is comforting that I'm not alone, and that you have also found some 'what the hell' issues with GWMT and SERPs, it doesn't help if my ethical SEO actions to help my client are skewed by some new Google f'kup.

  5. #35
    WebProWorld MVP deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,446
    Quote Originally Posted by murphypj View Post
    Thank God for that - I thought it was just me! If it's not priveliged info, I'd be glad to hear the results of your test..
    Still going round and round in the revolving door. Google keeps re-fetching the pages, updating the cache, and bumping up the "Processed" dates, but no more. It's Groundhog Day.

    Quote Originally Posted by murphypj View Post
    Not sure what you mean by 'solely an informational message....'
    It's just there to let one know that no robots.txt file was found, in case one either thinks that there's one present in the correct location or may want to exclude something but forgot to create said file.

    The only way that a robots.txt file can screw one up is if you accidentally place an unintended exclusion there, so that Google and other well behaved SEs ignore a critical page or directory.

    As for your case, has Big G yet combined all of the data for the two canonical forms into the one that you selected via the canonical tag?

  6. #36
    Senior Member
    Join Date
    Sep 2005
    Posts
    188
    "As for your case, has Big G yet combined all of the data for the two canonical forms into the one that you selected via the canonical tag? "

    Yes, and No, DS. In GWT, the Sitemaps page now shows the same number of pages indexed, for both forms, but on the www versions, two warnings, one is 'high response time' but the most worrying is "When we tested a sample of the URLs from your Sitemap, we found that some of the URLs were unreachable. Please check your webserver for possible misconfiguration, as these errors may be caused by a server error (such as a 5xx error) or a network error between Googlebot and your server. All reachable URLs will still be submitted."

    All of the pages have a couple hundred words of text, and up to 4 small images, and are all CSS/HTML ( no Flash, PHP etc. )

    Under 'Keywords', the 'most common words on your site' are wildly different, with the www-form at least having 'widgets' and 'mytown' in top 2 places, while they show as #11 and #10 on the non-www form.

    "Fetch as Googlebot" as noted is behaving strangely, but at some point has fetched the root and index.html for both forms. The www form has fetched all sub-pages succesfully, the non-www is failing on all of them.

    SERPs, on the other hand, would seem to indicate that Google has figured things out. All of the backlinks I created were to the 'www' form, but all targetted terms started to show on Page 1 a week ago ( 7 days after I began ), and are still doing so. The URL displaying is the non-www, but the new Titles and Descriptions are showing, and a few minor tweaks to on-page text yesterday were picked up almost immediately.

    Happy with overall results, but the GWT results, and messages are a bit of a concern.

  7. #37
    WebProWorld MVP deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,446
    Quote Originally Posted by murphypj View Post
    Quote Originally Posted by deepsand
    As for your case, has Big G yet combined all of the data for the two canonical forms into the one that you selected via the canonical tag?
    Yes, and No, DS. In GWT, the Sitemaps page now shows the same number of pages indexed, for both forms, but on the www versions, two warnings, one is 'high response time' but the most worrying is "When we tested a sample of the URLs from your Sitemap, we found that some of the URLs were unreachable. Please check your webserver for possible misconfiguration, as these errors may be caused by a server error (such as a 5xx error) or a network error between Googlebot and your server. All reachable URLs will still be submitted."
    But, no details re. specific pages and their corresponding HTTP Header Codes?

    Quote Originally Posted by murphypj View Post
    Under 'Keywords', the 'most common words on your site' are wildly different, with the www-form at least having 'widgets' and 'mytown' in top 2 places, while they show as #11 and #10 on the non-www form.
    Neither surprising nor, IMO, of concern. In fact, it may be argued that showing the different data according to the canonical form used by the user provides useful information.

    Keyword data are transient - from a sliding 30 day window - and for informational purposes only. They serve only to supplement your server log data following Google's suppression of Referer data when the user's session is via HTTPS, as is now the case if they are logged into a Google account.

    FWIW, I've noticed that the two canonical forms often report data for different time periods. Not drastically different, only 1 to 3 days apart, but still different.

    Quote Originally Posted by murphypj View Post
    "Fetch as Googlebot" as noted is behaving strangely, but at some point has fetched the root and index.html for both forms. The www form has fetched all sub-pages succesfully, the non-www is failing on all of them.
    I'd hesitate to accept the veracity of WMT at face value. May simply be an unintended and inconsequential artifact of the conflating of the two canonical forms.

    As for my test, 1 tiny step. The non-preferred canonical form, which is where virtually everything had been indexed, now show 0 pages indexed; and, the preferred form 's status now shows "Pending."

    As for search results themselves, including those from the various operators, all remains unchanged, with the exception of the link operator when used on the non-preferred form only; upon submission, Google keeps changing it to the site operator.

  8. #38
    Senior Member
    Join Date
    Sep 2005
    Posts
    188
    Quote Originally Posted by deepsand View Post
    But, no details re. specific pages and their corresponding HTTP Header Codes?
    - No, no details, just the scary message!


    Quote Originally Posted by deepsand View Post
    FWIW, I've noticed that the two canonical forms often report data for different time periods. Not drastically different, only 1 to 3 days apart, but still different.
    DS - There's a slight difference in dates, but all of the on page text changes were completed long before the earliest date. So if GWT reads the exact same set of pages in two forms, and comes up with a totally different set of resulsts ( esp. keywords ), it begins to render GWT info suspect, if not unusable.

    Quote Originally Posted by deepsand View Post
    I'd hesitate to accept the veracity of WMT at face value. May simply be an unintended and inconsequential artifact of the conflating of the two canonical forms.
    I expected an element of that, but not the errors I am seeing.

    Quote Originally Posted by deepsand View Post
    As for my test, 1 tiny step. The non-preferred canonical form, which is where virtually everything had been indexed, now show 0 pages indexed; and, the preferred form 's status now shows "Pending."
    - Well, that sounds pretty positive, doesn't it? From my reading, that means G has de-indexed the non-pref form, and is about to effect the 'rel=canonical' to replace these index entries with the preferred versions. But I appreciate you will be sweating it out until this happens.

    I've always found GWT reasonably useful before, but it's freaking me out this time. I've gained good Page 1's on all agreed terms, albeit on the non-preferred canonical form. If you search Google.ie for 'wexford eye centre', all but one of the Page 1 SERPs point to either the site, or the page entries on relevant local and industry directory sites. ( If this is too much information, let me know, I'll remove that line ).

    GWT is raising the following concerns:

    NON-WWW:
    Sitemaps: - Fine, but only 2 pages (unspecified) indexed.
    Crawler Access: Constantly showing robots.txt found ( 202 )
    Settings: Preferred domain: www. ( seemingly ignored )
    Links to your site: Strange mix of a few of the new links created with some others which had never shown up in backlink checks.
    Internal links: Fine, all pages shown linking to all other pages
    Crawl Errors: 2 showing 'Robots.txt unreachable', 1 showing 'Network Unreachable'. Very worrying.
    Fetch as Googlebot: All attempts now showing 'Missing Robots.txt'
    Duplicate Meta Descriptions: Errors based on the Description tags as they were before I changed them 16 days ago.

    WWW:
    Sitemaps: - 2 (unspecified) pages indexed, with the 2 worrying warning messages as outlined previously.
    Crawler Access: Constantly showing robots.txt found ( 202 )
    Settings: Preferred domain: www. ( seemingly ignored )
    Links to your site: None.
    Internal links: One page showing.
    Crawl Errors: 'Robots.txt unreachable' showing for 'index.html'
    Fetch as Googlebot: All attempts now showing 'Missing Robots.txt'
    Duplicate Meta Descriptions: None.

    Apologies for length of post ( and thread ). Would be great to hear if any of these GWT errors have been encountered by other members. It was reassuring to hear your general comment about the unreliablity of info on GWT - I've been used to seeing a reduced level of detail there on previous projects, but the 'unreachable' and 'robots.txt' messages are particularly of concern, all pages have always loaded instantly for me, and I don't want to have to dig into hosting issues unless I have to. Really appreciate all of the help and feedback I've had on this, thanks.
    Last edited by murphypj; 02-17-2012 at 07:56 AM.

  9. #39
    WebProWorld MVP deepsand's Avatar
    Join Date
    May 2004
    Location
    State College, PA
    Posts
    16,446
    Quote Originally Posted by murphypj View Post
    DS - There's a slight difference in dates, but all of the on page text changes were completed long before the earliest date. So if GWT reads the exact same set of pages in two forms, and comes up with a totally different set of resulsts ( esp. keywords ), it begins to render GWT info suspect, if not unusable.
    Puzzling it is. In my case, I last night observed that, while all pages of the site under test were indexed under both canonical forms, the keywords data presented on WMT were noticeably different. Of particular note is the case that singular/plural variants are reported only on the form not specified via the canonical tag.

    Quote Originally Posted by murphypj View Post
    - Well, that sounds pretty positive, doesn't it? From my reading, that means G has de-indexed the non-pref form, and is about to effect the 'rel=canonical' to replace these index entries with the preferred versions. But I appreciate you will be sweating it out until this happens.
    The odd thing is that, despite the "Pending" status, all of the pages that were previously indexed only under the non-preferred canonical form are now indexed under the preferred one as well. This suggests that WMT data are not truly current.

    Quote Originally Posted by murphypj View Post
    I've always found GWT reasonably useful before, but it's freaking me out this time.
    I noticed it becoming increasingly FUBAR following the Caffeine hardware platform upgrade.

    Quote Originally Posted by murphypj View Post
    GWT is raising the following concerns:

    NON-WWW:
    Sitemaps: - Fine, but only 2 pages (unspecified) indexed.
    Crawler Access: Constantly showing robots.txt found ( 202 )
    Settings: Preferred domain: www. ( seemingly ignored )
    Links to your site: Strange mix of a few of the new links created with some others which had never shown up in backlink checks.
    Internal links: Fine, all pages shown linking to all other pages
    Crawl Errors: 2 showing 'Robots.txt unreachable', 1 showing 'Network Unreachable'. Very worrying.
    Fetch as Googlebot: All attempts now showing 'Missing Robots.txt'
    Duplicate Meta Descriptions: Errors based on the Description tags as they were before I changed them 16 days ago.

    WWW:
    Sitemaps: - 2 (unspecified) pages indexed, with the 2 worrying warning messages as outlined previously.
    Crawler Access: Constantly showing robots.txt found ( 202 )
    Settings: Preferred domain: www. ( seemingly ignored )
    Links to your site: None.
    Internal links: One page showing.
    Crawl Errors: 'Robots.txt unreachable' showing for 'index.html'
    Fetch as Googlebot: All attempts now showing 'Missing Robots.txt'
    Duplicate Meta Descriptions: None.
    Noting that several of the above are logically inconsistent suggests a problem relating to a lack of database synchronization, not uncommon on a distributed database with dynamic load balancing. While these differences vanish with time, they are annoying as hell while they exist.

    The good news here is that the worst that can happen is that something specified in robots.txt to not be crawled will be.

  10. #40
    Senior Member
    Join Date
    Sep 2005
    Posts
    188
    DS - Well, the dichotomy between the crawler showing "Robots Found" and the "Fetch as Googlebot" showing "Missing robots.txt" obviously indicates a flaw in GWT which didn't previously exist. The messages saying that pages are "unreachable" and pointing to possible server response and delivery problems are worrying, if they are to be believed. If the client is indeed on a slow or dodgy server, I can obviously recommend that they move, but in working with the site, all of the pages load instantly, and I know that they are small, html and lite-image pages. When you speak of "lack of database synchronization", "distributed database with dynamic load balancing", I assume you're referring to Googles Database / load-balancing etc.

Page 4 of 5 FirstFirst ... 2345 LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •