View Full Version : Caching and Indexing
mjtaylor
07-01-2010, 09:53 AM
I admit it, I am confused. I used to think that a page being 'cached' indicated that it was indexed. And if a page was NOT cached, it was not indexed. But these days I find lots of pages in SERPs that are not cached. What is the significance of a page being cached?
weegillis
07-01-2010, 11:43 AM
My guess: that it once was indexed? I have no idea of the logic behind page caching but it might actually be there as an alternative to searching only the index for a search query. Could the results be pulled straight from the cached pages?
chandrika
07-01-2010, 12:50 PM
It is possible that those sites have used the meta tag
<meta name="robots" content="noarchive">
that prevents google and other bots from caching their content. People might use it if they did not want old copies of their site in places such as waybackmachine, which that metatag also prevents a sites content being added to.
davidweb
07-01-2010, 05:48 PM
I admit it, I am confused. I used to think that a page being 'cached' indicated that it was indexed. And if a page was NOT cached, it was not indexed. But these days I find lots of pages in SERPs that are not cached. What is the significance of a page being cached?
The real purpose of Cache is to store all the data pertaining to your website [content part] in Google database. This data is then poured into Google Algorithm where it is processed like a cheese. If the ingredients of your website are good then Google says Cheeese otherwise you get a finger I mean lady finger :)
Google cache is where all the on-page SEO things are stored.
peskyhuman
07-01-2010, 07:05 PM
The real purpose of Cache is to store all the data pertaining to your website [content part] in Google database. This data is then poured into Google Algorithm where it is processed like a cheese. If the ingredients of your website are good then Google says Cheeese otherwise you get a finger I mean lady finger :)
Google cache is where all the on-page SEO things are stored.
How can you get Google to update the cache more often? Sometimes it lists quite old contents in search results because of caching.
deepsand
07-01-2010, 10:59 PM
The crawler, which does no more than request copy of resources, dumps such into the cache, preparatory to it's being processed by the indexing engine. Thus, the cache can only be updated if and when a resource is re-crawled.
The search engine decides in real time, in response to a query, whether or not to display a link to the cache, assuming that it's archived. The criteria for making such decision are, to the best of my knowledge, undisclosed. The displayed cache data are limited to, I believe, the first 101 KB of test.
For display purposes, cached files are, with the exceptions of Text and SWF files, converted into HTML format, so that no special viewer application is required.
You can access the cached data for any archived resource via the cache: operator; e.g.
cache:www.domainname.com/page.
Google has it's own command, <META NAME="GOOGLEBOT" CONTENT="NOARCHIVE"> , which tells it to not make the cache public, and to not archive it once indexing is indexing is completed. <META NAME="ROBOTS" CONTENT="NOARCHIVE"> is a universal directive.
There are other Meta Tags relative to caching, as applicable to devices other than SEs, such as proxy servers. For details, see Useful HTML Meta Tags (http://www.i18nguy.com/markup/metatags.html).
CReed
07-02-2010, 12:07 AM
The real purpose of Cache is to store all the data pertaining to your website [content part] in Google database. This data is then poured into Google Algorithm where it is processed like a cheese. If the ingredients of your website are good then Google says Cheeese otherwise you get a finger I mean lady finger :)
Google cache is where all the on-page SEO things are stored.
Too funny. :lol:
In reality - a site may be included within the index and be returned (rank) for relevant queries without having a cached version available. A page also does not need to be cached to be reported and/or counted as a backlink; it only needs to be indexed.
http://www.google.com/intl/en/help/features_list.html#cached
Cached Links
Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable. If you click on the "Cached" link, you will see the web page as it looked when we indexed it. The cached content is the content Google uses to judge whether this page is a relevant match for your query.
When the cached page is displayed, it will have a header at the top which serves as a reminder that this is not necessarily the most recent version of the page. Terms that match your query are highlighted on the cached version to make it easier for you to see why your page is relevant.
The "Cached" link will be missing for sites that have not been indexed, as well as for sites whose owners have requested we not cache their content.
NetProwler
07-02-2010, 03:34 AM
Some CMS adopt a no cache directive by default. For example a typical default Joomla installation has this directive set at the header:
<meta HTTP-EQUIV="Pragma" Content="no-cache">
<meta HTTP-EQUIV="cache-control" content="no-cache"> and this:
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
In such cases, Google will crawl and index the pages, but will not display the cache option.
For the end users, some times using the cached version is faster to view than the original site - if the site is slow.
deepsand
07-02-2010, 04:15 AM
It is my understanding that said directives are not used by SEs, but by proxy servers and clients.
dburdon
07-02-2010, 04:38 AM
Post caffeine Google has broken the link between the old crawl, index, cache and the SERPs. See: http://uksearch.blogspot.com/2010/06/google-caffeine-goes-live.html
In essence if you grasp the nature of the diagram Google can draw on results without the weight and time lag of the old system.
deepsand
07-02-2010, 05:04 AM
We've already seen that; and, there's nothing there re. "breaking links" between functions.
Caffeine is not a new architecture, but a new computing platform with greatly increases capacity and capabilities. Data are still requested by a crawler, deposited into a cache, then processed by an indexing engine, which manages a database which serves to source the search engine. These are fundamental steps which must be executed in said order.
The greatest difference appears to be procedural, in that the crawler now seems to be semi-autonomous. Rather than simply serve the indexing engine's requests for data, it now has at least some discretion to advise the indexing engine of a newly retrieved resource, and ask if it should independently seek and retrieve related resources.
By analogy, the old procedure sent someone to the grocery store each time a particular item was needed. The new one tells the shopper to get all the items needed to make a particular dish. However, once all the needed items are in hand, the preparation and serving of the dish remain unchanged.
mjtaylor
07-02-2010, 12:52 PM
Too funny. :lol:
In reality - a site may be included within the index and be returned (rank) for relevant queries without having a cached version available. A page also does not need to be cached to be reported and/or counted as a backlink; it only needs to be indexed.
Well, that's what I now understand from my observation. So what is the purpose of a cache? If it can be matched and returned for a query when indexed and not cached, then is the cache simply a convenience for the user? When a page is missing or to review recent changes?
http://www.google.com/intl/en/help/features_list.html#cached
Which reads:
The "Cached" link will be missing for sites that have not been indexed, as well as for sites whose owners have requested we not cache their content.
That statement directly contradicts the idea that it can be indexed and returned in the SERPs without a cache.
I hear dburdon on the supposition that Caffeine has changed the system (though I didn't get any additional info from that link to your blog ... am I missing something?) but I would have thought Caffeine made it faster.
I will look at whether the pages I am noticing are "no 'cached' ... but that isn't always the case, as sometimes it's *my* new pages that are returned in SERPs before being cached and I am pretty sure I haven't restricted the cache.
We've already seen that; and, there's nothing there re. "breaking links" between functions.
Caffeine is not a new architecture, but a new computing platform with greatly increases capacity and capabilities. Data are still requested by a crawler, deposited into a cache, then processed by an indexing engine, which manages a database which serves to source the search engine. These are fundamental steps which must be executed in said order.
The greatest difference appears to be procedural, in that the crawler now seems to be semi-autonomous. Rather than simply serve the indexing engine's requests for data, it now has at least some discretion to advise the indexing engine of a newly retrieved resource, and ask if it should independently seek and retrieve related resources.
By analogy, the old procedure sent someone to the grocery store each time a particular item was needed. The new one tells the shopper to get all the items needed to make a particular dish. However, once all the needed items are in hand, the preparation and serving of the dish remain unchanged.
Not new architecture, per se, but a new platform? I am not sure I see the difference. From everything I've read about Caffeine, it is a complete change to the structure of the index. Splicing and dicing here, but if you look at the diagram here: http://www.webproworld.com/webmaster-forum/threads/101750-Caffeine-Is-Live that could be construed as a diagram of the architecture of the index.
I like your grocery store analogy... and agree with it.
chandrika
07-02-2010, 01:51 PM
Quote Originally Posted by Google: "The "Cached" link will be missing for sites that have not been indexed, as well as for sites whose owners have requested we not cache their content."
That statement directly contradicts the idea that it can be indexed and returned in the SERPs without a cache.
At first I couldnt see the contradiction, but now you mention it, I think I see what you mean, it makes sense that the cache link will be missing for sites where owners have chosen not to be cached, but they are saying it will be missing for sites that have not been indexed...which if not indexed, it would follow (to me also) that the site would not even be in the serps...let alone have a cache link..
So maybe a distinction between being crawled and indexed is there. Sites can be crawled and returned in search results, and maybe even during that interim time between crawling and indexing, Google is collecting data on the site such as click thorughs and bounces, which maybe it uses to determine its place in the eventual index, at which time it gets the cache link, once in the index, so the results arent just indexed sites, but include crawled sites that have not yet been fully evaluated (indexed)....maybe..i am just speculating.
weegillis
07-02-2010, 02:33 PM
Since you can be almost 100% certain of finding a URL (domain, anyways) in the SERPS, chandrika, there may be some merit in your speculation.
deepsand
07-02-2010, 04:17 PM
Well, that's what I now understand from my observation. So what is the purpose of a cache? If it can be matched and returned for a query when indexed and not cached, then is the cache simply a convenience for the user? When a page is missing or to review recent changes?
It's a user convenience when the live link:
for any reason, fails to provide a prompt and desirable effect; or,
delivers content materially different for that of time of caching.
That statement directly contradicts the idea that it can be indexed and returned in the SERPs without a cache.
There is a very impertant distinction between the retrieved data being temporarily cached, a necessary prerequisite for indexing to occur, and its being archived.
Once indexing is completed, the indexing engine no longer has need of the cache.
I hear dburdon on the supposition that Caffeine has changed the system (though I didn't get any additional info from that link to your blog ... am I missing something?) but I would have thought Caffeine made it faster.
Not new architecture, per se, but a new platform? I am not sure I see the difference. From everything I've read about Caffeine, it is a complete change to the structure of the index. Splicing and dicing here, but if you look at the diagram here: http://www.webproworld.com/webmaster-forum/threads/101750-Caffeine-Is-Live that could be construed as a diagram of the architecture of the index.
That pretty drawing and the accompanying words are pure marketing hype; they are not intended to, and do not, reveal any technical information. In order to deduce what's actually happened requires an in depth knowledge of computing technology.
Caffeine is a new hardware platform, with some procedural changes with regards to what gets done when and how; the what itself remains the same.
So maybe a distinction between being crawled and indexed is there. Sites can be crawled and returned in search results, and maybe even during that interim time between crawling and indexing, Google is collecting data on the site such as click thorughs and bounces, which maybe it uses to determine its place in the eventual index, at which time it gets the cache link, once in the index, so the results arent just indexed sites, but include crawled sites that have not yet been fully evaluated (indexed)....maybe..i am just speculating.
The necessary steps, in the required order, are:
Crawler retrieves data;
Data are deposited in temporary cache;
Indexing engine processes cached data;
Indexing engine updates database;
Cached data are either archived or dumped; and,
Search engine retrieves data from database to generate SERPs.
Crawler retrieved data that are not yet indexed cannot appear in SERPs.