View Single Post
  #22 (permalink)  
Old 02-18-2004, 04:45 AM
celox celox is offline
WebProWorld New Member
 
Join Date: Jan 2004
Posts: 2
celox RepRank 0
Default possible solution for alienzhavelanded

Quote:
Originally Posted by alienzhavelanded

As for my site's actual listing, it doesn't even show the title and description tags, even though they are there. Thank you so much Google!
Do you use any WYSIWYG CMS (content management system)? The reason I ask is we had identical problems with one of the sites we were optimizing few months ago: the site had good SERPs in many other SEs than google or those using goole results. We checked many factors. What we found is that the only thing leading to such bad results on the most popular SE was the CMS they were using. If this is your case as well than see below how you can check it out.

There are spider simulators (aka sim-spiders). The one we used can be downloaded here: http://www.searchengineworld.com/ (can't be accessed directly, search for 'sim spider' on the page). The test results clearly shown that the spider didn't 'see' the domain name, all the sim-spider saw was the pages names, i.e. http://londonweather.html/ note the lack of domain name. And that's what the Google and other main SE' spiders were getting. How can they index correctly the site (~all the pages) when only one url (the index) out of 60 pages is correct.. Which most probably leads to why DMOZ didn't index the site.

At that time the site owner still didn't want to get rid of the CMS. They were stating: 1. the pages are 'seen' 100% correctly no matter whether this is a browser (user) or a spider and 2. the CMS doesn't make changes to code / urls within page code.

Thus the sim spider results proved that the second theory is incorrect.

As for the first statement. The owner side insisted that it doesn't matter whether this is a user/visitor or a spider, that they see exactly same way the page(s). Will draw this table, hope WPW does not wrap text...

_____________ /*.html
I User/browser I ________ _______ ____________
I I <--I I <--I I <---------------------- I CMX /html I index.html
I Spider I--> I Internet I-->I Apache I --> CMS -->I___________ I
I____________ I I_________I<--I I -->PHP------------->I *.php I
I_______ I <---------------------- I___________ I index.php
/*.index.php


Let's see what happens according to their theory:

The user/browser or spider is addressing to Apache (via Internet) for an domainname.com/.html file. As many know, usually when CMS used Apache is configured in such a way that, when it calls for a .html page the server takes it from CMS_folder/html folder only (or other, the point is *not* from root). And respectively, the CMS is parsing the info received and shows it to the user.
So far the client theory was correct. But it's not an axiom.

Given the fact that we didn't have access to the server itself, we couldn't make changes to how the Apache is giving .html pages. But what we could do is use the PHP as part of Apache server and re-model this same situation. First off, we have created in PHP an *exact copy* of the index.html and called it index.php. You can see on the schematic above that, in case of index.php the Apache server is calling the PHP which provides the index.php which is given to user broswer / spider simulator *without* parsing.

The spiders and browsers see the pages differently when WYSIWYG CMS used. The client got rid of that CMS. Which you should do as well if this is your case. SE's get 302 (our case) or 304 message which means, in short "use local copy". An excerpt from RFC documentation on 304 :

{Document has not changed since the date and time specified in the If-Modified-Since field.}
Thus, the spider won't go deeper into the site when you have 304 status.

Only later we saw this CMS/spiders interrelation article which re-confirmed our suspicions.

http://www.searchenginejournal.com/index.php?m=20040202
Reply With Quote