PDA

View Full Version : Do Engines read Page Source or what?



wetchman
05-17-2004, 02:38 PM
Hello,

I have a technical issue that's popped up here in our discussions: Do search engines essentially "see" what we see when viewing the source code of a page, or do they go a step further and "render" the visible results as "seen" in a browser.

This topic came up when we were discussing the way a search engine might handle a link with the style of "display: none" applied. A link with that style shows up in the source code, but does not display on the HTML page or when the page is printed.

Thanks for any input!

Duncan Pollock
05-17-2004, 07:59 PM
Brian: You may find it useful to punch your URL in on the page you'll find at http://www.1-hit.com/all-in-one/tool.search-engine-viewer.htm
I think I was referred to this useful utility through a recent WPW posting by one of our colleagues, but I don't seem to find it via a search I've just done.
Whatever, whatever, 1-Hit.com offers numerous helpful tools, most of them for free.
It always amazes me how much help there is out there, and this is just one (excellent!) example.

Duncan

Mel
05-17-2004, 08:43 PM
Search engines do not render the code that they read.

Google for instance does not even use the page code when ranking documents but uses a combination of doclists (which are lists of documents which contain a particular word) combined with pagerank, and anchor text information plus many other factors,but the basic information used to determine which set of documents are considered for ranking for a particular query is the doclists.

In brief here is what happens:
Parse the query.
Convert words into wordIDs.
Seek to the start of the doclist in the short barrel for every word.
Scan through the doclists until there is a document that matches all the search terms.
Compute the rank of that document for the query.
If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and return the top k.

For more detaile information on how Google (as originally designed) basically works see The Anatomy of a Large Scale Search Engine (http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm) the original design paper.

Google of course has doubtless modified and improved their rankings, but with specfic reference to your question about display:none IMO Google does not include the ability to interpret this iformation in their algo at this time.

Note however that from time to time Google seems to run special "projects" which are designed to find and weed out specific types of spam that have become troublesome.