PDA

View Full Version : Google Bug Skews URL



Garrett
04-30-2004, 08:49 AM
Astute WebProNews reader Diego Palacios Soto (http://www.rosanegra.org) noticed an interesting bug in Google's result pages. He wrote recently to let me know that "when you search C + nº + B (http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=C+%2B+n%C2%BA+%2B+B) you get an error in the way Google parse the URLs of the results. Check it out."

I sent the bug over to Jason Dowdell (http://www.globalpromoter.com/), who writes the airgin (http://www.airgin.com) blog, and caught the nissan armada headrest monitor bug (http://www.webpronews.com/insiderreports/searchinsider/wpn-49-20040421GoogleBugExposed.html).

He dug into the error and found that it occurs when there is a letter "C" followed by more than one space and the letter "B."

The actual cause of the bug is a bit more difficult to pin down however. Jason offered these speculations:

Could it be that Google is cacheing specific queries, and keeping those highlighted results in it's resultset and then reparsing the urls and applying the highlighting again?

Not sure but it's possible. But that doesn't explain why we don't see the same results in the title of the site because highlighting is applied there as well as in the main description of the result.

Also, why is it that reversing the order of the keyphrase causes this bug to arise?

He told his buddy at Google about the problem (as he told him about the other bugs he's found lately) and maybe someday we'll learn the truth.

Thanks Jason (http://www.GlobalPromoter.com), for investigating this Google bug (http://www.airgin.com/archives/2004_04_25_index.cfm#108326090203868908). Check out this bug (http://www.airgin.com/archives/2004_04_18_index.cfm#108277359462532692) he found too.

jhales
04-30-2004, 11:48 AM
want to see a more interesting bug in google, try searching for cmd.exe

It doesn't work. Weird, huh?

Some of my coworkers just discovered this the other day.

abbeyinternet
04-30-2004, 12:02 PM
Astute WebProNews reader Diego Palacios Soto noticed an interesting bug in Google's result pages.

A sequence of characters that includes the "degree" symbol seems to have the ability to disable the escaping of HTML tags in the URL portion of each result.

The query "zºb (http://www.google.com/search?q=z%C2%BAb)" alone is enough to achieve this effect.

What's happening here is that "z" is just search text that gets highlighted whereever it is found in the results, while the proceeding control sequence "ºb" has the effect of escaping the HTML "b" tag (i.e. bold) used to highlight each instance of the search text.

Thus, for more dramatic results use "e" as your search text, because this is the most common letter so it occurs more often in URLs: "eºb (http://www.google.com/search?q=e%C2%BAb)".

Perhaps the most interesting aspect of this anomaly is the fact that the Bs in the unescaped HTML <b> tags get highlighted as though they are part of the URL that happens to match the "b" in the original query!

The results returned by the original query "C + nº + B" look strange because the query finds pages where the letters C, N and B occur individually and these letters are highlighted on the results page. However, closer inspection reveals that these they are ordinary results, apart from the unescaped HTML tags in the URL.

The first letter can be anything except for a, i, q, or r. The letters either side of the "degrees" character can be either uppercase or lowercase.

The trick doesn't work without the letter B at the end of the query, but that's only because the effect relies on the fact that the B tag is used to highlight incidences of matched search text in the URL portion of the results.

This anomaly provides a curious insight into Google's results rendering mechanism. I wonder which specific component of Google's technology is affected by these characters. I suspect that it is something to do with an XML/XSL transformation engine, or an XML parser. Perhaps it can be pinpointed to an individual software component of SQL Server.

NickLilavois
04-30-2004, 12:05 PM
The bug hapens with any search term found in a URL, followed by more than one space, and then a B.

It looks like it parses the URL to add bold tags for the first term you enter, and then parses the URL again to find the Bs and bold them. The problem is it is finding the Bs in the bold tags from the first pass and re-bolding them, creating a bunch of broken tags.

abbeyinternet
04-30-2004, 12:42 PM
want to see a more interesting bug in google, try searching for cmd.exe


Yes! In fact, any query containing "cmd.exe" seems to completely break Google, for instance "acmd.exe" or "cmd.exea". The server doesn't even respond, resulting in an apparent DNS error!

abbeyinternet
04-30-2004, 12:45 PM
The bug hapens with any search term found in a URL, followed by more than one space, and then a B.


Interesting! I tried it with any letter followed by a tab instead of a space, and then b, and got the same result (http://www.google.com/search?q=w%09b)!

intelliot
05-01-2004, 02:20 AM
Why are you all getting mixed up in checking out spacing, tabs, degree symbols and whatnot?

It's fairly simple to reproduce this bug with nothing special (at least where I am).

In many cases, you can just type any word followed by b.

For example, try a Google search for internet b - for me, the first result's URL looks like

www.pcpitstop.com/internetb>/Bandwidth.asp

It's that easy - no degrees symbols, no special spacing.

It also seems obvious to me that it's not a matter of Google caching the URL or anything complex like that. Within a single search, the URL is parsed for the first search term (in this case, internet). It's adding
[b]and around the term. Then, when it goes through the URL a second time, it surrounds those b (bold) tags with additional bold tags - breaking the HTML.

----
Web Hosting, Free Forums & More (http://www.sizzly.com)
Elliot Lee's Blog (http://www.intelliot.com)

idamay
05-04-2004, 07:45 PM
It happens with almost every letter (other than a) followed by a b. It doesn't even need to be a word.