PDA

View Full Version : Alexa: how NOT to get crawled



minstrel
12-28-2003, 07:27 PM
From the Alexa website (http://pages.alexa.com/help/webmasters/index.html?p=Corp_W_t_40_L1#prevent):


Alexa Information for Webmasters

You may be wondering about "ia_archiver" and be curious about why it is visiting your site, or you may want to invite the robot to crawl your site. To block ia_archiver from crawling your site, please read below.

Additional information regarding our privacy policy, web crawling philosophy, and technology can be found on the following pages Privacy Policy and Technology. If you wish to change the contact information for your site, please visit our contact information editor. If you would like to suggest some Related Links, please visit our related link suggestion page.

The Alexa crawler (robot), which identifies itself as ia_archiver in the HTTP "User-agent" header field, uses a web-wide crawl strategy. Basically, it starts with a list of known URLs from across the entire Internet, then it fetches all local links found as it goes. There are several advantages to this approach, most importantly that it creates the least possible disruption to the sites being crawled.

We will not index anything you would like to remain private. All you have to do is tell us. How? By using the Standard for Robot Exclusion (SRE).

The SRE was developed by Martijn Koster at Webcrawler to allow content providers to control how robots behave on their sites. All of the major Web-crawling groups, such as AltaVista, Inktomi, and Google, respect this standard. Alexa Internet strictly adheres to the standard:

Whenever ia_archiver lands on the top level of a Web site, it looks for a file called "robots.txt". Robots.txt is a file website administrators can place at the top level of a site to direct the behavior of web crawling robots.

The Alexa crawler will always pick up a copy of the robots.txt file prior to its crawl of the Web. If you change your robots.txt file while we are crawling your site, please let us know so that we can instruct the crawler to retrieve the updated instructions contained in the robots.txt file.

After retrieving any HTML file, we check for the presence of the NOINDEX, NOARCHIVE, and NOFOLLOW tags in the "<head>" element of the document. If we find a NOINDEX or NOARCHIVE tag, we throw away the copy. If there is a NOFOLLOW tag, the robot will not follow any links found on that page. This allows users to control access to their own data, without needing their site administrators to update "robots.txt".

To exclude all robots, the robots.txt file should look like this:

User-agent: *
Disallow: /
To exclude just one directory (and its subdirectories), say, the /images/ directory, the file should look like this:

User-agent: *
Disallow: /images/
Web site administrators can allow or disallow specific robots from visiting part or all of their site. Alexa's crawler identifies itself as ia_archiver, and so to allow ia_archiver to visit (while preventing all others), your robots.txt file should look like this:

User-agent: ia_archiver
Disallow:
To prevent ia_archiver from visiting (while allowing all others), your robots.txt file should look like this:

User-agent: ia_archiver
Disallow: /
For more information regarding robots, crawling, and robots.txt visit the Web Robots Pages at www.robotstxt.org, an excellent source for the latest information on the Standard for Robots Exclusion.

Fill out the form below to be crawled by Alexa.

There are a few reasons that Alexa may not have visited your site. Perhaps your site is new or we haven't discovered any links on the web that lead to your site. Or perhaps we haven't had any Alexa users visit your site. It is also possible that your web site administrator has disallowed crawlers from visiting your site - please read the information about robots.txt that we have provided above.

In any event, simply by visiting your site with the Alexa Toolbar open, Alexa will learn of your site and add it to our list of sites to visit, thus ensuring your inclusion in the Alexa service and in the Alexa archive.

ronniethedodger
12-28-2003, 08:06 PM
This passage was interesting:


The Alexa crawler will always pick up a copy of the robots.txt file prior to its crawl of the Web. If you change your robots.txt file while we are crawling your site, please let us know so that we can instruct the crawler to retrieve the updated instructions contained in the robots.txt file.


Does this mean they only check the robots.txt once? That is it? If you change it to exclude the Alexa crawler after it has already visited your site (check your logs, too) then you have to inform them that it has changed?

I am sure they do check for the file. It is just that the wording stumps me a little when I look at it.

This passage is worth noting too:


In any event, simply by visiting your site with the Alexa Toolbar open, Alexa will learn of your site and add it to our list of sites to visit, thus ensuring your inclusion in the Alexa service and in the Alexa archive.

Basicly look at it from the opposite point of view too.

Anything that you DO NOT want them knowing about, make sure that you have the Alexa Toolbar disabled

Here is an example --> http://info.alexa.com/data/details?amzn_id=alexa65-tb-20&url=http://12.10.96.163

This is Alexa data on an old IP address of mine which has not existed since August of this year. I had the Alexa bar open on my browser while I was testing pages on my local Apache Server.

Well Alexa came in that very day (after checking my logs) and indexed quite a few things...some of it I did not want out there, if you know what I mean.

Although my IP address is dynamic and changes after some length of time, it is still a little worriesome to have to remember to disable the toolbar everytime I do testing.

I guess I should put a robots.txt file in my Server root. But judging by the first paragraph I cited above, it makes me wonder about that even.

Duncan Pollock
12-29-2003, 12:30 PM
This isn't the first time I've run across this "Thou shalt not crawl" idea and it has me puzzled. Surely, the object of the exercise is to be included in search results, so why would anyone tell a search engine to ignore their site?
I must be missing something.
Can someone "unpuzzle" me, please.

Duncan

minstrel
12-29-2003, 12:40 PM
There may be pages or subdirectories on any website that either are not for general publication (e.g., "private" directories containing articles, pictures, or files that are intended for a restricted population - put there for the convenience of family members, or friends, or for example members of WebProWorld), or pages that are in progress and not yet ready for public viewing, such as revisions to the site.

In such cases, you may well want to tell spiders to ignore those pages or directories.

awall19
12-29-2003, 01:33 PM
One of the first places people can look for secure information is by viewing the robots txt file. Therefore its a bad call to protect secure information that way...

One would be better to password protect an area or to leave the information off the web.

When robots first came out they were not very sophisticated and crashed many servers...thus the robots txt file had the main goal of allowing you to kick out offensive spiders that may harm your site or server.

minstrel
12-29-2003, 01:51 PM
I agree completely, Aaron.

I wasn't talking about securing information per se - As an example, I've posted a couple of freeware or shareware programs on my site for other users, to make it easier to download them. However, these aren't my programs and, although I'm not doing anything illegal and I don't need them password-protected, I don't want them indexed by search engines which mught give the impression that I'm some sort of distributor and/or increase my bandwidth usage for things that have nothing to do with my site. Therefore, I dump them into a separate and generally inactive subdirectory where I don't expect people to be normally looking - which reminds me I probably need to update my robots.txt file...

ronniethedodger
12-29-2003, 02:22 PM
The robots.txt file is going under the assumption that any spider will obey the exclusions in it. This is why you see the clause in Alexa's Terms for example which states that they do follow the rules. They also could very well ignore them if they choose and not even mention it in their Terms.

The paragraph at Alexa is confusing too.

In your case Minstrel...it is not enough to just write the exclusion. Probably password protecting the directory would be a better idea. You can set it to username: guest -- password: guest and relate those instructions to your visitors in some fashion. Any spider can ignore the robots.txt file, but I don't think they are smart enough (yet) to read the instructions.

For the most part though...the spiders that matter the most, the more legit ones anyway, do follow the robots.txt file. They will play by the rules and follow only what you want them to. But even still, if parts of your site will have a profound effect on it...like a bunch of unrelated content...it might be better to take more precautions of not having it crawled.

awall19
12-29-2003, 02:43 PM
Alexa will follow robots txt and would never change that policy. Many people already view their toolbar as spyware...the last thing they would ever want to do is ignore web standards as well. If you search on Google for alexa http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=alexa you will see that people are advertising that it is spyware

minstrel
12-29-2003, 02:49 PM
That's one of the reasons I posted this, Aaron - I think it's interesting that they went to the trouble to post this on their website and I think it is indeed an effort to reassure people that they are not trying to spy on them secretly.

ronniethedodger
12-29-2003, 03:08 PM
I think it is indeed an effort to reassure people that they are not trying to spy on them secretly.

So...they are spying on us out in the open then? ;0)

minstrel
12-29-2003, 03:16 PM
I think it is indeed an effort to reassure people that they are not trying to spy on them secretly.
So...they are spying on us out in the open then? ;0)
Well, actually yes. So is the Google toolbar if you enable those "Page Information" options, but both now tell you that in plain language - I think when Alexa first appeared on the scene, they perhaps didn't make it clear enough and hence got their reputation as spyware. The other thing that's different about Alexa is that they started out that way, with their prime purpose to track people's surfing habits - in Google's case, the "spyware" features are (1) entirely optional and (2) less obviously intrusive.

ronniethedodger
12-29-2003, 04:30 PM
in Google's case, the "spyware" features are (1) entirely optional and (2) less obviously intrusive.

One of those features I just learned about here at WPW from another member.

He/she (don't remember who exactly) pointed out one of the features enables the Googlebar user to Vote on the currently viewed webpage. Enabling this feature displays a yellow happy face and a blue grumpy face on the toolbar. Voting is very simple, just click on one of the faces.

I don't know how Google uses this information, for it does not really tell you that. But I can tell you that I do click on that blue face quite a bit....aint gonna tell you when or why I click on it though. ;0)

neophytemedia
12-30-2003, 04:01 AM
Just wanted to mention this.
It all started in Feb. '99 and I remember pretty well that receiving a lot of unsolicited emails at that time or having your browser redirected whenever altavista or yahoo couldn't find your search query, did not have the word "spam" associated with it. At that time google, I think, was the pioneer of such actions, though everybody nowadays is complaining about it. (see Xupiter). My point is: It seems that the big guys are either "forgotten" or forgiven for their methods of advertising their services while smaller search engines are monitored every step of the way. Anyway, who could ban google? or shut it down? So...i guess that playing fair is far from what google had in mind at that time. And now they pose as having the most fair & accurate page ranking system?
Thanks for your time...
:)
Tony,
http://www.neophytemedia.com

ronniethedodger
12-30-2003, 12:06 PM
It all started in Feb. '99 and I remember pretty well that receiving a lot of unsolicited emails at that time or having your browser redirected whenever altavista or yahoo couldn't find your search query, did not have the word "spam" associated with it.

Tony - I think that had more to do with Internet Explorer and how it's Search worked in the browser. There are/is other sites out there (Comet Search comes to mind) that takes over your browser's search functions also. I do not ever remember Google being that way, but back then I was using either Yahoo or WebCrawler (or was it MetaCrawler...don't remember which.)

neophytemedia
12-30-2003, 06:01 PM
To make it more clear.
First of all it had nothing to do with your browser. It could have been IE or Netscape. An adware or spyware program is installed on your computer. Whenever you perform a search query to one of your favorite SEs a pop up window opens with a completely different search engine showing the results found.
Back then these actions probably were overlooked and definitely not considered spam. I'm 100% sure that google did sponsor some of those early spam campaigns as I experienced it myself. It all took a few months maybe half a year. And believe me it was enough to make their service popular. After that it got dimmed until it stopped. My only point was that in the early days these actions never had repercussions against the companies involved more than probably because of the lack of experience from people in tracking these actions or because high interests were at stake. :)

Tony,
http://www.neophytemedia.com