View Full Version : How to stop Google from crawling secure content/directory?
Ozzman
01-02-2008, 06:37 AM
My post title is clearly stating about my question :). But lets repeat it. How can we stop google bots to crawl through a secure (not-willing-to-share) information from one of the website's directories.
Can we use robots.txt for this or you may suggest any other better treatment for this?
If Google crawls any secure information from a website content and shows it in seacrh engines then what can we do to let this information disappeared from Google search results?
Your co-operation is appreciated in adavance.
thindenim
01-02-2008, 07:13 AM
Add the following to your robots.txt file, where directory is the area you want to disallow. This tells the robot they can't access or index this section.
User-agent: *
Disallow: /directory/*
Alternatively you can use the meta noindex tag on each page: -
<meta name="robots" content="noindex, nofollow">
which will mean that the page is not indexed and links not followed, or: -
<meta name="robots" content="noindex, follow">
which will mean that the page is not indexed, but any links are followed
fernimac
01-02-2008, 07:57 AM
The previous post has been very clear. Although I would add that if your pages are in a secure area, Google should not be able to get to those pages. Even if you include a noindex no follow directive, users could still get to those pages. You should implement a secure access via a password instead so that no SE and no undesidered users get to your private pages.
thindenim
01-02-2008, 11:47 AM
Good point fernimac, you should always password protect sensitive information.
Palindrome
01-02-2008, 03:42 PM
Hi
All good advice has gone before me.
In case you are taking the robots.txt route, best to use the meta tag as well. If a link somehow exists to a page, or is created by accident, that could still be crawled regardless of robots.txt.
Peter (IMC)
01-02-2008, 03:51 PM
This is one of those phylosofical questions. Googlebot is nothing more than a normal visitor who, when asked, will refer others to those pages.
If you allow visitors to that part of the site without the need to login, how are you going to prevent them from refering their friends to that part of the site?
If you really want "secure (not-willing-to-share) information" to be available to only those that you choose, you need to password protect it.
robots.txt or other "just for the search engines" kind of ways aren't the way to go because they aren't meant to block "not allowed" visitors.
Jean-Luc
01-02-2008, 03:59 PM
robots.txt and meta tags are not appropriate ways to protect confidential information. Confidential pages should be password protected.
robots.txt and meta tags are only meant to pass information to well-intended robots and search engines. Some ill-intended bots will use it to detect potential weaknesses in your web site.
If your private pages are already in Google, visit How can I prevent my own content from being indexed or remove content from Google's index? (http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35301) (see the part about expedite removal).
Jean-Luc
adverlicious
01-02-2008, 04:18 PM
Google provides definitive instructions for: (1) removing sensitive pages from Google's search results, and (2) preventing pages from being indexed by them in the first place:
Preventing content from appearing in Google search results (http://www.google.com/support/webmasters/bin/topic.py?topic=8459)
Note that, with the exception of Google, Yahoo!, MSN, and Ask, you should expect both your meta tags and requests for page removals to essentially be ignored. Worse, many of these "rogues" are based overseas where you'll have little or no legal recourse in the event of a problem.
Relatedly, don't assume that your private pages are "safe" from being Google-indexed because you have no links pointing to them or you haven't submitted them. Search engines have many, many ways of finding new pages -- e.g. a visitor's Google toolbar auto-submitting them, a competitor submitting them for you, links from sites you don't control, etc.
Confidential info should always, always be password protected -- or never placed online at all.
edhan
01-02-2008, 09:49 PM
Yes. The best solution will be password protection for that directory. Using nofollow or noindex may stop the crawlers but humans still can access them.
incrediblehelp
01-03-2008, 08:08 AM
Sure using password protection is best, but you still may be able to link to certain pages with some password protection scripts. You can also disallow viewing directories through the htaccess file on Apache web servers
johnehogan
01-03-2008, 08:55 AM
As a systems admin, I agree that using a simple .htaccess file with a corresponding .htpasswd file is the ONLY way to protect a sensitive Directory (or folder) against prying eyes. This is easy to do on Linux boxes. The folder CAN still be accessed by those who KNOW what the password is for the folder, but search engines and others will be locked out totally :)
kurt.santo
01-03-2008, 02:16 PM
johnehogan,
Great input!
I worked with .htaccess, but not with .htpasswd. Would you have an entry in .htaccess naming the directory that is protected and then have in .htpasswd the password and usernames stored? And if yes, how do you do this in detail?
Thank you,
Kurt
Everything that you put on the web might be found by a human or robot visitor... So do NOT place any "really need to really be secure" information on the web.
The robots.txt is NOT the good solution: it just says "this is a secret area, PLEASE don't come". Well educated spiders will respect your secret, bandit spiders too -but by exploring first this advertised secret area!
So a bare minimum would be the robots.txt exclusion AND an index.htm file in the directory, that silently redirect to your homepage (no sound, no noise, do not alert the bandits), maybe with a 301 redirect.
Better is the HT protect with .htaccess and its password file (usually named .htpassword, but this name is not fixed).
Best is: don't put it on the web
Webnauts
01-03-2008, 08:32 PM
I think I have a solution for you: Preventing Search Engine Indexing of Secure Pages - SEO Workers (http://www.seoworkers.com/seo-articles-tutorials/robots-and-https.html)
g3 creative
01-04-2008, 04:11 AM
To stop Google from crawling secure content you should always use password protection as standard.
Dave Mac
G3 Creative
Webnauts
01-04-2008, 05:25 AM
To stop Google from crawling secure content you should always use password protection as standard.
Dave Mac
G3 Creative
And what is if you do not want the pages to be password protected, but still don't want those pages to be crawled?
I provided a solution above (my tutorial), but seems it have been ignored.
Peter (IMC)
01-06-2008, 12:10 AM
but seems it have been ignored.
oh stop crying every time you don't get a standing ovation. Your posts are read and you just have to get used to the fact that most people don't fall on their knees to thank you.
You should watch the movie "the secret". Even though I think it's a dumb ass movie, they are right that if all you want to see is negative, all you will see is negative.
elizas
06-09-2010, 05:42 AM
There are times when you wouldn't want Search Engines to index your web page , but how do you go about preventing it? There are a number of ways to make sure that your web page is not found by the search bots, using meta tags is one of them. Meta tags are tags that provide detailed instructions regarding the web page to the Search Engines.
To make sure that the particular web page is not indexed, use the "NOINDEX" meta-tag and to prevent bots from following links from the page, use the "NOFOLLOW" tag between the <HEAD> and </HEAD> tags of your HTML.
Everything that you put on the web might be found by a human or robot visitor... So do NOT place any "really need to really be secure" information on the web.
The robots.txt is NOT the good solution: it just says "this is a secret area, PLEASE don't come". Well educated spiders will respect your secret, bandit spiders too -but by exploring first this advertised secret area!
So a bare minimum would be the robots.txt exclusion AND an index.htm file in the directory, that silently redirect to your homepage (no sound, no noise, do not alert the bandits), maybe with a 301 redirect.
Better is the HT protect with .htaccess and its password file (usually named .htpassword, but this name is not fixed).
Best is: don't put it on the web
Agree. There are much misinformation in this thread.
Password protection is not secure. (Why and how are Google and Yahoo Bot registered a members of my forums)?
Putting your site on a secure server is not secure.
Redirects are not secure.
Study http://curl.haxx.se/ (There is PHP cURL) and buy this http://www.schrenk.com/nostarch/webbots/ book and you will understand why. I have an article in Norwegian about the subject.
Assumption: Google has secret / hidden / masked bots or user agents.
Any reason to believe not: http://www.webproworld.com/webmaster-forum/threads/101746-Police-to-investigate-Google-street-view-info-gathering?p=515778&viewfull=1#post515778?
This thread made me start this
Is your online password protected database on a secure server really secure? (http://www.webproworld.com/webmaster-forum/threads/101747-Is-your-online-password-protected-database-on-a-secure-server-really-secure?p=515792&viewfull=1#post515792)
thread.
elizas
06-14-2010, 06:29 AM
There are times when you wouldn't want Search Engines to index your web page, but how do you go about preventing it? There are a number of ways to make sure that your web page is not found by the search bots, using meta tags is one of them. Meta tags are tags that provide detailed instructions regarding the web page to the Search Engines.
To make sure that the particular web page is not indexed, use the "NOINDEX" meta-tag and to prevent bots from following links from the page, use the "NOFOLLOW" tag between the <HEAD> and </HEAD> tags of your HTML.
Eliza
The original heading was:
"How to stop Google from crawling secure content/directory?"
How does GoogleBOT register as a member on my forums?
In addition:
The default is as far as I know:
Meta tags: No meta tags. The bots are allowed to index and archive content and follow links.
robots.txt: No robots.txt. Bots are allowed to visit every part of your site.
.htaccess: You have full control for known bots.
Should the default have been the opposite so you have to explicitely allow access to your site and your content?
That should be the easiest way to get control over the known scrapers.
Related links:
The Deep Web and the Surface Web. What does it mean? (http://www.webproworld.com/webmaster-forum/threads/96112-The-Deep-Web-and-the-Surface-Web.-What-does-it-mean?p=492201&viewfull=1#post492201)
Is your online password protected database on a secure server really secure? (http://www.webproworld.com/webmaster-forum/threads/101747-Is-your-online-password-protected-database-on-a-secure-server-really-secure?p=515792&viewfull=1#post515792)
You can simply register a bot manually. Once it has access, it can collect all content on your site for you without you noticing anything. If you use a good database platform, you have some level of control like advanced data encryption and view of what each visitor did. There may be unkown bots and spiders that have indexed and archived much more of the content in the deep web than known search engine bots.
Norwegian defense have discovered this recently.
Source: http://www.webproworld.com/webmaster-forum/threads/101645-What-do-you-know-There-IS-a-duplicate-content-penalty!-A-Google-employee-says-so...?p=515908&viewfull=1#post515908