Having a MA in Library and Information Science, I've always viewed the potential of the web being the World Wide Library, rather than the WWW. So, I was delighted by Google's news release and have spent a bit more time looking into it.
1) The books to be scanned are all in the public domain, i.e., out-of-copyright.
2) USA Copyright Laws: Copyright protection is afforded a much longer period than a patent. If a work is copyright after 1977, the copyright lasts for the life of the author plus 70 years. Under the Copyright Extension Act, otherwise known as the Sunny Bono Copyright Extension Act, for works published before 1978 with existing copyrights as of the CTEA’s effective date, the CTEA extends the term to 95 years from publication.
However, any work published in the United States before 1923 are now in the public domain (this includes, for instance, the works of Mark Twain).
International Copyright Laws: There is no such thing as an “international copyright” that will automatically protect an author’s writings throughout the world. Protection against unauthorized use in a particular country basically depends on the national laws of that country. However, most countries offer protection to foreign works under certain conditions which have been greatly simplified by international copyright treaties and conventions. There are two principal international copyright conventions, the Berne Union for the Protection of Literary and Artistic Property (Berne Convention) and the Universal Copyright Convention (UCC).
3) Currently, Google has signed agreements with five libraries. Harvard has had a library since its founding in 1636 and is one of the primary archives of Colonial America. Oxford University’s Bodleian library dates back to the 1400s. Other libaries are: Stanford University, Michigan University, and New York Public Library.
While two pilot projects have been tested, it will be approximately 2 years before a substantial archive of materials will be available. Currently, the aim is to get the 19th Century books online, and then expand deeper into archives.
4) Google has two book projects – the “Print” project is a growing library of copyrighted materials, which require paying a fee. The library project is for “in the domain” or non-copyrighted materials, due to age, and will be free to the public.
5) They have recently rolled out "Google Scholar," which accesses scholarly publications, which are usually very subject or topic-specific periodicals, found mostly in University Libraries.
6) I caught the Google Guys Biography last night, which was current to about September 2004. The Google founders view themselves as Librarians, it's strongly part of their entire mission statement and the cause of much of the chaos that paid-for people are struggling with. Their original goal was and remains accessing the huge volume of webpages as quickly as possible, and as relevantly as possible.
Part of the reason that they went public this past summer with a stock offering was to move forward with their library project (funding for it).
7) They started their original algorithm, based on authoritative sites (sites with the most incoming links). While they've adapted and changed the algorithm, this is still primary to their outlook.
They are battling multiple issues due to the growth of the net. One of which is that they aren't nearly as commerical-site leaning as businesses would like, partially because businesses are selfish in nature and want high ranking, regardless of genuine content. All of the chaos that ensued a year ago when "florida" algorithm updated was to try and clean out many of the sites that had cheated their way into high rankings.
There will always be tension between the commercial aspect of the net versus the library content aspect.
Although, they are working to develop Froogle, which is merchant-based, and is free. They also offer Catalogs, which are also business-based, and localized or regional content. In other words, they are working on the commercial site paradigm.
Many of the complaints from individual website developers don't seem to take into consideration just how much time and effort, content, incoming links, etc., that are accorded to the longer-established sites. When there are over 8 billion web pages indexed, and the vocabulary of the average person consists of 5,000 to 6,000 words, this means that the content of a site has no choice but to be on a collision course with many common search keywords.
Take for example, a single simple word: baby
It breaks into three primary useages: as an infant, as a derogatory term, and as a term of endearment.Here is the M-W thesaurus for that term:
Entry Word: baby
Text: 1 a very young child especially in the first year of life <the love of a mother for her baby>
Synonyms babe, bantling, infant, neonate, newborn
Related Word bambino, little one, toddler, tot; nursling, suckling, weanling; bratling
Idioms babe in arms
Synonyms WEAKLING, doormat, invertebrate, jellyfish, milksop, Milquetoast, mollycoddle, namby-pamby, pantywaist, sissy
Synonyms GIRL FRIEND 2, beloved, flame, honey, inamorata, ladylove, steady, sweetheart, sweetie, truelove
Yet, it will remain the most common word used as a keyword for a search, followed by infant.
All common language keywords are similar. One can change the content of a site to reflect a different word choice; however, how many people actually use it as a keyword choice?
I think that Google does an incredible job. I'm heartened by their constant expansion of services and different methods of helping people to search for the content that they are interested in. I'm an absolute fan of their proposed library offering.