Google (search engine)
|
- This is the article about the search engine. For the corporation, see Google, Inc.. For more technical information about the platform, see Google platform.
Google_screenshot.png
Google is a search engine owned by Google Inc. whose mission statement is to "organize the world's information and make it universally accessible and useful." The largest search engine on the web, Google receives over 200 million queries each day through its various services.
In addition to its tool for searching webpages, Google also provides services for searching images, Usenet newsgroups, news websites, videos, searching by locality, maps, and items for sale online. As of June 2005, Google has indexed 8.05 billion web pages, 1.3 billion images, and over one billion Usenet messages — in total, approximately 10.4 billion items. It also caches much of the content that it indexes. Google operates other tools and services including Google News, Google Suggest, Froogle, and Google Desktop Search. See list of Google services and tools for a complete list.
Contents |
History
The Google search engine began as a research project in early 1996 by Larry Page and Sergey Brin, two Stanford University graduate students who developed the theory that a search engine based on a mathematical analysis of the relationships between websites would produce better results than the basic techniques then in use. It was originally nicknamed "BackRub" because the system checked backlinks to estimate a site's importance.
Convinced that the pages with the most links to them from other highly relevant web pages must be the most relevant ones, Page and Brin decided to test their thesis as part of their studies, and laid the foundation for their search engine. The web site called "Google!" (with an exclamation mark) went live at the domain name google.com. They formally founded their company of the same name, Google Inc., on September 7, 1998 in a friend's garage in Menlo Park, California. Brin's lack of interest in writing HTML code used for designing web pages meant that the site's design used a minimal interface.
Google introduced advertisements in 2000, which were sold by the keyword so that they would be more relevant to the end user, and the ads were text-based in order to reduce loading time and to keep the page uncluttered. In September 2001, Google's ranking mechanism PageRank was awarded a U.S. patent. The patent was officially awarded to Stanford University and lists Lawrence Page as the inventor. At its peak in early 2004, Google handled upwards of 80 percent of all search requests on the Internet through its website and clients like Yahoo!, AOL, and CNN 1. Google's share of web search fell in 2004 when Yahoo! dropped Google's search technology in favor of their own.
The Google search site includes humorous features such as cartoon modifications (called "Google Doodles") of their logo for special occasions, the option to display the site in fictional or humorous languages such as Klingon and Leet, and April Fool's Day jokes about the company.
It has been conjectured that Google's future is personalized searches, using the data that is gathered from their Orkut, Gmail, and Froogle products to give results based on an individuals previous actions. In fact, there is a Personalized Google Search Beta in Google Labs, the experimental section of the site 2.
The name "Google"
Etymology
The name "Google" is an accidental misspelling of the word googol, which was coined in 1938 by Milton Sirotta, nephew of mathematician Edward Kasner, to refer to the number represented by 1 followed by a hundred zeros, <math>10^{100}<math>. Google's use of the term reflects the company's mission to organize the immense amount of information available on the Web.
Trademark and domain names
"To google," as a verb, has come to mean "to search for something on Google"; because of Google's popularity (in January 2005, 52 percent of all web searches 3 , but was as high as 80 percent) it has also generically come to mean "to search the web." Google officials have discouraged this usage of the company's name out of fear of trademark dilution, as it could lead to their name becoming a genericized trademark.
To prevent domain hijacking by unaffiliated third parties, Google has purchased the redirecting rights to several similar-sounding domain names like gogle.com, googel.com, etc. See external links below for other domain names owned by Google. The registration of other domain names to prevent hijacking and for humorous purposes is by no means restricted to Google.
The search engine
Index size
- ~ 1998: ~ 25,000,000
- August 2000: 1,060,000,000
- January 2002: 2,073,000,000
- February 2003: 3,083,000,000
- September 2004: 4,285,000,000
- November 2004: 8,058,044,651 web pages, 880,000,000 images, 845,000,000 Usenet messages, 4,500 news sources
- June 2005: 8,058,044,651 web pages, 1,187,630,000 images, 1 billion Usenet messages, 6,600 print catalogs, 4,500 news sources
(source: Internet Archive 4, GoogleBlog 5, Google Groups 6, Google Catalogs 7)
Physical structure
Google employs data centers full of computers running a custom Red Hat Linux in several locations around the world to respond to search requests and to index the web. The server farms in the data centers are built using a shared nothing architecture. The indexing is performed by a program named Googlebot, which periodically requests new copies of web pages it already knows about. The more often a page updates, the more often Googlebot will visit. The links in these pages are examined to discover new pages to be added to its internal database of the web. This index database and web page cache is several terabytes in size. Google has developed it’s own file system called Google File System for storing all this data.
The exact size and whereabouts of the data centers Google uses for its search engine are unknown, and official figures remain intentionally vague. According to John Hennessy and David Patterson's Computer Architecture: A Quantitative Approach, Google's server farm computer cluster in the year 2000 consisted of approximately 6000 processors, 12000 common IDE disks (2 per machine, and one processor per machine), at four sites: two in Silicon Valley, California and two in Virginia. Each site had an OC 48 (2488 Mbit/s) internet connection and an OC 12 (622 Mbit/s) connections to other Google sites. The connections are eventually routed down to 4 x 1 Gbit/s lines connecting up to 64 racks, each rack holding 80 machines and two ethernet switches. Google has almost certainly dramatically changed and enlarged their network architecture since then.
Based on the Google IPO S-1 form released in April 2004, Tristan Louis, the Vice President of application development for the Internet unit of a large financial firm, estimated the current server farm to contain something like the following 8:
- 719 racks
- 63,272 machines
- 126,544 CPUs
- 253,088 GHz of processing power
- 126,544 GB of RAM
- 5,062 TB of hard drive space
According to this estimate, the Google server farm constitutes one of the most powerful supercomputers in the world. At 126-316 teraflops, it can perform at over one third the speed of the Blue Gene supercomputer, which is currently the most powerful computing machine available to humanity.
Programming technology
Google use their own concept for distributing the task of processing collected data. Chunks from the Google File System of typically 64 MB are processed by the MapReduce framework. This framework makes it possible to apply the map and reduce concepts from functional programming languages across the data stored in the GFS. First a function is mapped across the collected data, then the result is reduced. For example a function extracting the hostname of the URL can be mapped across all pages, it is then sorted and reduced, yielding a figure of how many times a certain hostname has occurred. All mapping and reducing is massively parallelized across the nodes and fault tolerant, so if nodes crash or misbehave during map reduction, work is moved over to another machine.
PageRank and indexing
Google uses an algorithm called PageRank to rank web pages that match a given search string. The PageRank algorithm computes a recursive figure of merit for web pages, based on the weighted sum of the PageRanks of the pages linking to them. The PageRank thus derives from human-generated links, and correlates well with human concepts of importance. Previous keyword-based methods of ranking search results, used by many search engines that were once more popular than Google, would rank pages by how often the search terms occurred in the page, or how strongly associated the search terms were within each resulting page. In addition to PageRank, Google also uses other secret criteria for determining the ranking of pages on result lists.
Google not only indexes and caches HTML files but also 13 other file types 9, which include PDF, Word documents, Excel spreadsheets and plain text files. Except in the case of text files, the cached version is a conversion to HTML, allowing those without the corresponding viewer application to read the file.
Google may have difficulty indexing some websites, in particular those that use frames, links embedded within JavaScript or Java, or complex URLs with more than six variables in the query string. Google offers an explanation why some web pages haven't been included 9.
Users can customize the search engine somewhat. They can set a default language, use "SafeSearch" filtering technology (which is on 'moderate' setting by default), and set the number of results shown on each page. Google has been criticized for placing long-term cookies on users' machines to store these preferences, a tactic which also enables them to track a user's search terms over time. For any query (of which only the 32 first keywords are taken into account), up to the first 1000 results can be shown with a maximum of 100 displayed per page.
Despite its immense index, there is also a considerable amount of data in databases, which are accessible from websites by means of queries, but not by links. This so-called deep web is minimally covered by Google and contains, for example, catalogues of libraries, official legislative documents of governments, phone books, etc.
As an April Fool's parody of PageRank, Google introduced an explanation of something called "PigeonRank" 11
Google optimization
Since Google is the most popular search engine, many webmasters have become eager to influence their websites' Google rankings. An industry of consultants has arisen to help websites raise their rankings on Google and on other search engines. This field, called search engine optimization, attempts to discern patterns in search engine listings, and then develop a methodology for improving rankings.
One of Google's chief challenges is that as its algorithms and results have gained the trust of web users, the profit to be gained by a commercial web site in subverting those results has increased dramatically. Some search engine optimization firms have attempted to inflate specific Google rankings by various artifices, and thereby draw more searchers to their clients' sites. Google has managed to weaken some of these attempts by reducing the ranking of sites known to use them.
Search engine optimization encompasses both "on page" factors (like body copy, title tags, H1 heading tags and image alt attributes) and "off page" factors (like anchor text and PageRank). The general idea is to affect Google's relevance algorithm by incorporating the keywords being targeted in various places "on page," in particular the title tag and the body copy (note: the higher up in the page, the better its keyword prominence and thus the ranking). Too many occurrences of the keyword, however, cause the page to look suspect to Google's spam checking algorithms.
One "off page" technique that works particularly well is Google bombing in which websites link to another site using a particular phrase in the anchor text, in order to give the site a high ranking when the word is searched for.
Google publishes a set of guidelines for a website's owners who would like to raise their rankings when using legitimate optimization consultants 12. The New Zealand DMA offers a more comprehensive guide to SEO ethics standards 13.
Services and tools
- Main article: List of Google services and tools
Google offers a number of tools and services. Some, such as Google's calculator, stock quotes and weather results are integrated into what they call the "OneBox", meaning they appear in-line with other search results 14. The name is based on an ideal of all information being available from the one search box.
Many of Google's other services are based on applying search technology to other sources of data. Examples of this are Google Image Search, Google News, and Google Video, as well as Froogle, their catalog searching service. However, many of these services have become integrated as OneBox results and now appear in normal search results as well as having their own pages. 15
Google also provides other related services that are not directly related to searching. These include their AdSense and AdWords targeted text advertising services, Gmail and Blogger web-logging service.
Lastly, there are a number of tools written by Google to interact with their search and services. As of February 2005, these have been written exclusively for the Microsoft Windows operating system. Such tools include Google Desktop Search, Google Deskbar and the Gmail Notifier.
Jargon
- SEO
- Search Engine Optimization
- To google
- to search something using google (also, to seek information on someone by entering their full name or other information)
- Googler
- is a person who makes it easy to search the Google. Uses the google commands very efficiently. Uses the most of times "I am feeling lucky" button while searching. Fan of a google. 'Googler' is sometimes also used for "Expert Online Searcher". Also, a full-time google employee.
- Noogler
- New Googler
- Googlosophy
- The science of Google
- Googlenym, Googlonym, Memomark, Google URL
- A mental bookmark expressed as Google search ("go to my site by entering 'John Doe Chicago' into Google"). A phrase or group of random key words for which a Google search returns a corresponding page.
- SERPs
- Search Engine Result Pages
- Nigritude ultramarine, SERPs, Seraphim Proudleduck, Mangeur de cigogne
- SEO competitions
- Blackhat SEO
- search engine optimization using dirty tricks such as linkfarms, wiki or guestbook spamming, and so on
- Googledork
- A person who accidentally exposes information to the web by placing it into a location spidered by Google.
- Whitehat SEO
- search engine optimization using enhanced content, improved accessibility and usability, unique page titles, non-JavaScript linking methods, and so on
- Google-proof
- search-phrase delivering exactly the intended result while searching with google
- Sandbox Effect
- The name given to the phenomenon in which Google filters (from its results) websites created after March 2004.
- Google bomb
- An attempt to influence the ranking of a given site in results returned by the Google search engine. Also known as Google wash.
- Blue Red Yellow Blue Green Red
- synonym of Google (from the colors of their logo)
Books
- Google Hacks from O'Reilly is a book containing tips about using Google effectively. Now in its second edition. ISBN 0596008570
- Google: The Missing Manual by Sarah Milstein and Rael Dornfest (O'Reilly, 2004). ISBN 0596006136
- How to Do Everything with Google by Fritz Schneider, Nancy Blachman, and Eric Fredricksen (McGraw-Hill Osborne Media, 2003). ISBN 0072231742
- Google Power by Chris Sherman (McGraw-Hill Osborne Media, 2005). ISBN 0072257873
See also
References
Note 1: Statistics (http://www.onestat.com/html/aboutus_pressbox21.html) Note 2: Personalized Google Search (http://labs.google.com/personalized) Note 3: International Herald Tribune article (http://www.iht.com/articles/2005/01/30/business/google31.html) Note 4: Internet Archive copy of google.com (http://web.archive.org/web/%2a/google.com) 5http://googleblog.blogspot.com/2005/02/get-picture.html 6http://groups-beta.google.com/intl/en/googlegroups/about.html 7http://catalogs.google.com/ Note 8: How many Google machines (http://www.tnl.net/blog/entry/How_many_Google_machines) Note 9: Filetypes FAQ (http://www.google.com/help/faq_filetypes.html) Note 10: Exclusion of pages explanation (http://www.google.com/webmasters/2.html) Note 11: Pigeonrank explanation (http://www.google.com/technology/pigeonrank.html) Note 12: guidelines (http://www.google.com/webmasters/guidelines.html) Note 13: SEO ethics standards (http://www.dma.co.nz/pdfs/Standards_for_Search_Engine_Marketing.pdf) Note 14: OneBox (http://www.google.com/help/interpret.html) Note 15: searchenginewatch.com article (http://blog.searchenginewatch.com/blog/050207-082648)
External links
- Google website (http://www.google.com/)
- Google Services (http://www.google.com/options/index.html)
- Google Language Tools (http://www.google.com/language_tools)
- Google doodles (http://www.google.com/holidaylogos.html)
- Early Google.com (http://web.archive.org/web/19981111184551/google.com) - Google as on November 11, 1998 from Internet Archive
- Open Directory Project: Google (http://www.dmoz.org/Computers/Internet/Searching/Search_Engines/Google/)
- Original Stanford.edu Google (http://web.archive.org/web/19980502040303/google.stanford.edu) - Google on Stanford.edu
- Overview of lawsuits against Google's AdWord program (http://www.linksandlaw.com/adwords-pendinglawsuits.htm)
- Google's Ads -- and Minuses (http://www.businessweek.com/technology/content/mar2004/tc2004039_1592_tc047.htm)
- Google.ru (Russian) (http://www.google.ru/)
- An Evening with Google's Marissa Mayer (http://alan.blog-city.com/read/1003011.htm) - with several facts about Google's history
- Google Suggest for non-javascript browsers (http://www.jdhodges.com/tools/suggest/)
- List of alternative Google domains (http://whois.webhosting.info/216.239.37.99) (currently down for maintenance.)
- Official Google Weblog (http://googleblog.blogspot.com/)
- Google accepts submissions for websites that need indexing (http://www.google.com/intl/en/addurl.html)
- MapReduce article from Google Labs (http://labs.google.com/papers/mapreduce.html)
- The Search Engine Relevancy Challenge (http://www.rustybrick.com/rustysearch.php)Gives you the top 10 results from Google, Yahoo, MSN and Ask Jeeves, but whites out the name of the Search Engine returning the result. Instead it is possible to vote on a site's relevancy. Site will soon post the results of the experiment.bg:Google
cs:Google da:Google de:Google el:Google es:Google eo:Google fa:گوگل fr:Google gl:Google he:גוגל hr:Google id:Google it:Google ko:구글 ku:Google hu:Google nl:Google (zoekmachine) ja:Google no:Google pl:Google pt:Google ro:Google ru:Google simple:Google sr:Гугл fi:Google sv:Google th:กูเกิล tt:Google tr:Google wa:Google zh:Google