Analysis and Implications of Hilltop Algorithm: How will it affect your site ranking? | In the article Google Florida Algo Update, we discussed how we believe that Google has deployed the Hilltop algo in its Florida algo update. As usual, Google has been silent about the algo update so our analysis is based on research and experiments. Why need a new algo? While the PR algo did its job well all these years, there is a basic flaw in the PR system and Google knew about this. The PageRank (PR) system allocates an absolute value of importance to a web page based on the number and quality of sites that link to it. However, PR value is not specific to search terms and therefore a high-PR web page that even contained a passing reference to an off-topic keyword phrase, often got a high ranking for that phrase. Krishna Bharat from California realized the flaw in this PR-based ranking system and came up with an algorithm he called Hilltop in the year 1999-2000. He filed for the Hilltop patent in Jan 2001 with Google as an assignee. Needless to say, Google realized the advantage this new algo would offer to their ranking system if combined with their own PR system. Hilltop could perfectly bridge the gap. The Hilltop algo may have gone through several refinements/iterations from its original form, before this deployment. What is the Hilltop algo? For the geeks who wish to go into great depths, there is detailed info available here Hilltop Paper & Hilltop Patent: http://www.cs.toronto.edu/~georgem/hilltop/ For the rest of us, here is a simple explanation In a nutshell, PR determines authority of a web page in general. Hilltop (LocalScore) determines the authority of a web page related to the query or search term. Bharat formulated that instead of using just the PR value to find the authoritative web pages; it would be more useful if the value has topical relevance. As such, counting links from topic relevant document to a web page would be more useful. He called these topic relevant documents as expert documents and links from these expert documents to the target documents evaluated their authority score The Hilltop algo calculates a score of authority of web pages (over-simplified) as follows: | | Run a normal search on the keyphrase to locate a corpus of expert documents. The qualifying rules of expert documents are stringent so the corpus is a manageable number of web pages. | | | Filter affiliate* sites and duplicate sites from the experts list. | | | Pages are assigned a LocalScore of authority based on number and quality of votes they get from these expert documents. Pages are then ranked based on their LocalScore. | How does Hilltop define affiliate sites? *Affiliate sites are defined as follows | | Pages that originate from the same domain (www.ibm.com, www.ibm.com/us/, products.ibm.com, solutions.ibm.com etc.) | | | Pages that originate from the same domains but with different top level and second level suffixes (like www.ibm.com, www.ibm.co.uk, www.ibm.co.jp etc.) | | | Pages that originate from neighborhood IPs (first 3 common octet in the IP number like 66.165.238.xxx is common) | | | Pages that originate from affiliate of affiliates (if www.abc.com is hosted on the same IP octet as www.ibm.com, then www.abc.com is an affiliate of www.ibm.co.uk even if they are on a different IP series) | It is worth noting that the Hilltop algo bases its calculations only on expert documents. Its algo requires finding at least two expert documents voting for a page. If the algo does not find a minimum of two expert documents, the results returned are zero. Which essentially means, that the Hilltop algo fails to pass on any values to the rest of the ranking algo and therefore becomes ineffective for the search term query in question. This is a very important aspect of the Hilltop algo It is ineffective if sufficient expert documents are not located. This unique feature of Hilltop algo, which has a high chance of returning a zero score, based on highly specific query term, has led the majority of SEO community to believe that Google is using a money words filter list. Actually, the old Google results got displayed for specific search terms where Hilltop failed to produce effect. The collection of these terms is what the SEO community collected and called the Money Words List. This effect also comes across as a strong evidence, indicating the deployment of Hilltop by Google. When Google introduced this new algo on November 15th, 2003, an analyst figured out that if you search for a query term added with some exclusion trash characters, Google displayed the original (pre-algo-change) results, bypassing the so-called money words filter list. For example if you search for real estate hgfhjfgjhgjg kjhkhkjhkjhk then Google would attempt to show you the pages on real estate but excluding pages that contained the terms hgfhjfgjhgjg and kjhkhkjhkjhk. Since it is easy to understand that, there would hardly be any page containing the words hgfhjfgjhgjg and kjhkhkjhkjhk, Google should be returning the same results as one would get for the term real estate alone. However that did not happen. Google showed results, which seemed to be identical to pre-algo-change ranking. In fact an anti-Google group setup a site (www.scroogle.org) to capture the differences in rankings to extract a so-called money words filter list. Whats the real story behind the so-called money keywords list filter? We believe that the money words filter list effect was just a spin-off symptom of the Hilltop algo. Each time, someone attempted to run a search term like real estate hgfhjfgjhgjg kjhkhkjhkjhk, Google passed on this entire search term to Hilltop. Since Hilltop was unable to locate sufficient expert documents containing this funny looking search term, it produced zero result. (read zero effect). This essentially means that the Hilltop was simply bypassed with the exclusion search term. The rest of the Google algo was then left to extract and display results, which obviously looked identical to the pre-algo-update results. The growing popularity of www.scroogle.org led Google to detect this bug. Google fixed this bug by kicking in the Hilltop a 2-step process. The exclusion terms are withheld while passing on the query to Hilltop; Hilltop does its work and extracts results, passes results to Google algo, Google excludes the terms just before displaying results. Simple. Exclusion terms are no longer passed on to Hilltop so the Hilltop now works fine. As you can see on Google site, the above exclusion method no longer shows old Google results. What does the new Google algo look like? What are the implications? The combination of Hilltop algo, Google-PR and on-page relevance factors seem to be a highly potent combination, very difficult to beat. Not impossible, but very difficult. This new combination has far-reaching implications on how link-popularity/PageRank and links from Expert Documents (LocalScore) would affect your site ranking. The exact Google algo will only be known to Google. It is a closely guarded secret. Im not good at maths (I wish I were), but here is an attempt to simplify the new Google algorithm for the purpose of understanding of how variables take effect Old Google Ranking Formula = {(1-d)+a (RS)} * {(1-e)+b (PR * fb)} New Google Ranking Formula = {(1-d)+a (RS)} * {(1-e)+b (PR * fb)} * {(1-f)+c (LS)} Where: RS = RelevanceScore: (Score based on keywords appearing in Title, Meta tags, Headlines, Body text, URL, Alt text, Title attribute, anchor text etc. of your site) PR = PageRank: (Score based on number and PR value of pages linking to your site. Original formula is PR (A) = (1-d) + d (PR (t1)/C (t1) + ... + PR (tn)/C (tn)), where PR of page A is the sum of the PR of each page linking to it divided by the number of outgoing links on each of those pages. d is a dampening factor believed to be equal to 0.15) LS = LocalScore: (Score computed from expert documents. Has variables and different values for search term appearing in title (16), headline (6), anchor text (1), search term density etc. Figures in parenthesis are the original values, which may have been changed by Google) a, b, c = Tweak Weight Controls: (available to Google for fine-tuning the results) d, e, f = Dampener Controls: (available to Google for fine-tuning the results. We believe that the value of f is currently set at zero.) fb = FactorBase: (The PageRank scale of 1 to 10 on Google bar is not linier but an exponential/logarithmic one. As per our internal analysis, we believe that it is a base close to 8. This means that PR5 is 8 times more in value than PR4. As such, a PR8 website has a value 4000 times more than a PR4 website. This factor somehow needs to be built into the algo formula. We have therefore taken a fb value to accommodate this factor) |