Chatmaster
03-06-2007, 05:05 PM
Google Patent – Detecting spam documents in a phrase based information retrieval system.
I am sure many of you are aware of this patent registered by Google on the 28th of December 2006. I thought it would be a good idea to go through the patent and reflect my interpretation of it. In short I would say this patent means that Google just secured the jobs for many SEO copywriters. You will also note that I haven’t read anybody else’s summary of this patent and I did this deliberately as to ensure that my views were unbiased. Please read through it and raise your opinions.
What is clear is that this patent address search phrases with 3 or more words in it. Until now it seems Google had a problem with providing accurate results with search phrases and this patent addresses this issue.
First of all a summary as to the specifics in terms of the claims section of the patent.
PRIMARY OBJECTIVE: Identifying spam documents in Google
Method
-Maintaining a list of phrases, each phrase associated with a list of related phrases. Determining the number of phrases
-Expected related phrases to be present on a document for any phrase on the list of phrases.
-Determine the actual amount of related phrases present in the document.
-Identifying the document as a spam document by comparing the document with the expected number of related phrases.
This patent addresses
-“spam stuffing pages” amongst other issues.
-Words or phrases known to be of value to advertisers.
-Identifying spam phrases
-Filter “identified” spam documents from the results.
-Identifying multiple word phrases
-Identify phrases that are related to each other.
-Semantics are being used.
-The use of URLs is used to identify the documents. Can this mean that by simply changing the URL you can “fix” your rankings?
-Classify good and bad phrases, where a bad phrase is a phrase that lacks predictive power.
-It also uses frequency statistics to classify good or bad phrases.
-Phrases includes stop words.
Notes
-It is clear that deep site structures will be doing well under this patent. The patent clearly mentions looking at more that one document in a document collection, to determine to see how often the phrases appear.
-It seems that linking out to other sites can be beneficial based on this patent. The link text being used will now be determined by you the website owner to target a specific keyword phrase.
-Hyperlink text is identified as distinguished text.
-Having several exit links to related sites can be good. But there is a threshold (Density). This might explain why the Wikipedia site is ranking so well in the latest SERPS? Therefore the assumption is made that contextual exit links are good for a document, using topic related keywords.
-Phrase information is used to determine the Primary and secondary topics a document are about.
-Making changes to page already indexed should be considered carefully. It can be good or bad depending on the reason for such change. If a page is changed then the archive is closed for the page and the page is re-scored and ranked.
-Topic lists of documents changed are compared to archived documents instances.
-If more than a certain % of a document’s topics has changed then the document is re-indexed for all phrases.
-Semantics play an important role in determining relevant results.
-It is a good idea to apply the rule of few occurrences of the targeted keywords and a number of related keywords in a document. This will be better than a high frequency of the targeted keyword phrases.
-Incoming links, anchor text also plays a big role in giving relevance to a document. This will definitely cause considerable changes in how link exchanges are being done.
-Older documents will have their relevance scores down rated or newer documents will have their relevancy scores rated higher. However, this rule does not apply to all searches.
-Date range of archives plays a major role in this patent. Frequency of updates of specific documents plays a great role.
Identifying your rank declines because of the patent?
The shortest and sweetest way I believe is based on point 11 in the document. This point clearly states that the algorithm will determine related keyword phrases for which an identified spam document should not rank. This means to me that your rankings will drastically decline for all related keyword phrases. Therefore if your rankings suddenly dropped across variants or related keywords, this algorithm most probably filters out your pages.
I am sure many of you are aware of this patent registered by Google on the 28th of December 2006. I thought it would be a good idea to go through the patent and reflect my interpretation of it. In short I would say this patent means that Google just secured the jobs for many SEO copywriters. You will also note that I haven’t read anybody else’s summary of this patent and I did this deliberately as to ensure that my views were unbiased. Please read through it and raise your opinions.
What is clear is that this patent address search phrases with 3 or more words in it. Until now it seems Google had a problem with providing accurate results with search phrases and this patent addresses this issue.
First of all a summary as to the specifics in terms of the claims section of the patent.
PRIMARY OBJECTIVE: Identifying spam documents in Google
Method
-Maintaining a list of phrases, each phrase associated with a list of related phrases. Determining the number of phrases
-Expected related phrases to be present on a document for any phrase on the list of phrases.
-Determine the actual amount of related phrases present in the document.
-Identifying the document as a spam document by comparing the document with the expected number of related phrases.
This patent addresses
-“spam stuffing pages” amongst other issues.
-Words or phrases known to be of value to advertisers.
-Identifying spam phrases
-Filter “identified” spam documents from the results.
-Identifying multiple word phrases
-Identify phrases that are related to each other.
-Semantics are being used.
-The use of URLs is used to identify the documents. Can this mean that by simply changing the URL you can “fix” your rankings?
-Classify good and bad phrases, where a bad phrase is a phrase that lacks predictive power.
-It also uses frequency statistics to classify good or bad phrases.
-Phrases includes stop words.
Notes
-It is clear that deep site structures will be doing well under this patent. The patent clearly mentions looking at more that one document in a document collection, to determine to see how often the phrases appear.
-It seems that linking out to other sites can be beneficial based on this patent. The link text being used will now be determined by you the website owner to target a specific keyword phrase.
-Hyperlink text is identified as distinguished text.
-Having several exit links to related sites can be good. But there is a threshold (Density). This might explain why the Wikipedia site is ranking so well in the latest SERPS? Therefore the assumption is made that contextual exit links are good for a document, using topic related keywords.
-Phrase information is used to determine the Primary and secondary topics a document are about.
-Making changes to page already indexed should be considered carefully. It can be good or bad depending on the reason for such change. If a page is changed then the archive is closed for the page and the page is re-scored and ranked.
-Topic lists of documents changed are compared to archived documents instances.
-If more than a certain % of a document’s topics has changed then the document is re-indexed for all phrases.
-Semantics play an important role in determining relevant results.
-It is a good idea to apply the rule of few occurrences of the targeted keywords and a number of related keywords in a document. This will be better than a high frequency of the targeted keyword phrases.
-Incoming links, anchor text also plays a big role in giving relevance to a document. This will definitely cause considerable changes in how link exchanges are being done.
-Older documents will have their relevance scores down rated or newer documents will have their relevancy scores rated higher. However, this rule does not apply to all searches.
-Date range of archives plays a major role in this patent. Frequency of updates of specific documents plays a great role.
Identifying your rank declines because of the patent?
The shortest and sweetest way I believe is based on point 11 in the document. This point clearly states that the algorithm will determine related keyword phrases for which an identified spam document should not rank. This means to me that your rankings will drastically decline for all related keyword phrases. Therefore if your rankings suddenly dropped across variants or related keywords, this algorithm most probably filters out your pages.