Bill pointed out a patent recently awarded to Google I thought I would try and decipher. Based on the summary they could assign a separate personalized anchor text score based on a “set of user-specific parameters” and combine this with the regular score to reorder your results. Here is the summary from the patent:
A search engine identifies a list of documents from a set of documents in a database in response to a set of query terms. For each document in the list, the search engine determines an information retrieval score based on its content and the query terms, and also identifies a set of source documents that have links to the document and that also have anchor text satisfying a predefined requirement with respect to the query terms. The search engine calculates a personalized page importance score for each of the identified source documents according to a set of user-specific parameters and accumulates the personalized page importance scores to produce a personalized anchor text score for the document. The personalized anchor text score is then combined with the document’s information retrieval score to generate a personalized ranking for the document. The documents are ordered according to their respective personalized rankings.
Key takeaways from the summary and the claims:
* “Each of the identified source documents have anchor text that satisfies a search query corresponding to the set of query terms.”
* “Each of the identified source documents have anchor text that contains at least one of the query terms in the set of query terms.”
* “The personalized page importance score of the identified source document is independent from the set of query terms.”
Those seem to be fairly basic IR methods but looking a bit further into the patent I’ve come to a couple conclusions/theories/guesses. First, in certain cases this could have a significant impact on the result set. There are a lot of mentions of the source document meeting “predefined requirements” further down in the patent.
* The anchor text of each of the identified source documents is examined to determine if the anchor text satisfies a predefined requirement with respect to the query terms. (They note that the predefined requirement is that it must have terms that correspond the the search query, but there are more than just that. It would be fairly simple to spam in that case with massive page generation based on as many keyword combinations you could think up. More on the possible requirements later)
* After the above condition is met, then a personalized page score is calculated according to a set of user-specific parameters that could include a list of user favored websites or keywords suitable for identifying user favored websites. (Once again they are favoring websites you frequently visit, making it harder to find new sources of information that could be more relevant to your query, but I’ve already covered that)
* User specific parameters may be provided by the user, collected from a third-party having such information, or derived from a users search history and web history (Spyware in a nicely worded sentence – TURN YOUR WEB AND SEARCH HISTORY OFF!!!) .
The reason I think this could have significant impact on the result set is I don’t think the predefined requirement will be met very often and when it is you may see a significant ranking jump with certain sites in your personalized results. I would believe most result sets would carry the normal IR score and a very low to zero anchor text score. When the document meets all the criteria the anchor text score combined with the IR score will boost the page.
Assume that a user named Adam is looking for a website covering Stanford’s sports teams. For the purposes of this explanation, we will assume that Adam would prefer that the search engine return www.gostanford.com 110-2 ahead of www.espn.com 120-2. To achieve this goal, one approach would be to allow a user like Adam to instruct the search engine to personalize the rankings of search results by providing appropriate user information such as the user’s background information or a plurality of favorite websites. For example, Adam may register with the search engine that he prefers web pages whose URL includes the term “Stanford” over other web pages.
This looks to me like another indication of the increasing weight given to keyword domains. I also think it’s crap they just assume Adam would prefer gostanford.com over espn.com just because he’s been to the first one more often or he’s specified that in his profile. Maybe he hasn’t discovered something that is much more relevant and something he would prefer more. He might not ever find it in this system.
Conceptually, when computing personalized page importance scores, the Page Importance Ranker boosts the page importance scores of documents that are deemed to match the user-specific parameters, which in turn boosts the downstream documents linked to those documents. From another viewpoint, the Page Importance Ranker boosts the page importance scores of documents of each host whose URL matches one or more of the user-specific parameters.
Here is the boost I mentioned above, however the amount of “boost” is complete speculation.
In some embodiments, a document is be deemed to match (or not match) user-specific parameters solely based on the URL of the document.
Another hint of keyword domains being more powerful (time to increase my domain buying).
In some embodiments, the source pages listed for a respective document are limited to those that satisfy a predefined requirement with respect to the search query. For instance, in one embodiment the predefined requirement is that the anchor text of the link to the respective document contain at least one query term of the search query. In another embodiment, the predefined requirement is that the anchor text of the link to the respective document satisfy the entire search query, which may be a Boolean expression containing multiple query terms. In yet other embodiments, all source documents are included, without respect to whether the anchor text of the links to the respective document contain any of the query terms.
As I mentioned way up at the beginning of this post, here are some variations of their “predefined requirements”.
That’s about as detailed as I’m going to get. It’s an interesting look into personalized search and ways they might reorder your results. I still think all the concepts they have come up with so far are bad for the web and users alike, but personalized search is still in its infancy so we’ll see where it goes. If you want to take a look at the patent you can find it here. I’d also like to thank Bill for being at SES and having no time so I could be first to the market with this analysis, but I’m sure his will outshine mine, he’s the expert on these things.