From the category archives:

Personalized Search

Personalized Anchor Text Score

by TheMadHat on August 21, 2007

If you're new here, you must subscribe to my RSS feed, or I will hunt you down. Thanks for visiting!

Bill pointed out a patent recently awarded to Google I thought I would try and decipher. Based on the summary they could assign a separate personalized anchor text score based on a “set of user-specific parameters” and combine this with the regular score to reorder your results. Here is the summary from the patent:

A search engine identifies a list of documents from a set of documents in a database in response to a set of query terms. For each document in the list, the search engine determines an information retrieval score based on its content and the query terms, and also identifies a set of source documents that have links to the document and that also have anchor text satisfying a predefined requirement with respect to the query terms. The search engine calculates a personalized page importance score for each of the identified source documents according to a set of user-specific parameters and accumulates the personalized page importance scores to produce a personalized anchor text score for the document. The personalized anchor text score is then combined with the document’s information retrieval score to generate a personalized ranking for the document. The documents are ordered according to their respective personalized rankings.

Key takeaways from the summary and the claims:

* “Each of the identified source documents have anchor text that satisfies a search query corresponding to the set of query terms.”

* “Each of the identified source documents have anchor text that contains at least one of the query terms in the set of query terms.”

* “The personalized page importance score of the identified source document is independent from the set of query terms.”

Those seem to be fairly basic IR methods but looking a bit further into the patent I’ve come to a couple conclusions/theories/guesses. First, in certain cases this could have a significant impact on the result set. There are a lot of mentions of the source document meeting “predefined requirements” further down in the patent.

* The anchor text of each of the identified source documents is examined to determine if the anchor text satisfies a predefined requirement with respect to the query terms. (They note that the predefined requirement is that it must have terms that correspond the the search query, but there are more than just that. It would be fairly simple to spam in that case with massive page generation based on as many keyword combinations you could think up. More on the possible requirements later)

* After the above condition is met, then a personalized page score is calculated according to a set of user-specific parameters that could include a list of user favored websites or keywords suitable for identifying user favored websites. (Once again they are favoring websites you frequently visit, making it harder to find new sources of information that could be more relevant to your query, but I’ve already covered that)

* User specific parameters may be provided by the user, collected from a third-party having such information, or derived from a users search history and web history (Spyware in a nicely worded sentence - TURN YOUR WEB AND SEARCH HISTORY OFF!!!) .

The reason I think this could have significant impact on the result set is I don’t think the predefined requirement will be met very often and when it is you may see a significant ranking jump with certain sites in your personalized results. I would believe most result sets would carry the normal IR score and a very low to zero anchor text score. When the document meets all the criteria the anchor text score combined with the IR score will boost the page.

Assume that a user named Adam is looking for a website covering Stanford’s sports teams. For the purposes of this explanation, we will assume that Adam would prefer that the search engine return www.gostanford.com 110-2 ahead of www.espn.com 120-2. To achieve this goal, one approach would be to allow a user like Adam to instruct the search engine to personalize the rankings of search results by providing appropriate user information such as the user’s background information or a plurality of favorite websites. For example, Adam may register with the search engine that he prefers web pages whose URL includes the term “Stanford” over other web pages.

This looks to me like another indication of the increasing weight given to keyword domains. I also think it’s crap they just assume Adam would prefer gostanford.com over espn.com just because he’s been to the first one more often or he’s specified that in his profile. Maybe he hasn’t discovered something that is much more relevant and something he would prefer more. He might not ever find it in this system.

Conceptually, when computing personalized page importance scores, the Page Importance Ranker boosts the page importance scores of documents that are deemed to match the user-specific parameters, which in turn boosts the downstream documents linked to those documents. From another viewpoint, the Page Importance Ranker boosts the page importance scores of documents of each host whose URL matches one or more of the user-specific parameters.

Here is the boost I mentioned above, however the amount of “boost” is complete speculation.

In some embodiments, a document is be deemed to match (or not match) user-specific parameters solely based on the URL of the document.

Another hint of keyword domains being more powerful (time to increase my domain buying).

In some embodiments, the source pages listed for a respective document are limited to those that satisfy a predefined requirement with respect to the search query. For instance, in one embodiment the predefined requirement is that the anchor text of the link to the respective document contain at least one query term of the search query. In another embodiment, the predefined requirement is that the anchor text of the link to the respective document satisfy the entire search query, which may be a Boolean expression containing multiple query terms. In yet other embodiments, all source documents are included, without respect to whether the anchor text of the links to the respective document contain any of the query terms.

As I mentioned way up at the beginning of this post, here are some variations of their “predefined requirements”.

That’s about as detailed as I’m going to get. It’s an interesting look into personalized search and ways they might reorder your results. I still think all the concepts they have come up with so far are bad for the web and users alike, but personalized search is still in its infancy so we’ll see where it goes. If you want to take a look at the patent you can find it here. I’d also like to thank Bill for being at SES and having no time so I could be first to the market with this analysis, but I’m sure his will outshine mine, he’s the expert on these things.

{ 5 comments }

Personalized Search And Why It Sucks

by TheMadHat on August 1, 2007

I recently had a guest post on 8 Ways To Optimize For Personalized Search over at Search Engine Journal so I figured I would follow that up with why I think it’s not good for users, marketers, and the search engines alike.

1. Personalized Search limits the discovery of new sites and new information. The higher placement of sites you have visited and browsed before pushes anything new that you may have not seen before down the page. Users will have to start finding new content in other ways than search; through blogs, news sites, etc. The users don’t like this and the search engines certainly don’t want people finding content in other ways. If the result sets are filled with things I’ve already seen I’m going to look elsewhere.

2. Personalized Search does not equal more relevant results. Simply because I’ve visited and browsed through a site once or even multiple times doesn’t mean that I think it’s more relevant. In some cases it might be, it others probably not. Google will never know what is more relevant based on my browsing and search history. Take this fairly old post from Graywolf for an excellent example. They might have some decent indicators but their accuracy in this area isn’t even going to be as close as the relevancy of their current results, which we know are filled with search pages from authority domains and edu’s selling Viagra.

3. It’s the most invasive spyware known to man. Google records everything you search for, everything you look at, and probably everything you think about. Imagine a massive database filled with every search you’ve ever done and every web page you’ve ever seen. It’s not very appealing is it? Sure, I don’t mind if it will actually give me something useful in the long run and if I can turn it off and on at the flick of a switch. That way, when I’m searching for [how to infiltrate the Googleplex] no one will be the wiser.

4. Entering the market just got much more difficult. New (legitimate) sites are going to have problems without experienced and professional help that will be very costly. If Google does not have data on your website, even if you go out and purchase an old domain with some already established trust, you’re still up shit creek. Without any users looking at your site then you won’t have new subscribers to your feeds, you won’t have any click-through data, and you will be pushed down the results in favor of something “more popular”…like craigslist.

5. It won’t put a dent in spam. I keep hearing over and over how this is going to eliminate spam. I said in my article over at SEJ that they will be working with very large data sets and filtering out the majority of the automated spam, and they will. The majority of it. Much like they do now. The other 2% of spammers are the ones that are good enough to hijack botnets and fill the world with edu spam and everything else imaginable. Very quickly they will figure out ways to spam personalized search and the flood gates will once again open.

6. They will never be able to determine intent with any accuracy. Ever The engine reps always bring up the “Jaguar and Jaguar” example about how to deliver searches to a biologist and a car enthusiast. WTF? Maybe the freaking biology teacher wants to drive a Jaguar and maybe the rich guy already driving a Jaguar wants to go on a safari? No search engine will ever know. That is until Google finishes SkyNet and they take over with their mind reading robots. You will be assimilated, or they’ll make you go work for Wikipedia.

{ 3 comments }

Welcome to Part I of a long awaited collaboration between TheMadHat and TheGypsy. We’ve been digging deep into personalized search lately, and there has been a lot of chatter about the subject around the SEO world. In part I of our collaboration, I’m going to go through the latest patent on the removal and manipulation of personalized search, and what this means for you, the SEO mastermind. You can see the full patent here.

This patent is a fairly simple one (when it comes to the world of patents anyway). Essentially it goes through the different methods a user could remove unwanted results, and how this data could be used in ranking results. Let’s break it down:

Each search result would have an option to immediately remove the result for that search if the user deems it spam, irrelevant, or offensive. By default, the result would be removed only for that search session. The next time you jump back on and perform the same search your results will be standard (assuming other personalization factors and not influencing the listings). This is the basic, bare bones example and would cause the least problems from and SEO standpoint.

The second option is one that could cause a little more turbulence. The user would have the option to remove that specific result from all future searches. Keep in mind we’re talking about only one listing at this point. Bye bye. You’re now invisible to this user for this particular page. It will never show up again for that search.

The third option is the killer. This option allows you to remove this result, and all documents associated with this result. Bye bye big time. Your entire site is now invisible to this user from now on.

What does all this mean to you? Not a big deal you’re thinking? Well think again. Sure, users removing you from their results were not likely to buy anyway. Not a big loss there. However, let’s look at these specific sections of the patent:

“…aggregating information regarding documents that have been removed by a group of users; and assign scores to a set of documents based on the aggregated information.”

“…determining a remove list score associated with the documents in the set of documents based on the aggregated information;”

Now we’re talking. Each result will be assigned a “remove list score” and rankings will be determined by “link-based score, the information retrieval score, and the remove list score”. My first thought was that this would be easily gamed by all you spammers. Automated account creation, mass removals, etc. Obviously Google has thought of that as well:

“identifying a set of legitimate users and a set of illegitimate users; and collecting information regarding documents that have been removed by the set of legitimate users.”

Now it doesn’t go into detail about how they define a legitimate user. My theory (guess) is that they will look at: length of time on the account, number of searches performed, number of removals done, etc. Sort of a trust factor for the account in question.

What does this mean to us? The ease of removing a site will be a big factor, but if it’s easy I can see lots of users removing stuff in droves. If your site sucks, people will remove you and this will cause your rankings to plummet. It just emphasizes the fact that you need to have content that people will eat up and demonstrate you are an expert in your niche. A professional, well designed site that makes your users comfortable is also going to be a factor. Obviously Google doesn’t know what your site looks like, but it can make some assumptions based on removal data. Goodbye MFA’s and a ton of affiliate garbage.

Make sure you’re ahead of the curve with unique and compelling content and you should reap the rewards. Seems like a standard sentiment now days.

{ 0 comments }