I was evaluating a particular search index recently and trying to compare the relevance of results as compared to a different search engine using the same corpus. This is the same problem you have when you’re tuning the relevance functions in a single index, and trying to see if it’s “better”. Some sort of external scoring methodology is needed to compare the different result sets.
First, you need a corpus. A collection of representative documents.
Second you need a selection of representative queries. Best if they’re taken from acutal logfiles of real user queries against the actual collection. Failing that, use the AOL Search Data.
Anyway, I was working today with a real non-trivial corpus, a good set of representative queries, and a buddy’s index and search engine (not Lucene). As I performed sample searches and scored the results, I realized the naive scoring method I started with wasn’t sufficient.
As a user, I don’t care much about the relative ranking in the first page, I just want to make sure that the best results are not languishing on page seven or nineteen of the results. In other words, how do I know the specific ten that search engine A picks for it’s first page are better than the specific ten that search engine B picks for the first page?
This is because it’s orders of magnitude more important to get the right result on the first page at all than it is to get the right results ordered properly on the first page. Relative ordering of the first page is irrelevant to me as a user.
This also corresponds with how experts in SEO behave. Being the first result on Google is mostly an ego-inflating party trick. The difference between being the second result and the tenth (still on the first page) is negligible. The difference between being the tenth (on the first page) and the eleventh (not on the first page) is huge.
Finally, there’s a “diversity” measure. Users don’t want substantially identical pages to appear on the first results page.
This thinking lead me to a very different notion of how to measure “quality” of the results that probably corresponds more naturally to a user’s intuition, and certainly corresponds to my intuition.
Here’s how I’d score, if I had to.
- Rank all possible results for a query (or at least a very large number, say the top 100 possible results), using some external “true” ranking methodology – (Amazon Mechanical Turk, or the TREK data? ) to discover the “true top ten.” It’s important that this be external to the relevance mechanism of the search engine itself – otherwise you’ve just created a feedback loop, and everything else is meaningless.
- Flag substantially identical pages in the “true top ten” so that you only count one, but not both. For example, if your “top ten” has two substantially identical ones, you’ll include the eleventh as part of the “true top ten”.
- Ignore order on first page.
- The base score is a count of the “true top ten” that appear in the first page of the result set. (scores from 0 to 10)
- Add in a +1 “bonus” for getting the “true best one” in the first position.
- If more than one of the “substantially identical” page flagged in rule two earlier appear, only count one, and score the others as zero.
Let’s call this NQM (Naive Quality Measure). While other more complicated measures might be more defensible academically (and there’s no shortage of academic ways to compare search results), I believe that NQM is going to correspond much closer to actual user perception. In addition, it’s a lot easier to explain than more complicated academic measures, since most of the score comes from rule 4.