Issues on TREC Session Track

Posted on December 14, 2010


William Webber’s post on TREC Session Track got me thinking about what it means to evaluate a user’s session, or an interactive IR system in general. TREC Session track aims at extending the horizon of evaluation beyond the single query-response interaction, whose potential benefit is described in the following excerpt from the track overview paper:

A search engine may be able to better serve a user not by ranking the most relevant results to each query in the sequence, but by ranking results that help “point the way” to what the user is really looking for, or by complementing results from previous queries in the sequence with new results, or in other currently-unanticipated ways.

However, this seemingly simple and natural extension leads to many complications, one of which is pointed out by William.

This problem is that the reformulation of the second query is independent of the results retrieved to the first. The response to the first query could be gibberish, or it could be an excellent answer to that first query; the retrieval system could say “I’m a teapot, I’m a teapot, I’m a teapot”, over and over.

In my understanding, his argues that subsequent responses of a system in a session, by definition, should be evaluated in the context of earlier responses. Given that the track provides a predetermined type of reformulation (generalization, specialization, or drift) for each query, although the specific type is not disclosed to the participants, the first response of a system cannot influence the second request of a user.

In this sense, the evaluation model for the track precludes the possibility that user’s action at time t+1 is affected by the system’s response at time t. On the other hand, however, the impact of user’s action on system (original and subsequent queries) can be easily evaluated in this model, since participants are given an original and a reformulated query for each topic. The following diagram illustrates this interaction model, where the only missing link in the model is marked with a dashed line.

Despite this simplification, I believe that this is a reasonable approximation of user’s session in terms of repeatability. Taking the variability of subsequent query due to system’s initial response should be possible, yet it’ll multiply the  assessors’ efforts since every variation of user’s query should be judged.

Furthermore, the track guideline above suggests that a system’ responses (to original and reformulated query) should be evaluated w.r.t. each other, which is another unique aspect of the track (although ‘their new measure’ somewhat sounds like a diversification metric).

In our new measures, a document ID that appears in both RL1 and RL2/RL3 will be penalized in RL2/RL3, with the penalization decreasing by the depth at which it appeared in RL1. For example, if document A appears at rank 1 in RL1, it will be heavily penalized for reappearing in RL2 or RL3. If document B appears at rank 100 in RL1, it will not be penalized much for reappearing in RL2 or RL3. The exact form of the penalization has yet to be determined.

All in all, as William eloquently put in his post (excerpt below), I hope that session track be a solid first step into repeatable evaluation for any interactive IR scenarios, including PIM. The idea of evaluating the user session seemed ambitious at first, yet I think they could sort out an interesting and tractable bit out of the whole problem.

…history has shown that well-formed test collections are employed for an enormous variety of tasks beyond that which they were originally designed for.

Posted in: IR