CIKM’10 Papers – Reverted Indexing for Feedback and Expansion

Posted on November 7, 2010


I haven’t been blogging for a while, yet several interesting papers in recent CIKM conference in Toronto motivated me to write something here. While lots of papers make incremental contributions by introducing additional features to existing models or by applying existing technique to new domain, there are papers that seem to be ‘groundbreaking’ in that they provide a new perspective to known things.Here’s the first paper I’ll look at.

Reverted Indexing for Feedback and Expansion

This paper from FXPAL is interesting in that they tried to look at traditional query-document relationship from the opposite direction — documents retrieve (basis) queries instead of a query retrieving documents, as shown in following diagram:

In their model, basis queries, along with top documents form a reverted document. These reverted documents are indexed in the same way as original documents are indexed, and we can apply all the statistics we know of in this reverted indexing model. (normalized document scores become TF, the sum of TFs per basis query becomes (reverted) document length, the number of reverted documents that contains each original document becomes IDF)

For evaluation, they use relevance feedback where a set of relevant documents are used as a (reverted) query to retrieve reverted documents (basis queries), which in turn are used for retrieving original documents as expansion terms along with original queries.

They show that expansion terms derived by reverted indexing is more selective (lower DF) compared to terms given by traditional relevance feedback methods. After the conference, William Weber raised a point on why this is the case, resulting in a thought-provoking discussion.

I believe the following factors contribute to the advantage of reverted indexing: 1) the existence of cut-off (only top 1000 documents are included in reverted document, so basis queries for which query documents are ranked low cannot be considered at all ) 2) the effect of document length normalization (basis queries with shorter docid lists are given priority) 3) document frequency (the query documents that are found in smaller number of basis queries have more impact). In contrast, traditional methods of term weighting for relevance feedback mostly considers only term frequency (e.g. relevance model).

While William’s point on why it works was interesting, I’m also interested in its applications. I remember that Jeremy mentioned during (or after) the talk that they expect reverted indexing to be useful for interactive IR, which makes sense given their impressive results on relevance feedback experiment.

In particular, I’m interested in seeing how it may work when applied to the document similarity search task, which is a natural fit for reverted indexing since we look for relevant documents given a document in similarity search task. I can imagine it working better than traditional approach of using the cosine similarity with TF-IDF term weighting.

As I presented in my CIKM paper, this similarity-based browsing can be a good model for a user to interact with personal documents, and reverted indexing may improve its effectiveness further, especially considering that there exist many named entities which can be used as basis queries in personal documents.

In summary, I believe that more interesting things are still ahead, since reverted indexing is a general framework for retrieving (basis) queries using (relevant) documents and there are a lot of parameters in this model — basis query, original retrieval model, retrieval score normalization method and cutoff in building reverted document, and so on. In fact, any method for ranking document can be used here, as they suggested using click-through data.

Posted in: IR