How would you search for your favorite movie?

Posted on August 25, 2009


How would you search for your favorite movie? I tried to answer this question in my paper ‘A Probabilistic Retrieval Model for Semistructured Data (download)’ , which was presented in ECIR’09 at Toulouse, France.

In this problem, we can assume that collection is structured by many different fields (e.g. title, cast, genre, etc.) and users may recall partial information on some aspect of the target item. People have either used advanced search interface or let user issue structured query (e.g. XPath), which renders the problem trieval.

I wanted to see the problem differently. The first obvious thing was that people usually don’t want to go for multi-form search interface. Also, it’s beyond the capability of average users to formulate XPath query — it’s hard even for me!. I thought that it will be useful if users issue simple keyword query yet have the same effect as using those advanced techniques.

And a thought on the typical user’s querying behavior made me realize that we implicitly map each query-term into some aspect of the item we are looking for. Let’s assume a user trying to find a movie ‘French Kiss’ with partial information about cast (‘meg ryan’) and genre (‘romance’). He or she may type ‘meg ryan romance’ yet it is clear which aspect of data (movie) user meant by each query-word. And we can infer this mapping between query-term and document field by bayesian estimation (more detail on paper).

Given this observation and taking into account that each aspect of information is encoded in different XML element, it is natural that raking algorithm for this kind of document can benefit from this mapping between query-word and document. The solution is to put a higher weight for the element which seems to be what user intended. In above example, ‘cast’ element needs to be weighted higher for ‘meg ryan’ and the same can be said about ‘genre’ element and ‘romance’.

This simple idea later turned out to improve retrieval performance significantly. The performance gain was more noticable for collection with clear semantics (e.g. movie descriptions) since it was easier for a system to map each query-word into correct document field.

I’m currently working on applying this retrieval model for the desktop search problem, XML data were replaced with documents with metadata fields.

Tagged: ,
Posted in: IR