Retrieval Experiments in Pseudo-desktop Collections

Posted on September 2, 2009


My paper ‘Retrieval Experiments in Pseudo-desktop Collections’ (co-authored with my advisor Bruce Croft) will be presented in CIKM2009. It is about a new model of desktop search research, where we introduced ‘pseudo-desktop’ — a simulated desktop collection composed of automatically gathered documents and generated queries. The method to validate generated collection is suggested as well.

For retrieval model perspective, we saw desktop search as a known-item search task over semistructured document collection, since people usually find what they already know of and each document in desktop has metadata. For instance, e-mail has sender and receiver fields in addition to usual title and content fields. We presented an improved retrieval model based on PRM-S, which was introduced in my previous work.

I believe that the significane of this work is threefold. First off, it is an effort to bring more scientific effort to desktop search (or searching personal information in general). People have studied desktop search for a long time yet they mostly built their own systems and reported the result of user-study, which lacks reproducibility. Pseudo-desktop can be a sharable data collection which address this problem, by which new researchers can test their algorithm against state-of-the-art baselines without building yet another desktop search engine.

Also, the experimental result shows the value of simulation as a method in IR research. Simulated query is not only free to get but also provides total control over the parameters. In our experiment, using algorithmically-generated queries many different characteristics, we could find insights over the performance of tested retrieval methods. Another paper in SIGIR2009 also demonstrated this value of simulated queries.

Lastly, PRM-S — a novel retrieval model based on the mapping between query-word and document structure — was found to be useful in a noisy settings like e-mails (e.g. many word-overlap between document fields) as well as the collection of clean structure like movie database.

Recently, I’ve been working on the development of LiFiDeA — a prototype PIM system, by which I plan to compare the experimental results of pseudo-desktop and real-desktop collections. I believe that this will provide ultimate validation for our approach here.

Posted in: IR, PIM