Toward Repeatable Evaluation for PIM Research

Posted on December 2, 2010


It’s been a while since I started the research on PIM (personal information management), focusing on the retrieval of personal documents. The empirical verification of research finding is the holy grail of science in general and information retrieval in particular, however, evaluation of PIM research has been a challenging problem for several reasons.

The Challenge

Among many reasons, the biggest issue is the private nature of the data. Few people are willing to donate their own document collections to random researchers. Even when we can find a few participants who are willing to share at least some parts of theirs, there are usually conditions attached, which makes the sharing of such data practically difficult.

If we consider the benefit that TREC-style repeatable evaluation has brought to the IR research community over the years, this has been a major limiting factor for PIM research. Researchers have built their own systems and done user studies, yet their findings could not be verified by other parties, because the data was not available to other parities. This challenge in evaluation has prevented research findings from being accumulated, hindering the advancement of the research area as a whole.

Initial Attempts

As an initial attempt, my recent works (CIKM’09 and SIGIR’10) tried to address this point, by suggesting a general retrieval model and a method for building test collection including queries and documents that can be shared among research community. The basic idea is using a simulation technique and a human computation game to collect search log data.

Since I introduced some artificiality in collecting the data, an obvious problem lies in the external validity — whether the data gathered reflect the behaviors of actual users. Despite these limitations,  experimental control and reusability of collection seems to outweigh the disadvantages in my view.


During the SIGIR’10 conference and collocated Desktop Search Workshop, we discussed ways to do a repeatable evaluation for PIM systems. Paul (Paul Thomas in CSIRO) suggested sharing the algorithm instead of the collection. The basic idea is that one can instrument the PIM system equipped with many models of information access to see its relative benefit.

If each participating researcher can be agreed on the platform and  share the outcome of user study, this can be a viable way to evaluate the benefits of PIM system. As a downside, one can imagine that the design and implementation of such system (on top of which many PIM modules can be reliably run and evaluated) can be challenging in itself.

Another way I suggested during the workshop is using only publicly available parts of personal information for evaluation. Given that increasingly larger part of personal information is published in blogs, twitters, and other places on web, it might be a challenge (at least for some people) to manage all the information spread across these places. However, some participants questioned whether there’s even such a need (to search through all the SNS sites), and the implementation still requires a working system.

A Next Step

After the workshop, several people (including me) gathered together and discussed what we can do about the problem. One common voice was that we as a community need some venue for standardized evaluation, although we will be open to new forms of evaluations. In the end, we can have something like a TREC PIM track, where many methods for PIM can be evaluated in a standardized manner.

As a first step, together with David Eilsweiler and Liadh Kelly, I proposed a workshop titled “Evaluating Personal Search” for ECIR’11. A key difference of this workshop from previous PIM or desktop search workshop is that we aim to reach a consensus on the evaluation of personal information retrieval, as opposed to having participants throw new ideas.

To facilitate this process through the workshop, I made public the two dataset I created for my research — pseudo-desktop and CS collections. Here are brief descriptions of the dataset from the workshop webpage:

Pseudo-desktop collections were created so that they may contain typical file types in desktop collections like emails, web pages (html) and office documents (pdf, doc and ppt) related to specific individuals. Documents were collected by filtering from the W3C email collection and using Yahoo! web search API. Queries were generated by taking terms from each of the target documents, which were later validated using a separate set of hand-written queries.

The computer science (CS) collection was created for the evaluation of desktop search, where documents of various types were collected from many public sources in the Computer Science department, Univ. of Mass, Amherst. Known-item queries were created from the people of the same department using the DocTrack game.

Although there is room for improvements in how these datasets are created and used, I hope this dataset can be a useful resource for many researchers, and hopefully a good starting point from which we can come up with a reasonable evaluation method which many of the PIM researchers can use.

Posted in: PIM