Implementing PRM-S retrieval model on Galago

Posted on March 10, 2011

1


PRM-S (Probabilistic Retrieval Model for Semistructured Data) is a novel retrieval model for structured documents — i.e. XML documents or emails.  My previous works show its effectiveness in the IMDB collection [1], and also in the TREC email known-item finding task [2]. (I have blogged about the model before)

Recently I got requests for codes implementing the PRM-S, and I started implementing it using the Galago search engine. Galago is a Java-based search engine & framework for IR research, developed by Trevor Strohman, and being maintained and improved by some of my fellow researchers in CIIR (especially Sam and Marc). Being intended for IR education and research, it has a flexible and extensible architecture, which allows the implementation of retrieval models like the PRM-S to be done in a page of code.

Let’s look at what’s PRM-S, and how it can be implemented in Galago.

PRM-S

In PRM-S, we assume that each document is composed of several fields (e.g. emails). Previous work tried to use the structure by using the weighted combination of field-level scores, such as the mixture of field language models, or BM25F. (More detailed introduction of these models can be found in my paper.)

PRM-S is similar to previous retrieval models exploiting the field structure of documents, but with a key difference in that it allows each field to have weights depending on the field and query-term. For example, while previous work allowed to explain the notion of one field being more important than another (e.g. title is more important than body), yet not the case where each field has a different importance for each query-term. To see why this is a useful thing to do, please refer to my previous post about this model.

In terms of estimation, mapping probabilities are determined based on the Bayesian classification of query-term into one of field-level language models. The following excerpt from my ECIR slide shows (1) the calculation of mapping probability (MP), and (2) the corresponding retrieval model where field-level query-likelihood scores are combined using MP as weights.

As you can see above, the implementation of the mapping probabilities (MP) requires the field-level collection frequency of each query-term, and the length of each field. And then the MP can be used as weights in a field-level mixture language model. In what follows, I’ll introduce steps for implementing such retrieval model within the Galago search engine framework.

Our Goal : Transforming a Simple Query into a PRM-S Query

Like its predecessor the Indri, the Galago is a based on the combination of inference network and language modeling approaches for IR, although an arbitrary retrieval model to be implemented thanks to its flexible architecture. In Galago, you can construct a complex query by building a hierarchy of nodes. More details can be found in Don Metzler’s Indri query language description.

Let’s start by defining what we want to do clearly. Given a simple query like ‘meg ryan romance’ against a movie collection composed of fields such as {title, genre, cast, team, plot}, our goal is to transform it into the following structured query.

#combine(
#combine:0=0.407:1=0.382:2=0.187 ( meg.cast  meg.team  meg.title )#combine:0=0.601:1=0.381:2=0.017 ( ryan.cast  ryan.team  ryan.title )#combine:0=0.927:1=0.070:2=0.002 ( romance.genre  romance.title  romance.plot ))

In Galago, such transformation can be done by implementing Traversal interface, which defines how a given query can be transformed into another query. Since each query is represented as a hierarchy of nodes, we are essentially performing a graph-to-graph transformation here.

Getting the Length of Each Field

Before we construct actual retrieval model, we first need to get the length of each field that will be used as a denominator in calculating MP.

In Galago, you can fetch some information from a part of the index by creating an appropriate node structure. In our case, we first fetch a field list from user’s query, and then create an all node for each fields which iterates through all the occurrences of extents. And the term count can be calculated using the xcount method, which returns the  occurrences count of a given expression.

Constructing the Retrieval Model

To complete the implementation, we need to construct the following node (circle) structure with corresponding parameters (table). Here, each arrow denotes a parent-child relationship between nodes. (A node can contain multiple children in Galago)

To look at the structure from bottom-up, we first create two extent nodes whose function is to read the information from corresponding index files. Here, the first node is for reading positions of a term (from “posting” file), and the second node is for reading positions of a field (from “extents” file). While the extent node returns a set of start and end positions for each extent, a term is considered as an extent with size 1 in Galago, so we use the extent node for representing both terms and extents.

Given these two extent nodes, we then create an inside node whose function is to return the number of times the left extent is contained by the right node. If the extents are identical, then they are counted. In our case, since the left child has positions of a term and the right one has positions of an extent, the output of the inside node is the count of a specific term within an extent. (e.g. How many times ‘meg’ occurred in cast field? “<cast> meg </cast> <title> meg says hello </title>”  returns the count = 1)

We then need to convert the count into the score that can be used as a input for other operators. In Galago, a feature node is responsible for turning a raw expression count into a score. While this can be done in many different ways, we used the query-likelihood score with Dirichlet smoothing in our example.

The code below shows how the node structure as above can be built. For each query-term (outer loop) and field (inner loop), you can see that a feature node is created and a mapping probability is calculated, which are then used to create a combine node. Here, Combine nodes return a normalized weighted sum of the scores produced by each child node. The weights are specified through the parameter object. The weights are specified as in an array, where index 0 defines the weight of the first child node, index 1 defines the weight of the second, and so on. It is important to note that the weights are normalized by the total weight sum.

Using PRM-S in Galago

Given the implementation as above, you can use the PRM-S using the following query syntax, which then is transformed into a structured query with mapping probabilities built-in. While PRM-S is not checked-in to Galago source repository at this point, it will be available for use within a week or two.

Here I introduced the PRM-S briefly, and then explained how I implemented it using the Galago. Please let me know  if you have any questions or comments.

p.s. Special thanks to Sam and Marc for helping me in learning Galago and implementing this!

 

References

[1] A Probabilistic Retrieval Model for Semi-structured Data [paper]
Jinyoung Kim, Xiaobing Xue and W. Bruce Croft, Proceedings of the 31st European Conference on Information Retrieval (ECIR), Tolouse, France, Apr. 6-9, 2009, pp. 228-239.

[2] Retrieval Experiments using Pseudo-Desktop Collections [paper]
Jinyoung Kim and W. Bruce Croft, Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, China, Nov. 2-6, 2009, pp. 1297-1306.

Advertisements
Posted in: IR