Research issues in context-aware retrieval: retrieval duplication

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

In a CAR application if the user does one retrieval, changes their context a little, and then does a second retrieval, then it is likely that the second retrieval will deliver much the same set of documents as the first, i.e. there will be many duplicates between the two sets (though the ranking scores on two duplicate documents may be slightly different, e.g. when the user's context has moved slightly closer to it). If a document is of passing interest the user does not want to see it many times: hence there is often a need to eliminate duplicates. This document discusses the issues involved. The same problem can arise in ordinary IR, but not in such a severe way, since in ordinary IR it is unlikely that each query will be a small perturbation of its predecessor.

Introduction

In context-aware retrieval the user is continually retrieving sets of documents, and, assuming the context is changing slowly, it is likely that one set of documents will be much the same as its predecessor. Obviously, therefore, there are potential duplication problems: presenting the same document many times to the user.

In order to throw light on this issue we first look at the general case of document delivery and presentation. Typically there are three components:

A delivery system. This may deliver documents singly or in batches; in the latter case each document may have an associated ranking. In some delivery systems, e.g. mail or information filtering, duplicates are rare or impossible. In others, such as information retrieval systems, duplicates are common -- though most systems avoid duplicates within the same batch.
a presentation system that shows documents to the end-user. Typically this works at several levels: e.g. (1) brief title; (2) full title + further info; (3) full document. Thus the presentation system provides a ramping interface.
a storage system that keeps some or all of the delivered documents so that the user can see them again at a later date. Sometimes the storage system is automatic, as with most mailers, and sometimes it relies entirely on the user saving documents in directories of their own creation.

General points about duplication

We now look at some general characteristics of duplication.

duplication may have an associated "time-out", which may just be the end of the current session or it may be a fixed period such as a month. If a document is re-delivered after its time-out, then it is considered to be a new document. (An alternative, more sophisticated, approach is to reduce a document's ranking according to how recently the user has previously seen it.)
A re-delivered document may sometimes be the same as a previous one, but with a slightly changed content. For simplicity we will not consider this to be a duplicate but will assume a changed document is a new document.
If a document is a duplicate, but its predecessor has not yet been presented to the user, then it is often convenient to delete the record of the previous document and treat the duplicate as if it were a new document (perhaps with a slightly different score from its predecessor).
Some documents are by their nature "read-once" whereas others are more for reference, i.e. regular or continuous reading. A map is an example of the latter. To the user, duplication of read-once documents is worst. For reference documents there are cases where duplication is no problem at all. For example, assume that the presentation system in a context-aware application always shows a map relating to the user's current location. Here, if the user goes back to a previously-visited location, and a previously-seen map is retrieved and presented, then this is quite natural. (In a retrieval system, relevance feed-back can help establish the user's view of a document's role, and whether it is used for regular reference.)
To extend the previous point, in some applications duplication is not an issue at all: for example the application may say "deliver, as a batch of documents, all documents with property XXX" with the purpose of using the whole batch to form a cache. It is then irrelevant whether some of the documents occurred in a previous batch.
The smaller the screen space available for presentation to the user, the more important it is to eliminate duplication: even a title may occupy 20% of a small screen. Similarly an intrusive interface, such as one that beeps continuously on delivery of a document, has no place for delivery of unwanted documents.
Users are already familiar with interfaces that indicate whether documents have been viewed before: examples are mailers and web browsers, which typically shade links in different ways according to whether the user has previously visited the destination. Such interfaces can be extended to mark newly-delivered documents that are duplicates of previous ones.
To detect duplicates it helps if each document has a unique ID.
Retrieval and other information delivery systems normally eliminate duplicates in their basic data. Thus a document collection for an IR system will not contain the same document twice [[but there can be problems with copied documents, e.g. a newsflash from Reuters may be used by several different information providers, and a document collection that covers several such providers might then include the same original Reuters document several times -- I doubt, however that this is a widespread problem]], and an IF system will try to ensure that its document stream does not contain the same document twice.

Context-aware retrieval

As we have said, duplication is likely to be a severe problem in context-aware retrieval. Indeed if a user's context is changing slowly it may be that a retrieval yields exactly the same batch of documents as its predecessor, albeit perhaps with slightly different scores. We believe that this is generally a problem that needs to be tackled at by the application rather than the retrieval system, since only the application has any idea of whether duplicates are wanted (see above examples of the map and the documents for a cache.) To tackle the problem, the application needs a storage system to record previously-seen documents: at the very least this should contain a unique ID to identify a document and the time at which it was last viewed.

There are especial problems if the retrieval system is connected to the application via an expensive and/or low-bandwidth transmission line. Clearly transmission of unwanted duplicate documents should be avoided. The solution is to have an application-oriented filter, which takes a batch of documents delivered by the retrieval system, and eliminates unwanted documents before they are sent down the transmission line. (Arguably it is all-in-the-mind whether this filter is regarded as an extra module of the retrieval system or as a part of the application that happens to be at the opposite end of the transmission line from the rest of the application. Our feeling is, however, that it needs to be tied to an application rather than being part of a general retrieval system.)

Some relevant papers

1.: todo: more