Research issues in context-aware retrieval: retrieval duplication
Peter Brown
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk
ABSTRACT
In a CAR application if the user does one retrieval, changes their context a little,
and then does a second retrieval, then it is likely that the second retrieval
will deliver much the same set of documents as the first, i.e. there will be
many duplicates between the two sets (though the ranking scores
on two duplicate documents may be slightly different, e.g. when the
user's context has moved slightly closer to it).
If a document is of passing interest the user does not want to see it
many times: hence there is often a need to eliminate duplicates.
This document discusses the issues involved.
The same problem can arise in ordinary IR, but not in such a severe way,
since in ordinary IR it is unlikely that each query will be a small perturbation
of its predecessor.
Introduction
In context-aware retrieval the user is continually retrieving sets of
documents, and, assuming the context is changing slowly, it is likely
that one set of documents will be much the same as its predecessor.
Obviously, therefore, there are potential duplication problems: presenting the
same document many times to the user.
In order to throw light on this issue we first look at the general
case of document delivery and presentation.
Typically there are three components:
-
A delivery system.
This may deliver documents singly or in batches; in the latter case each
document may have an associated ranking.
In some delivery systems, e.g. mail or information filtering, duplicates are
rare or impossible.
In others, such as information retrieval systems, duplicates are common --
though most systems avoid duplicates within the same batch.
-
a presentation system that shows documents to the end-user.
Typically this works at several levels: e.g. (1) brief title;
(2) full title + further info; (3) full document.
Thus the presentation system provides a ramping interface.
-
a storage system that keeps some or all of the delivered documents
so that the user can see them again at a later date.
Sometimes the storage system is automatic, as with most mailers,
and sometimes it relies entirely on the user saving documents in
directories of their own creation.
General points about duplication
We now look at some general characteristics of duplication.
-
duplication may have an associated "time-out", which may just be the end of the current session
or it may be a fixed period such as a month.
If a document is re-delivered after its time-out, then it is considered to
be a new document.
(An alternative, more sophisticated, approach is to reduce a document's
ranking according to how recently the user has previously seen it.)
-
A re-delivered document may sometimes be the same as a previous one, but with
a slightly changed content.
For simplicity we will not consider this to be a duplicate but will assume a changed
document is a new document.
-
If a document is a duplicate, but its predecessor has not yet been
presented to the user, then it is often convenient to delete the record
of the previous document and treat the duplicate as if it were a new document (perhaps
with a slightly different score from its predecessor).
-
Some documents are by their nature "read-once" whereas others are more for
reference, i.e. regular or continuous reading.
A map is an example of the latter.
To the user, duplication of read-once documents is worst.
For reference documents there are cases where duplication is no problem at all.
For example, assume that the presentation system in a context-aware application
always shows a map relating to the user's current location.
Here,
if the user goes back to a previously-visited location, and a
previously-seen map is retrieved and presented, then this is quite natural.
(In a retrieval system, relevance feed-back can help establish the user's
view of a document's role, and whether it is used for regular reference.)
-
To extend the previous point,
in some applications duplication is not an issue at all: for example the application
may say "deliver, as a batch of documents, all documents with property XXX" with the purpose
of using the whole batch to form a cache.
It is then irrelevant whether some of the documents occurred in a previous
batch.
-
The smaller the screen space available for presentation to the user, the more
important it is to eliminate duplication: even a title may occupy 20% of
a small screen.
Similarly an intrusive interface, such as one that beeps continuously on
delivery of a document, has no place for delivery of unwanted documents.
-
Users are already familiar with interfaces that indicate whether documents
have been viewed before: examples are mailers and web browsers, which
typically shade links in different ways according to whether the user has
previously visited the destination.
Such interfaces can be extended to mark newly-delivered documents
that are duplicates of previous ones.
-
To detect duplicates it helps if each document has a unique ID.
-
Retrieval and other information delivery systems normally eliminate
duplicates in their basic data.
Thus a document collection for an IR system will not contain the same
document twice [[but there can be problems with copied documents, e.g. a newsflash
from Reuters may be used by several different information providers, and a
document collection that covers several such providers might then
include the same original Reuters document several times -- I doubt, however
that this is a widespread problem]], and an IF system will
try to ensure that its document stream does not contain the same
document twice.
Context-aware retrieval
As we have said, duplication is likely to be a severe problem in
context-aware retrieval.
Indeed if a user's context is changing slowly it may be that a retrieval
yields exactly the same batch of documents as its predecessor, albeit
perhaps with slightly different scores.
We believe that this is generally a problem that needs to be tackled at
by the application rather than the retrieval system, since only the
application has any idea of whether duplicates are wanted (see
above examples of the map and the documents for a cache.)
To tackle the problem, the application needs a storage system to
record previously-seen documents: at the very least this should contain
a unique ID to identify a document and the time at which it was last viewed.
There are especial problems if the retrieval system is connected to the
application via an expensive and/or low-bandwidth transmission line.
Clearly transmission of unwanted duplicate documents should be avoided.
The solution is to have an application-oriented filter, which takes
a batch of documents delivered by the retrieval system, and eliminates
unwanted documents before they are sent down the transmission line.
(Arguably it is all-in-the-mind whether this filter is regarded as an
extra module of the retrieval system or as a part of the application
that happens to be at the opposite end of the transmission line from the
rest of the application.
Our feeling is, however, that it needs to be tied to an application
rather than being part of a general retrieval system.)
Some relevant papers
- 1.
-
todo: more