Peter Brown and Gareth Jones
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk
ABSTRACT
TODO
Our research interest is in Context-Aware Retrieval (CAR): retrieving information according to the user's context. This context consists of such fields as the user's location, the temperature and other aspects of the weather, time-of-day, computation state, user preferences, search terms, behaviour of peeers, etc. The context may be partly set via sensors, partly set directly by the user, and partly synthesized by the application on the basis of information available locally or via the web.
In CAR the user's current context is matched against each document in a document collection. For example, relating this to conventional Information Retrieval technology, the user's context may be turned into a query. Basic CAR systems use Boolean retrieval, but, we believe, systems for serious production use need to use best-match retrieval, i.e. instead of a document matching or not matching, it is given a score. The higher the score, the better the match. Documents delivered to the user are typically either the N documents with the highest score, or all the documents whose score beats some preset threshold.
We have implemented a prototype CAR system called the Context Matcher. The Context Matcher uses best-match retrieval. It also allows each field to be either active (involved in the matching process) or inactive (not involved). For example at a certain time, the user's location and preferences may be active, whereas the user's current body temperature may be inactive. Generally the Context Matcher has been designed to allow experimentation, e.g. there are facilities for plug-ins to change its behaviour.
Each document in the collection must have an associated context against which the user's context can be matched. These contextual fields (a) may be explicit -- e.g. they may be stored in metadata associated with a document -- or (b) they may be embedded in the document content (in this case intelligence may be needed to extract some of the fields, e.g. that a reference to Exeter meant that the document should be matched against the location of Exeter). We will here assume case (a), and will also assume that location is represented in a geomentric form rather than a symbolic form, i.e. as co-ordinates rather than as place names [Lee02]. Thus a document about a tourist site might have an associated location field and a time field (its value being a time period representing opening hours). (Case (a) would not to apply to a field representing search terms, which need to be sought in the content of a document; in our experiments, however, the search term field was not set to be active.)
One aspect of our research involves guessing future contexts of the user. In some situations forecasting may be impossible, but in situations where the context is changing gradually and continuously it can be very effective. Forecasting may be done, for example, by:
Context-aware retrieval is most frequently used in mobile applications, with location being a key field of the user's context. Tourist applications are a popular example, and we will use that example in this paper. Often the user displays information on a small hand-held device that is either permanently or sporadically connected to a server. Context-aware applications are especially challenging to implement because:
The points specially relevant to this paper are (b) and (c).
One purpose of trying to guess future contexts is to anticipate the user's future retrieval needs, and to perform retrievals in advance of the need. Assuming the guess is correct, response to retrieval requests will then be very fast, since the necessary retrieval will have been done in advance. The other purpose -- the one that is most relevant to this paper -- is to help build context-aware caches. A context-aware cache works as follows:
This paper reports some experiments on cache building and subsequent retrieval from the cache. The paper is not about forecasting, and our assumption is a simple one: retrieval is just based on location, and that the forecast is that the user will move in a straight line at constant speed, or, to be exact, he will make retrievals after every M metres of progress, where M is a constant. (The current context actually contained many other fields besides location, but these were either set to be inactive in the matching or given values that always yielded a perfect score.)
When a cache is built a set of documents to form the cache is retrieved; in our experiments each entire document, including its metadata, is copied to the cache. Thus the cache behaves identically to the original document collection as far as retrieval is concerned.
In some applications it may be specified that the cache contains a fixed number of documents, e.g. the 100 documents that best match the supercontext. (If the documents are of variable size, the total size of the cache will, of course, still be variable.) In other applications the cache will contain a variable number of documents, e.g. all those documents that get a matching score of at least 60% against the supercontext. We cover both approaches in this paper. Obviously in practical applications the cache may need to be reduced from its specified size, e.g. when the documents it contains turn out to be huge. We will, however, not worry about such pruning here.
In normal Information Retrieval, retrieval takes place not from the original source documents, but from a set of surrogate structures, such as inverted files, that represent the original documents. This improves retrieval speed by orders of magnitude, and that is why the speed of web search engines is so impressive. Building the surrogates may, however, take considerable time, and may only be done, say, once a week.
We did not cover surrogates in our experiments: retrieval was direct from the source documents. An unfortunate practical consequence of this was that retrieval was very slow, and, since, we performed a large number of experiments, we needed to keep our document collection small -- a true Information Retrieval researcher would regard them as pitifully small. (TODO Does it scale?)
If we did cover surrogates, then there are three possible approaches to using surrogates in the cache:
Obviously, since we did not have surrogates in the original document collection, we used the last approach.
One of our research interests is matching algorithms for numeric fields. Assume we are matching two numeric values N and M. The algorithm built into the Context Matcher calculated a score that decays linearly according to the difference between N and M. For a muti-dimensional field, such as location, its decay is based on the distance betweeen the locations. As part of our experimentation we tried replacing the linear algorithm with an N-squared algorithm, i.e. scores decayed much more severely as distances apart increased.