Some experiments in context-aware caching

Peter Brown and Gareth Jones

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk

ABSTRACT

TODO

Background

Our research interest is in Context-Aware Retrieval (CAR): retrieving information according to the user's context. This context consists of such fields as the user's location, the temperature and other aspects of the weather, time-of-day, computation state, user preferences, search terms, behaviour of peeers, etc. The context may be partly set via sensors, partly set directly by the user, and partly synthesized by the application on the basis of information available locally or via the web.

In CAR the user's current context is matched against each document in a document collection. For example, relating this to conventional Information Retrieval technology, the user's context may be turned into a query. Basic CAR systems use Boolean retrieval, but, we believe, systems for serious production use need to use best-match retrieval, i.e. instead of a document matching or not matching, it is given a score. The higher the score, the better the match. Documents delivered to the user are typically either the N documents with the highest score, or all the documents whose score beats some preset threshold.

We have implemented a prototype CAR system called the Context Matcher. The Context Matcher uses best-match retrieval. It also allows each field to be either active (involved in the matching process) or inactive (not involved). For example at a certain time, the user's location and preferences may be active, whereas the user's current body temperature may be inactive. Generally the Context Matcher has been designed to allow experimentation, e.g. there are facilities for plug-ins to change its behaviour.

Each document in the collection must have an associated context against which the user's context can be matched. These contextual fields (a) may be explicit -- e.g. they may be stored in metadata associated with a document -- or (b) they may be embedded in the document content (in this case intelligence may be needed to extract some of the fields, e.g. that a reference to Exeter meant that the document should be matched against the location of Exeter). We will here assume case (a), and will also assume that location is represented in a geomentric form rather than a symbolic form, i.e. as co-ordinates rather than as place names [Lee02]. Thus a document about a tourist site might have an associated location field and a time field (its value being a time period representing opening hours). (Case (a) would not to apply to a field representing search terms, which need to be sought in the content of a document; in our experiments, however, the search term field was not set to be active.)

One aspect of our research involves guessing future contexts of the user. In some situations forecasting may be impossible, but in situations where the context is changing gradually and continuously it can be very effective. Forecasting may be done, for example, by:

(a): looking at past contexts in order to analyze past behaviour, and making future predictions on the basis of this. In the simplest case past locations for a user may show him moving in an approximately straight line at a constant speed: extrapolation can then be used to guess future locations. As another example, analysis of the past may show that, at around noon, the user's preferences are dominated by food.
(b): the future may be gleaned from the user's diary, e.g. from an entry that says they have a meeting in two hour's time at a certain location with certain people.
(c): some aspects of the future may be guessed from the content of web documents such as weather forecasts (or horoscopes?).

Context-aware retrieval is most frequently used in mobile applications, with location being a key field of the user's context. Tourist applications are a popular example, and we will use that example in this paper. Often the user displays information on a small hand-held device that is either permanently or sporadically connected to a server. Context-aware applications are especially challenging to implement because:

(a): they often require near-continuous retrieval, every time any aspect of the user's context has changed.
(b): retrieval must be fast: in a tourist application it is no good telling a user about a site they have just passed.
(c): retrieval may need to be ubiquitously available, in particular when there is disconnected operation.
(d): retrieved documents must really be relevant to the user.
(e): implementing a good HCI is extremely difficult.

The points specially relevant to this paper are (b) and (c).

One purpose of trying to guess future contexts is to anticipate the user's future retrieval needs, and to perform retrievals in advance of the need. Assuming the guess is correct, response to retrieval requests will then be very fast, since the necessary retrieval will have been done in advance. The other purpose -- the one that is most relevant to this paper -- is to help build context-aware caches. A context-aware cache works as follows:

since time is a fundamental part of a forecast, this is treated as a special field. Almost always time is part of a user's context. Time could, for example, be used in a tourist application to match opening hours of attractions. Given the importance of time, the first stage in building a cache is to choose a time period for which, ideally, the cache should be useful.
the user's future contexts are estimated for this time period. A union of all these contexts is then taken: we call it the supercontext. We assume fields of a context can be ranges or sets rather than single values. Thus a temperature field may have as value the range 15 to 20, and a location field may have as value a circle or a rectangle. Typically most fields of a supercontext will be ranges or sets.
A retrieval is made, using the supercontext as the current context. This retrieval may have a fairly low threshold, so that documents with contexts outside, but close to, the supercontext will be included. The set of retrieved documents form a cache.
The cache is then used in place of the original document collection. Retrieval from the cache will typically be much faster, since it is often many orders of magnitude smaller than the original document collection.
The cache is updated if the user subsequently strays out of the supercontext. (In disconnected operation this may not be possible, but at least the user can be informed that retrieval performance is likely to degrade -- because documents that would be retrieved if the original document collection were used are not in the cache). When the cache needs updating this might be done incrementally or by complete replacement, Assuming that time is part of the current context -- and thus the time field of the supercontext contains the time period covered by the cache -- the cache will naturally expire when this time period expires, since the user's time field has strayed out of the supercontext's time field.

This paper reports some experiments on cache building and subsequent retrieval from the cache. The paper is not about forecasting, and our assumption is a simple one: retrieval is just based on location, and that the forecast is that the user will move in a straight line at constant speed, or, to be exact, he will make retrievals after every M metres of progress, where M is a constant. (The current context actually contained many other fields besides location, but these were either set to be inactive in the matching or given values that always yielded a perfect score.)

todo somewhere; Cache size and content

When a cache is built a set of documents to form the cache is retrieved; in our experiments each entire document, including its metadata, is copied to the cache. Thus the cache behaves identically to the original document collection as far as retrieval is concerned.

In some applications it may be specified that the cache contains a fixed number of documents, e.g. the 100 documents that best match the supercontext. (If the documents are of variable size, the total size of the cache will, of course, still be variable.) In other applications the cache will contain a variable number of documents, e.g. all those documents that get a matching score of at least 60% against the supercontext. We cover both approaches in this paper. Obviously in practical applications the cache may need to be reduced from its specified size, e.g. when the documents it contains turn out to be huge. We will, however, not worry about such pruning here.

Surrogates; todo somewhere

In normal Information Retrieval, retrieval takes place not from the original source documents, but from a set of surrogate structures, such as inverted files, that represent the original documents. This improves retrieval speed by orders of magnitude, and that is why the speed of web search engines is so impressive. Building the surrogates may, however, take considerable time, and may only be done, say, once a week.

We did not cover surrogates in our experiments: retrieval was direct from the source documents. An unfortunate practical consequence of this was that retrieval was very slow, and, since, we performed a large number of experiments, we needed to keep our document collection small -- a true Information Retrieval researcher would regard them as pitifully small. (TODO Does it scale?)

If we did cover surrogates, then there are three possible approaches to using surrogates in the cache:

each time a cache is created, build surrogates for it.
for the cache, take the relevant subset of the original surrogates used in the document collection. For this to work the original surrogates must possess appropriate granularity; this may be a challenging research problem.
not use surrogates for the cache, on the basis that the cache will be so small that surrogates are unnecessary.

Obviously, since we did not have surrogates in the original document collection, we used the last approach.

Todo somewhere Matching numeric fields

One of our research interests is matching algorithms for numeric fields. Assume we are matching two numeric values N and M. The algorithm built into the Context Matcher calculated a score that decays linearly according to the difference between N and M. For a muti-dimensional field, such as location, its decay is based on the distance betweeen the locations. As part of our experimentation we tried replacing the linear algorithm with an N-squared algorithm, i.e. scores decayed much more severely as distances apart increased.

References

Newman, W.M., Eldridge, M.A. and Lamming, M.G., `Pepys: Generating Autobiographies by Automatic Tracking', Proc. ECSCW `91, Amsterdam, September 1991.
Rhodes, B.J. `The Wearable Remembrance Agent: A system for augmented memory', Personal Technologies, 1 pp. 218-224, 1997.
Brown, P.J. and Jones, G.J.F. `Context-aware retrieval: exploring a new environment for information retrieval and information filtering', Personal and Ubiquitous Computing, 5, 4, pp. 253-263, 2001.
Brown, P.J. and Jones, G.J.F., `Exploiting contextual change in context-aware retrieval', Proceedings of the 17th ACM Symposium on Applied Computing (SAC 2002), Madrid, ACM Press, New York, pp. 650-656, 2002.
Das, R.E. and Sen, S.K. `Adaptive location prediction based on a hierarchical network model in a cellular mobile environment', Computer Journal, 42, 6, pp. 474-486, 1999.
Lee, D.K., Xu, J. and Zheng, B., `Data management in location-dependent information services', IEEE Pervasive Computing, 1, 3, pp. 65-70, 2002.
Abowd, G.D. and Dey, A.K. `Towards a better understanding of context and context-awareness', panel statement in Gellerson, H.-W. (Ed.) Handheld and Ubiquitous Computing, Springer, pp. 304-5, 1999.