Research issues in context-aware retrieval: building a context-aware cache

Peter Brown and Gareth Jones
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk Gareth: I'm not too familiar with literature on caching algorithms, but it's probably worth exploring these, of at least making clear reference to them, if you have not done so already. Should we also consider issues related to something like paging within a cache, or is this moving too quickly at this point. Time is a difficult parameter to use, it moves forward linearly so it is possible to make predictions (probably application specific) of when to refresh. Other contexts, location or a least in one dimension to keep things easy, will generally not change linearly and so there needs to be a mechanism to refresh the cache as needed. Perhaps we can relate this more closely the virtual memory principle of "locality".

Introduction

The idea behind a context-aware cache is explained in [1]. It is useful to compare context-aware caching with other types of cache. Two well-known examples are memory caches [5] and web-page caches [6]. We will call these addressed caches because information is accessed through an address (the memory address or the URL) -- and this address is stored in the cache against the information it contains. Context-aware caches, on the other hand, are retrieval caches. Both addressed caches and retrieval caches are used (a) to improve retrieval speed and/or (b) to cater for disconnected operation (or more generally to cater for cases where data transmission is cheap/fast at the time the cache is installed, and expensive/slow at times when the cache is used). A second similarity is that either sort of cache may apply to a single user or to a group of users. There are, however, several differences between addressed caches and retrieval caches:

In retrieval caches, the cache is usually searched just as the original document collection was, i.e. via a query. We assume (see below for more details) the cache is simply a subset of the original document collection and is used in the same way as the document collection was. There is no concept of going directly to a cached item via an address (actually the process can be slightly indirect in a set-associative memory cache, where the address leads to a set of possible items rather than to one). Gareth: Surely memory and web cache "go directly" to the data items within the cache as well. Memory caches prevent the need for calls to main memory and web caches make the page available regardless of whether it is still available from the original source or whether the original server is active. Peter: a misunderstanding: I have tried to make the original more clear.
In an addressed cache, it is known when the cache fails. In a retrieval it is not known when a retrieved item is missed, i.e. when a relevant item is in the original document collection but not in the cache. This applies particularly to best-match retrieval: the cache may cover the best matches over the range covered by the cache, but it may not include the best match for each individual context covered by the cache (e.g. if the cache covers Devon, it may not include the best match for each location in Devon). However if the cache relates to a range of numeric values (e.g. times, locations) it is, however, known when the current value, as used in the query, strays outside the range covered by the cache (e.g. when the user's location has strayed outside the area covered by the cache). Gareth: So the problem arises when when we might wish to retrieve items that are outside the scope of the cached items, but we do not know of their existence. This is a recall issue, and some has parallels (although they may not be meaningful here) with standard retrieval when the query statement fails to identify some relevant documents, here we do not know from the query statement whether we have all the potentially relevant documents available in the cache.
Addressed caches are normally updated dynamicly (though not when there is disconnected operation), e.g. every time an item is not found in the cache, it is retrieved from its original source and placed in the cache (normally in place of an existing cached item). In a retrieval cache the content is typically only updated periodically (e.g. when the context covered by the cache no longer applies to the user -- this may be caused, say, by a location or a time outside the cache range): at this point the cache may be updated incrementally or totally replaced. Gareth: Note that address caching is more is, of course, usually more complex than replace the missing item since it usually involves paging and the trade off of page size against number of memory fetches. I think that this element is something that we need to consider.
Any cache tries to cover future needs. For many applications it is adequate to assume that needs in the immediate past are a good guide to needs in the future. In context-aware applications, by their very nature, the current context -- which is used to generate the retrieval query -- is normally changing continuously, and therefore the immediate past is likely to be a poor guide to the future. Thus, when building a context-aware cache, there needs to be a considerable effort to forecast the future, i.e. the context-aware caching mechanism, to be effective, needs to try harder than the average memory cache. Note, however, that all types of caching mechanism can gain from trying harder with prediction: a memory-caching mechanism might predict that the user in future will need pages close to a current, newly loaded, one; similarly, a web cache might predict, on the basis of past user behaviour over several months, that if page X is in the cache, page Y will soon be needed. Moreover all forecasting associated with caches can benefit from a historic analysis of past usage (covering the current user and possibly others) in similar situations.
Memory caches (though not web caches) are usually implemented in hardware, and have a predetermined size. A retrieval cache is implemented in software and its size may be determined dynamically according to the state of the application; it may lie in main memory or secondary memory. In an extreme case it may not occupy any memory at all, but be implemented as a modification to the access mechanisms to the original document collection (again, see details below) -- though obviously this cannot apply when there is disconnected operation. Gareth: I'm not aware of previous work in dynamic allocation of cache size, but I would be surprised if no one had considered this before. We should look this up. Set against the other issues though, I would rate this as a low priority at the moment.

The conclusions from the above list of differences are that, if we are looking for literature references to throw light on context-aware retrieval, we can certainly find some relevant material in the literature relating to addressed caches, but we also need to look in the retrieval literature.

As an aside, we could go back on the assumption that a context-aware cache is treated like the original document collection, and instead try to take advantage of information gleaned in building the cache to devise fast ways of accessing the cache. For example if our application centred on location we could build a cache that was indexed by location, although the original document collection was not. Alternatively we could explore indexing the cache by means of queries. If any of the query fields are numeric, it will be rare for exactly the same query to be issued twice (e.g. the location will normally be different from previous locations, if only by a few centimetres); however, if the cache is indexed by queries, we might find a query in the index that is close to the present one, and use this as a route into the cache. We then have a hybrid form of cache: an "addressed retrieval cache". However, all this is, we suggest, for the future. For the time being the cache is simply another, hopefully smaller, document collection. Gareth: Interesting points which made me think of something else. The cache should serve several purposes: make it better able to cope with disconnection, avoid overwhelming the main server with requests (aside: perhaps we should should consider application of cache "push" - follow on from the Rutgers talk as SIGIR last year), and to give rapid access of information. On the last point, rapid access relies on appropriate data structures (inverted file) which take time to build and are expensive to update. In view of this we will need to think very carefully about the physical data structure, update and searching of the cache to make search fast and to make update cost effective and fast.

With web caching a key issue is consistency, and this has been widely studied [3]. The same issues can apply to context-aware caches when the content of the document collection is dynamic. We suspect that the issues are similar to web caches and will have similar solutions. Here we ignore the issues, and assume the document collection is static. With memory caches, at least for data, there is the issue that the application may write to memory, and this necessitates both an update to the memory and to the original source; such issues are unlikely to apply to retrieval caches: retrieval application do not tend to allow end-users to change the information they have retrieved. Gareth: This is a real difference. We don't need to write to memory and thus consistency for data sharing is not a problem, at least for a static document set, only problem may arise if the document set is dynamic.

Disconnected or connected use

In general a cache can either be:

A connected cache:: here the user is also connected to the artifact from which the cache was extracted.
A disconnected cache:: here the user relies entirely on the cache, as the original artifact is not available.

A connected cache offers extra possibilities on how the cache is built and how it is monitored. If we consider a retrieval cache, a disconnected cache will consist of entire documents taken from the document collection. For a connected cache, on the other hand, the cache might just contain the information about a document that is needed for retrieval, such as the context associated with it, and the full version of each document, when needed, can be taken from the original document collection. Gareth: Need to think about the reasons for having a particular cache. Probably want to limit requests to central server for individual documents.

An alternative, more radical, approach for a connected cache is still to use the original document collection, but to change its surrogate data structures so that:

only a restricted subset of the documents are accessed;
as a result of the above, the surrogate data structures are faster to search.

Gareth: I don't think that this would scale to large numbers of users.

For example if a surrogate data structure is an index, we could cut out, say, 99% of the index entries, thus yielding a smaller but more limited index. This approach might be unlikely for a textual index, but if the index were, say, an index of locations, it might be attractive to build a sub-index that only covered locations near to the user. Whether such an approach can be called "caching" is an open question: it is essentially a different implementation technique to achieve the same purpose as caching.

Another advantage of a connected retrieval cache is that its performance can be continually monitored, e.g. when the user issues a query that is applied to the cache, this is treated normally, but, as a background job, the same request is applied to the original document collection. When the background job has finished, its retrieved documents are compared with those retrieved from the cache. If the two differ to the extent that the cache is deemed to be failing, the cache can be replaced or incrementally updated. Otherwise the cache can continue to be used indefinitely. Gareth: This idea is interesting. I don't think that we would want to check all the time, but maybe an occasional check. This comparison might provide some useful input to a cache management algorithm and some parameter update using some sort of machine learning.

The above possibilities are interesting, but initially we shall confine ourselves to caching techniques that apply to disconnected operation as well as connected operation. We will therefore continue with our assumption that the cache is a subset of the documents in the original document collection; furthermore we will also assume, following our early comment that an "addressed retrieval cache" is something for the future, that the cache is accessed in the same way that the original document collection was.

The coverage of a context-aware cache

When building a context-aware cache, we need to decide the set of contexts the cache is intended to cover. We call this the coverage of the cache. The coverage is often derived from a forecast centred round one or two contextual fields, the key parameters, that are vital to the application. For example:

the coverage could be time-based. Specifically the application could be aimed at fieldworkers who were disconnected for 12 hours. The coverage would encompass all the contextual values likely to occur within the next 12 hours.
the coverage could be location-based. Specifically an application might be tied to locations within a certain city. The coverage would encompass all contexts likely to occur within the city, e.g. a range of temperatures, all times of day, all seasons, ... .
as an extension to the previous example, the coverage could be both time and location based. The tourist office rents out PDAs for a period of one hour, and they contain a cache covering this period. An attraction that opened in two hour's time would therefore not be in the cache. (To extend this example further, the cache could be tailored to each user, and could be based on a particular user's interests. We then have three key parameters for a forecast: location, time and user interest.)

Thus the coverage of the cache is the union of all the contexts that are likely to occur. If the fields are numeric fields the union can be expressed as a range of values for each numeric field, e.g. the range of locations that can occur, the range of locations, etc. (Actually, taking the union on a field-by-field basis leads to a rather bigger coverage, since relationships between values of different fields are lost, but this probably does not matter.) If this is done, the coverage is then a single context, but one whose values are likely to have wide ranges. We call this the context-range approach. (A much more subtle approach uses probabilities, e.g. the probability that the user will visit a given location [4], rather than a range where all values are treated equally.)

Building the cache

All our techniques to build context-aware caches assume that a small change in the current context will lead to a correspondingly small change in the documents retrieved and their scores, i.e. that the retrieval space is in some sense continuous. Building a cache consists of the following sequence: (1) setting the key parameters for the cache; (2) making a forecast of the current contexts that form the coverage of the cache; (3) performing a retrieval, using the coverage as the query; (4) adjusting the scores of retrieved documents to factor in, e.g., experience of past retrievals -- thus documents that were often retrieved by similar users in the past might get their scores raised; (5) taking as the cache all the retrieved documents whose score exceeds some threshold.

The crudest approach to building a cache is to forecast that the current context will not change much, and thus to use at step (3) the current context. We might set a low threshold for the retrieval so that documents "far away" are still retrieved. A slight improvement on this is to assume that numeric values will change randomly from the current ones, and thus to set each as a range cented on the current point. this is an example of a context-range approach.

More sophisticated context-range approaches depend on a specific forecast of the value of each contextual field within the coverage of the cache. One approach, the one explained in [1], is to forecast, using past experience or information about the future, the union of values of each field of the current context within the coverage of the cache, e.g. that, given past data and the current weather forecast, that the temperature will be in the range T1 to T2.

An alternative approach is to forecast the current context at successive time intervals, every five minutes say, during the time coverage of the cache, and then to do a retrieval for each forecast. The cache is then the union of the documents retrieved for each forecast. Thus with a five minute time interval and a time-based cache with a lifetime of three hours, there would be a sequence of 60 forecasts, and the cache would be built from the union of 60 retrievals (which might take a significant time to do). We call this the stepwise approach. With the stepwise approach there is increasing uncertainty at each step in the sequence. For example if a forecast of a numeric value at the end of a five minute interval was deemed to involve 10% uncertainty, then there is the analogy of a 10% compound interest rate. Generally the stepwise approach, since its forecasts are more specific, is a more high-risk high-reward approach.

Forecasting values of fields

The operation of forecasting and, for the context-range approach, taking a union, may be applied to numeric or to non-numeric data types. In the numeric cases the union can be expressed as a range, but in the non-numeric cases such as text and image the forecast is likely to be expressed as an ORing of individual values. To take a numeric example, the forecast for a Temperature field can be expressed as a range such as 10..20 (the notation "10..20" means the range from 10 to 20, inclusive, expressed in the syntax used by the Context Matcher). To take a non-numeric example, we assume that there is a User-interest field whose value acts as a textual query, e.g. "gothic architecture" or "recreation". There is no concept of a continuous range with textual fields such as this, but instead there may be sudden change between discrete values. Forecasting is, however, possible: the context diary (past history) may record the value of User-interest in contexts that are similar to the present one, and the forecast may then be an ORing together of the values derived from the context diary. Gareth: perhaps the profile could be predicted into the future from the diary. (Alternatively the forecast could be a sequence of discrete values tied to time, e.g. that the value will change from "gothic architecture" to "recreation" after one hour; in practice, however, it is unlikely that a forecast would attempt to be this accurate; it would be more realistic to forecast a union of discrete values, any of which might occur at any time.)

Forecasting numeric ranges

Consider a particular numeric one-dimensional case, a Temperature field. Assume a time-based cache whose lifetime is 3 hours, and the current Temperature value is 10; two sample forecasts, which tie temperature to time, might be:

the temperature will rise at a steady rate, reaching a final value of 20 at the end of the cache lifetime.
the temperature will first rise at a steady rate, reaching 15 after an hour, but will then drop steadily to 5.

In the stepwise approach this forecast can be used as it stands. In the context-range approach, however, some information is thrown away, and we effectively use a summary of a forecast. With the context-range approach we assume for simplicity that all fields are forecast independently. Hence the forecast relation of Temperature to Time will be lost, and there will be two independent forecasts: one for Temperature, and one for Time (which will represent the next 3 hours). In the two above examples, the forecasts for Temperature would be 10..20 and 5..15 respectively.

Overall this puts the following requirements on the search query used to build a cache:

(a): it must cater for ranges of numeric values; moreover this must cover 1D, 2D, 3D, etc. This requirement applies strongly to the context-range approach; however it is also likely to apply to the stepwise approach, given that ranges are likely to be used to cover uncertainty in forecasting.
(b): it must cater for a set of non-numeric values, ORed together. This requirement is likely to apply to both approaches.

The current Context Matcher caters for (a), though ranges in 2D, 3D, etc. must be rectangular; it does not cater for (b) at all. Because of this, our initial work in investigating caches will concentrate on numeric fields -- these are, in any case, probably the types of field that will be most commonly forecast in practice.

Forecasting multi-dimensional numeric values

A linear range is an adequate way of capturing the forecast values of a one-dimensional numeric value. With multi-dimensional values, on the other hand, a forecast might in principle cover an area of any shape.

For the purpose of example, we will here consider a Location field, which we will assume to be a two-dimensional value represented by X and Y co-ordinates. A forecast range of locations could be a rectangle, a circle, or a much more complicated shape. For example, with the context-range approach, a forecast might assume that the user was following a certain circular guided tour. Forecast shapes get particularly complex when the topology of the ground and the nature of travel of the user are taken into account. For example the shape might be:

points on roads within one hour's driving time of the current location.
a circle, but excluding lakes, bogs and private land.

Complex shapes can be approximated by polygons, but for simplicity we will assume here that (1) a forecast is just based on linear interpolation, and (2) no account is taken of ground topology and inaccessible areas. With this assumption the area covered by a forecast can be represented by a segment taken from a circle with the current position at the centre, i.e. a piece of pie. (If forecasts are deemed to have high uncertainty it will be a generous helping of pie, but will become vanishingly thin as forecasts are assumed to be increasingly certain.)

With the current Context Matcher we need to approximate the piece of pie by a rectangle. Worse still the rectangle must have its sides parallel with the N-S and E-W axes. This is not a problem if, say, the user was forecast to travel due East, but would lead to an unnecessarily large rectangle if the user, was, say, travelling North-East (indeed the rectangle would be at least as big as the quadrant bounded by the N and E axes). Similarly, circular areas need to be approximated by rectangles. (Todo: can we still do some useful tests or do we have to invest a lot of effort extending the Context Matcher to cover arbitrary rectangles?) Gareth: See your point, not sure what the answer is. Why was it implemented this way to start with?

In all cases the documents actually retrieved and placed in the cache might have associated locations outside the forecast shape. This is because the matching algorithm for locations might give a good score for a match even if two locations were a mile apart; in such a case retrieved documents could have an associated location one mile away from the edge of the forecast shape.

Possible evaluation approaches

An important research question is evaluation of cache strategies and optimal setting of parameters (e.g. the time interval to be used in the stepwise approach) -- see [2], for a general discussion of the issues. The current Context Matcher is most suitable for investigating the stepwise approach. Some possible experiments are:

try different time intervals for the step used in a stepwise building of a cache, the limit being covering the whole cache lifetime in one time interval. To evaluate each cache so built, assume that actual retrieval requests from the cache are made at random times within the cache lifetime; assume that forecasts are correct; measure recall degradation from using the cache compared with the original document collection. As ever there are inter-dependencies: a key dependency here is the matching algorithm for locations. If this algorithm were, say, extreme in the sense that everything over a hundred yards away got a score of 0, then the time-intervals would need to be small (e.g. the time for the user to cover 100 yards). If, however, the matching algorithm gave a good score to two locations ten miles apart, then a large time interval could yield good results.
repeat the above, but with random perturbations from the forecast values.
repeat the above, using the a static approach. When does the static approach become better, i.e, the forecast is worse than none at all?

Conclusions

Todo

Some relevant papers

1.: Brown, P.J. and Jones, G.J.F., `Exploiting contextual change in context-aware retrieval' , Proceedings of the 17th ACM Symposium on Applied Computing (SAC 2002), Madrid, ACM Press, New York, pp. 650-656, 2002.
2.: Working paper that looks at evaluation
3.: Gwertzman, J. and Seltzer, M., `World-Wide Web cache consistency', Proceedings of the USENIX 1966 Technical Conference, Santa Clara, pp. 141-152, Jan. 1996.
4.: Kubach, U. and Rothermel, K., `Exploiting Location Information for Infostation-Based Hoarding', Proceedings of the Seventh ACM SIGMOBILE Annual International Conference on Mobile Computing and Networking (MobiCom 2001), Rome, Italy, pp. 15-27, July 2001.
5.: Patterson, D.A. and Hennessy, J.F, Computer architecture: a qualitative approach , Morgan Kaufmann, San Fransisco, 1995.
6.: Wessels, D. Web caching, O'Reilly, 2001.