Peter Brown and Gareth Jones
Department of Computer Science, University of Exeter, Exeter EX4 4QF,
UK
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk
Gareth: I'm not too familiar with literature on caching algorithms, but it's probably worth exploring these, of at least making clear reference to them, if
you have not done so already. Should we also consider issues related to something like paging within a cache, or is this moving too quickly at this point.
Time is a difficult parameter to use, it moves forward linearly so it is possible to make predictions (probably application specific) of when to refresh.
Other contexts, location or a least in one dimension to keep things easy, will generally not change linearly and so there needs to be a mechanism to
refresh the cache as needed. Perhaps we can relate this more closely the virtual memory principle of "locality".
The idea behind a context-aware cache is explained in [1]. It is useful to compare context-aware caching with other types of cache. Two well-known examples are memory caches [5] and web-page caches [6]. We will call these addressed caches because information is accessed through an address (the memory address or the URL) -- and this address is stored in the cache against the information it contains. Context-aware caches, on the other hand, are retrieval caches. Both addressed caches and retrieval caches are used (a) to improve retrieval speed and/or (b) to cater for disconnected operation (or more generally to cater for cases where data transmission is cheap/fast at the time the cache is installed, and expensive/slow at times when the cache is used). A second similarity is that either sort of cache may apply to a single user or to a group of users. There are, however, several differences between addressed caches and retrieval caches:
The conclusions from the above list of differences are that, if we are looking for literature references to throw light on context-aware retrieval, we can certainly find some relevant material in the literature relating to addressed caches, but we also need to look in the retrieval literature.
As an aside, we could go back on the assumption that a context-aware cache is treated like the original document collection, and instead try to take advantage of information gleaned in building the cache to devise fast ways of accessing the cache. For example if our application centred on location we could build a cache that was indexed by location, although the original document collection was not. Alternatively we could explore indexing the cache by means of queries. If any of the query fields are numeric, it will be rare for exactly the same query to be issued twice (e.g. the location will normally be different from previous locations, if only by a few centimetres); however, if the cache is indexed by queries, we might find a query in the index that is close to the present one, and use this as a route into the cache. We then have a hybrid form of cache: an "addressed retrieval cache". However, all this is, we suggest, for the future. For the time being the cache is simply another, hopefully smaller, document collection. Gareth: Interesting points which made me think of something else. The cache should serve several purposes: make it better able to cope with disconnection, avoid overwhelming the main server with requests (aside: perhaps we should should consider application of cache "push" - follow on from the Rutgers talk as SIGIR last year), and to give rapid access of information. On the last point, rapid access relies on appropriate data structures (inverted file) which take time to build and are expensive to update. In view of this we will need to think very carefully about the physical data structure, update and searching of the cache to make search fast and to make update cost effective and fast.
With web caching a key issue is consistency, and this has been widely studied [3]. The same issues can apply to context-aware caches when the content of the document collection is dynamic. We suspect that the issues are similar to web caches and will have similar solutions. Here we ignore the issues, and assume the document collection is static. With memory caches, at least for data, there is the issue that the application may write to memory, and this necessitates both an update to the memory and to the original source; such issues are unlikely to apply to retrieval caches: retrieval application do not tend to allow end-users to change the information they have retrieved. Gareth: This is a real difference. We don't need to write to memory and thus consistency for data sharing is not a problem, at least for a static document set, only problem may arise if the document set is dynamic.
In general a cache can either be:
A connected cache offers extra possibilities on how the cache is built and how it is monitored. If we consider a retrieval cache, a disconnected cache will consist of entire documents taken from the document collection. For a connected cache, on the other hand, the cache might just contain the information about a document that is needed for retrieval, such as the context associated with it, and the full version of each document, when needed, can be taken from the original document collection. Gareth: Need to think about the reasons for having a particular cache. Probably want to limit requests to central server for individual documents.
An alternative, more radical, approach for a connected cache is still to use the original document collection, but to change its surrogate data structures so that:
Gareth: I don't think that this would scale to large numbers of users.
For example if a surrogate data structure is an index, we could cut out, say, 99% of the index entries, thus yielding a smaller but more limited index. This approach might be unlikely for a textual index, but if the index were, say, an index of locations, it might be attractive to build a sub-index that only covered locations near to the user. Whether such an approach can be called "caching" is an open question: it is essentially a different implementation technique to achieve the same purpose as caching.
Another advantage of a connected retrieval cache is that its performance can be continually monitored, e.g. when the user issues a query that is applied to the cache, this is treated normally, but, as a background job, the same request is applied to the original document collection. When the background job has finished, its retrieved documents are compared with those retrieved from the cache. If the two differ to the extent that the cache is deemed to be failing, the cache can be replaced or incrementally updated. Otherwise the cache can continue to be used indefinitely. Gareth: This idea is interesting. I don't think that we would want to check all the time, but maybe an occasional check. This comparison might provide some useful input to a cache management algorithm and some parameter update using some sort of machine learning.
The above possibilities are interesting, but initially we shall confine ourselves to caching techniques that apply to disconnected operation as well as connected operation. We will therefore continue with our assumption that the cache is a subset of the documents in the original document collection; furthermore we will also assume, following our early comment that an "addressed retrieval cache" is something for the future, that the cache is accessed in the same way that the original document collection was.
When building a context-aware cache, we need to decide the set of contexts the cache is intended to cover. We call this the coverage of the cache. The coverage is often derived from a forecast centred round one or two contextual fields, the key parameters, that are vital to the application. For example:
Thus the coverage of the cache is the union of all the contexts that are likely to occur. If the fields are numeric fields the union can be expressed as a range of values for each numeric field, e.g. the range of locations that can occur, the range of locations, etc. (Actually, taking the union on a field-by-field basis leads to a rather bigger coverage, since relationships between values of different fields are lost, but this probably does not matter.) If this is done, the coverage is then a single context, but one whose values are likely to have wide ranges. We call this the context-range approach. (A much more subtle approach uses probabilities, e.g. the probability that the user will visit a given location [4], rather than a range where all values are treated equally.)
All our techniques to build context-aware caches assume that a small change in the current context will lead to a correspondingly small change in the documents retrieved and their scores, i.e. that the retrieval space is in some sense continuous. Building a cache consists of the following sequence: (1) setting the key parameters for the cache; (2) making a forecast of the current contexts that form the coverage of the cache; (3) performing a retrieval, using the coverage as the query; (4) adjusting the scores of retrieved documents to factor in, e.g., experience of past retrievals -- thus documents that were often retrieved by similar users in the past might get their scores raised; (5) taking as the cache all the retrieved documents whose score exceeds some threshold.
The crudest approach to building a cache is to forecast that the current context will not change much, and thus to use at step (3) the current context. We might set a low threshold for the retrieval so that documents "far away" are still retrieved. A slight improvement on this is to assume that numeric values will change randomly from the current ones, and thus to set each as a range cented on the current point. this is an example of a context-range approach.
More sophisticated context-range approaches depend on a specific forecast of the value of each contextual field within the coverage of the cache. One approach, the one explained in [1], is to forecast, using past experience or information about the future, the union of values of each field of the current context within the coverage of the cache, e.g. that, given past data and the current weather forecast, that the temperature will be in the range T1 to T2.
An alternative approach is to forecast the current context at successive time intervals, every five minutes say, during the time coverage of the cache, and then to do a retrieval for each forecast. The cache is then the union of the documents retrieved for each forecast. Thus with a five minute time interval and a time-based cache with a lifetime of three hours, there would be a sequence of 60 forecasts, and the cache would be built from the union of 60 retrievals (which might take a significant time to do). We call this the stepwise approach. With the stepwise approach there is increasing uncertainty at each step in the sequence. For example if a forecast of a numeric value at the end of a five minute interval was deemed to involve 10% uncertainty, then there is the analogy of a 10% compound interest rate. Generally the stepwise approach, since its forecasts are more specific, is a more high-risk high-reward approach.
The operation of forecasting and, for the context-range approach, taking a union, may be applied to numeric or to non-numeric data types. In the numeric cases the union can be expressed as a range, but in the non-numeric cases such as text and image the forecast is likely to be expressed as an ORing of individual values. To take a numeric example, the forecast for a Temperature field can be expressed as a range such as 10..20 (the notation "10..20" means the range from 10 to 20, inclusive, expressed in the syntax used by the Context Matcher). To take a non-numeric example, we assume that there is a User-interest field whose value acts as a textual query, e.g. "gothic architecture" or "recreation". There is no concept of a continuous range with textual fields such as this, but instead there may be sudden change between discrete values. Forecasting is, however, possible: the context diary (past history) may record the value of User-interest in contexts that are similar to the present one, and the forecast may then be an ORing together of the values derived from the context diary. Gareth: perhaps the profile could be predicted into the future from the diary. (Alternatively the forecast could be a sequence of discrete values tied to time, e.g. that the value will change from "gothic architecture" to "recreation" after one hour; in practice, however, it is unlikely that a forecast would attempt to be this accurate; it would be more realistic to forecast a union of discrete values, any of which might occur at any time.)
Consider a particular numeric one-dimensional case, a Temperature field. Assume a time-based cache whose lifetime is 3 hours, and the current Temperature value is 10; two sample forecasts, which tie temperature to time, might be:
In the stepwise approach this forecast can be used as it stands. In the context-range approach, however, some information is thrown away, and we effectively use a summary of a forecast. With the context-range approach we assume for simplicity that all fields are forecast independently. Hence the forecast relation of Temperature to Time will be lost, and there will be two independent forecasts: one for Temperature, and one for Time (which will represent the next 3 hours). In the two above examples, the forecasts for Temperature would be 10..20 and 5..15 respectively.
Overall this puts the following requirements on the search query used to build a cache:
The current Context Matcher caters for (a), though ranges in 2D, 3D, etc. must be rectangular; it does not cater for (b) at all. Because of this, our initial work in investigating caches will concentrate on numeric fields -- these are, in any case, probably the types of field that will be most commonly forecast in practice.
A linear range is an adequate way of capturing the forecast values of a one-dimensional numeric value. With multi-dimensional values, on the other hand, a forecast might in principle cover an area of any shape.
For the purpose of example, we will here consider a Location field, which we will assume to be a two-dimensional value represented by X and Y co-ordinates. A forecast range of locations could be a rectangle, a circle, or a much more complicated shape. For example, with the context-range approach, a forecast might assume that the user was following a certain circular guided tour. Forecast shapes get particularly complex when the topology of the ground and the nature of travel of the user are taken into account. For example the shape might be:
Complex shapes can be approximated by polygons, but for simplicity we will assume here that (1) a forecast is just based on linear interpolation, and (2) no account is taken of ground topology and inaccessible areas. With this assumption the area covered by a forecast can be represented by a segment taken from a circle with the current position at the centre, i.e. a piece of pie. (If forecasts are deemed to have high uncertainty it will be a generous helping of pie, but will become vanishingly thin as forecasts are assumed to be increasingly certain.)
With the current Context Matcher we need to approximate the piece of pie by a rectangle. Worse still the rectangle must have its sides parallel with the N-S and E-W axes. This is not a problem if, say, the user was forecast to travel due East, but would lead to an unnecessarily large rectangle if the user, was, say, travelling North-East (indeed the rectangle would be at least as big as the quadrant bounded by the N and E axes). Similarly, circular areas need to be approximated by rectangles. (Todo: can we still do some useful tests or do we have to invest a lot of effort extending the Context Matcher to cover arbitrary rectangles?) Gareth: See your point, not sure what the answer is. Why was it implemented this way to start with?
In all cases the documents actually retrieved and placed in the cache might have associated locations outside the forecast shape. This is because the matching algorithm for locations might give a good score for a match even if two locations were a mile apart; in such a case retrieved documents could have an associated location one mile away from the edge of the forecast shape.
An important research question is evaluation of cache strategies and optimal setting of parameters (e.g. the time interval to be used in the stepwise approach) -- see [2], for a general discussion of the issues. The current Context Matcher is most suitable for investigating the stepwise approach. Some possible experiments are:
Todo