Research issues in context-aware retrieval: using history

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

Todo

INTRODUCTION

We believe that recording and using history has great potential for improving the performance of CAR systems. In particular it relates to the key issue of improving the relevance of the documents delivered to the user. Three possible types of history of interest are:

history of the current context.
history of the document collection: which documents have been delivered; which documents have changed.
adaptive feed-back: history of how relevant the user found the documents previously delivered.

We discuss each of this in turn. The first one is our primary interest since it is special to CAR.

PART A: HISTORY OF THE CURRENT CONTEXT

Some authors have said that what is of interest is not the current context itself, but change in the current context. Most of us agree that there is a lot of truth in this. Obviously keeping a history is a pre-requisite for detecting change. In general the current context will consist of several fields, and some may be changing fast and some may be static. (One field, time, is perhaps unique in that it changes continually in a predictable way -- assuming the user does not wish to change the time by pretending that they are at some time in the past or future.) We would like to investigate ideas about weighting fields that are changing, perhaps giving special priority to fields that have suddenly changed after a long static period, or, more generally, fields that suddenly have an increased rate of change.

Such uses of change mean that the application should keep a complete history of change, at least for the current session. (We could even keep histories of past sessions too -- perhaps even relating to different users -- in order to try to detect patterns, but at present we are not interested in such deep and complex analysis.)

History is most obviously useful for numerical fields, but may also be useful for fields of other data types, in particular textual fields. Change and prediction of separate fields may be interdependent: if the air pressure is dropping, the likelyhood of rain is increasing, i.e. two fields Pressure and Rain-liklyhood may interdepend; if the time is approaching 1 p.m., the user preferences field may be likely to relate to eating, i.e. Time and Preferences fields may interdepend.

Interestingly, history can sometimes be generalised to include the future as well as the past. For example if the user's diary says he is planning to be at a meeting at a certain location in two hour's time, then this can be recorded as a `future' item in the history of their location field. Indeed some fields may naturally relate to the future rather than the past, e.g. a temperature field derived from a weather forecast; the "historical" information for such fields may be largely or even totally related to the future. Hence when we use the word "history" in this paper, we include the future too: in other words our history is looking back from a viewpoint long in the future. (An alternative term would be "time record".) Of course as time goes by, future events can become past events -- but perhaps only if they are detected as really happening, e.g. that sensor values indicated that a user really did attend a scheduled meeting.

In addition history can be used for prediction: predicting the values of field values in the future. The rationale for this is that users are likely to be more interested in information relating to current contexts that are ahead of them than behind them. For example a location field for a location ahead of the user might be more interesting than a location behind. Of course predictions can be made invalid by a sudden change (the temperature suddenly drops after a period of rise or the user veers from their previous path): in such cases the application may wish to take fast remedial action by cancelling the previous retrieval operation if it is still running, and initiating a new one.

Prediction is also useful in other circumstances, though these are not of priority to us at present:

to reduce the search space: if the retrieval engine has a good idea of what the future holds, it can spend some time extracting the subset of the documents in the collection that are likely to be relevant in the future, and to use this subset for future queries. For example if the documents have attached locations, and if the user is currently on the Cornwall/Devon border, then the subset used may be just the documents relating to Cornwall and/or Devon. Caching is another example of the same mechanism, and is particularly useful if the user is only periodically connected to the document collection: the strategy may then be to download into a cache those documents likely to be relevant while the user is disconnected. (The application will then need to have its own internal simplified retrieval engine in order to retrieve from this cache while disconnected from the main retrieval engine.)
to anticipate future queries: CAR often requires fast retrieval, but there will also usually be times when the retrieval engine is idle. During these idle times, the retrieval engine may profitably try to predict the next query, and to perform this retrieval so that, if the prediction was correct, the response can be delivered immediately the next query arrives.
catering for a sensor that has ceased to work (e.g. GPS in an urban canyon). Prediction can be used to supply a value (or a range of possible values) for the malfunctioning censor, though in such cases the field may be given a decreased weight because of its uncertainty.
dealing with retrieval delay. If it took two minutes to perform each retrieval request, either because of slow performance or periodic connectivity, then, when submitting a request, the request should ideally relate to the predicted current context in two minute's time.

We will use the collective term history exploitation strategies to describe the kinds of strategy we have indicated above.

Implementing history and history exploitation strategies

Implementing the history of the current context is not a major task. It is just necessary to remember each current context at one of the following occasions:

at the time it is sent to the retrieval engine This is the approach we prefer.
at other intervals as set by the application (e.g. after every 5 minutes, whenever a field value has changed by more than a certain threshold). In this case the history should also record when retrieval queries are sent to the retrieval engine because the optimisation strategy may depend on how frequent the queries are, and when the next one is anticipated.

We call the set of remembered current contexts the history archive. [[NB: Name now changed to CONTEXT DIARY, since it covers the future as we as history.]] Each remembered context needs to have a time attached to it. (Alternatively we could abandon the traditional concept of a history based on time, but instead organize history round some other field, such as location; as an added refinement the concept of remembering user trails is a history based on two fields: time and location.) A possible complication is that there might be several histories: e.g. the history of the real current context, as detected by sensors, etc., and a history of pretended fields created by the user. The latter might be completely different from the former, so it is useful to keep them separate. Alternatively the system might only create a history archive for real values detected by sensors, and might only record history of a field when that field was known to have a real value. (We assume that the pretended worlds are much less predictable and less continuous than the real world, and thus the value of their history is much less -- or even worthless.)

A much bigger task is how to incorporate into the retrieval engine some hooks for history exploitation strategies. This is in addition to the need for general hooks that are not connected with history, e.g. a hook to insert an algorithm to match two temperatures and return a score that records how well the two matched. We assume we have the following architecture:

(1): a pre-processor (or set of pre-processors) that can massage the current context and set field-weightings.
(2): a retrieval engine. This has a facility for the caller to set the relative weights of fields. It also has hooks for plugging in algorithms which (a) take two values corresponding to two field values that are to be matched, and return a score indicating how well those values match (this is the general hook we mentioned above), and (b) calculate the overall score for a match by combining the scores from its constituent fields. We also assume that any of these can be changed dynamically.
(3): a post-processor (or set of post-processors) that takes the output from the retrieval engine, which is a ranked set of documents each with an overall matching-score, and massages this output before presentation to the user.

We assume that the history archive is available to each of the above, if required.

One approach is to use pre-processing. A pre-processor is a natural place to incorporate field-weighting strategies, but it can also be used for wider strategies such as prediction. In the latter case a possible principle is to assume that the user does not want information relating to their current context as such, but instead that they want information relating to their context-of-interest, which is a context slightly ahead of their current context. With this principle the task of the pre-processor is to take a current context and a history archive and to derive a context-of-interest on the basis of prediction. The context-of-interest is then passed to the retrieval engine in place of the current context. If a pre-processor is almost totally committed to prediction, it is convenient to call it a prediction engine, so that its purpose is clear.

An alternative is to incorporate history exploitation strategies into the retrieval engine itself, or more specifically into the matching algorithms that are plugged into it. For example the algorithm for matching two field values would look at the history of the field that represented the present context, and would use this as a factor in calculating the score. I have a certain reluctance to use this approach, since it has a flavour of being monolithic rather than modular, but perhaps we should allow it. The only cost appears to be making the history archive available to algorithms plugged in by the user. If the retrieval engine is a separate server, an alternative approach to our history archive may be attractive: each field of the current context supplied by the client consists of the current value together with a history of past values. However we will assume the history archive approach since it appears to be more widely useful.

The third alternative is to incorporate history exploitation strategies into a post-processor. We assume the retrieval engine returns a set of matched documents with an overall matching-score on each document, and an individual matching-score on each field that was matched within the document. The uses of history in a post-processor mirror the uses of a post-processor in general:

it might be most convenient to issue a relatively crude retrieval request (i.e. not worrying about pre-processing and/or tailoring the matching process of the retrieval engine) and then extracting what is really wanted during post-processing (e.g. picking out documents that are "ahead"). In some cases the requirements might be difficult to specify in advance (e.g. to give higher matching-scores to locations that are roughly in the current direction of travel), and thus post-processing is a better answer.
following on from the previous point, it can be useful to see the whole picture before refining the results of a search. As an example if the user was interested only in documents where both the location and the time had a high matching-score, this can only be done after all the scores have been seen.
combatting errors in prediction and delays in time. When the retrieval engine is relatively slow, the current context may have changed in a surprising way since the original request was made. The ill-effects of this can be mitigated by post-processing, e.g. increasing matching-scores of documents whose location matches the user's true current location rather than the one expected at the time the request was made.
taking the results from several retrieval operations (perhaps using different retrieval engines) and combining the results.

PART B: HISTORY OF THE DOCUMENT COLLECTION

In some applications, the document collection may be dynamic, with documents continually being added, deleted or altered. An example would be a collection relating to traffic information. In such cases the history of change may be useful to the retrieval process: for example recent or frequently changing documents may be given a higher score.

Even with static collections, another piece of history may be important: the history of which documents have been already passed to the application. A document never before retrieved may be given a higher score than one that was retrieved on the last retrieval request. These pieces of history may also be important if a document is deleted: for example if a document relating to a traffic problem has now been deleted, and if this document has recently been passed to the application, then the application might like to be told of the deletion. Again, this kind of history may incorporate the future. It may happen, for example, that a document's content is updated regularly every hour, and this knowledge may be used to refine scores (e.g. if the content was updated 59 minutes previously, its score may be low).

All these uses of the history of the document collection are not in principle tied to context-aware retrieval, but could apply in any situation where the document collection is dynamic. However the strategies used may be specific to context-aware retrieval, mainly because in CAR the user is generally issuing a continuous stream of slowly-changing queries (thus, for example, a document that is to be updated in one minute's time may be given a low score because it would be better to deliver it at the next request).

PART C: ADAPTIVE FEED-BACK

Todo

HISTORY AS USED IN TRADITIONAL IR/IF

Part B and Part C above have many parallels with traditional IR/IF, and it will probably be possible to use similar approaches. Part A, however, is almost entirely specific to CAR, and involves many new research issues. There is, however, one minor parallel: in IF, the query, i.e. the profile, usually changes very slowly (in contrast to the CAR current context, change may be over months rather than over seconds). In principle, however, it may be treated in a similar manner to change in CAR. For example one could conjecture that if the user has made a small change to their profile, then the changed part of the profile should have a higher weighting than the rest.

Todo: analysis of other similarities, if any, between the two sorts of history.

SOME RELEVANT PAPERS

1.: The manual for the Context Matcher.
2.: Context-aware retrieval: exploring a new environment for information retrieval and information filtering . P.J. Brown, G.J.F. Jones. to be published in Personal Technologies, 2001.

todo: more citations