Peter Brown and Gareth Jones
Department of Computer Science, University of Exeter, Exeter EX4 4QF,
UK
School of Computing, Dublin City University, Dublin 9, Ireland
P.J.Brown@ex.ac.uk, G.J.F.Jones@ex.ac.uk
In mobile IRiX (Information Retrieval in Context) there are great potential gains if future information need(s) can be predicted. Some possible gains are
Obviously prediction can sometimes be wrong, not only in the example cited above, but in all cases; normally, however, the cost of a wrong prediction is small. For example a cache of predicted information needed may be only 10% useful, but this is better than no cache at all if the user becomes disconnected.
In normal IR the scope for prediction is small: traditionally every query is treated entirely separately from its predecessor, and the assumption is that queries are random (except perhaps when relevance feedback causes the previous query to be enhanced). In IRiX, the context involves a large number of fields, and often these change slowly (e.g. location) or remain constant for long periods (e.g. certain user preferences); thus prediction becomes possible. (In this paper we assume the entire information need is captured by the context; thus one field of the context might be a traditional IR query typed by the user, whereas other fields might have values generated automatically by sensors -- these being the more predictable fields.)
In IRiX there is the added possibility that retrieval is sometimes pro-active, e.g. done by an agent working on the user's behalf, rather than directly by the user. In general pro-active retrieval is more regular and predictable than retrieval that is requested interactively by the user, and involves fewer if any "random" fields, like queries typed by the user.
In real life many fields of the context change continuously; examples are location, air temperature and time-of-day. In fact sensor readings to determine the values of these fields are taken at discrete intervals, every 15 seconds say, so rather than a continuum we have a sequence of discrete values where each value is "close" to its predecessor. Closeness is easy to define for numerical values such as location, temperature and time-of-day, but can also be defined for other types of data such as camera images or text. Prediction is possible when the sequence follows some pattern, such as a continual steady increase. Patterns can be determined not only from a sequence itself, but also from history, e.g. sequences that occurred on previous days, such as a regular path of locations that the user has often followed. (In principle patterns can also be seen in a sequence of values that are not "close" to each other, but closeness is probably the easiest case. Closeness is also important in cache-building -- see below.)
Prediction is further enhanced if something about the future (as it is thought to be) is known, e.g. if the prediction system has access to a user's diary, which says that, in 30 minutes time, she has a scheduled meeting at location X about topic T with a set of people P. We explore this in our concept of a context diary [2 ].
We referred earlier to the common need to build a cache of documents that the user might need in the future. We have already done work on this [1]. Essentially this work involves trying to predict a supercontext, which is the union of the contexts that will apply while the cache is needed, e.g. during the time the user is predicted to be disconnected. The supercontext concept works best when all the contexts that comprise it are close to each other, such as a set of locations all with 800 metres of the user's current position. The tactic is to retrieve in advance the documents relevant to the supercontext, and place all of these in the cache. Obviously if certain important fields are completely unpredictable, or are predicted wrongly (e.g. the user switches for walking to driving a car, and is soon more than 800 metres away), this technique will work poorly if at all.
Building a cache can also be an aid to speed, independently of the needs for mobile use when there is periodic disconnection. Assuming a context is rich, context-aware retrieval may be slow. If a cache is built that is, say, 100 times smaller than the original document collection, then retrieval from the cache may (depending on the retrieval algorithm) be 100 times faster. This gain may more than offset the loss from missing good hits because of using the cache. Predicted supercontexts can also be used in building such caches; sometimes a cache may be built to cover several users, all with similar predicted contexts (e.g. participants in a sponsored walk), rather than just one.
As an aside, building a cache might involve another element of prediction, not covered here: predicting which documents in the collection may change during the lifetime of the cache, and, downweighting such documents. It may also involve predictions already available from public sources, such as a weather forecast for a temperature field.
We have assumed that a context consists of a number of fields, and we also assume that the documents to be matched also have fields attached to them, e.g. a document may relate to a certain location, time, etc. (Such fields may be explicitly attached to a document or may be derived automatically by looking at its textual content -- that is not our concern here.) The retrieval process consists of matching these individual fields and deriving a overall score. As we have said, one field of the user's context may well be a traditional textual query.
A supercontext, by its nature, will generally have several fields that are unions of values rather than individual values. If each possible value of each field is taken in turn a supercontext is a union of what is typically a very large number of individual contexts.
It might be possible to retrieve for each of these individual contexts in turn, and then take the union of all the retrieved contexts. This union would then be taken as the set of retrieved documents corresponding to the supercontext. However in most real cases large numbers resulting from combinatorial explosions make this approach infeasible. A better approach is to work with ranges of values, and this is the focus of the rest of this Section.
If a set of individual values of a field are all close to each other, the set can be represented as a range. This is simplest for 1D numerical fields, e.g. a temperature field of a supercontext may be the range 15 to 25, and a time-of-day field may have a range of 10 to 15 (meaning, of course, 10 a.m. to 3 p.m.).
With such linear ranges, a simple approach to matching is just to replace the range by its centre point. The problem with such a simple approach is that it is likely to give added weight to values near this centre point. This weighting may be unjustified: for example if a time-of-day field of a supercontext is 10 to 15, and this is matched against documents representing scheduled events that have an associated time-of-day field, then the user is unlikely to want special precedence to be given to events scheduled at 12.30, the centre of the range. In general it is better to have a retrieval algorithm that is explicitly designed to cater for ranges. In fact ranges may occur both in the supercontext associated with a user and with the corresponding fields attached to documents (e.g. the document may have a associated time-of-day field representing opening hours of 14 to 18). Optimum matching algorithms for pairs of ranges are by no means obvious.
(If a range represents a range of possibilities, e.g. because a sensor is not accurate, then replacing, for matching purposes, a range by its centre point is often a reasonable approach since the centre is the most probable value. Such ranges, we believe, need to be treated differently from the sorts of range we discuss here.)
If we move from 1D to 2D, as a location field often is (though it may be 3D), a numeric field may be a geometric shape. For example a location field of a supercontext may be a circle centred on the user's current location, or, if the user is known to be heading in a certain direction, a segment of this circle. Moreover, location fields attached to documents may also have complex shapes, e.g. a document may relate to the county of Devon, and may thus have a location field that is the geometric shape that Devon covers. (In practice the shape would need to be simplified, e.g. to a six-sided polygon representing a rather bigger area that encompasses Devon. Likewise the supercontext may be bigger than it needs to be, i.e. including values that were not predicted: an example would be turning a set of predicted values of 21, 23, 26 into the range 21 to 26.) Optimum matching algorithms for 2D ranges and shapes are again not obvious.
If a field cannot be predicted at all it can be given an all-encompassing range; we call this ANY in our implementation, and ANY matches everything with equal weight.
The work we have done assumes all predictions are equally probable. This is, of course, simplistic, and work needs to be done in incorporating the notion of probability into our models. For the time being, however, we believe the need is to research the problems we have, rather adding probability too.
Prediction of information needs is an important, and, we believe, somewhat neglected aspect of IRiX. One research topic is the prediction algorithms themselves, and how these can be made to learn from predictions that turn out to be wrong. An equally important topic, and one that we have focussed on in this paper, arises with using the predictions. The topic is matching algorithms to retrieve documents for caches; here the predictions will cover a number of future contexts, rather than just one, and we need ways of incorporating ranges and aggregates of values into matching algorithms.