Research issues in context-aware retrieval: a working paper on weighting within best-match strategies

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

Todo

Boolean versus best-match

A central task in context-aware retrieval is to match the user's present context against the contexts associated with each document in the document collection, and to retrieve those documents that (best) match. Pilot applications have often used Boolean matching of fields, because this is simple. It is, however, likely that a best-match strategy will generally yield better results in terms of relevance and precision. We say "generally" because there are some cases -- essentially those cases where the application deals with absolutes -- where Boolean matching still has a place: e.g. to retrieve information that is only relevant to a particular room, to a particular temperature range (below freezing-point, say) or to a particular range of opening hours. In such absolute cases, one can argue that if the present context does not match the context associated with a document, then the document should not be retrieved: being close is not good enough. (Actually the example of opening hours is asymmetric in this respect: just after is a killer, but just before is no great problem.) Overall, however, we stick by our claim that best-match retrieval is generally the preferred approach, and this paper explores issues surrounding this.

Some assumptions

We will make a number of assumptions for the sake of simplicity:

  1. Todo: our assumptions about explicit sensors and about broader concepts such as personalisation.
  2. the software in use consists of two components: a retrieval engine and an application. (There may be some further components, such as pre-processors to the retrieval engine, but we are not interested in such detail here.) The application maintains the current context, and presents retrieved documents to the user. We assume that the retrieval engine is associated with a single document collection, and can optimise its performance by pre-processing the documents in this collection, e.g. by building indexes or by sorting documents according to certain field values, such as their associated location. In some cases the retrieval engine might be a server that is logically and perhaps physically separate from the application; the server might be stateless, and if so any historical information needs to be preserved by the client.
  3. the query may either be derived from the current context and applied to a document in the collection, or vice versa: the reverse case often occurs with pro-active retrieval (e.g. the requirement is to give certain information when the user is next in a certain context, such as when they next visit a certain room); in principle retrieval is the same in each case. See [1] for more details.
  4. a context is a set of independent fields, each with a value; a query may involve one or more fields, each to be matched separately and independently. (Todo: is this too strong?) Fields may be of many possible types. Numerical fields of a context may be 1D, 2D, 3D, etc., and, although numeric fields predominate in the literature, there are plenty of possible uses for text fields (e.g. the document the user is currently reading or writing). Todo: ref to Rhodes [3] + products that find "similar" web pages to your current one, and some discussion.
  5. the query uses the same names for fields as the document, and values are expressed in the same units (e.g. we are not trying to match imperial units against metric ones).
  6. within the query some fields may be required to match (i.e. the matching score for the field must be greater than some threshold) and some fields may be optional matches. For example if a query is derived from the current context, it may specify that the location field must match but that the heartrate field may or may not match.

History

An important characteristic of CAR is that history is an important aid to improving performance. This is for two reasons:

  1. many CAR applications are running continuously or nearly continuously; there is thus a regular series of retrieval requests, and often each query is similar to its predecessor (e.g. there may just be a small change in location).
  2. the way in which a field is changing may be used to determine its weighting; this can apply even if the application is not continuous, but is, say, performing a retrieval every quarter of an hour. A conjecture is that (a) changing fields are more important than static fields, and that (b) fields that are changing fast (relative to their normal rate of change) and more important than slowly changing ones.

History is probably most important in the user's current context, but it can also be important in the content of the document collection, in cases where this is not static. The main uses of history are (a) for recording change, and (b) for prediction of future field values. The way a CAR application uses history is likely to be very different to traditional IR and IF applications. We discuss history, and the way it might be exploited in pre- and post-processors to the retrieval engine in [7].

Database or information retrieval

Traditionally the disciplines of databases and information retrieval have been entirely separate, though some technologies, such as fuzzy databases, begin to bridge the gap. CAR applications will sometimes lie uneasily between databases and information retrieval. Numeric fields or multiple-choice fields are often best treated with a database approach, whereas textual fields are likely to need a retrieval engine (for example, a user-preferences field for hotels may contain strings such as "fishing", "garden", "countryside" and "old manor house", and the best way of matching these is likely to involve using a textual description of each hotel, rather than expecting a database to relate in each of the user's preferences).

Often it will be necessary to exploit an existing body of information, such as existing web pages or existing database content. Obviously this will affect the decision on what retrieval technology to use. In contrast, there are some applications where the document collection is created with CAR in mind. This applies to "memory aid" systems that captures events in the user's life with a view to later CAR, and to CAR systems for conference attendees [4]. In the latter case, all the conference information may be prepared with mark-up designed for CAR. For example the conference streams may be marked with their rooms, times and perhaps the audience they are designed for. Indeed it may be that CAR is the only way to retrieve the conference information: someone away from the conference site may need to pretend, using a suitable interface, that they are at the conference time and place in order to retrieve information. Arguably this is a good and natural metaphor anyway. Todo: relate to Rooms interface from Xerox PARC.

Obviously applications that can decide the form of the document collection have a potential advantage for achieving good retrieval performance.

Overall we have a preference for an information retrieval approach, because it is likely to be more flexible, provided that some of the associated research issues can be solved. We discuss this issue further in [8].

Accumulation of scores

Todo: discuss existing retrieval applications where scores are combined. We assume that the scoring mechanism is such that an individual score is calculated for each field, and then these field scores are combined in some way to yield an overall score for a document match. A previous paper [2], which explores an example where there is just one contextual field (location), suggests that scores be multiplied together, and we will follow that approach here. For more details of scoring see [9].

It is worth saying a little more about non-matches. There are three possible cases, as illustrated by the following examples:

Case 2 above is less serious than case 1, and perhaps should be given a low score but not a zero score.

Matching of individual fields

The scoring of matches of individual fields will depend on the nature of each field. For textual fields, information retrieval has produced a large body of research and effective algorithms, and this can be exploited. There is much more need for research into scoring of numeric matches, since such fields are likely to form the bulk of contextual fields. Numeric values can be one-dimensional, two-dimensional (as used for many location fields), three-dimensional (as used for locations that include height), or more. Reference [2] has suggested some approaches to scoring the matching of two two-dimensional locations. We suggest that there be generic algorithms that can be used for matching any numeric field of a certain dimension, but recognise that there be special cases where the generic algorithms need to be overridden. An example of a special case is a time field representing the opening hours of a tourist attraction: assume this has a value of 14.00 to 17.00. If the current time is 14.05, or even 13.55, then this a good match since the attraction is just opening, whereas if the current time is 16.55 it is a bad match; thus there is an asymmetry about matching, with a bias towards the start of a range, and a special algorithm will need to be used to reflect this.

As indicated already, the values of numeric fields are often likely to be ranges rather than single values. For example a document may relate to a certain area, which is represented by a circle, rectangle or polygon; opening hours will typically be a range or times, or even several disjoint ranges. Fields of the user's current context may also be represented as ranges, especially when the recording method is inaccurate or uncertain. For example a user whose location is set via a GPS sensor may have the location represented as a circle, with the current GPS reading as centre, whereas another user whose location is recorded more precisely using a short-range beacon may have their location recorded as a point.

Generic algorithms for matching

In this Section we discuss some details of the generic algorithms that can be used for matching, and we will start with some general considerations.

Firstly, matching will involve two values, V1 and V2, with V1 derived from the query and V2 derived from the document being matched; we suggest that the generic algorithms be commutative, i.e. matching V1 with V2 yields the same score as matching V2 with V1. (However some possible scoring systems go against this: if V1 and V2 are ranges, one possible scoring system is to give extra credit if V2 completely includes V1.)

Secondly, we must remember that not all the matches will involve the same fields, e.g. some tourist attractions may have an associated Temperature field that needs to be matched, whereas others might have a Time-of-day field. As a result there is a need for fairness between the scores on different fields; for example if Temperature matching tended always to get a low score and Time-of-day a high score, this would distort what was delivered to the user.

Thirdly consider the two cases:

  1. what is the score for matching 10 with 20?
  2. what is the score for matching 283 with 293?
One might think that the second gives the higher score, and this will often be sensible. However if both cases relate to temperature, but case (1) is Centigrade and case (2) is Kelvin, then the two cases are fundamentally the same and thus the two scores should be identical. To cater for this, we suggest that, at the start of a session there be a prepass that looks at the spread, which is the overall range of values used for each field (ignoring infinite ranges). This might find, for example in case (1), that all the values in Temperature fields come between -10 and 40; a conclusion might be that case (1) gets a reasonably low score (being a miss of 10 in a spread of 50). In case (2), assuming identical data, but expressed to the Kelvin scale, it would find that values came between 263 and 313, and the score for case (2), being a miss of 10 in a spread of 50, would get the same low score since it is again a miss of 10 in an overall range of 50. On the other hand if the spread was 500, as might apply if the field measured something else (e.g. a share price for a certain company), the a miss of 10 might get a good score.

Fourthly we assume each match is independent, both of the matching of other fields, and matching of the same field but with a different query/document. For example in a tourist application if an attraction is one of 20 that are within a mile of the user, then the score would be just the same as if it had been the only one. In the former case we assume the thresholding algorithm (e.g. only deliver the 5 best matches to the user) will act as a discriminator.

Fifthly we assume (at least until experience dictates otherwise) that scores are the same irrespective of whether matching is proactive or interactive.

Sixthly, we just discuss basic matching scores here: these may be changed by looking at wider considerations, such as history of change and prediction. Such changes are achieved by separate pre- and post- processors.

Some examples

In order to illustrate the properties of an ideal generic scoring system, we will give some examples of suggested scores. The first set of examples applies to two-dimensional locations, and assumes a interactive query from a user who has given as their location an area (i.e. a range of locations) covering Devon. The list below gives suggested scores for documents with the given associated locations:

  1. Devon: score 2.
  2. Cornwall (an area of similar size to Devon, with a common border but no overlap): 0 or very small.
  3. Exeter (smallish area with Devon): score 1.
  4. Lustleigh (small village within Devon): score 0.1.
  5. A single point within Devon, e.g. location of a celtic cross: score 0.1. South-West England (large area of which Devon comprises about a quarter): score 1.5.
  6. England (a larger area that includes Devon): score 0.3.
  7. Europe (a still larger area): score 0.1.
  8. Exmoor (an area half in Devon, half outside): the part within Devon is scored, and the score then halved.
  9. Circle of 180 miles radius from London (one hundredth of this area lies within Devon): as the previous case, but the score is divided by 100.

Our next list assumes proactive matching (though we have provisionally assumed this gives the same as interactive matching) to a user's location that is a point in the small village of Lustleigh.

  1. Lustleigh: score 2.
  2. Lustleigh Tea Rooms (a point less than 100 yards from the user): score 2.
  3. A point ten miles from Lustleigh: score 0.1.
  4. West Devon (a large area whose nerest point is 10 miles from Lustleigh): score 0.02.
  5. Teignbridge (an area 20 miles square that includes Lustleigh): score 1.
  6. Exeter (an area of several square miles that is 16 miles from Lustleigh): score 0.3.
  7. Devon (a large area that includes Lustleigh): score 0.3.
  8. Other large areas that include Lustleigh, e.g. England, Europe, etc.: steeply decaying scores according to their size.

Our next list assumes matching a temperature of 10 -- a one-diminensional case:

  1. A temperature of 10 gets a score of 2.
  2. A range of 9..11 gets a score of 2.
  3. A range of 10..12 gets a score of 1.
  4. A range of 11..20 gets a score of 0.5 (fairly near miss).
  5. A temperature of 11 gets a score of 1
  6. A range 0f 11.00..11.01 gets a score of 0/99. Our initial feeling was that this should get a score of zero, since the author has set a small precise range, and the value is, proportionally to the range size, a long way off. However, if the author had set a series of documents with end-to-end ranges, e.g. 10.00..10.01 and 10.01..10.02, etc., then the closer ones would get higher scores and thus make 11.00..11.01 a low choice. If, on the other hand, 11.00..11.01 is the nearest small range, it might well be relevant. Nevertheless there are worries: what if the documents are for operating a finely calibrated machine, where a certain procedure only applies in the range 11.00..11.01? Answer: application designer writes a special algithm, perhaps a Boolean one, for this case.

Matching the centre of a range

We suggest the default algorithms score a match in the centre of a range more highly than a match at the edge (or in matching two ranges, they score higher if their centres are close). There is a possible counter-example to this: a person located at Dover (on the edge of England) is more likely to be interested in a page about England than a user a Birmingham (near the centre of England); this is because a person who has just entered an area is likely to be most interested in it. We are inclined to reject this counter-example: we believe that it lies in the realm of pre-processing and post-processing algorithms that may take account of history and of newly-triggered pages that have not been previously triggered. This is a separate (though important) concern, and should be kept apart from the basic scoring algorithms: it would be wrong to have two sets of algorithms each trying to do the same thing.

Infinite ranges

It is convenient to allow infinite ranges. There are three possible cases:

fully infinite
A range can be infinite at both ends; e.g. a tourist attraction that is suitable for all temperatures may have a fully infinite range supplied as its value. Infinite ranges will match anything; in our algorithms we always give such a match a score of 1.
low-infinite
A range can be infinite at the low end, e.g. to cover all temperatures below zero. One purpose of this is to avoid redundant specification: why specify the lower bound when it is irrelevant. A somewhat similar case might arise with an area such as "The Northern hemisphere".
high-infinite
This is the opposite of low-infinite.
It is possible that a match involves two infinite ranges, e.g. a low-infinite one with a high-infinite one. We suggest that in all cases there is a score of 1 if the ranges overlap and a score of 0 (or perhaps some credit for a near-miss) otherwise.

Continuous performance

Two desirable properties of our algorithms for scoring a match of V1 and V2 are:

A prototype scoring algorithm

We suggest as a prototype the following scoring algorithm. It applies to numeric values of any dimensionality. We assume a point is treated as a range where there low and high values are the same. Thus the algorithm always matches two ranges. It uses the concept of spread introduced earlier. The spread is calculated on a pre-pass at the start of a session (or alternatively could be declared by the author, e.g. in a special document, labelled `spread', within the document collection), but it needs to be updated if any dynamic value, e.g. a value from a sensor, extends the spread. There is a spread for each numerical field; if the value is multi-dimensional then there is a spread for each dimension. The spread-size is the size of the spread: thus if the spread is -20..60. the spead-size is 60. The maximum-distance for a value is the square root of the sum of the squares of the spread-sizes for each dimension. For one-dimensional value the maximum-distance will be the spread-size. The proportional-overlap of two ranges is the proportion of the smaller range that overlaps the larger (?plus halo). If the smaller range is infinite.. todo.

Possible algorithms

Assumptions: notation: range of x to y is written x..y; score is in range 0..2, with 2 a perfect score; if a value-tuple is (e.g.) a pair of values, then this represents a geometric point, not two independent values -- therefore we need to treat values as a whole, rather than say as a sequence of 1D values; ranges are rectangular.

Possible algorithm for cases where at least one is a point (MD=maximum-distance):

score = sqrt(((MD - distance-between-centres) / MD) ^ 2 + sqrt((MD - distance-outside-edge) / MD) ^ 2))

If a point is within a range, its distance-outside-edge is 0; hence if a point is within a range it score is at least 1. (If, bizarely, either of the distances in the formula is greater than MD, then that distance is set to MD.)

If comparing two ranges, we take two opposite corners of each range, e.g. the NE corner and the SW corner if in 2D. The distance-outside-edge is then the average of the distance between the two NE corners and the distance between the two SW corners. Another algorithm for ranges:

score = (2 - (distance-between-range-centres)/maximum-distance - (size-difference-of-ranges)/(greater-size)) * proportional-overlap

Some possible heuristics for matching ranges of values are:

  1. if one range completely includes another, this may have added weight.
  2. if the centres of two matched ranges are close to each other, this should have added weight.
  3. small ranges should have higher weight than large ranges (e.g. information marked as relating to the village the user is in should weight more highly than information relating to the country they are in).

Discontinuities

As our experience with a field trail of location-based stick-e notes showed [6], absolute distance is an unsatisfactory measure of relevance. This is because the terrain and its access is important, e.g. roads, private areas, cliffs, rivers, walls. Similar issues doubtless apply to other types of field.

There is interesting research top be done in this area, but, given all the other issues, we do not plan to do it. Instead we will ignore discontinuity problems.

Probabilities

Interestingly, the application widely held to be the first context-aware system, the Olivetti active badge system [5], presented contexts in terms of probabilities. For example a display shows the people in a building, each with a room and a probability that they are really there (which is based on the nature and time of the sighting). Later CAR systems have often failed to follow this good example.

When a CAR application supplies a field value, it might also usefully supply a probability that the value is correct. The information is particularly useful to a retrieval engine that naturally works in a probablistic way. (For numeric fields, an alternative to probabilities is to use a range of possible values.)

In spite of its possible attractions, we do not plan to follow a probablistic approach in our implementation: we are already experimenting in many areas and we do not want to compound this with an experimental probablistic retrieval engine.

Some problems to resolve

Cheating: if temperature range of 10..20 matched against a user temperature of 15 gives a score of 1.5, but a a temperature of 15 against a 15 gives a score of 2, then authors might cheat and give several different documents, one for a temperature of 11, one for a temperature of 12, ... .

Wrong assumption: an author might assume that a range of, e.g., 10..20, is a Boolean concept, and anything outside gets a score of 0!

Do we need to have a symmetrical algorithm?

Some relevant papers

1.
Context-aware retrieval: exploring a new environment for information retrieval and information filtering . P.J. Brown, G.J.F. Jones. to be published in Personal Technologies, 2001.
2.
Information access for context-aware appliances. G.J.F. Jones, P.J. Brown. Proceedings of ACM SIGIR 2000, Athens, July 2000, pp. 382-4.
3.
B.J. Rhodes, `The Wearable Remembrance Agent: A system for augmented memory', Personal Technologies, 1
4.
A.K. Dey, D. Salber, G. Abowd and M. Futakawa, The conference assistant: combining context-awareness with wearable computing, Third International Conference on Wearable Computers, San Francisco, Cal., pp. 21-28, 1999.
5.
R. Want, A. Hopper, V. Falcao, and J. Gibbons, The active badge location system, ACM Transactions on Information Systems 10, 1, pp. 91-102, 1992.
6.
P.J. Brown, J.D. Bovey and X. Chen, `Context-aware Applications: from the Laboratory to the Marketplace', IEEE Personal Communications, 4, 5, pp. 58-64, 1997.
7.
Companion paper to this: using history.
8.
Companion paper to this: databases versus IR.
9.
Companion paper to this: scores.