Research issues in context-aware retrieval: specifying the fields to be matched

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

During any matching process some fields will be active, i.e. used to derive a query, and other will be inactive and play no part in the matching process. Moreover active fields may be compulsory or optional. This document discusses a proposed way to allow the application designer to specify to the Context Matcher which fields are active, and whether these are optional or compulsory.

Some basic factors

One of the factors that makes retrieval more difficult is that data is often inhomogeneous. Thus, although the majority of documents in the document collection may have a <Location> field, there may be a documents that do not. For example there may be a document, triggered by a <Temperature> field, that gives a general warning about dangers of exposure to certain high temperatures. This, of course, applies independently of location. Moreover -- and I am not sure how realistic this would be in practice -- the document collection may be a mixture of completely different documents, some location-based ones for tourists and some triggered by share prices and aimed at investors.

Inhomogeneity can also apply to the user's current context. In a multi-user system, for example, some users may have a temperature sensor, and hence a <Temperature> field in their current context, and some may not. Even with one user, the temperature sensor may periodically fail, thus causing the temporary removal of the <Temperature> field from their current context. As a result of all this inhomogeneity, there is the possibility of a field being present on one side of a match and absent on the other (i.e. present in the query but not in the document being queried).

One answer to inhomogeneity within the document collection is to fill in the missing fields. A metavalue `ANY' comes in handy here: e.g. if there is no location field the location is set as `ANY'. This answer, which is essentially a database approach, works if inhomogeneity is minor, but is not a panacea.

The second basic factor concerns the concept of active fields, i.e. the fields that are to be involved in the matching. In interactive retrieval, which corresponds to ordinary IR, this is not an issue: the query involves certain fields, and these are therefore the active ones. In proactive retrieval, on the other hand, the active fields can vary. Experience has shown that it is unwise to wire into a document a specification of which of its fields are to be used for proactive triggering, i.e. are to be active. The same document may be on one occasion be set to trigger on its <Location> field, and another on its <Body> field -- in this latter case it might be desired to trigger those documents whose subject matter matched the text the user was currently composing on their PDA (and which formed the <Body> of their current context). Thus there is merit in specifying the active fields as parameters of the retrieval process, rather than being wired into the data involved. We have assumed this in our model; we have taken the slightly restricting approach that the active fields are specified globally by their tags, i.e. the same active fields, as specified by their tags, apply to all documents. If a proactive document contains at least one field that corresponds to an active tag, the document itself is active and is thus involved in the matching: a query is derived from its active fields and matched against the current context. (An active document does not have to contain all the active tags.)

The third and final factor concerns whether matching of an active field is compulsory or optional, i.e. if the matching of that field fails, does the whole document match fail? In fact the compulsory case can be divided into sub-cases:

presence-compulsory: the field must be present on both sides of the match; for example if a proactive document required a <Time> and the user's current context (which represents the document being queried) did not have one, then the overall document match would fail. The alternative is just to give a low or zero score to the match, and to continue looking at other fields.
value-match-compulsory: the values of the field, if present on both sides of the match, must get a matching score greater than zero; if not, the overall document match fails. (If the algorithm for aggregating field scores into an overall document score is an arithmetic mean, then all fields are value-match-compulsory.)
fully-compulsory: both of the above.

The concept of compulsory matching is a Boolean concept. Arguably, therefore, it is more suited to Boolean matching. However even in best-match matching it is sometimes useful to impose Boolean constraints such as compulsory fields. The concept is also relevant if queries are written in a Boolean notion: X AND Y means that both X and Y are compulsory (or alternatively the notation used by certain web browsers of putting a `+' in front of compulsory words). However for the moment we do not wish to pursue Boolean queries with AND and OR (? is this agreed?).

Implementation within the Context Matcher

In its previous incarnation the Context Matcher used the mechanism of <activeTags> to specify the active tags. It has tried various mechanisms for differentiating compulsory and optional tags, but none has found much favour. Here we make a new proposal, which has two advantages: simplicity and the casting away of a lot of previous baggage. (The other side of the simplicity coin is that, as the examples below bring out, the approach can be crude and ad hoc; however, I think it will serve our purpose.)

The first part of the proposal concerns specifying the active tags: this is now done by means of the field weights; if a tag is not specified in the field weights (or is specified with a weight of zero) it is inactive. Thus each active tag must be specified in the weights, even if it has a weight of one. (Previously, active tags were specified separately and had a default weight of one.)

The second part of the proposal is that there be no explicit mechanisms for compulsory/optional tags. I believe it can be covered by a combination of the existing mechanisms:

the facility for the application designer to plug into the Context Matcher algorithms that match the values of two fields and return a score. I propose three (hopefully minor) extensions to this. The first is that the plugged-in algorithm be called by the Context Matcher even in the case where the field is not present on both sides of the match, i.e. one of the arguments to the algorithm will be null; this covers the presence-compulsory case, and the algorithm can decide whether to treat this as the end of a document match. The second extension is that the plug-in algorithms should be able to return a result, -1 say, that specifies that the whole document document match should fail. (Returning a value of 0 does not always accomplish this: it depends whether value-match-compulsory is the default.) The third extension is that the plug-in algorithm should be able to call the Matcher's default algorithm for its data type. Thus a plug-in for location could call the Context Matcher's default algorithm for matching two locations, and then change the result in some way, such as turning a low score into a score of -1.
the facility, mentioned above, to define a weight for each individual tag. This weight can be changed dynamically, i.e. between a retrieval for one current context and a retrieval for the next current context.

The solution assumes that being optional/compulsory is a global property associated with a tag; this is a somewhat limiting assumption, but, I think, will serve. It is not therefore possible under the proposal to say that the location field is compulsory on one proactive document but optional on another.

Defaults

The old Context Matcher had the rule that if no active tags were specified, the names of the tags used in the current context were by default the active tags. This is often convenient, but I would also be happy if there were no defaults: i.e. it was an error if the active tags were not explicitly specified. The old Context Matcher also had a rule that tags were value-match-compulsory: i.e. if a tag was actually present in the matched document it had to match (with a best-match strategy this would mean getting a score greater than 0), but if the tag was not there, then the document could still match. For example if the user's location was unknown, a location-based document could still be triggered, but if the user's location was known it had to match the location on the proactive document. In the new Context Matcher, I would be happy with this default, or perhaps a default that all fields were optional: the latter is more in keeping with a best-match strategy.

Some examples

We show some examples of how compulsory fields can be effected. All the examples apply to proactive matching.

Example 1: compulsory Boolean

Requirement: if a document specifies a <CompanionsQualifications> field, then the value supplied is fully-compulsory (i.e. both presence-compulsory and value-match-compulsory) and must therefore be matched by a corresponding field in the current context, e.g.

  <CompanionsQualifications> medic

Strategy: a plug-in algorithm is supplied for matching <CompanionsQualifications> fields. This returns a score of -1 if there is a non-match, or if one of the fields is absent; otherwise it returns the score for a perfect match, 2 say.

Example 2: close location

Requirement: that the <Location> fields, if present, must be within a mile of each other; provided that this is satisfied, locations are scored for closeness.

Strategy: a plug-in algorithm is specified in a similar way to Example 1. If a <Location> field was absent from the document being matched (e.g. the user has no location sensor) then a 0 score is returned, or if the Matcher's default is value-match-compulsory, a small positive score in order to override the default; if the two locations are more than a mile apart a score of -1 is returned; otherwise the Matcher's default algorithm for comparing locations is called.

Example 3: combining the two above examples

Requirement: the requirements for both of the above examples apply.

Strategy: this is no problem, since the above two strategies can be used in parallel.

Example 4: missing fields

Requirement: a proactive document collection for tourists uses, among others, fields for <Location>, <Time> and <Temperature>. For example gardens and other open-air attractions may have suggested temperatures that make a visit pleasant. The application designer makes the somewhat arbitrary decision that the <Temperature> field should be optional, but all other fields should be fully-compulsory. The rationale is that the user would not be interested in attractions many miles away, but could be interested in visiting a garden even if the temperature was near freezing point. (Of course garden attractions would get a low score if their suggested temperature had a large mismatch, but they would not be ruled out.)

Strategy: depending on the Matcher's defaults, plug-ins are written for those fields for which the application designer wants to override the default. If for example the default was optional, plug-ins would be written for all fields except <Temperature>: these plug-ins would return -1 if the field was not present in the current context; otherwise it would call the Matcher's built-in scoring method for the field, but if the result was 0 it would be turned to -1.

Example 5: cutting out a field to stop triggering

Requirement: the current context has a <HeartRate> field, set by a heart rate sensor. As a result lots of documents giving dire warnings about the user's health are triggered. These make the user worse still, and he does not want to see any more of them; he therefore switches off his heart rate sensor.

Strategy: If triggering is presence-compulsory, the application can simply remove the <HeartRate> field from the current context. Other possible approaches are to set the weight of the <HeartRate> field to zero, thus making it inactive, or to provide a plug-in for <HeartRate> that always returns -1. (The former will still trigger documents that use the heart rate in conjunction with some other field(s) -- see next Example.)

Example 6: cutting out a field to widen triggering

Requirement: the current context has a <Temperature> field, set by a temperature sensor. However the user is a hardy type, who is willing to do anything in any weather. She feels that the temperature information is stopping certain useful documents being triggered; she therefore switches her temperature sensor off. Note that the user's purpose in switching off a sensor is in this case the exact opposite of the previous example, with its heart rate sensor. This user's goal is more information whereas the previous one's was less. Obviously the user interface must deduce or explicitly ask for the reason for the switching off, in order to differentiate the two cases.

Strategy: the simplest solution is to give the <Temperature> field a weight of zero, thus making it inactive. Then, for example, documents that previously had active <Temperature> and <Location> fields would now be triggered just by location; this is presumably the user's wish. A further effect is that documents that were just triggered by temperature will no longer be triggered; if such documents are wanted, an alternative overall strategy is to set the <Temperature> field to ANY. A moral of this and other examples is that you really need to have an idea of what is in the document collection in order to gage the effects of changes in the triggering process; user feedback can also help, e.g. the user saying that this document is an example of what they want and this other document is an example of what they do not want.

Comparison with previous proposal

A previous proposal for specifying active tags allowed a more fine-grained selection of active tags: instead of a tag being selected just by its name it could be selected by its name and value, e.g. all <Location> tags with a value in a certain range. Under the new proposal given here this would need to be done in two stages: (1) build a cache of potentially relevant documents, e.g. those whose locations were within a desired range -- extracting this cache is a straightforward retrieval process; (2) mark, within the cache, the active tags. Arguably this two-stage process is appropriate, since two logically separate operations are being performed.