Research issues in context-aware retrieval: setting scores and weights on fields

Peter Brown

Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk

ABSTRACT

Retrieval operations involve matching fields of documents. Weights can be used to adjust the relative importance of different fields (a multi-field query), and matching-scores can be used to indicate how well fields and documents have been matched. One use of such scores and weights is by pre- and post-processors, which fine tune how the matching is performed and what is presented to the user.

This paper consists of rough notes and thoughts about how a scoring/weighting system might work. As ideas harden this paper should evolve into a specification.

Some basic assumptions

The following are our basic assumptions. Retrieval involves matching each document in a document collection against the user's current context (in previous documentation the current context was called the "present context"). The current context is like each document in the collection, being a collection of fields. Each field is a name/value pair. Values may be numbers, 2D locations, 3D locations, strings, images, etc. Retrieval may be proactive (driven by a document) or interactive (driven by the current context). The query is derived from fields of the document, or, in the interactive case, from the current context. In the proactive case the fields to be used may be selected dynamically (e.g. "use the Location and Temperature fields from now on"; this means that the retrieval queries would be derived from the Location and Temperature fields of each document in the collection).

Overall a retrieval operation may consists of a pipeline with the following stages:

a preprocessing stage which takes a copy of the user's current context, perhaps massages it a bit, and derives the query to be used; the query will usually involve lots of different fields, e.g. Location matches .. and Time matches .. and ... . The preprocessor also sets weightings of fields (we call these field-weights). For example if the user has suddenly started moving, it may (a) massage the query derived from the current context by pretending the user is slightly in advance of their true position; it may also adjust field-weights so that location is now more important, and (b), in the proactive case, promote location to be a field from which the query is derived (e.g. our above example of "use Location now").
calls of the retrieval engine, which will take queries and match them against the current context (proactive case) or against each document in the document collection (interactive case). For a multi-field query we assume here that each field is matched separately and independently (though in the future we may move on to sophisticated algorithms that match fields in combination). There may be several stages of retrieval operations, e.g. proactive retrieval followed by one or more conventional IR stages to knock out unwanted documents (e.g. user only wants documents about architecture).
post-processing stages that decide what to present to the user and how it is presented.

Pre- and post-processing stages may use a separate body of information, e.g. a history of how the user's context has been changing. Scores are used to relay information about matching from one pipeline stage to the next. However their driving purpose is for use on the final output: to provide a ranked list to the user. (The ranked list may be null if all the matching-scores are below a certain threshold, i.e. there is nothing that is important enough to show the user.)

Some details

There are matching-scores attached to fields (field scores) and matching-scores attached to documents (document scores). Document scores are usually calculated by first matching the individual fields of the document, and then combining the field scores obtained. A field score measures how well that field has matched, within a multi-field query. Scores are not attached to the current context (?even when matching is interactive?); they are attached to documents in the collection and each field within them.

A matching-score is a real number in the range [0,2]. A score of 2 is a perfect match, a score of 1 is an "average match" and a score of a 0 is a non-match. (A minor advantage of this scheme is that if we use arithmetic or geometric means to combine the field scores into a document score, then two average matches still yield an average match.) (?Change this: it is only a minor advantage; would [0,1] be better?)
Each matching-score is given an initial value. Each time a pipeline is re-evaluated matching-scores are re-set to their initial values. (Todo: think about initial scores and how scores propagate through a pipeline.)
Scores are changed by the matching process. A field may be in three possible states: active (must be matched), optional (may be matched) or inactive. Scores of active and optional fields may be changed by the matching process; matching-scores of inactive fields remain unchanged (e.g. if Body is not used in the current query, Body fields keep an unchanged matching-score). In retrieval terms it is the active and optional fields that are used to derive a query. For example we may decide that Time is an active field and Location is an optional field. In this case the matching-scores on Time and Location fields will be set according to how well they match.
Field scores may also be used to determine whether a field is active at the next stage, e.g. the first stage of a pipeline sets some field scores, and fields with a score above some designated threshold may then be designated as active for the next pipeline stage (see discussion towards the end of the paper discussing how queries are derived from document fields).
A field score is calculated according to a formula, which will generally use the two values being matched (i.e. the value in the query and the value in the corresponding field of the target document), and may also use the previous field score. For example, if the values to be matched are x and y, the formula may be (2 -abs(x-y)) ^ 2. (If the formula yields a matching-score less than 0, the result is treated as 0; likewise a matching-score greater than 2 is treated as 2.) Normally formulae need to be set by a human who knows what a field means and what constitutes a near miss: e.g. two centigrade temperatures of 0 and 20 may appear to be a long way apart, but if the temperatures are expressed in absolute they are 273 and 293; then they are within 10% of each other. An automaton might think the former is a complete miss but the latter a near miss.
A document score is accumulated from scores of fields within the document; it may also use the previous document score. Field scores may have weights, e.g. Location is twice as important as temperature. The formulae, field-weights, etc, may vary between fields and may be set dynamically.
Document scores may be used to filter out documents during the pipeline, (e.g. "delete all documents whose score is less than 1.5").
On output, at the end of the retrieval stages of the pipeline, matching-scores may, on option, be attached in source form to fields. If we assume an SGML notation, it is natural to use attributes, e.g. a field extracted from a document might be
```
  <temperature score="1.6"> 24
```
Such matching-scores may be used by post-processors to fine-tune what is presented to the user.

To summarise, the parameters for deriving scores in the matching process are:

the initial value of each matching-score. Each individual field in each document may have its own unique initial field score. Normally, at the start of a pipeline of retrieval operations, fields will not be given specific initial matching-scores but will use a system-wide default value, such as 1. However in subsequent pipeline stages the scores output by one stage of the pipeline will be used as initial scores by the next stage.
the formula used to derive each field score. The formula may use the values being matched and perhaps the previous field score. We assume a formula is associated globally with each field name, e.g. for all Location fields, use this formula.
the relative weights of fields; field-weights are attached as attributes to fields; most often they will be on the current context (having perhaps been put there by a pre-processor); they may, however, be on the document from the collection (e.g. a site may be best if the temperature is in a certain range, but this is not vital -- i.e. Temperature has a relatively low field-weight). Given this there may be a possibility that both fields in a match have a field-weight; if so the field of the current context overrides. All of the above is independent of whether matching is active or pro-active. (An alternative, not currently pursued, is a policy whereby field-weights, like formulae, are associated globally with field names, e.g. to specify that all Location fields have a weight of two. We can achieve the same effect by embedding pre-defined field-weights into each specific formula.) We assume that a field_weight can be any real number, even a negative one (which has interesting possibilities, e.g. programming a form of NAND).
the formula used to calculate a document score from the scores of its constituent fields and the field-weights; the formula may also use the previous document score. We assume the same formula is used globally for all documents. The formula will usually (I assume) be a weighted arithmetic or geometric mean of the field scores, and these two should be provided as defaults. However there are issues about (a) incorporating the previous document score; and (b) the treatment of optional fields (are those that match given matching scores using the same formula as for compulsory fields, and those that do not match ignored?) and (c) giving extra credit where more fields are matched, even if only with a score of 1. (Example: if each of your Location, Time, Orientation and Temperature is an average match, then this may be considered better than an average match where the query just involves Location. This might apply in the proactive case, where all the queries are different and some will contain more fields than others. If we accept this, and we are using arithmetic/geometric means, then we may wish to further weight the means according to the number of values we are taking a mean of: the more the better.)

All of the above can be changed dynamically between each usage of the retrieval engine; if the pipeline uses several retrieval stages these stages may well use different parameters.

The existing implementation

The existing implementation by pjb (a Java program written by someone who was trying to learn OO at the time, I am afraid), called the Context Matcher, provides context-aware matching facilities based on the stick-e note framework. Documents -- which are called "notes" -- are encoded in a SGML form. Each begins with the tag <note>. The current context is also treated as a document, and thus also begins with the <note> tag; there can be a sequence of such documents, each representing a new setting of the current context, e.g. as location and other fields change. Currently all settings of the current context are simulated and input as data, but potentially there could be real sensors behind these settings. The Context Matcher (hereafter called CM) automatically generates queries from the fields of the documents it processes; it uses a global declaration that specify which fields are active, i.e. to be turned into queries.

CM supports a pipeline. Documents representing settings of the current context can each be assigned to a stage in the pipeline -- this is done using the STAGE attribute, e.g.

   <note STAGE="5">
   <!-- setting(s) of the current context for stage 5 now follow -->
     <location> ...
     <temperature> ...

Each stage of the pipeline uses the same retrieval engine and works on the same document collection -- except that each stage of the pipeline potentially acts as a filter and can knock out some of the documents in the collection, thus making them unseen by subsequent stages. The parameters that control how the matching engine is to work at each pipeline stage can be set by the <notedefaults> tag, e.g.

   <notedefaults STAGE="5" WEIGHTS="location 2; temperature 1">

(Currently STAGE is implemented, but not WEIGHTS.) [[Todo can the <notes> tag be used for this too, i.e. setting parameters local to the matching of a particular current context?]]

Specifying formulae and field-weights

We will want to run lots of tests, investigating the effects of different parameters, e.g. different formulae. We want an easy way to do this.

We will assume initial matching-scores are provided as part of the source data; the natural way, if we assume SGML notation, is to use attributes of fields. Field-weights can be supplied as parameters to each call of the retrieval engine. [[In the current CM field-weights can also be WEIGHT attributes of individual fields; we assume weights are taken from the query and any weights on the target are ignored.]]

A more interesting question is how to specify the formulae. There are two approaches:

(a)

to embed the formulae in the Java code of the retrieval engine (perhaps using a mechanism involving abstract classes). Each set of test data would then somehow specify which set of formulae to use, e.g. by a switch value.

(b)

to embed the formula as a declaration in the test data, e.g.

  <notedefaults FORMULA="temperature: (2*x + 3*y)/5">

Here the FORMULA attribute declares the formula to be used for the next set(s) of test data; in the above example it sets the formula for the Temperature field.

Given that our immediate purpose is research, and that we want to keep trying out different formula, etc., for matching algorithms, approach (b) is potentially much more attractive. Its main advantage is that several different researchers can use the same CM, and each can easily try out new ideas. The disadvantage is that CM needs to parse the formula supplied by the user, and then execute it; if the formula could be arbitrarily complicated (e.g. any program), this parsing would be a big task; however it is reasonable (i) to place severe limits on the nature of the formula (e.g. must be a polynomial -- though this is probably too restrictive) and (ii) not to worry if the notation used is somewhat clumsy, since we are aiming at researchers rather than an end-user facility -- however, there must be full error checking: we want to avoid dangers of a researcher supplying a syntactically incorrect formula, and being given spurious results with no error message.

[[Possible approach though I suspect not a runner: write the formula using some existing language such as Perl; CM then just calls Perl to parse each formula, and calls Perl at run-time to evaluate each formula, returning a numerical result to CM.]]

My feeling is that we need to use approach (a) for "complicated" matching, such as text and images, but we should at least investigate the viability of approach (b) for "simple" matching such as numbers or 2D locations. We also probably need to use approach (a) for the formulae that calculate document scores from field scores.

Here are some details of how the formula may work in approach (b):

All the values (matching-scores, field-weights, variables, etc.) are real numbers.
CM pre-sets certain variables to represent the value being matched. (In the same way the Unix shell has present variables that can be used in shell scripts.) For example, for 1D numerical values q1 and t1 might be the values of the two fields to be matched between the query and the target document, and d might be the absolute difference between q1 and t1 (?Issue: we are often dealing with ranges rather than single numbers, e.g. matching a temperature in the range 0-9; in these cases do we provide some extra variables, such as t1l, the lower bound, t1u, the upper bound, and with t1 itself set to the mid-point, i.e. 4.5. ?) For 2D values we would have t1, t2, q1 and q2.
Formula should involve variables with pre-set values (essentially these are named constants), constants, parentheses and the five common arithmetic operators (+, -, *, /, ^); there is also a case for sqrt and an if-then-else construct like that in C (i.e. (boolean) ? e1 : e2); the if-then-else leads into relational operators, etc.
Matching of certain fields is essentially asymmetric: e.g. an event relating to an hour in the past is of no interest, but one relating to an hour in the future may be. It is an open question whether we deal with this by (1) pre- and post- processing; (2) an if-then-else within a formula, or (3) a more sophisticated algorithm that uses approach (a).

An example pipeline

We will assume a standard tourist application; the current context of the tourist consists of Location, Temperature, Time, Season (which is a number from 1 (Spring) to 4 (Winter)). The documents in the collection contain these fields (though not every document will contain every field -- for example the Temperature field might only appear in documents relating to attractions that are only attractive at certain temperatures, such as an open-air restaurant.) Documents will have other fields such as Title, Author and Body (a textural description of a tourist attraction).

The pipeline is then as follows:

the preprocessor keeps a database of the history of the current context, so that it can detect change. It sets field-weights for CM, based on the way fields are changing. (Our research is about how best to do this.) It also changes the current context to extrapolate what the values might be in one minute's time. We might explore preprocessors that added new fields, e.g. a Location-Change field.)
the CM does a proactive retrieval, using the Time, Location and Temperature fields. (CM's rule is, incidentally, that a compulsory field must be matched if it is present in the document that the query is derived from; thus in our example documents that do not specify a Temperature field can still be matched.)
CM does a second, interactive, retrieval. The user is interested in architecture, and specifies that the Body of a target document must match the word "architecture". (Currently CM has a crude text matching algorithm, but it could have a better algorithm, like those used by web search engines, that gave a score according to how good a match there appears to be: e.g. a document that mentions architecture several times gets a higher score.) Document scores for the two retrieval stages are multiplied together to yield the final document score (?or something more subtle?).
a post-processor, which is a front-end to the application the user is running, decides what to display to the user. It just displays documents that have a matching-score greater than 1.2. It also ups the matching-scores of documents that are just ahead of the user's current location -- it uses the database of how context is changing that the preprocessor has produced. It also presents a "timely attractions" list that contains documents whose Time field score is best, and a "very near" list that displays the documents whose Location field score is best. (We also want to investigate post-processors that use different retrieval engines, e.g. CM and a separate engine that does good text matching; the post-processor then combines the results using the field scores obtained.)

Task for Lindsey

Think about what formulae people might want to use in real situations. Consider whether approach (b) is feasible for simple situations; my feeling is that if we could get something useful out of 2-3 weeks of Java coding, it would be worth while; if feasible, how far should we go (e.g. is if-then-else worth the effort?), and how much can be use existing publicly-available code? The Java code would go into CM -- but the hooks are not there yet: indeed there is no proper mechanism for matching-scores. Maybe the best answer is completely different from anything here; if so, suggestions welcome.

Please come back for clarifications if necessary: some of the above is very rough.

Some relevant papers

1.: The manual for the Context Matcher.
2.: Context-aware retrieval: exploring a new environment for information retrieval and information filtering . P.J. Brown, G.J.F. Jones. to be published in Personal Technologies, 2001.
3.: A sample document collection, marked up as stick-e notes and A test script that applies some sample current contexts to the document collection
4.: An explanation of how a query is derived from fields of a document

todo: more citations