Research issues in context-aware retrieval: setting scores and weights on fields
Peter Brown
Department of Computer Science, University of Exeter, Exeter EX4 4QF, UK
P.J.Brown@ex.ac.uk
ABSTRACT
Retrieval operations involve matching fields of documents.
Weights can be used to adjust the relative importance of
different fields (a multi-field query),
and matching-scores can be used to indicate how well fields and
documents have been matched.
One use of such scores and weights is by pre- and post-processors, which fine tune
how the matching is performed and
what is presented to the user.
This paper consists of rough notes and thoughts about how a scoring/weighting system
might work.
As ideas harden this paper should evolve into a specification.
Some basic assumptions
The following are our basic assumptions.
Retrieval involves matching each document in a document collection against
the user's current context (in previous documentation the current context
was called the "present context").
The current context is like each document in the collection, being a collection
of fields.
Each field is a name/value pair.
Values may be numbers, 2D locations, 3D locations, strings, images, etc.
Retrieval may be proactive (driven by a document) or interactive
(driven by the current context).
The query is derived from fields of the document, or, in the
interactive case, from the current context.
In the proactive case the fields to be used may be selected dynamically
(e.g. "use the Location and Temperature fields from now on"; this means that
the retrieval queries would be derived from the Location and Temperature
fields of each document in the collection).
Overall a retrieval operation may consists of a pipeline with the following stages:
-
a preprocessing stage which takes a copy of the user's current context, perhaps massages
it a bit,
and derives the query to be used; the query will usually involve
lots of different fields, e.g. Location matches .. and Time matches .. and ... .
The preprocessor also sets weightings of fields (we call these field-weights).
For example if the user has suddenly started moving, it may
(a) massage the query derived from the current context by pretending the user
is slightly in advance of their true position; it may also adjust
field-weights so that location is now more important, and (b), in the proactive case,
promote location to be a field from which the query is derived (e.g. our above
example of "use Location now").
-
calls of the retrieval engine, which will take queries and match them
against the current context (proactive case) or against each document in
the document collection (interactive case).
For a multi-field query we assume here that each field is matched
separately and independently (though in the future we may move on to
sophisticated algorithms that match fields in combination).
There may be several stages of retrieval operations, e.g. proactive retrieval followed by
one or more conventional IR stages to knock out unwanted documents
(e.g. user only wants documents about architecture).
-
post-processing stages that decide what to present to the user and how it
is presented.
Pre- and post-processing stages may use a separate body of information, e.g. a history
of how the user's context has been changing.
Scores are used to relay information about matching from one pipeline stage to
the next.
However their driving purpose is for use on the final output: to provide
a ranked list to the user.
(The ranked list may be null if all the matching-scores are below a certain threshold, i.e.
there is nothing that is important enough to show the user.)
Some details
There are matching-scores attached to fields (field scores) and matching-scores attached to documents (document scores).
Document scores are usually calculated by first matching the individual
fields of the document, and then combining the field scores obtained.
A field score measures how well that field has matched, within a multi-field query.
Scores are not attached to the current context (?even when matching is interactive?); they are attached to documents in the collection and each field within them.
-
A matching-score is a real number in the range [0,2].
A score of 2 is a perfect match, a score of 1 is an "average match" and a score
of a 0 is a non-match.
(A minor advantage of this scheme is that if we use arithmetic or geometric means
to combine the field scores into a document score, then two average matches still yield
an average match.)
(?Change this: it is only a minor advantage; would [0,1] be better?)
-
Each matching-score is given an initial value.
Each time a pipeline is re-evaluated matching-scores are re-set to their initial values.
(Todo: think about initial scores and how scores propagate through a
pipeline.)
-
Scores are changed by the matching process.
A field may be in three possible states: active (must be matched),
optional (may be matched) or inactive.
Scores of active and optional fields may be changed by the matching process;
matching-scores of inactive fields remain unchanged (e.g. if Body is not used
in the current query, Body fields keep an unchanged matching-score).
In retrieval terms it is the active and optional fields that are used to derive a query.
For example we may decide that Time is an active field and Location is an
optional field.
In this case the matching-scores on Time and Location fields will be set according to how well
they match.
-
Field scores may also be used to determine whether a field is active at
the next stage, e.g. the first stage of a pipeline sets some field scores, and
fields with a score above some designated threshold may then be designated
as active for the next pipeline stage
(see discussion towards the end of the paper
discussing how queries are derived from document fields).
-
A field score is calculated according to a formula, which will generally
use the two values being matched (i.e. the value in the query and
the value in the corresponding field of the target document), and may also use the previous field score.
For example, if the values to be matched are x and y, the formula may be (2 -abs(x-y)) ^ 2.
(If the formula yields a matching-score less than 0, the result is treated as 0; likewise a matching-score
greater than 2 is treated as 2.)
Normally formulae need to be set by a human who knows what a field means and
what constitutes a near miss: e.g. two centigrade temperatures of 0 and 20
may appear to be a long way apart, but if the temperatures are expressed in
absolute they are 273 and 293; then they are within 10% of each other.
An automaton might think the former is a complete miss but the latter a near miss.
-
A document score is accumulated from scores of fields within the document;
it may also use the previous document score.
Field scores may have weights, e.g. Location is twice as important
as temperature.
The formulae, field-weights, etc, may vary between fields and may be set
dynamically.
-
Document scores may be used to filter out documents during the pipeline,
(e.g. "delete all documents whose score is less than 1.5").
-
On output, at the end of the retrieval stages of the pipeline, matching-scores may, on option, be attached in source form to fields.
If we assume an SGML notation, it is natural to use attributes, e.g.
a field extracted from a document might be
<temperature score="1.6"> 24
Such matching-scores may be used by post-processors to fine-tune what is
presented to the user.
To summarise, the parameters for deriving scores in the matching process are:
-
the initial value of each matching-score.
Each individual field in each document may have its own unique initial field score.
Normally, at the start of a pipeline of retrieval operations, fields will not be given specific initial matching-scores but
will use a system-wide default value, such as 1.
However in subsequent pipeline stages the scores output by one stage of the pipeline
will be used as initial scores by the next stage.
-
the formula used to derive each field score.
The formula may use the values being matched and perhaps the previous field score.
We assume a formula is associated globally with each field name, e.g. for all Location fields, use
this formula.
-
the relative weights of fields; field-weights are
attached as attributes to fields; most often they will
be on the current context (having perhaps been put there
by a pre-processor); they may, however, be on the document
from the collection (e.g. a site may be best if the temperature
is in a certain range, but this is not vital -- i.e. Temperature has
a relatively low field-weight).
Given this there may be a possibility that both
fields in a match have a field-weight; if so the field
of the current context overrides.
All of the above is independent of whether matching is
active or pro-active.
(An alternative, not currently pursued, is a policy
whereby field-weights,
like formulae, are associated globally with field
names, e.g. to specify that all Location
fields have a weight of two.
We can achieve the same effect by embedding pre-defined
field-weights into each specific formula.)
We assume that a field_weight can be any real number, even
a negative one (which has interesting possibilities,
e.g. programming a form of NAND).
-
the formula used to calculate a document score from the scores of its constituent
fields and the field-weights; the formula may also use the previous document score.
We assume the same formula is used globally for all documents.
The formula will usually (I assume) be a weighted arithmetic or geometric
mean of the field scores, and these two should be provided as defaults.
However there are issues about (a) incorporating the previous document score;
and (b) the treatment of optional fields (are those that match given matching scores
using the same formula as for compulsory fields, and those that do not match ignored?) and (c)
giving extra credit where more fields are matched, even if only with
a score of 1.
(Example: if each of your Location, Time, Orientation and Temperature is
an average match, then this may be considered better than an average match
where the query just involves Location.
This might apply in the proactive case, where all the queries are different and some
will contain more fields than others.
If we accept this, and we are using arithmetic/geometric means,
then we may wish to further weight the means according to the number
of values we are taking a mean of: the more the better.)
All of the above can be changed dynamically between each usage of
the retrieval engine; if the pipeline uses several retrieval stages these stages may well use
different parameters.
The existing implementation
The existing implementation by pjb (a Java program written by someone who
was trying to learn OO at the time, I am afraid), called the Context Matcher, provides context-aware
matching facilities based on the stick-e note framework.
Documents -- which are called "notes" -- are encoded in a SGML form.
Each begins with the tag <note>.
The current context is also treated as a document, and thus also begins with the
<note> tag; there can be a sequence of such documents, each
representing a new setting of the current context, e.g. as location and other
fields change.
Currently all settings of the current context are simulated and input as data, but potentially
there could be real sensors behind these settings.
The Context Matcher (hereafter called CM) automatically generates queries from the fields of the documents it
processes; it uses a global declaration that specify which fields are active, i.e.
to be turned into queries.
CM supports a pipeline.
Documents representing settings of the current context can each be assigned to
a stage in the pipeline -- this is done using the STAGE attribute, e.g.
<note STAGE="5">
<!-- setting(s) of the current context for stage 5 now follow -->
<location> ...
<temperature> ...
Each stage of the pipeline uses the same retrieval engine and works on the
same document collection -- except that each stage of the pipeline
potentially acts as a filter and can knock out some of the documents in the collection, thus making them unseen by subsequent stages.
The parameters that control how the matching engine is to work at each pipeline stage
can be set by the <notedefaults> tag, e.g.
<notedefaults STAGE="5" WEIGHTS="location 2; temperature 1">
(Currently STAGE is implemented, but not WEIGHTS.)
[[Todo can the <notes> tag be used for this too, i.e. setting parameters local
to the matching of a particular current context?]]
Specifying formulae and field-weights
We will want to run lots of tests, investigating the effects of different
parameters, e.g. different formulae.
We want an easy way to do this.
We will assume initial matching-scores are provided as part of the source data; the natural
way, if we assume SGML notation, is to use attributes of fields.
Field-weights can be supplied as parameters to each call of the retrieval engine.
[[In the current CM field-weights can also be WEIGHT attributes of individual fields;
we assume weights are taken from the query and any weights on the target
are ignored.]]
A more interesting question is how to specify the formulae.
There are two approaches:
- (a)
- to embed the formulae in the Java code of the retrieval engine
(perhaps using a mechanism involving abstract classes).
Each set of test data would then somehow specify which set of
formulae to use, e.g. by a switch value.
- (b)
-
to embed the formula as a declaration in the test data, e.g.
<notedefaults FORMULA="temperature: (2*x + 3*y)/5">
Here the FORMULA attribute declares the formula to be used for the next set(s) of test data; in the above example it sets the formula for the Temperature
field.
Given that our immediate purpose is research, and that we want to keep
trying out different formula, etc., for matching algorithms, approach
(b) is potentially much more attractive.
Its main advantage is that several different researchers can use the
same CM, and each can easily try out new ideas.
The disadvantage is that CM needs to parse the formula supplied by the user, and then execute it; if the formula could be arbitrarily complicated
(e.g. any program), this parsing would be a big task; however it is
reasonable (i) to place severe limits on the nature of the formula
(e.g. must be a polynomial -- though this is probably too restrictive) and
(ii) not to worry if the notation used is somewhat
clumsy, since we are aiming at researchers rather than an end-user facility --
however, there must be full error checking: we want to avoid dangers of a researcher
supplying a syntactically incorrect formula, and being given spurious results
with no error message.
[[Possible approach though I suspect not a runner: write the
formula using some existing language such as Perl; CM then
just calls Perl to parse each formula, and calls Perl at run-time to
evaluate each formula, returning a numerical result to CM.]]
My feeling is that we need to use approach (a) for "complicated"
matching, such as text and images, but we should at least investigate the
viability of approach (b) for "simple" matching such as numbers or
2D locations.
We also probably need to use approach (a) for the formulae that calculate
document scores from field scores.
Here are some details of how the formula may work in approach (b):
-
All the values (matching-scores, field-weights, variables, etc.) are real numbers.
-
CM pre-sets certain variables to represent the value being matched.
(In the same way the Unix shell has present variables that can be used
in shell scripts.)
For example, for 1D numerical values q1 and t1 might
be the values of the two fields to be matched between the query and the target document,
and d might
be the absolute difference between q1 and t1
(?Issue: we are often dealing with ranges rather than single numbers, e.g. matching
a temperature in the range 0-9; in these cases do we provide some extra variables,
such as t1l, the lower bound, t1u, the upper bound,
and with t1 itself set to the mid-point, i.e. 4.5. ?)
For 2D values we would have t1, t2, q1
and q2.
-
Formula should involve variables with pre-set values (essentially these are
named constants), constants, parentheses and the five common arithmetic
operators (+, -, *, /, ^); there is also a case for sqrt and an if-then-else
construct like that in C (i.e. (boolean) ? e1 : e2); the if-then-else leads
into relational operators, etc.
-
Matching of certain fields is essentially asymmetric: e.g. an event relating
to an hour in the past is of no interest, but one relating to an hour in
the future may be.
It is an open question whether we deal with this by (1) pre- and post- processing; (2) an if-then-else within a formula, or (3) a more sophisticated
algorithm that uses approach (a).
An example pipeline
We will assume a standard tourist application; the current context of the tourist
consists of Location, Temperature, Time, Season (which is a number from 1 (Spring) to 4 (Winter)).
The documents in the collection contain these fields (though not every
document will contain every field -- for example the Temperature field
might only appear in documents relating to attractions that are only attractive
at certain temperatures, such as an open-air restaurant.)
Documents will have other fields such as Title, Author and Body (a textural
description of a tourist attraction).
The pipeline is then as follows:
-
the preprocessor keeps a database of the history of the current context, so
that it can detect change.
It sets field-weights for CM, based on the way fields are changing.
(Our research is about how best to do this.)
It also changes the current context to extrapolate what the values might be in
one minute's time.
We might explore preprocessors that added new fields, e.g. a Location-Change field.)
-
the CM does a proactive retrieval, using the Time, Location and Temperature
fields.
(CM's rule is, incidentally, that a compulsory field must be matched if it
is present in the document that the query is derived from; thus in
our example documents that do not specify a Temperature field can still be matched.)
-
CM does a second, interactive, retrieval.
The user is interested in architecture, and specifies that the Body of
a target document must match the word "architecture".
(Currently CM has a crude text matching algorithm, but it could have
a better algorithm, like those used by web search engines, that gave a
score according to how good a match there appears to be: e.g. a document that
mentions architecture several times gets a higher score.)
Document scores for the two retrieval stages are multiplied together to
yield the final document score (?or something more subtle?).
-
a post-processor, which is a front-end to the application the user is running,
decides what to display to the user.
It just displays documents that have a matching-score greater than 1.2.
It also ups the matching-scores of documents that are just ahead of
the user's current location -- it uses the database of how context is
changing that the preprocessor has produced.
It also presents a "timely attractions" list that contains documents
whose Time field score is best, and a "very near" list that displays
the documents whose Location field score is best.
(We also want to investigate post-processors that use different retrieval
engines, e.g. CM and a separate engine that does good text matching;
the post-processor then combines the results using the field scores obtained.)
Task for Lindsey
Think about what formulae people might want to use in real situations.
Consider whether approach (b) is feasible for simple situations; my feeling
is that if we could get something useful out of 2-3 weeks of Java coding,
it would be worth while; if feasible, how far should we go
(e.g. is if-then-else worth the effort?), and how much can be use existing
publicly-available code?
The Java code would go into CM -- but the hooks are not there yet: indeed
there is no proper mechanism for matching-scores.
Maybe the best answer is completely
different from anything here; if so, suggestions welcome.
Please come back for clarifications if necessary: some of the
above is very rough.
Some relevant papers
- 1.
-
The manual for the Context Matcher.
- 2.
- Context-aware retrieval:
exploring a new environment for information retrieval and information filtering
.
P.J. Brown, G.J.F. Jones.
to be published in
Personal Technologies, 2001.
- 3.
-
A sample document collection, marked up as stick-e notes
and
A test script that applies some sample current contexts to the document collection
- 4.
-
An explanation of how a query is derived from fields of a document
todo: more citations